Summarized using AI

Improving my own Ruby

monochrome • April 18, 2025 • Matsuyama, Ehime, Japan • Talk

The video presentation titled "Improving my own Ruby" by monochrome at RubyKaigi 2025 focuses on his new Ruby implementation, monoruby, which is crafted from scratch using Rust. The speaker shares their journey since the last RubyKaigi, emphasizing recent advancements made in monoruby, particularly in areas like RubyGems support and performance improvements.

Key Points:

- Introduction to monoruby: The system comprises a Ruby parser, interpreter, garbage collector, and JIT compiler.

- Progress since last presentation: Enhanced support for RubyGems and several performance optimizations.

- Core features and optimizations: The implementation includes features such as machine code inlining, method inlining, polymorphic method call optimizations, and improved access speeds for instance variables and arrays.

- Compatibility issues: Current limitations include lack of support for C extensions, native threads, and certain Ruby core features that impact performance.

- Performance benchmarks: Monoruby shows 3 to 10 times faster performance compared with the standard CRuby interpreter in various benchmarks, demonstrating significant speed improvements in execution and deoptimization processes.

- JIT compilation process: The speaker delves into the mechanics of JIT compilation, explaining how methods that are called frequently get compiled into machine code, with fallbacks to the interpreter when assumptions about types fail.

- Memory and optimization tactics: An exploration of how the JIT compiler optimizes memory usage using hardware resources effectively by managing CPU registers and tracking variable states.

- Specialization techniques: The talk discusses specialization in method calls, optimizing execution by anticipating argument structures at compile time, which enhances performance significantly.

- Real-world implications: Concludes with a demonstration of Ruby code performance and how monoruby is addressing complex performance challenges.

The overall message emphasizes that creating a Ruby compiler can be enjoyable but emphasizes the importance of assessing and measuring performance to achieve meaningful optimizations. The speaker expresses a passion for combining computing with personal experiences, making the technical discussion relatable.

Improving my own Ruby
monochrome • Matsuyama, Ehime, Japan • Talk

Date: April 18, 2025
Published: May 27, 2025
Announced: unknown

monoruby is a new Ruby implementation I am working on, written in Rust, built from scratch, and consisting of a parser, interpreter, garbage collector, and just-in-time (JIT) compiler. Since the last RubyKaigi, I have made progress in Rubygems support and other performance improvements. This time I want to present the details of the optimizations and the various features for performance tuning. Meaningful optimization requires measurement and evaluation, so I have implemented features to record where and why various events such as JIT compilation, JIT code invalidation, and deoptimization (back to the interpreter) occurred, and to display the generated assembly for each virtual machine instruction. The actual optimizations implemented in monoruby include machine code inlining, method inlining, polymorphic method call optimization, faster instance variable access, faster array access, etc. Making the Ruby compiler is fun.

https://rubykaigi.org/2025/presentations/s_isshiki1969.html

RubyKaigi 2025

00:00:05.120 hello everyone Uh thank you for
00:00:09.000 coming Thank you Thank
00:00:11.559 you Thank you I'm
00:00:14.599 Monogram For these several years I'm
00:00:17.359 working on my imple
00:00:19.320 implementation of Ruby It is it is named
00:00:23.039 Mon Ruby In the last Ruby Kagi I
00:00:26.160 presented an overview of Monor Ruby And
00:00:29.279 this time I want to talk about how it
00:00:31.519 works and
00:00:32.920 how to make it
00:00:35.239 faster So first of all uh let me
00:00:38.800 introduce introduce myself I am a
00:00:42.079 general surgeon My specialty is a
00:00:45.120 detection and treatment of uh cancer in
00:00:48.480 the digestive organs and like gastric
00:00:51.840 cancer and colon cancer uh it's kind of
00:00:55.440 a hardware engineer of human body having
00:00:58.640 a national license I have a national
00:01:01.039 license that allows me to debug or fix
00:01:04.159 or cut or stitch human beings and I I
00:01:08.799 also love rubin rust and jet compiler
00:01:12.799 So I'd like to mention about my personal
00:01:16.080 connection to this Matama city and
00:01:19.759 Matama is where my grandmother was born
00:01:23.280 and my father spent his childhood and of
00:01:27.119 course my grandmother passed away many
00:01:29.840 years ago My father is 91 years old and
00:01:34.159 doing well so far and plays golf three
00:01:37.360 times a week And while his son is
00:01:41.400 sitting all day long and writing
00:01:44.320 compiler and assembly he's not
00:01:48.200 healthy So let's get
00:01:50.680 started And this is about monuby and
00:01:55.840 this is
00:01:58.119 GitHub This is GitHub
00:02:00.960 And it consists of yet another another
00:02:04.640 Ruby parser and garbage collector and
00:02:08.200 interpreter and password is very
00:02:10.800 annoying and we want to go to prison
00:02:14.080 someday So we have some
00:02:17.560 progress and now we James is supported
00:02:21.120 and we can require gems and now
00:02:24.319 struggling with
00:02:25.959 bundler
00:02:27.480 Yes and we we must mention about
00:02:30.560 compatibility with shy Of course there's
00:02:33.840 bunch of things to do many classes and
00:02:35.920 methods are missing but here especially
00:02:39.200 for the functionalities which affects
00:02:41.879 performance We we supports big num and
00:02:45.440 fiber and binding and redefining basic
00:02:48.560 operations method like integer add On
00:02:52.239 the other hand we do we do not support C
00:02:55.720 extensions This is a big problem for
00:02:59.400 now Native threads uh threat threat
00:03:02.879 classes and vector uh we do not support
00:03:07.239 them We also do not support object space
00:03:10.480 and trace point and refinements and call
00:03:13.040 CC and call CC will never be supported
00:03:15.760 in the future I'm sorry
00:03:18.959 And this is
00:03:20.200 microbenchmark uh with white bench and
00:03:23.760 it's showing ratio to CB
00:03:27.159 342 and red bar shows uh monor ruby and
00:03:30.879 it show it shows 3 to 10 times faster
00:03:33.680 than cubby interpreter and and this pink
00:03:38.080 bar shows a monor ruby interpreter is
00:03:41.120 comparable to a CRB
00:03:44.599 interpreter benchmark on optical count
00:03:48.239 This shows a frame per second over
00:03:51.080 overtime This is start point and time
00:03:54.400 goes and red line shows a mono ruby Uh
00:03:59.519 it achieves six or seven times faster
00:04:02.080 than interpreters interpreters
00:04:04.159 interpreters is here This pink and light
00:04:07.760 blue
00:04:08.840 lines and this blue line is widget and
00:04:13.360 the green lines are
00:04:15.639 truly showing a slow startup and nice
00:04:19.600 big
00:04:21.639 performance and we have several build
00:04:24.240 options for debugging and profiling This
00:04:26.960 is a profile option op option that we
00:04:29.520 can see a stats for deoptimization and
00:04:32.639 recompilation and method cache failure
00:04:35.360 and so on And another option the opt we
00:04:39.360 can check we can check when where and
00:04:42.560 why the optimization
00:04:45.960 occurred This shows the layout of bite
00:04:49.120 code I'm sorry And our bite code uh it's
00:04:54.560 16 or 32 bytes long for one instruction
00:04:58.720 This is at instruction and this is up
00:05:01.919 code This operance operance means a
00:05:05.600 register number of a left hand side and
00:05:08.320 right hand side and
00:05:10.120 destination And for method call uh this
00:05:12.880 op code and operance uh like
00:05:16.199 receiver and arguments and the number of
00:05:19.360 num number of arguments and
00:05:21.960 destination So uh take a look at the
00:05:24.560 blue part this blue part there there are
00:05:28.080 trace informations attached in the b
00:05:30.639 byte codes they're collected by
00:05:33.320 interpreter So for add instruction
00:05:38.000 there's a class id of left hand side and
00:05:40.400 right hand side for method call there is
00:05:44.560 uh reservers class and the uh method
00:05:49.520 call and method version So it means uh a
00:05:53.440 kind of inline method cache and this is
00:05:56.080 a class of receiver and id of a
00:06:00.080 colleague method that's stored there
00:06:04.639 So this is struct structure of stack
00:06:07.680 frame and if you call a certain method
00:06:10.600 block uh each stack frame is pushed and
00:06:14.000 the frame is popped when you
00:06:15.960 return the frame has a link to the uh
00:06:20.160 caller frame like a chain and also have
00:06:22.960 a link to a local
00:06:26.199 frame This is local frame Local frame
00:06:29.039 holds the self this registers register
00:06:33.080 zero and arguments and local v variables
00:06:36.400 and temporaries and the block given and
00:06:40.080 some
00:06:41.000 metadata It also has a link to the
00:06:45.720 outscope like
00:06:47.639 this and local frames are mostly placed
00:06:50.720 on the stack and push and pop But if
00:06:53.759 proc or binding object was made inside
00:06:56.080 the frame the local frame is copied like
00:06:59.840 this copied to heap for persistence
00:07:04.919 persistence and this is how the
00:07:07.120 interpreter and jit compiler and
00:07:08.960 generated jit code work This is
00:07:11.960 interpreter fits the bite code and
00:07:14.400 dispatch and the execute And this is a
00:07:17.919 infinite loop
00:07:20.440 And if many times a certain method
00:07:24.560 called or uh and many types a certain
00:07:29.199 loop
00:07:30.120 executed we compile the bite code to the
00:07:34.919 the machine code It's a JIT code and
00:07:39.120 just jump to jump them and this is a JIT
00:07:43.720 code and JIT code is uh compile with
00:07:48.000 some assumption Assumption means uh this
00:07:51.120 method
00:07:52.840 call the receiver must the class
00:07:57.080 A and the Ruby is a dynamic language So
00:08:02.319 this these assumptions uh sometimes
00:08:05.120 blicks If if assumption bicks and we we
00:08:08.560 must
00:08:09.720 deoptimize theop deoptimize means uh
00:08:12.879 fall back in interpreter and do it
00:08:16.520 again Yes So jet compiler is an
00:08:20.319 extremely complex system and the only
00:08:23.120 one thing that justifies it complexity
00:08:26.160 is performance So we we must generate a
00:08:29.680 nice
00:08:31.880 code and there are several hardware
00:08:34.640 resources the CPU can use the stack in
00:08:38.240 main memory which is slow and registered
00:08:41.120 the CPU which is
00:08:42.839 fast in
00:08:45.080 X86 XC6 CPU roughly roughly two
00:08:49.880 categories of resist registers are there
00:08:53.440 general purpose registers and
00:08:55.279 floatingoint resistors we must use
00:08:58.240 floating point resistors to calculate
00:09:00.800 for floating
00:09:02.440 point and in this in in the interpreter
00:09:08.040 uh the value of each register is stored
00:09:12.240 in stack it's one to one it's simple but
00:09:16.560 in compiler for optimization we must
00:09:20.000 track the state of each
00:09:21.880 register a
00:09:23.640 state means where the value of the
00:09:26.720 register uh stored in
00:09:28.760 runtime It may be
00:09:30.839 stack like this and it may be a CPU
00:09:34.399 resistor like this or floating flo 14
00:09:36.959 floating point register like this So we
00:09:39.600 must track where the value uh of the
00:09:43.680 resistor is stored
00:09:47.000 actually and it is necessary to generate
00:09:50.720 code for storing the value in any
00:09:53.120 anywhere uh just record in the state
00:09:56.399 like this If compiler knows the value of
00:09:59.120 register uh like a literal flo lit in
00:10:03.760 compile time it is necessary to generate
00:10:06.279 code for storing the value and just we
00:10:10.240 must just record in the
00:10:14.120 state So uh think about uh small method
00:10:18.560 area which calculates
00:10:21.200 uh the circle area for circle with
00:10:24.959 radius r and you are jit compiler at
00:10:29.200 first an ar ar ar ar ar ar ar ar ar ar
00:10:30.880 ar ar ar ar ar ar ar ar ar ar argument R
00:10:31.680 is stored in register one but compiler
00:10:35.120 does not know what it is It may be
00:10:38.240 integer or float and can be string so it
00:10:42.000 is colored in gray
00:10:44.480 So next uh multiply oper operation and
00:10:48.640 trace information shows uh register one
00:10:51.360 is an integer So we must
00:10:55.160 check the value of list one are actually
00:10:59.240 integer This is after checking we know
00:11:03.440 list one is integer So list one is
00:11:06.320 colored in yellow This is integer Yes
00:11:10.399 and multiply register one by itself and
00:11:14.240 got the in got the integer result here
00:11:19.120 and stored in uh the general purpose
00:11:21.760 register
00:11:23.320 R5 So we we must check the result If
00:11:27.360 overflow
00:11:28.680 occurred we must optimize the optimize
00:11:31.519 and fall back the interpreter and do it
00:11:33.880 again Check the overflow If no overflow
00:11:38.200 occurs link register two to uh this R 15
00:11:44.120 register Next uh load lit of
00:11:48.120 3.14 to register 3 We know it is a float
00:11:54.120 Okay So just change the state No code is
00:11:57.440 generated just link uh resistor three to
00:12:02.320 uh
00:12:04.920 3.14 multiply again and now we know las
00:12:10.399 two is a an
00:12:13.160 integer this this is colored yellow and
00:12:16.639 las three is float and colored blue and
00:12:20.160 we can omit guards for both slides both
00:12:23.320 sides so load we must load the values to
00:12:27.920 uh 14 point registers for
00:12:30.680 calculations like
00:12:34.120 this here match by
00:12:38.519 XMM2 by XM XMM3 and store the result in
00:12:43.120 XM
00:12:44.519 XMM4 of course we know all of them are
00:12:47.360 float so color in
00:12:51.160 blue link and
00:12:54.200 the returns register is number two so We
00:12:58.240 must link register two to
00:13:02.360 XM4 and finally we must return with
00:13:06.959 register 2 but register 2 is stored in
00:13:10.959 XM4
00:13:12.639 uh it's fault and we must convert from
00:13:15.920 fraud to Ruby value
00:13:20.600 34.0 so we must convert and return Well
00:13:25.440 done
00:13:27.399 Yes this is another topic for
00:13:30.760 optimization It's
00:13:32.600 specialization Uh think about think
00:13:34.959 about a method
00:13:37.320 RH The problem is uh we can we cannot
00:13:41.360 know which block is given from from
00:13:44.680 caller and so we cannot know a signature
00:13:48.800 of a given block The signature means uh
00:13:51.920 details of par parameters that methods
00:13:54.880 or block
00:13:56.600 have Uh it it means uh how many how many
00:14:00.000 positional arguments on optional
00:14:02.160 arguments and what kind of keyword
00:14:04.760 arguments It is necessary to know the
00:14:07.199 call number of arguments and call these
00:14:09.600 par parameters in compile time for an
00:14:13.120 efficient code generation otherwise
00:14:15.920 otherwise we must do a lot of things in
00:14:18.399 runtime
00:14:21.120 So the idea is the caller method and
00:14:24.000 method like arrange and block and we
00:14:27.760 compile them all at once and write
00:14:31.199 arrange itself in Ruby not rest to use
00:14:35.399 information efficiently
00:14:39.360 This is an array arrh in Ruby and this
00:14:43.440 is bite code and this is called for
00:14:46.839 life and look at
00:14:50.760 this there's a block giving block giving
00:14:54.079 this is method but we can optimize this
00:14:57.680 using an inline assembly
00:15:00.480 uh this is a code for kernel block given
00:15:05.720 and it's the last code Sorry And if we
00:15:09.600 know the block is given in this specific
00:15:15.000 specialization no code is
00:15:18.120 generated just change the state So if
00:15:22.000 not we can generate a
00:15:23.880 simpler
00:15:26.360 directory So in this specialized context
00:15:30.480 we know the block given is always true
00:15:34.160 So we can remove
00:15:36.880 uh these instructions and these
00:15:39.199 conditional branches and this basic
00:15:42.160 block
00:15:45.160 itself Um there's
00:15:48.199 another method array size or array index
00:15:53.440 we can
00:15:55.800 uh inline we can substitute inline
00:16:02.120 assembly So generally the
00:16:05.639 last thing is yield and yield is very
00:16:11.240 slow because we cannot know which block
00:16:14.240 is given in in compile time and we can
00:16:17.279 know we cannot know the signature of kie
00:16:20.720 and we must use indirect indirect branch
00:16:24.079 which suffers uh branch predictor of
00:16:29.079 CPU and but in this specialized context
00:16:33.519 we know the signature of Collie block in
00:16:35.440 compile time so we can generate an
00:16:38.880 efficient code So this is a benchmark
00:16:42.959 for
00:16:44.759 specialization and this shows uh key
00:16:48.240 iteration per second The higher the
00:16:50.519 better These are bars method like
00:16:53.600 integer times or integer step or
00:16:57.240 arr
00:16:59.560 map and the left bars shows a monor ruby
00:17:04.559 and the blue blue bar shows uh c rubies
00:17:10.839 342
00:17:12.679 Yes And pink lines shows monor ruby
00:17:16.720 without this optimization
00:17:21.160 So uh you can see
00:17:27.400 that you can see improvement in uh
00:17:33.799 performance and
00:17:38.440 uh in this red bar monor ruby uh methods
00:17:43.760 are written like
00:17:46.120 uh are written in ruby and in pink are
00:17:50.720 uh before optimiz
00:17:52.679 optimization methods are written in rust
00:17:57.480 So on the other hand green bars uh for
00:18:01.919 interpreter shows getting slower after
00:18:09.480 optimization
00:18:14.280 So I show some demos
00:18:30.720 This is some idiotic Ruby
00:18:33.720 code and we execute
00:19:02.720 like this
00:19:05.200 Thank you very
00:19:06.360 much Thank
00:19:09.000 you So enjoy the rest of Thank you very
00:19:12.799 much
Explore all talks recorded at RubyKaigi 2025
+66