00:00:05.120
hello everyone Uh thank you for
00:00:09.000
coming Thank you Thank
00:00:11.559
you Thank you I'm
00:00:14.599
Monogram For these several years I'm
00:00:17.359
working on my imple
00:00:19.320
implementation of Ruby It is it is named
00:00:23.039
Mon Ruby In the last Ruby Kagi I
00:00:26.160
presented an overview of Monor Ruby And
00:00:29.279
this time I want to talk about how it
00:00:31.519
works and
00:00:32.920
how to make it
00:00:35.239
faster So first of all uh let me
00:00:38.800
introduce introduce myself I am a
00:00:42.079
general surgeon My specialty is a
00:00:45.120
detection and treatment of uh cancer in
00:00:48.480
the digestive organs and like gastric
00:00:51.840
cancer and colon cancer uh it's kind of
00:00:55.440
a hardware engineer of human body having
00:00:58.640
a national license I have a national
00:01:01.039
license that allows me to debug or fix
00:01:04.159
or cut or stitch human beings and I I
00:01:08.799
also love rubin rust and jet compiler
00:01:12.799
So I'd like to mention about my personal
00:01:16.080
connection to this Matama city and
00:01:19.759
Matama is where my grandmother was born
00:01:23.280
and my father spent his childhood and of
00:01:27.119
course my grandmother passed away many
00:01:29.840
years ago My father is 91 years old and
00:01:34.159
doing well so far and plays golf three
00:01:37.360
times a week And while his son is
00:01:41.400
sitting all day long and writing
00:01:44.320
compiler and assembly he's not
00:01:48.200
healthy So let's get
00:01:50.680
started And this is about monuby and
00:01:55.840
this is
00:01:58.119
GitHub This is GitHub
00:02:00.960
And it consists of yet another another
00:02:04.640
Ruby parser and garbage collector and
00:02:08.200
interpreter and password is very
00:02:10.800
annoying and we want to go to prison
00:02:14.080
someday So we have some
00:02:17.560
progress and now we James is supported
00:02:21.120
and we can require gems and now
00:02:24.319
struggling with
00:02:25.959
bundler
00:02:27.480
Yes and we we must mention about
00:02:30.560
compatibility with shy Of course there's
00:02:33.840
bunch of things to do many classes and
00:02:35.920
methods are missing but here especially
00:02:39.200
for the functionalities which affects
00:02:41.879
performance We we supports big num and
00:02:45.440
fiber and binding and redefining basic
00:02:48.560
operations method like integer add On
00:02:52.239
the other hand we do we do not support C
00:02:55.720
extensions This is a big problem for
00:02:59.400
now Native threads uh threat threat
00:03:02.879
classes and vector uh we do not support
00:03:07.239
them We also do not support object space
00:03:10.480
and trace point and refinements and call
00:03:13.040
CC and call CC will never be supported
00:03:15.760
in the future I'm sorry
00:03:18.959
And this is
00:03:20.200
microbenchmark uh with white bench and
00:03:23.760
it's showing ratio to CB
00:03:27.159
342 and red bar shows uh monor ruby and
00:03:30.879
it show it shows 3 to 10 times faster
00:03:33.680
than cubby interpreter and and this pink
00:03:38.080
bar shows a monor ruby interpreter is
00:03:41.120
comparable to a CRB
00:03:44.599
interpreter benchmark on optical count
00:03:48.239
This shows a frame per second over
00:03:51.080
overtime This is start point and time
00:03:54.400
goes and red line shows a mono ruby Uh
00:03:59.519
it achieves six or seven times faster
00:04:02.080
than interpreters interpreters
00:04:04.159
interpreters is here This pink and light
00:04:07.760
blue
00:04:08.840
lines and this blue line is widget and
00:04:13.360
the green lines are
00:04:15.639
truly showing a slow startup and nice
00:04:19.600
big
00:04:21.639
performance and we have several build
00:04:24.240
options for debugging and profiling This
00:04:26.960
is a profile option op option that we
00:04:29.520
can see a stats for deoptimization and
00:04:32.639
recompilation and method cache failure
00:04:35.360
and so on And another option the opt we
00:04:39.360
can check we can check when where and
00:04:42.560
why the optimization
00:04:45.960
occurred This shows the layout of bite
00:04:49.120
code I'm sorry And our bite code uh it's
00:04:54.560
16 or 32 bytes long for one instruction
00:04:58.720
This is at instruction and this is up
00:05:01.919
code This operance operance means a
00:05:05.600
register number of a left hand side and
00:05:08.320
right hand side and
00:05:10.120
destination And for method call uh this
00:05:12.880
op code and operance uh like
00:05:16.199
receiver and arguments and the number of
00:05:19.360
num number of arguments and
00:05:21.960
destination So uh take a look at the
00:05:24.560
blue part this blue part there there are
00:05:28.080
trace informations attached in the b
00:05:30.639
byte codes they're collected by
00:05:33.320
interpreter So for add instruction
00:05:38.000
there's a class id of left hand side and
00:05:40.400
right hand side for method call there is
00:05:44.560
uh reservers class and the uh method
00:05:49.520
call and method version So it means uh a
00:05:53.440
kind of inline method cache and this is
00:05:56.080
a class of receiver and id of a
00:06:00.080
colleague method that's stored there
00:06:04.639
So this is struct structure of stack
00:06:07.680
frame and if you call a certain method
00:06:10.600
block uh each stack frame is pushed and
00:06:14.000
the frame is popped when you
00:06:15.960
return the frame has a link to the uh
00:06:20.160
caller frame like a chain and also have
00:06:22.960
a link to a local
00:06:26.199
frame This is local frame Local frame
00:06:29.039
holds the self this registers register
00:06:33.080
zero and arguments and local v variables
00:06:36.400
and temporaries and the block given and
00:06:40.080
some
00:06:41.000
metadata It also has a link to the
00:06:45.720
outscope like
00:06:47.639
this and local frames are mostly placed
00:06:50.720
on the stack and push and pop But if
00:06:53.759
proc or binding object was made inside
00:06:56.080
the frame the local frame is copied like
00:06:59.840
this copied to heap for persistence
00:07:04.919
persistence and this is how the
00:07:07.120
interpreter and jit compiler and
00:07:08.960
generated jit code work This is
00:07:11.960
interpreter fits the bite code and
00:07:14.400
dispatch and the execute And this is a
00:07:17.919
infinite loop
00:07:20.440
And if many times a certain method
00:07:24.560
called or uh and many types a certain
00:07:29.199
loop
00:07:30.120
executed we compile the bite code to the
00:07:34.919
the machine code It's a JIT code and
00:07:39.120
just jump to jump them and this is a JIT
00:07:43.720
code and JIT code is uh compile with
00:07:48.000
some assumption Assumption means uh this
00:07:51.120
method
00:07:52.840
call the receiver must the class
00:07:57.080
A and the Ruby is a dynamic language So
00:08:02.319
this these assumptions uh sometimes
00:08:05.120
blicks If if assumption bicks and we we
00:08:08.560
must
00:08:09.720
deoptimize theop deoptimize means uh
00:08:12.879
fall back in interpreter and do it
00:08:16.520
again Yes So jet compiler is an
00:08:20.319
extremely complex system and the only
00:08:23.120
one thing that justifies it complexity
00:08:26.160
is performance So we we must generate a
00:08:29.680
nice
00:08:31.880
code and there are several hardware
00:08:34.640
resources the CPU can use the stack in
00:08:38.240
main memory which is slow and registered
00:08:41.120
the CPU which is
00:08:42.839
fast in
00:08:45.080
X86 XC6 CPU roughly roughly two
00:08:49.880
categories of resist registers are there
00:08:53.440
general purpose registers and
00:08:55.279
floatingoint resistors we must use
00:08:58.240
floating point resistors to calculate
00:09:00.800
for floating
00:09:02.440
point and in this in in the interpreter
00:09:08.040
uh the value of each register is stored
00:09:12.240
in stack it's one to one it's simple but
00:09:16.560
in compiler for optimization we must
00:09:20.000
track the state of each
00:09:21.880
register a
00:09:23.640
state means where the value of the
00:09:26.720
register uh stored in
00:09:28.760
runtime It may be
00:09:30.839
stack like this and it may be a CPU
00:09:34.399
resistor like this or floating flo 14
00:09:36.959
floating point register like this So we
00:09:39.600
must track where the value uh of the
00:09:43.680
resistor is stored
00:09:47.000
actually and it is necessary to generate
00:09:50.720
code for storing the value in any
00:09:53.120
anywhere uh just record in the state
00:09:56.399
like this If compiler knows the value of
00:09:59.120
register uh like a literal flo lit in
00:10:03.760
compile time it is necessary to generate
00:10:06.279
code for storing the value and just we
00:10:10.240
must just record in the
00:10:14.120
state So uh think about uh small method
00:10:18.560
area which calculates
00:10:21.200
uh the circle area for circle with
00:10:24.959
radius r and you are jit compiler at
00:10:29.200
first an ar ar ar ar ar ar ar ar ar ar
00:10:30.880
ar ar ar ar ar ar ar ar ar ar argument R
00:10:31.680
is stored in register one but compiler
00:10:35.120
does not know what it is It may be
00:10:38.240
integer or float and can be string so it
00:10:42.000
is colored in gray
00:10:44.480
So next uh multiply oper operation and
00:10:48.640
trace information shows uh register one
00:10:51.360
is an integer So we must
00:10:55.160
check the value of list one are actually
00:10:59.240
integer This is after checking we know
00:11:03.440
list one is integer So list one is
00:11:06.320
colored in yellow This is integer Yes
00:11:10.399
and multiply register one by itself and
00:11:14.240
got the in got the integer result here
00:11:19.120
and stored in uh the general purpose
00:11:21.760
register
00:11:23.320
R5 So we we must check the result If
00:11:27.360
overflow
00:11:28.680
occurred we must optimize the optimize
00:11:31.519
and fall back the interpreter and do it
00:11:33.880
again Check the overflow If no overflow
00:11:38.200
occurs link register two to uh this R 15
00:11:44.120
register Next uh load lit of
00:11:48.120
3.14 to register 3 We know it is a float
00:11:54.120
Okay So just change the state No code is
00:11:57.440
generated just link uh resistor three to
00:12:02.320
uh
00:12:04.920
3.14 multiply again and now we know las
00:12:10.399
two is a an
00:12:13.160
integer this this is colored yellow and
00:12:16.639
las three is float and colored blue and
00:12:20.160
we can omit guards for both slides both
00:12:23.320
sides so load we must load the values to
00:12:27.920
uh 14 point registers for
00:12:30.680
calculations like
00:12:34.120
this here match by
00:12:38.519
XMM2 by XM XMM3 and store the result in
00:12:43.120
XM
00:12:44.519
XMM4 of course we know all of them are
00:12:47.360
float so color in
00:12:51.160
blue link and
00:12:54.200
the returns register is number two so We
00:12:58.240
must link register two to
00:13:02.360
XM4 and finally we must return with
00:13:06.959
register 2 but register 2 is stored in
00:13:10.959
XM4
00:13:12.639
uh it's fault and we must convert from
00:13:15.920
fraud to Ruby value
00:13:20.600
34.0 so we must convert and return Well
00:13:25.440
done
00:13:27.399
Yes this is another topic for
00:13:30.760
optimization It's
00:13:32.600
specialization Uh think about think
00:13:34.959
about a method
00:13:37.320
RH The problem is uh we can we cannot
00:13:41.360
know which block is given from from
00:13:44.680
caller and so we cannot know a signature
00:13:48.800
of a given block The signature means uh
00:13:51.920
details of par parameters that methods
00:13:54.880
or block
00:13:56.600
have Uh it it means uh how many how many
00:14:00.000
positional arguments on optional
00:14:02.160
arguments and what kind of keyword
00:14:04.760
arguments It is necessary to know the
00:14:07.199
call number of arguments and call these
00:14:09.600
par parameters in compile time for an
00:14:13.120
efficient code generation otherwise
00:14:15.920
otherwise we must do a lot of things in
00:14:18.399
runtime
00:14:21.120
So the idea is the caller method and
00:14:24.000
method like arrange and block and we
00:14:27.760
compile them all at once and write
00:14:31.199
arrange itself in Ruby not rest to use
00:14:35.399
information efficiently
00:14:39.360
This is an array arrh in Ruby and this
00:14:43.440
is bite code and this is called for
00:14:46.839
life and look at
00:14:50.760
this there's a block giving block giving
00:14:54.079
this is method but we can optimize this
00:14:57.680
using an inline assembly
00:15:00.480
uh this is a code for kernel block given
00:15:05.720
and it's the last code Sorry And if we
00:15:09.600
know the block is given in this specific
00:15:15.000
specialization no code is
00:15:18.120
generated just change the state So if
00:15:22.000
not we can generate a
00:15:23.880
simpler
00:15:26.360
directory So in this specialized context
00:15:30.480
we know the block given is always true
00:15:34.160
So we can remove
00:15:36.880
uh these instructions and these
00:15:39.199
conditional branches and this basic
00:15:42.160
block
00:15:45.160
itself Um there's
00:15:48.199
another method array size or array index
00:15:53.440
we can
00:15:55.800
uh inline we can substitute inline
00:16:02.120
assembly So generally the
00:16:05.639
last thing is yield and yield is very
00:16:11.240
slow because we cannot know which block
00:16:14.240
is given in in compile time and we can
00:16:17.279
know we cannot know the signature of kie
00:16:20.720
and we must use indirect indirect branch
00:16:24.079
which suffers uh branch predictor of
00:16:29.079
CPU and but in this specialized context
00:16:33.519
we know the signature of Collie block in
00:16:35.440
compile time so we can generate an
00:16:38.880
efficient code So this is a benchmark
00:16:42.959
for
00:16:44.759
specialization and this shows uh key
00:16:48.240
iteration per second The higher the
00:16:50.519
better These are bars method like
00:16:53.600
integer times or integer step or
00:16:57.240
arr
00:16:59.560
map and the left bars shows a monor ruby
00:17:04.559
and the blue blue bar shows uh c rubies
00:17:10.839
342
00:17:12.679
Yes And pink lines shows monor ruby
00:17:16.720
without this optimization
00:17:21.160
So uh you can see
00:17:27.400
that you can see improvement in uh
00:17:33.799
performance and
00:17:38.440
uh in this red bar monor ruby uh methods
00:17:43.760
are written like
00:17:46.120
uh are written in ruby and in pink are
00:17:50.720
uh before optimiz
00:17:52.679
optimization methods are written in rust
00:17:57.480
So on the other hand green bars uh for
00:18:01.919
interpreter shows getting slower after
00:18:09.480
optimization
00:18:14.280
So I show some demos
00:18:30.720
This is some idiotic Ruby
00:18:33.720
code and we execute
00:19:02.720
like this
00:19:05.200
Thank you very
00:19:06.360
much Thank
00:19:09.000
you So enjoy the rest of Thank you very
00:19:12.799
much