00:00:13.280
hi everyone um today I'm going to talk
00:00:16.640
about the optimization how W speeds up
00:00:20.000
Ruby by slowing
00:00:22.199
down my name is Kokabun uh I work on the
00:00:26.240
W team at Shopify y team is part of the
00:00:29.519
Ruby infrastructure team we maintain a
00:00:33.440
lot of Ruby related open source fulltime
00:00:36.320
and improve the ecosystem um yeah and
00:00:41.320
um news is we're no longer working on
00:00:44.719
Yet we are like these days working on
00:00:48.320
another like new compiler called Zjit
00:00:51.039
maxim is going to give a talk about it
00:00:53.520
tomorrow so if you're interested um talk
00:00:57.360
like yeah go to the talk and but today's
00:01:01.520
topic is still relevant to the zjit as
00:01:03.840
well um the technique we are going to
00:01:07.119
talk about today is going to be
00:01:09.600
necessary for that compiler too so don't
00:01:12.560
worry about it and another thing is we
00:01:16.000
also hiring uh it's kind of rare
00:01:18.400
position if you're interested in working
00:01:20.159
on the open source project fulltime the
00:01:23.040
position is open This QR code is
00:01:25.280
specific to the Rubik ID attendees so
00:01:28.240
scan the QR code and apply for it now
00:01:30.720
because the permissions are limited and
00:01:33.360
if it's gone it's gone um if you also
00:01:36.640
want to work on digit now is the best
00:01:38.799
time because it's not finished yet so
00:01:41.520
like you can build the infrastructure
00:01:43.680
and the core implementations of it so
00:01:46.799
yeah let's build GI together with us and
00:01:51.119
I also maintain the latest stable branch
00:01:53.680
of Ruby releases uh I just released 343
00:01:57.680
this Monday um we also kind of
00:02:00.399
collaborate on backboarding things to
00:02:02.960
the Ruby 34 bench with uh inside the
00:02:05.840
Ruby infrastructure team i still
00:02:08.000
maintain the merge of everything but uh
00:02:10.560
we like share the maintenance of this
00:02:13.120
bench uh this
00:02:14.680
year so today's talk is about wget y
00:02:18.400
stands for yet another just in time
00:02:21.319
compiler and just in time compiler does
00:02:24.720
uh optimization
00:02:26.920
by switching from the interpreter uh
00:02:30.480
which interprets the virtual machine
00:02:32.640
instruction that are specific to the C
00:02:35.319
impmentation to machine code that can be
00:02:38.480
natively executed by Intel or ARM um
00:02:42.360
CPU and um the performance today is like
00:02:47.120
Um this is a speed of widget or website
00:02:51.120
which shows the widget's performance
00:02:53.599
compared to the interpreter on various
00:02:55.440
benchmarks and for example on the rails
00:02:58.080
bench which uh performs the active
00:03:00.879
record queries and do the HTML rendering
00:03:04.599
um it's twice as fast as interpreter
00:03:07.519
today um so uh even like with the IO we
00:03:11.840
can do a lot of transitions on those
00:03:14.120
benchmarks and this is still benchmarks
00:03:17.200
and not the real world um workload but
00:03:20.159
in the actual production workloads this
00:03:23.040
is the performance of uh storefront
00:03:25.519
vendor which has the like highest
00:03:27.360
traffic in our company and uh it shows
00:03:30.480
like 18% speed up on average and like
00:03:33.599
33% speed up on like region and Like we
00:03:37.840
also deploy widget on the Shopify's
00:03:41.120
largest monate as well so it's
00:03:43.120
production ready widget I mean sorry
00:03:44.720
production ready G compiler and also
00:03:47.120
like not inside not not only Shopy but
00:03:49.760
also other companies uh enable Y in
00:03:52.080
production as well these are just the
00:03:53.760
articles that are written this year but
00:03:56.319
from yes last year like years ago like
00:03:58.959
we've seen a lot of other articles that
00:04:01.280
said we enable Y in production so please
00:04:04.640
do that if you haven't but also another
00:04:07.920
thing is uh from rails 72 if you're
00:04:12.159
using Ruby newer than 33 or equal to 33
00:04:16.239
um it's enabled by default by rails so
00:04:19.519
if you upgraded Rails to 372 or newer or
00:04:23.600
3 Ruby 33 or newer and you switch the
00:04:26.479
defaults to 372 or newer then you might
00:04:29.199
be already running magic in production
00:04:31.040
without
00:04:32.759
noticing so and today's focus is going
00:04:36.320
to be on the something called the
00:04:38.320
optimization
00:04:40.040
um if you may have attended Rubik Kagi
00:04:44.240
before you might remember this Rubik's
00:04:47.360
2016 talk called the optimizing Ruby
00:04:50.320
done by Shi and um that was about like
00:04:56.400
optimizing Ruby interpreter by doing the
00:05:00.720
something called deoptimization
00:05:02.880
and that idea was pretty interesting and
00:05:06.800
it didn't end up being merged to the
00:05:08.720
master branch but today as of Ruby 34 or
00:05:13.600
35 we have the some like mechanism
00:05:16.479
called demization master branch and like
00:05:18.320
it's already released in 33 31 32 yeah
00:05:22.400
every
00:05:23.320
like has the mechanism called
00:05:26.680
demization so what is it it's u so what
00:05:30.880
if you can slow down Ruby at any time
00:05:33.919
you are probably not interested in
00:05:35.759
slowing down Ruby's performance but um
00:05:39.039
let's say you want to do some
00:05:40.479
optimization and if you can cancel it at
00:05:43.360
any time the optimization could be
00:05:45.520
anything like you can do whatever you
00:05:48.080
want and just throw it away if it's
00:05:50.600
necessary so that way um we can kind of
00:05:54.240
forget about Ruby's dynamic features
00:05:56.400
that prevents Ruby from being faster so
00:05:59.759
that's the something we call as the
00:06:02.919
optimization and to uh kind of give you
00:06:06.240
the hands experience um if you are
00:06:09.360
interested uh you can try building Ruby
00:06:12.319
with uh the configure flag called enable
00:06:14.720
ydev and if you build Ruby that way um
00:06:18.479
the CB is going to support this extra
00:06:21.199
command line flag called widget this
00:06:23.919
which shows the uh machine code for
00:06:26.240
every single method you have compiled
00:06:28.319
with widget so like for example um you
00:06:32.000
if you build the Ruby with enable ydev
00:06:35.600
the -v with y is going to show plus
00:06:39.840
widget dev it's usually just press
00:06:41.759
widget but if it's press widget dev it
00:06:44.080
means it's a development mode of widget
00:06:46.000
so you can show the machine code with a
00:06:48.400
uh dash dump this option and the example
00:06:51.840
is like this um it's I'm going to talk
00:06:54.960
about this like similar code in the next
00:06:57.520
slides but uh we just define the method
00:07:01.120
and code it and redefined it and then
00:07:03.840
shows um the machine code that was
00:07:06.319
generated for the optimization purposes
00:07:11.560
so I'm going to next talk about um
00:07:15.280
traditional Y optimization de
00:07:17.199
optimizations that have existed since
00:07:19.680
3.1 Ruby 31 so the the major thing we do
00:07:23.919
for the optimization is called code
00:07:26.039
patching so because we maintain the
00:07:30.560
every single bit of machine code we
00:07:32.560
generate we know exactly which address
00:07:36.639
should be invalidated when we need to do
00:07:39.039
so so for example let's say you have
00:07:43.680
Ruby code like this you define a
00:07:45.919
constant called fu and it's one and just
00:07:49.120
define a method called lowerase fu and
00:07:51.759
just refer to the constant the method
00:07:55.039
who should return one so uh the reg just
00:07:59.360
look at the content of the constant and
00:08:02.479
uh embeds the actual uh content of the
00:08:05.199
con constant like um three so as you may
00:08:09.199
know um integer is uh left shifted once
00:08:14.000
so it's like integer one is a three in
00:08:16.560
the machine code and um but if you do
00:08:20.160
this then the machine code will always
00:08:23.280
uh return one integer one from the
00:08:26.240
method fu but it's not necessarily true
00:08:29.800
if the constant is defined because
00:08:32.560
Ruby's constants are just another kind
00:08:34.320
of global variable it's not actually a
00:08:36.959
constant and like could be redefined at
00:08:39.039
any time and when when it happens what
00:08:41.760
reg does is just patch the code and uh
00:08:45.519
rewrite to another instruction called
00:08:47.760
jump and if we do this jump uh that
00:08:51.120
could go to anything for example the the
00:08:55.760
thing we actually do is jump to the um
00:08:59.080
trampoline which jumps to the
00:09:01.600
interpreter goes back to the interpreter
00:09:03.120
from the G-code so by doing this uh we
00:09:06.320
can cancel the optimization of inlining
00:09:08.640
the content of uh fu constant and then
00:09:11.680
goes back to the interpreter when it uh
00:09:14.720
executes this um method
00:09:17.800
again and I also like to introduce uh
00:09:21.440
another uh audio or transition called
00:09:24.320
global invitation so this is under Ruby
00:09:28.880
code i want you to think about what it
00:09:32.120
returns
00:09:33.640
um raise your hand if you think this is
00:09:36.480
not going to return
00:09:39.399
one you win so of course it's going to
00:09:44.160
return 5,000 trillion because it's Ruby
00:09:48.560
um so
00:09:51.399
there's thousands of features that break
00:09:54.399
this kind of optimization in Ruby and um
00:09:58.640
this example has a trace point uh in
00:10:02.000
particular it's a line trace point so if
00:10:04.959
you define a line trace point event um
00:10:07.600
when you execute another line you hook
00:10:10.480
the execution of the Ruby and do
00:10:12.720
anything there so in this example um
00:10:16.480
when you call the number method u after
00:10:19.760
executing the one equals sorry one
00:10:22.160
equals one yeah one equals one then it's
00:10:24.560
going to rerun the block of the uh trace
00:10:27.600
point and it set the uh local variable
00:10:30.959
again to 5,000 trillion so of course
00:10:34.399
it's going to return the number we
00:10:36.560
generated in the trace
00:10:38.839
point and when we do that because trace
00:10:42.720
point messes up everything we just throw
00:10:45.760
away everything so like this example
00:10:47.920
shows like it's not the code we just
00:10:50.800
showed but uh it's just on the other
00:10:53.680
benchmark and like when trace point line
00:10:56.800
event is enabled we just generate
00:10:59.279
thousands of jump instruction to the
00:11:01.440
side um exit code to the interpreter so
00:11:06.079
um yeah this is what happens if you use
00:11:08.079
line trace point so today's takeaway is
00:11:10.399
like don't use line trace point in
00:11:15.240
production so these are the existing
00:11:18.800
traditional uh deoptimizations we had in
00:11:21.519
Y for years but uh I'm also going to
00:11:25.360
talk about new deoptimizations uh that
00:11:28.560
we added to Ruby
00:11:30.760
34 so the first thing I want to talk
00:11:33.519
about is the invitation of escape locals
00:11:37.760
uh so this is another example obviously
00:11:39.760
it's going to return 5,000 trillion but
00:11:42.480
uh I'm not going to use trace point but
00:11:44.800
uh can you guess what kind of re
00:11:46.480
features you could use in the do
00:11:51.000
something so somebody called binding but
00:11:53.839
yeah it is so like um not just the
00:11:57.279
binding of the current frame but any
00:11:59.920
random Ruby method could look at the
00:12:02.480
caller frame arbitrary caller frame
00:12:05.120
using the C like public like official C
00:12:08.880
API we provide for messing up the Ruby
00:12:11.839
um you can look at the caller frame and
00:12:15.440
then mess up the frame basically so
00:12:18.880
whenever you call some arbitrary method
00:12:20.800
that's not in line to your G-code uh it
00:12:23.519
could just mess up the Ruby local
00:12:25.839
variable so it's not guaranteed that the
00:12:28.639
Ruby local variables are not going to be
00:12:31.360
uh modified by the co methods even if
00:12:34.639
you are not passing the block to the
00:12:37.440
caller sorry
00:12:39.560
method and with that uh we introduced a
00:12:43.519
new optimization called uh local
00:12:45.360
variable resurgocation to Ruby 34 we
00:12:49.040
used to just write to memory like in
00:12:51.920
Ruby 33 we were writing only to the
00:12:55.279
memory when we need to deal with the uh
00:12:58.000
Ruby locals but from 34 we use registers
00:13:02.000
as if it were the regular compiler and
00:13:05.040
of course we are going to throw away the
00:13:07.519
code when the binding is fetched by the
00:13:10.079
any like co or
00:13:13.560
wherever so another example is uh I want
00:13:17.120
to talk about is called uh invitation on
00:13:19.279
singleton classes so it's another
00:13:21.839
optimization we added to 34 so in this
00:13:24.800
example um the example method defines
00:13:28.639
the string local variable and then uh
00:13:34.040
concatenate another string uh returned
00:13:37.279
from the define method that's bar so if
00:13:40.480
you don't um specify the true to the
00:13:43.839
flag then it's going to uh concatenate
00:13:47.360
empty string and bar so it's going to
00:13:48.880
return bar and the next step is going to
00:13:52.880
uh set true to the flag so it
00:13:55.279
redefineses the string plus method on
00:13:58.240
that specific string object so instead
00:14:01.519
of concatenating empty string and the
00:14:04.160
bar it's going to just return the f
00:14:06.600
string so if you run this script with
00:14:10.160
interpreter uh the correct result is
00:14:12.720
going to be uh print bar first and then
00:14:15.199
foo next but
00:14:18.199
um in the if you think about how do you
00:14:21.360
compile this in y um as of evaluating
00:14:25.360
the receiver of the pro instruction the
00:14:28.199
string is pushed to the like virtual
00:14:31.600
stack and then as of that because it's
00:14:34.079
initialized by the string little of
00:14:36.079
course it's going to be bare string
00:14:37.680
nothing happens to the string yet even
00:14:39.839
if the flag is true so because the
00:14:43.120
string is pushed to the bure stack
00:14:45.279
before executing the define method the J
00:14:48.800
g compiler thinks it's going to be the
00:14:50.720
bare string and after pushing that to
00:14:53.760
the stack it's going to call the define
00:14:55.839
method and because the flag is true it
00:14:58.399
redefineses a string plus uh method on
00:15:00.880
that particular object so we swap the uh
00:15:04.480
class field or the object to the
00:15:06.880
singleton class that has this special
00:15:09.760
method definition like definition and
00:15:12.959
after doing so because J thinks it's a
00:15:17.440
string uh it may think uh you don't need
00:15:21.920
to check if it's actually a string so if
00:15:24.480
we don't do anything with it um it could
00:15:27.680
be just like um return bar because we
00:15:30.800
skipped the red definition and we
00:15:33.040
already compiled assuming that um the
00:15:35.920
string plus is going to be the actual
00:15:38.560
string plus so this is a miscomp
00:15:40.880
completion that could happen if you
00:15:42.560
don't invite it on the single classes so
00:15:46.000
what we do today with 3 Ruby 34 is that
00:15:49.680
um when singleton classes are created on
00:15:53.199
particular classes we track for example
00:15:55.440
like string uh array and hash um for
00:15:59.279
those objects um we check if any
00:16:03.440
singleton class is created for those
00:16:05.279
classes and if it's any singleton class
00:16:07.759
for those classes we just skip the uh
00:16:11.839
like alli the type check for those
00:16:14.000
classes and um invite it when um the
00:16:18.160
string uh single class is
00:16:21.399
defined uh the next thing I want to talk
00:16:24.320
about is lazy frame push
00:16:27.160
so with this example we have the string
00:16:32.160
empty string and set by zero uh just
00:16:35.120
letter a and um because the empty string
00:16:40.079
doesn't have any length you can't set
00:16:42.560
the um character to the index zero
00:16:45.279
because it's the length is zero so it
00:16:48.320
should raise the exception like index
00:16:51.680
zero out of string um but the widget
00:16:57.120
optimizes this string set by method so
00:17:00.880
we could behave like make it behave like
00:17:04.400
this so like um if you don't have any
00:17:07.640
invitation y could just skip pushing the
00:17:10.880
frame for the set bite and just uh
00:17:13.120
inline the optimized instructions for
00:17:15.520
the set by implementation and then if
00:17:17.760
you do that um even if the argument is
00:17:21.120
invalid and should raise error because
00:17:24.079
we haven't pushed the method frame for
00:17:26.319
the set bite the behavior could be the
00:17:29.760
the lower one like which doesn't say
00:17:32.559
string set by in the structures um so
00:17:36.480
it's the wrong behavior um the thing we
00:17:39.840
introduced to 334 is the lazy frame push
00:17:44.400
if you lazily sorry um when the set by
00:17:48.960
sees the invite um argument it
00:17:51.919
internally um calls the CB um API that
00:17:57.200
allocates a new
00:18:00.520
um new exception object and when we
00:18:04.160
reach the uh C function that outdates
00:18:06.320
the exception object um it calls back
00:18:09.440
the widget hook that uh checks if it has
00:18:13.360
registered the invalidation for that um
00:18:16.720
program location and um if it's already
00:18:20.240
registered then uh we lazy fra uh push a
00:18:23.039
frame to the stack when
00:18:25.720
necessary and the way it works is um we
00:18:30.480
have uh this is the first time showing
00:18:32.880
the Ras code in the Rubik but um it's
00:18:36.520
a looks at the um hash table that has
00:18:40.080
the uh program counters as keys program
00:18:43.440
counter is like one to one mapping to
00:18:45.520
the program specific location and the
00:18:48.640
values are the something that we need
00:18:50.960
for pushing the frame basically um it
00:18:53.760
has the location of the receiver on the
00:18:56.000
stack and also has the method entry so
00:18:58.640
that we can materialize the method uh
00:19:01.520
content sorry the frame content and the
00:19:05.440
trick is like because method any method
00:19:09.280
could be executed from a same location
00:19:12.240
like for example even if you call the
00:19:15.039
set by method depending on the receiver
00:19:18.160
of the object the set bite could be
00:19:20.160
something different for example you can
00:19:21.840
redefine the set bite for string or any
00:19:24.559
other receiver could be used so uh the
00:19:28.480
program counter is not necessarily tied
00:19:30.720
to a specific method entry so um because
00:19:35.919
we kind of assumed that the set bite is
00:19:39.600
not going to be u polymorphic call um we
00:19:43.919
assume that it's a one to one mapping
00:19:45.840
and if it's not we can just side like
00:19:48.320
exit to the interpreter when that
00:19:49.919
happens so that way we can uh calculate
00:19:53.360
the um the frame content we need to push
00:19:57.760
lazily based on the program counter um
00:20:01.280
which we have to set for other reasons
00:20:04.240
like uh tracing um and the side exits
00:20:07.520
and so we set the program counter anyway
00:20:10.080
so we use that for materializing the
00:20:12.720
frame
00:20:14.280
later um the next thing I'm going to
00:20:16.799
talk about is the widget only due
00:20:18.720
methods so as you may know Ruby is
00:20:22.080
faster than C so
00:20:24.520
we have method definitions like this for
00:20:28.320
array each so this is the like
00:20:31.679
historical uh C implementation for array
00:20:34.320
each we attempted to just replace that
00:20:36.720
with Ruby base array each um which we
00:20:41.200
didn't end up doing but um in the
00:20:44.840
microbenchmark the Ruby version of the
00:20:47.280
array each was actually faster even on
00:20:50.159
the interpreter um but if you run the
00:20:53.919
benchmark like larger large enough
00:20:55.760
benchmark then uh it actually performed
00:20:58.400
worse which is why um we are now uh
00:21:02.240
thinking of like doing the switching
00:21:04.400
between the C and Ruby version so uh
00:21:08.159
with YJIT because going to Ruby to C and
00:21:13.600
then going back to uh JIT from the C
00:21:16.480
world is actually slow um so we don't
00:21:19.600
want to cross the do the boundary
00:21:22.159
crossing especially the C to Ruby is a
00:21:25.039
like bad bad thing so to avoid that um
00:21:29.039
like if you define array each in C then
00:21:32.320
you must uh cross the boundary like from
00:21:35.120
C to Ruby because the block is often
00:21:38.000
just defined by Ruby so we shouldn't
00:21:41.520
define array each in Ruby if you want to
00:21:44.320
not make that not happen
00:21:47.240
so uh this is the complicated code we
00:21:52.960
write that's actually shipped in the
00:21:54.640
Ruby 34 um so it's actually not really
00:21:59.280
deoptimization but like achieves the
00:22:01.520
same kind of purpose so what it it does
00:22:04.880
is it checks if the core array each C
00:22:08.880
implementation has been redefined and if
00:22:11.520
it's not redefine then go like move on
00:22:14.640
to execute those like content and just
00:22:17.520
undefine the C based implementation
00:22:21.240
arrine the Ruby base array each so and
00:22:26.159
the thing I want to talk about is like
00:22:29.240
uh this one so this is new in Ruby 34
00:22:34.240
and it's called like C trace attribute
00:22:37.520
and um this primitive ATR exclamation is
00:22:41.919
the uh like special syntax that's
00:22:45.039
specific to the C2B internal core
00:22:48.080
classes that were developed by Kohichi
00:22:51.280
before and if you this is not a
00:22:54.000
methodical but it just uh parsed by
00:22:57.360
prison and then like uh compiled into
00:23:00.159
special binary that's only possible
00:23:02.159
inside the CB core classes and if you
00:23:04.880
have the C trace attribute this method
00:23:07.600
entry is going to have the flag called C
00:23:10.320
trace and if it does have the C trace
00:23:13.039
flag what is that is um so this is the
00:23:18.480
um original behavior of the exception
00:23:20.960
raised inside the block of the G given
00:23:23.600
to array each so if you raise something
00:23:26.320
inside the block given to array each uh
00:23:29.440
the back of course it should um have the
00:23:32.320
array each entry and the interesting
00:23:35.200
part is
00:23:37.000
um because C array each is defined in C
00:23:41.520
in this example um it shows the file
00:23:45.039
name called like dash E which is the
00:23:47.679
dash given to the Ruby command however
00:23:51.679
um what happens if we uh redefine that
00:23:55.679
in Ruby is usually like this so the
00:23:58.159
lower one is the version that defines
00:24:01.120
array h in Ruby so um when we
00:24:04.960
reimplement something in Ruby um the
00:24:08.360
back has the uh Ruby file name i mean
00:24:12.159
it's actually not the file name like
00:24:14.240
just it's internal array for core
00:24:16.320
classes defined in Ruby but uh the file
00:24:20.080
name is going to be different like for
00:24:22.240
some reason C methods just look at the
00:24:25.039
caller flames location and shows dash e
00:24:28.240
which is not actually the location that
00:24:30.240
define the ar written in C but um anyway
00:24:34.080
if it's rewritten in Ruby uh this is
00:24:37.039
going to be different but this is
00:24:38.960
something we don't want to hap to happen
00:24:41.600
because
00:24:43.480
um
00:24:45.000
the if this happens like we redefine so
00:24:49.520
if you enable Yet um we because of the
00:24:54.400
uh with Vit helper um we switch swaps
00:24:58.640
the uh implementation of the array each
00:25:00.960
only if the V is enabled so uh the what
00:25:05.600
happens in real world is if you you
00:25:08.559
enable y suddenly your test cases that
00:25:12.880
match the back is going to break like
00:25:15.679
for some reason people like to test back
00:25:17.520
traces so these tests are going to break
00:25:20.080
just because you enable rigit so we
00:25:22.720
don't want that to happen so like um to
00:25:25.919
prevent that we added the C trace
00:25:27.760
attribute and that's what we do for
00:25:30.240
fixing or dealing with this problem so
00:25:32.799
like um the switch from the C like dashy
00:25:36.799
to the internal array happens all the
00:25:38.799
time when you upgrade the Ruby minor
00:25:41.360
versions because minor versions should
00:25:43.200
have backward incompatibilities but
00:25:45.760
switching in like in between the
00:25:47.520
interpreter and should be as smooth as
00:25:49.679
possible so uh the C trace uh attribute
00:25:53.039
is what we have for uh fixing that
00:25:56.919
problem uh I talk about it a bit fast
00:26:01.200
but that now comes to conclusion so the
00:26:04.080
de the optimization enable
00:26:06.039
sculative optimizations with like lazy
00:26:09.120
inviting throwing away the code later
00:26:11.679
and um Ruby 34 optimizes uh W optimizes
00:26:15.600
the local variables and method goals
00:26:17.279
using the technique called
00:26:18.480
deoptimization is which is the
00:26:20.400
conclusion of today thank you for coming
00:26:22.240
to the talk