00:00:36.399
so most of japanese people uh talk in
00:00:39.680
english as the first uh first contact so
00:00:44.960
uh i try to uh say uh hello in japanese
00:00:49.280
but the this presentation i will uh
00:00:52.239
present in english so please
00:00:55.719
enjoy so today i'd like to talk about
00:00:58.960
garbage collection in luby more
00:01:02.160
specifically how we can move toward
00:01:05.680
lecture local garbage collection for
00:01:09.600
better
00:01:10.600
concurrency and performance so let's get
00:01:15.400
started so let me start with the
00:01:18.520
conclusion to uh of this talk so
00:01:22.680
introducing local di is
00:01:26.520
promising
00:01:28.600
direction but it's also a very
00:01:31.600
challenging
00:01:32.600
one the biggest issue is cross reference
00:01:38.240
as soon as objects start being shared
00:01:41.040
things get
00:01:42.759
complicated so we propose a conservative
00:01:46.600
approach we track all sharable objects
00:01:49.759
and uh treat them as gc roots as long as
00:01:54.479
the number of such object is small this
00:01:57.280
is a feasible approach it behaves the
00:02:00.719
same as the current garbage
00:02:03.079
collection which always does global
00:02:05.920
garbage collection
00:02:08.479
so we've observed s significant
00:02:11.680
performance improvement in
00:02:13.120
microbenchmarks with this approach so
00:02:16.480
i'll explain uh how we got to this point
00:02:20.720
in the following
00:02:23.720
slides uh before going into the
00:02:26.480
technical details i want to acknowledge
00:02:29.440
that the uh that the work is
00:02:32.800
collaboration with loit manon he was a
00:02:36.800
google sab code developer in
00:02:40.360
2022 and i had uh the pleasure of
00:02:43.760
mentoring him on the project uh titled
00:02:47.879
optimizing garbage question for ruby lus
00:02:51.760
so this is the first step of the l first
00:02:55.360
trial of the uh l local garbage
00:02:59.640
collection i initially proposed the core
00:03:03.200
ideas and we discussed uh it together so
00:03:08.000
loit continued developing his own
00:03:10.599
implementation extending my idea over
00:03:13.599
the next three
00:03:15.000
years i learned from his experiments and
00:03:18.959
developed my own garbage
00:03:21.720
collection so today i present uh my own
00:03:25.920
presentation but this uh work this
00:03:29.599
presentation is the collaboration work
00:03:31.840
with
00:03:35.080
light uh let me introduce myself so over
00:03:38.640
the year i'm co sasada uh over the years
00:03:41.840
i worked on several parts of luby
00:03:44.000
implementation and also i'm uh director
00:03:46.879
of luby association since
00:03:50.280
2012 and also uh an owner of the book at
00:03:54.799
second floor uh thank you for coming and
00:03:57.519
buying many copies of books
00:04:01.760
so before uh diving into the uh garbage
00:04:05.439
collection uh let me uh mention about my
00:04:08.080
proposal uh changing the lock api in
00:04:11.760
this
00:04:12.599
ticket so the idea is to introduce a new
00:04:16.400
communication primitive called lacta
00:04:18.320
port and duplicate the current lock and
00:04:21.840
lock take pair to communicate to each
00:04:24.759
other uh conceptually lock port is a
00:04:28.000
small enhancement of the mailbox concept
00:04:30.639
from the the actor model uh it provides
00:04:34.240
a cleaner clearer and more uh explicit
00:04:38.560
interface for uh for communication
00:04:41.520
between locks
00:04:44.440
so today's uh developers meeting so uh
00:04:48.880
the the today's morning session so there
00:04:51.759
there are many many compatibility
00:04:53.560
discussion with the ractor showing the
00:04:58.000
the warning so this uh this feature is
00:05:03.080
experimental feature so we can change
00:05:05.199
the api
00:05:08.039
easily so okay let me give uh you some
00:05:12.240
background on the garbage collection for
00:05:15.919
the
00:05:17.320
lers so l was introduced in ruby 3.0 a
00:05:21.840
new uh concurrent
00:05:25.000
model it was changed to uh enable two
00:05:28.800
things so at first the parallel
00:05:31.360
computing on ruby to take advantage of
00:05:35.120
parallel computers for faster ruby
00:05:38.600
applications second robust concurrent
00:05:42.520
programming so we as you know the sl
00:05:45.600
programming has many many issues so with
00:05:48.320
lactus isolated object spaces we provide
00:05:51.680
and we can avoid common concurren bugs
00:05:55.039
es especially those caused by shared mut
00:05:58.240
mutable uh
00:06:00.039
state so o overall recta gives us the
00:06:04.319
safer and faster way for concurrent
00:06:07.360
program in luby
00:06:11.280
so threads inside a single thre single
00:06:14.160
lockctor still need to acquire the gbl
00:06:17.759
which means the threads in locker don't
00:06:21.199
run in parallel and one thread in the
00:06:24.560
locker can long in lubric at the
00:06:29.479
time however each has its own gbl and
00:06:34.560
thread in different rockers can run in
00:06:37.520
parallel so on the parallel computers
00:06:40.560
such as the multiple cpu threads in
00:06:43.639
lockers one and lock two can be
00:06:46.639
scheduled independently and actually run
00:06:49.440
at the same time on different
00:06:54.280
cores each actor has uh in lubby has its
00:06:59.280
own isolated object space uh in
00:07:02.520
logically that means that every object
00:07:05.039
be belong to the exact one locker and it
00:07:08.400
cannot be accessed directly from other
00:07:11.360
lockers
00:07:14.360
basically so here uh here is a very
00:07:17.520
important rule about rockus so basically
00:07:20.720
you cannot access object in other rotors
00:07:23.840
this means that the lock one can't read
00:07:26.960
and write object from the
00:07:30.520
uh uh belong to lock two so this strict
00:07:35.360
isolation guarantees thread safety and
00:07:37.840
avoid bugs from shared mutable
00:07:41.240
states this is the foundation of for
00:07:45.120
robustness and safety
00:07:49.520
now if each doctor has its own isolated
00:07:52.240
object space then it seems natural to
00:07:55.680
learn garbage collection uh
00:07:57.800
independently or in parallel in each
00:08:01.560
ractor in the slide so each ractor has
00:08:05.120
its own
00:08:07.160
loot
00:08:09.240
loot and we can uh mark the uh all
00:08:13.280
object in each ractor and can uh garbage
00:08:16.479
collection independently and in
00:08:21.000
parallel so this idea what we call
00:08:24.560
laughter local garbage collection today
00:08:27.199
i want to
00:08:29.000
propose and on the surface it's very uh
00:08:33.120
it seems like very natural optimization
00:08:37.279
so it's not strange
00:08:39.800
one however so this is a problem so lux
00:08:44.600
can share certain object called sharable
00:08:49.160
objects so these include the classes and
00:08:52.720
modules closing objects that only
00:08:54.959
difference other shar objects and some
00:08:58.160
special classes such as the locker
00:09:00.160
objects themselves so in this example uh
00:09:04.399
both recta one and racta two are holding
00:09:06.959
references to the same sharable object
00:09:10.880
uh this uh lead
00:09:17.560
one and this is allowed and that to uh
00:09:22.320
totally fine this is allowed and so we
00:09:25.600
while most objects are isolated to
00:09:27.800
paractor shared objects act as
00:09:30.959
exceptions s as and can be referenced
00:09:34.720
safely from multiple
00:09:38.519
rockers now here is the key question so
00:09:42.080
if we allow each doctor to run garbage
00:09:45.440
collection
00:09:46.440
independently how should we deal with
00:09:48.959
sharable
00:09:50.440
objects so let's walk through this simp
00:09:53.760
uh simple example so step one l r r1
00:09:59.279
owns a sharable object
00:10:04.200
here and second the l two r2 has a
00:10:09.920
reference reference here this
00:10:13.800
one to the sh1 object and what when r1
00:10:19.279
learns its local garbage question it
00:10:21.600
doesn't seem reference to uh sh1 from
00:10:25.680
within r1 so r1 in r1 there is no
00:10:28.720
references to the uh sh1 so garbage
00:10:33.200
collector mistaken to free the sh1 but
00:10:38.399
we shouldn't free the sh1 because r2
00:10:42.079
needs this s sh1 object so this is a
00:10:45.600
problem
00:10:48.399
so with that background uh in mind so
00:10:51.279
let's move to the uh main challenges of
00:10:53.839
the uh implementing the lock local
00:10:56.839
gc so the first idea or ideal idea is to
00:11:03.360
track the reference to the sharable
00:11:05.279
objects across
00:11:06.839
lockers so if r1 knows so r1 knows the
00:11:11.040
ob sh1 is referenced from other lockers
00:11:14.640
then we can care about this reference
00:11:18.079
and the sh1 can be kept uh from
00:11:22.760
green so in this example uh sh1 is only
00:11:27.200
reachable from r2
00:11:32.000
so the challenge is how to maintain the
00:11:35.120
reference information such so the s sh1
00:11:38.399
is reference from the other
00:11:41.480
doctors so in theory uh tracking
00:11:44.640
reference between doctors uh sounds
00:11:46.720
reasonable but in practice it's very
00:11:52.279
difficult references can be updated at
00:11:55.360
any time and doctors learn in parallel
00:11:58.880
so this makes it hard to keep the
00:12:01.279
references uh reference state accurate
00:12:04.640
across all and if the tracking
00:12:08.079
information become out outdated or
00:12:10.639
incorrect the garbage collector might
00:12:13.440
miss a reference and accidentally free a
00:12:16.959
live
00:12:17.959
object so also from a performance
00:12:20.959
perspective tracking these uh interlock
00:12:23.920
references can introduce speed and
00:12:26.240
memory overheads
00:12:29.040
so let's talk about the tricky simple uh
00:12:32.320
sample scenario first uh imagine that
00:12:39.160
the sh1 a sharable object is sent to
00:12:42.959
from sent from ra one to l 2 and ra 2
00:12:46.880
holds reference to the sh1 and so we
00:12:51.920
send the objects so the the runtime
00:12:54.959
detect their reference move the
00:12:57.279
reference is uh passed to the lock two
00:13:03.440
so we can uh uh we can mark the s sh1 as
00:13:08.160
marked reference uh the reference to
00:13:11.440
level uh to the sh1 and that information
00:13:15.680
is now used to avoid the freeing sh1
00:13:18.800
during gc in
00:13:21.079
local gc in l one so there is no problem
00:13:26.600
now and now l to reference the uh
00:13:31.480
sh3 because we we can reach the sh3 from
00:13:37.880
sh1 but here is problem here is the
00:13:41.000
problem the difference tracking into uh
00:13:45.279
uh sorry reference tracking information
00:13:48.560
still says sh1 is referenced so not a
00:13:52.800
reference uh to the
00:13:56.040
sh3 and it hasn't been updated to uh
00:14:00.079
includes this new reference to sh3 so
00:14:04.079
now sh3 will marked concurren uh
00:14:07.360
correctly
00:14:08.519
because so in this case so the sh1 is
00:14:12.320
marked as a reference so uh the uh
00:14:16.480
traversing marking correctors
00:14:21.079
uh mark the this is uh sh1 sh2 sh3 is
00:14:27.279
alive living object and there is no
00:14:29.839
problem just now so the problem is if
00:14:33.920
the sh2 remove the link to the sh3 so
00:14:37.839
there is no uh information uh that the
00:14:41.519
sh3
00:14:43.560
is should not be uh
00:14:47.160
freed
00:14:48.920
so it means that the ractor loc at the
00:14:53.279
r1 uh sh3 can be freed and after that
00:14:58.240
the uh the if the doctor 2 uh try to use
00:15:04.000
the sh3 then it was freed and it
00:15:16.040
caus many many effort many many
00:15:18.560
overheads we can travel uh we can get a
00:15:21.519
reference from the object uh uh object
00:15:25.120
sh3 reference to the sh3 but
00:15:30.120
uh there are many many uh facts that we
00:15:34.240
cannot introduce such a uh mechanism so
00:15:37.440
this so this is a challenge so let's uh
00:15:40.639
not solve this
00:15:42.680
challenge so because uh solution one
00:15:46.399
seems difficult and we were using the
00:15:49.440
global object space just now so the
00:15:51.600
current master branch or ruby from ruby
00:15:54.320
3.0 to the ruby 3.4 and master branch uh
00:15:59.519
it uses a global object space so that
00:16:02.480
means that there is no power locked heap
00:16:05.360
all objects with uh whether created in
00:16:08.880
r1 or r2 are managed
00:16:12.759
together when we need to learn the
00:16:15.440
garbage collector ruby has to stop all
00:16:19.000
reacts and then traverse from all roots
00:16:22.560
mark live object this is accurate no
00:16:26.800
problem but it's never miss uh but it is
00:16:30.720
slow because we need to stop the all of
00:16:33.519
the object and all of the doctors and we
00:16:36.079
need to uh traverse all of the object
00:16:38.800
between doctors and also we can't
00:16:41.440
utilize the uh
00:16:45.720
parism so this compares the timeline for
00:16:49.199
of the two gc uh strategies so solution
00:16:54.680
one local gc and
00:16:57.480
uh solution two is the current global
00:17:01.279
garbage collector so in local gc each
00:17:05.280
runs uh garbage collector independently
00:17:08.000
and that allows gc uh to learn in
00:17:11.319
parallel but as we uh we've seen it's
00:17:15.199
difficult to
00:17:16.679
implement in contrast the current
00:17:19.360
garbage global garbage collection uh
00:17:22.319
solution
00:17:26.839
too so all lockers one uh if one
00:17:32.640
uh requests a pose then the runtime wait
00:17:36.240
until the all ractor are stopped and
00:17:39.360
runs garbage collector on the shared
00:17:42.919
heap this is
00:17:45.080
simple and space efficient because free
00:17:49.360
slots can be shared between lockers
00:17:52.960
uh but unfortunately it's
00:17:56.760
slow and it block uh parallel solution
00:18:00.880
uh ex
00:18:02.840
execution so our goal is that we want to
00:18:06.400
introduce the local gc to improve the
00:18:08.400
performance and round garbage collection
00:18:10.240
in parallel and the independently for
00:18:13.919
each lecture
00:18:16.480
uh like how it's doing in alon or elixir
00:18:22.440
langages but the big uh challenge is
00:18:26.720
shadow objects and it difficult to track
00:18:29.360
reference that goes across
00:18:33.480
lus and that's exactly why lactishi has
00:18:38.240
been missing for
00:18:41.240
years it's very hard to get it right so
00:18:44.880
the core challenge here is how do we
00:18:48.559
design
00:18:50.360
uh safe and practical algorithm that
00:18:54.240
enable rato logos without breaking
00:18:57.360
anything
00:19:00.000
so now that uh we've seen the challenge
00:19:02.640
and how do we safely implement lock gc
00:19:06.320
especially with shadow objects so let's
00:19:08.960
dive
00:19:10.520
in our proposion is simple and
00:19:13.600
conservative one so we do not try to
00:19:16.799
detect the reference liess of sharing
00:19:19.760
objects
00:19:21.280
so it means that we assume that all of
00:19:24.000
the sharable objects marked as sharable
00:19:26.559
object are
00:19:28.360
living while uh local garbage
00:19:31.799
collection because it's really hard to
00:19:34.880
track close to refine safely so instead
00:19:38.799
we assume the share of objects are
00:19:41.600
always arriving during local gc so this
00:19:44.880
work because this works this algorithm
00:19:48.240
works because the number of sharable
00:19:51.440
object is in small and we treat them as
00:19:56.320
root object during lock local gc then
00:19:59.520
the two correct the sharable object when
00:20:03.320
needed we fall back to the global gc so
00:20:06.640
we already have a global gc so we can
00:20:09.120
reuse this one
00:20:16.400
so
00:20:17.640
which ps all rocks and checks every
00:20:21.520
references as i said so it is slow but
00:20:24.799
we can learn global gc less frequently
00:20:27.280
than current one
00:20:31.000
so this this is why i propose this one
00:20:35.159
uh local gc will be enough i think
00:20:40.960
so this figure shows how to share
00:20:43.600
sharable object for each
00:20:46.280
doctor we maintain set of we uh we
00:20:50.320
maintain a set of sharable objects so
00:20:53.760
during local garbage collection we treat
00:20:56.720
these as root objects and mark them as
00:21:00.000
if they are always
00:21:02.520
alive so this is accurate we never miss
00:21:06.159
live references but there there is a
00:21:09.520
trade-off so if sharable object uh no
00:21:12.720
longer reachable uh won't be corrected
00:21:16.080
until global gc
00:21:18.200
happens so that means that uh memory for
00:21:22.799
uh such objects may stay uh around
00:21:26.120
longer but the benefit is we can safely
00:21:29.760
run this in parallel
00:21:33.080
paractor so in practice most of sharable
00:21:36.640
object are long lived for
00:21:38.840
example classes modules methods and bite
00:21:43.360
codes are of course long living
00:21:48.760
objects and ra objects themselves also
00:21:51.919
tend to live long uh longer so maybe uh
00:21:56.640
naturally the the most of shadow objects
00:22:00.000
will be long lived objects
00:22:03.960
so we can uh so we decided to uh employ
00:22:09.120
this
00:22:11.559
idea so this is proposed timeline with
00:22:14.960
our uh conservative
00:22:17.400
approaches so sometime so most of case
00:22:21.280
so we learn the local gc independently
00:22:24.240
in parallel and uh infrequently we you
00:22:27.520
we need to use the global
00:22:30.919
gc from uh from this work we uh from
00:22:35.440
this work we learn key insight so done
00:22:39.520
is better than perfect so we can't
00:22:43.760
introduce the lock local gc because
00:22:46.000
solution one is can't be uh uh completed
00:22:49.679
so we we discussed several years
00:22:53.640
but so this our the today's proposal is
00:22:58.720
not the best but it improves the
00:23:00.880
performance so we need to uh introduce
00:23:03.919
this is it fast and after that we can
00:23:07.280
discuss solution
00:23:09.400
too so this is my study of uh what i
00:23:13.760
learned from this work
00:23:18.000
so i implemented the local gc with these
00:23:21.799
steps but let's skip details because
00:23:24.960
there is no time to uh talk enough so
00:23:29.280
uh so i i do many many things to do
00:23:33.000
that and then we have uh many techniques
00:23:37.280
to uh improve the performance but i also
00:23:40.240
skip this slide so i will publish this
00:23:42.720
slide so if you have interest please
00:23:45.360
check it or ask me
00:23:47.480
later so from here i'd like to talk
00:23:50.720
about the evaluation we measures the
00:23:53.280
performance of our proposal through the
00:23:56.760
microbenchmarks comparing it against the
00:23:59.360
current lub implementation so here is
00:24:02.240
the benchmark u environments i use 40 14
00:24:07.760
cores with six performance core core and
00:24:11.200
eight deficient cores of inter
00:24:14.360
cpu for microbenchmarks we repeat simple
00:24:17.760
task end times across ractors so each
00:24:20.640
ractor pro performs tn tasks where tn
00:24:24.799
defined as n divided by the number of
00:24:27.480
ractors the benchmark script is
00:24:29.760
available this uh g
00:24:32.520
page so first of all we show the uh
00:24:36.799
simplest one so each repeat allocate
00:24:40.440
show live string objects so as shown in
00:24:44.880
code snippets here
00:24:47.799
here so it just create and discard empty
00:24:51.520
string tn times on each rocks on the
00:24:55.279
x-axis the n it is the number of rocks
00:24:58.960
and the y-axis uh we have ex we have
00:25:03.200
execution time in seconds so lower is
00:25:07.080
better and the blue line shows the
00:25:09.760
current uh lubby master and the orange
00:25:12.400
line is local distribution you can
00:25:14.960
clearly see that with more actors our
00:25:18.320
version scale scales better so this is
00:25:22.240
same benchmark as before but we show
00:25:24.799
looking at the speed up ratio compared
00:25:27.520
with the uh single l execution time so
00:25:30.799
higher is better here as you can see
00:25:33.919
those the orange lines for our modified
00:25:37.279
version scales quite well achieving
00:25:39.840
nearly five times speed up with 16 lers
00:25:44.480
in contrast the master version shows
00:25:46.960
almost no parallel benefits that's
00:25:49.520
because all are stopped at for global gc
00:25:52.960
eliminating the the advantage of par
00:25:55.760
execution of ractors this result shows
00:25:58.880
that the local dish work effectively for
00:26:02.720
local short uh local
00:26:06.039
short with shortlived
00:26:09.559
objects so next uh example is making a
00:26:15.440
long live long long lived object so this
00:26:19.279
cause the uh make a many array but al
00:26:23.919
doesn't read
00:26:28.640
so so the master version uh shows the
00:26:33.120
career bottleneck so as we increase the
00:26:36.480
number of rockers the execution becomes
00:26:39.559
worse even if it's long in parallel in
00:26:43.760
contrast our local gc version handles
00:26:47.360
this very
00:26:48.919
efficiently so this is a speed up ratio
00:26:51.760
of the the last benchmark so higher is
00:26:55.200
better for same benchmark again master
00:26:58.159
branch doesn't scale performance
00:27:00.799
actually get worse with more octas
00:27:04.000
that's because every
00:27:06.480
uh contribute to triggering global gc
00:27:09.520
more frequently in contrast the local gc
00:27:15.400
version local dish version uh uh
00:27:20.240
continue to scale achieving over four
00:27:22.640
times speed up with 16 rafters
00:27:26.640
this result shows how local helped long
00:27:29.520
with locked local
00:27:32.039
objects here is a benchmark long in
00:27:35.279
regular simple regular expression
00:27:38.039
matching and it also shows a good
00:27:41.200
performance before uh compared with uh
00:27:44.480
the the master one so this shows the
00:27:47.840
speed up pressure and orangey one uh is
00:27:51.120
modified version so higher is better and
00:27:53.600
the uh the our local gc shows a good
00:27:57.600
performance than the current
00:27:59.320
one so it includes the regular
00:28:01.919
expression execution time so it is not
00:28:05.080
uh uh better than the last two uh
00:28:09.520
examples but it also have good
00:28:12.880
performance compared with current one so
00:28:16.399
let's bring up the today's topic so we
00:28:21.440
want to uh we need to we need to
00:28:24.399
complete the implementation for lub 3.5
00:28:28.000
to introduce the doctor local gc uh so
00:28:32.000
we have many tasks and also we want to
00:28:34.640
make a parallel marking on the global gc
00:28:37.279
so we have several uh cores so we can
00:28:40.559
utilize them and also so after after
00:28:44.640
that we we we need uh we want to uh
00:28:48.399
introduce the difficult solution one for
00:28:51.520
more performance so l said he might have
00:28:55.279
a good idea and he implemented it and he
00:28:59.600
says that he has a good implementation
00:29:01.760
for that so we can wait for his good
00:29:04.720
great news after that
00:29:08.000
so this is a summary of this uh
00:29:10.799
presentation so i introduce why the
00:29:13.559
lagishi is difficult and i propose uh
00:29:18.000
conservative approach and it can intro
00:29:21.279
uh improve the performance on the
00:29:23.600
microbenchmark
00:29:25.320
significantly
00:29:27.080
so question is so address the this work
00:29:31.679
is collaboration with lo man and so
00:29:35.039
thank you for listening so question and
00:29:37.440
are very welcome so please ask me later
00:29:40.399
or at the uh stores booth so thank you
00:29:43.440
so much