SDB: Efficient Ruby Stack Scanning Without the GVL


Summarized using AI

SDB: Efficient Ruby Stack Scanning Without the GVL

Mike Yang • April 16, 2025 • Matsuyama, Ehime, Japan • Talk

In this presentation at RubyKaigi 2025, Mike Yang discusses SDB, a new stack profiler designed for Ruby that operates without the Global VM Lock (GVL). The GVL restricts concurrent execution in Ruby, extremely hindering profiling capabilities and overall application performance. SDB differentiates itself by allowing higher sampling rates of stack information—up to 1000 samples per second—while maintaining a significantly low CPU overhead, which is essential for real-time application performance.

Key Points Discussed:
- Purpose of SDB: The necessity for a new stack profiler arises from the limitations of existing profilers that often rely on instrumentation, resulting in blind spots and high CPU usage due to GVL constraints. SDB aims to streamline this process by providing a single tool that minimizes these drawbacks.
- Design Philosophy: Yang outlines the architectural components of SDB, which includes:
- A signal trigger to initiate stack scanning without holding the GVL.
- An analyzer that processes collected stack data and provides human-readable formats.
- A symbolizer that translates memory addresses to function names.
- Methodology for Non-GVL Scanning: The presentation explains how SDB achieves stack scanning capabilities without the GVL by leveraging atomic nature of certain memory operations, ensuring thread safety without significant overhead.
- Performance Evaluation: Yang conducted evaluations to measure SDB's impact on application latency and CPU usage compared to conventional profilers. Results showed that SDB maintained low latency (increased by only 0.5%) and CPU usage (less than 3%), showcasing its efficiency and readiness for integration into production-grade applications.
- Future Directions: Yang mentions ongoing developments for SDB to enhance its capabilities to support more Ruby versions, improve the symbolizer without additional dependencies (like eBPF), and facilitate better integration with Ruby's internal events for improved observability.

The main conclusion drawn from the talk is that SDB has the potential to serve as a default observability solution for Ruby applications, allowing developers to conduct thorough performance analysis without the drawbacks commonly associated with existing profilers. The project is still experimental but shows promise for broad adoption in real-world applications.

SDB: Efficient Ruby Stack Scanning Without the GVL
Mike Yang • Matsuyama, Ehime, Japan • Talk

Date: April 16, 2025
Published: May 27, 2025
Announced: unknown

Most existing Ruby stack profilers rely on the GVL, which blocks application execution and leads to high CPU usage.

SDB, on the other hand, releases the GVL when scanning Ruby thread stacks. This design minimizes its impact on Ruby applications, allowing us to increase the sampling rate from 100/s to 1000/s while still consuming very low CPU. This makes SDB a truly always-online stack profiler.

In this talk, I will explain how to implement a stack profiler without relying on the GVL, along with its benefits and challenges.

https://rubykaigi.org/2025/presentations/yfractal.html

RubyKaigi 2025

00:00:05.120 oh hello hello hello everyone good
00:00:06.879 afternoon yeah uh my today topic is SDB
00:00:11.440 in efficient Ruby stack scanning without
00:00:14.160 the dial when you see this title you
00:00:16.800 might wonder why creating another stack
00:00:19.119 profiler Ruby already has start and many
00:00:22.480 of them work pretty well in their cases
00:00:24.800 why should we create another yeah and u
00:00:29.119 and You you you might be weird how or
00:00:31.599 how how how or why we we we want to scan
00:00:36.160 the stacks without the GL how could we
00:00:38.239 do this yeah I will answer all of those
00:00:40.760 intentioning and in my traditioning I
00:00:43.760 will describe why I designed a new stack
00:00:46.399 profile then I will describe how I
00:00:48.960 designed it after that I will discuss
00:00:51.440 how to scan the Ruby stacks without the
00:00:53.840 JBL yeah yeah let me answer the first
00:00:56.879 question by a new stack profiler yeah to
00:01:00.160 answer this we need to see how we use
00:01:02.480 instrumentation and the stack profiler
00:01:05.519 we know people use the instrumentation
00:01:07.760 as the default solution yeah and we
00:01:10.720 insert the code around around the
00:01:12.799 function we want to measure their
00:01:14.799 latency yeah no matter through an
00:01:17.520 instrumentation library or such as the
00:01:19.759 dead dog or neuro or we can insert those
00:01:22.640 code by ourself but the question but the
00:01:25.759 things is we can't insert the
00:01:28.320 instrumentation code to all the
00:01:30.240 functions that means cost too much i
00:01:32.880 mean no matter the manual effort or the
00:01:35.360 effort the the the overhead of those
00:01:37.840 instrumentation code so for those
00:01:40.400 functions do not have those
00:01:42.240 instrumentation code they we can't know
00:01:44.799 their latency they are blind spots of
00:01:47.280 our instrumentation if when the
00:01:49.759 performance issue happen during those
00:01:52.159 those brand spots we may enable a stack
00:01:54.799 prof to see can it help us so you you
00:01:58.159 see here first we have instrumentation
00:02:00.960 and we know it may not work then we use
00:02:04.000 another tool to to to work as a comp
00:02:06.880 complement but my question is my
00:02:09.440 question is why not we just use one tool
00:02:12.239 I mean we just use the stack profile in
00:02:14.640 the first place yeah and uh doing this
00:02:18.080 the ob the benefits seem to be obvious
00:02:21.120 you see the stack profiler is
00:02:23.599 implementation agnostic and uh it has
00:02:26.959 less blend spots or no blend spots more
00:02:30.440 importantly a it doesn't require us to
00:02:33.680 to write the instrumentation manually
00:02:36.400 and it means less effort but in practice
00:02:40.400 this seems not happen i mean maybe
00:02:42.560 happen really well i think there must be
00:02:45.200 several reasons first I think uh most
00:02:48.640 stack profilers are designed for average
00:02:51.000 latency uh for example many of the
00:02:53.760 profilers use the 10 milliseconds as
00:02:56.480 their sampling interval that means if a
00:02:59.519 request takes around 50 milliseconds uh
00:03:03.519 those stack profilers can only gather
00:03:05.680 five samples the five samples can't say
00:03:08.239 much about a single request so I think
00:03:10.800 they are they are not enough for any
00:03:13.680 specific request however the
00:03:15.920 instrumentation works on the other side
00:03:18.400 even they have blind spots they can
00:03:21.120 detect any any any requested state so so
00:03:25.200 so so that's the the good part of the
00:03:29.080 instrumentation another issues of
00:03:31.519 existing solution is they may use the
00:03:33.760 high you use the high CPU may I mean if
00:03:37.360 a performance tool may consume more than
00:03:39.680 10%age CPU we may not we may hesitate to
00:03:42.879 use it more importantly
00:03:45.680 uh because existing solutions need the
00:03:48.400 JVL as we know the JVL use the the JV
00:03:52.319 when the stack profiler hosts the JVL
00:03:55.040 the Ruby application can't does it work
00:03:57.200 so it slows down our Ruby applications
00:04:00.599 Yeah so so I think if so if we want to
00:04:05.760 want a stack prof profiler that can
00:04:08.239 replace the instrumentation I mean we
00:04:10.159 can use a stack profile only the this
00:04:13.360 profile must be fulfill several things
00:04:16.160 first we should reduce the sampling
00:04:18.880 interval from 10 milliseconds to 1
00:04:21.320 millisecond yeah and we should keep the
00:04:24.000 CPU usage as low as possible maybe less
00:04:26.960 than five or three percentage and more
00:04:29.919 importantly we should not use a GVL when
00:04:33.199 we scan the stacks as no one want to
00:04:35.520 slow down their applications okay after
00:04:39.280 after we know why I want a new stack
00:04:42.400 profiler I will discuss how I designed
00:04:44.520 it yeah we can check the general
00:04:47.600 solution first yeah uh most typical
00:04:51.199 stack profiler will have has two parts
00:04:54.400 one is the signal trigger And another
00:04:56.560 one is the signal handler uh the signal
00:04:59.440 trigger sends the signals to the
00:05:01.160 application threats regularly maybe
00:05:03.440 every 10 milliseconds or 1 millisecond
00:05:06.400 and when the application threats receive
00:05:08.720 those signals it invoke the handler and
00:05:11.840 in that handler it it the scans the
00:05:15.759 stacks and synchronizing or maybe do
00:05:18.000 some an analyzing for the signal trigger
00:05:21.440 you see it actually just just does some
00:05:24.320 simple job it just send some signals and
00:05:27.039 we can release the JVL in the single
00:05:29.199 trigger so for a single trigger it job
00:05:32.320 is quite simple and there's not much to
00:05:34.639 say about it so let's let's us check
00:05:37.360 each element of the handler uh one one
00:05:40.800 component of the handler is the anizer
00:05:43.680 which accurate gates we collected yeah
00:05:46.160 that's because that's because the stack
00:05:48.400 scanner only can know which functions
00:05:50.320 are on the stacks or which functions are
00:05:53.280 during execution and we need another
00:05:56.720 component to calculate or infer those
00:05:59.680 those functions latency average latency
00:06:01.840 yeah that's the analyzer and after that
00:06:04.720 we may need the analyzer to convert the
00:06:07.039 data into other formats i mean maybe you
00:06:09.919 want to present the the data as a flame
00:06:13.440 flame graph or have some summary about
00:06:16.479 some text summary about each functions
00:06:18.479 average latency so that's the thing uh
00:06:21.199 actually the analyzer does the h work
00:06:23.840 yeah but for it we can do we can do the
00:06:26.880 lazy and offline analyze yeah as a
00:06:30.560 default stack profiler it can collect it
00:06:33.120 can collect a lot of symbols but only
00:06:35.600 some of them we are interested in so we
00:06:38.800 can only analyze those those samples we
00:06:42.479 think relate to our performance issues
00:06:44.720 so we can perform lazy analyze that can
00:06:47.199 reduce a lot of a lot of work yeah
00:06:50.319 another thing is our design goal here is
00:06:52.800 to reduce the impactation on our
00:06:55.919 production environment or online servers
00:06:58.960 that means we can we can put the data to
00:07:01.440 our local machines or other machines so
00:07:03.919 it doesn't impact our online source yeah
00:07:07.120 for the analyzer a solution is to
00:07:09.440 perform lazy and offline analyze the
00:07:12.880 next component of the handler is the
00:07:15.720 symbolizer which converts memory address
00:07:18.479 into human readable symbols yeah the
00:07:21.840 stack scanner only can gather the
00:07:24.000 function or methods addresses so to
00:07:26.639 understand them we need to convert those
00:07:29.280 addresses into method names and other
00:07:32.560 details so the another yeah but we know
00:07:36.000 for any given Ruby applications they
00:07:38.400 only have limited set of of methods that
00:07:41.759 means if we do not convert uh method
00:07:45.360 method address to to to symbols
00:07:47.520 repeatedly that means it work is
00:07:50.000 manageable so for a symbolizer is it
00:07:53.199 solution is to catch the translation
00:07:55.400 results okay uh after those two change
00:07:58.639 our architecture look like this you see
00:08:00.960 the stack gunner and simple only gather
00:08:03.599 state and lock them into file and after
00:08:07.039 that we can perform the analyze when
00:08:09.520 necessary so that's the offline analyzer
00:08:12.160 actually after those two changes the
00:08:15.199 stack profiler should have much better
00:08:17.479 performance however there is still one
00:08:20.080 last element the stack scanner uh the
00:08:23.680 stack scanner travels and collects state
00:08:26.400 regularly maybe every 10 millisecond or
00:08:29.199 1 millisecond yeah uh a lab stack a
00:08:32.399 typical lab stack can only can have 100
00:08:35.919 to 200 frames for a modern CPU it can
00:08:39.200 run very fast that means if we scan 200
00:08:44.159 lens array it doesn't take take much
00:08:46.560 time so that for us it means if we do
00:08:50.320 not do not scan the stacks in an
00:08:53.040 extremely high sampling rate this part
00:08:55.519 work is right compared to the other two
00:08:57.880 components but the existing solution is
00:09:00.880 not ideal because they use the JVL that
00:09:04.240 means when the stack gun holds the JVL
00:09:06.959 our application can't do their job it
00:09:09.839 blocks our applications execution so it
00:09:13.760 slows down our applications and it
00:09:16.720 limits scanning frequency as well as
00:09:19.560 accuracy yeah so let's go to the next
00:09:22.080 section how to scan the stack without
00:09:24.720 the
00:09:25.720 JBL yeah let's recheck our stack scanner
00:09:29.519 the stack scanner read the active frames
00:09:32.560 and record those frames information into
00:09:35.440 fail or into a fail or into the memory
00:09:38.160 and at the same time the RubyVM will
00:09:40.800 update them i mean push a frame onto the
00:09:43.760 stack or pop a frame out of the stack
00:09:47.120 more importantly the frame is a Ruby
00:09:49.760 internal structure we know the JVL is
00:09:52.640 used to to protect all Ruby objects yeah
00:09:56.720 so it seems that we must use the JVL but
00:10:00.720 if we look at closely it's not entirely
00:10:03.360 true yeah uh to answer this we need to
00:10:06.720 check the duty of the JVL or analog yeah
00:10:10.000 here is an example there are two threads
00:10:13.120 both of them share the same same object
00:10:15.920 the the instance of the link list and
00:10:18.800 the thread one want to free free the
00:10:21.680 last element of the link list and the
00:10:24.720 thread two at the same time want to read
00:10:27.279 it unfortunately the thread two first
00:10:30.480 holds the tail element the host elements
00:10:33.680 address and after that the thread one
00:10:36.640 free that element then the thread to
00:10:39.760 wants to read it as the element has
00:10:42.880 already been freed it means it will
00:10:45.200 cause segmenting fault to solve this
00:10:48.160 people use the lock generally yeah you
00:10:50.959 see the lock group things together and
00:10:53.760 it seconds critical sections that means
00:10:56.560 each sections can't happen at the same
00:10:58.720 time so there's no such issues yeah if
00:11:03.279 we look it closely we can find out the
00:11:07.200 one job of the lock is to group the
00:11:10.640 instructions i mean several instruction
00:11:12.720 make them as a whole item a whole unit
00:11:15.760 it's atomic but here my question is what
00:11:19.120 if we only need one single instruction i
00:11:22.560 mean the instruction itself can
00:11:24.640 guarantee atomic do we still need the
00:11:27.839 lock yeah okay the answer is no we can't
00:11:31.600 we we do not need the lock for example
00:11:34.160 there is a counter we can read and write
00:11:36.800 it without any lock yeah in in my in my
00:11:40.160 program we we may see some outdated
00:11:42.399 value but no corrupted date so uh for
00:11:45.040 example that means uh the CPU one has
00:11:48.240 already updated the value to some number
00:11:50.560 and after that the CPU 2 wants to read
00:11:53.040 it it may read some old value but it
00:11:56.480 will not read some invalid or corrupted
00:11:59.200 value i mean the the value count belongs
00:12:01.839 to different numbers so for counter the
00:12:04.640 lock is not necessary that's because the
00:12:07.839 six bit around the memory read and write
00:12:11.360 are atomic that means the read or write
00:12:14.800 can happen fully or not at all there's
00:12:17.839 no intermediary state this feature is
00:12:20.399 guaranteed by the hardware yeah after we
00:12:23.519 know this we can check the SDB stack
00:12:26.480 scanner actually a stack scanner only
00:12:28.880 depends on one single field which is ISQ
00:12:32.000 it holds the address of the assequence
00:12:34.480 structure you can consider this
00:12:36.880 structure as the internal representation
00:12:39.600 of the Ruby methods yeah for for simple
00:12:42.160 yeah and the EQ field is a sixful bit
00:12:46.160 around the memory that means we can read
00:12:49.120 it safely without any lock without the
00:12:51.760 JVL so that's the foundation of why we
00:12:54.720 can read the stack without the GVL yeah
00:12:59.360 but but the formal step only guarantees
00:13:02.160 that if a frame is is is alive we can
00:13:06.160 read it now we need to guarantee that we
00:13:09.760 only read the live threats yeah we know
00:13:12.560 Ruby used the stack to save to to store
00:13:15.600 those those frames and the stack is a
00:13:18.480 serial array this array is allocated
00:13:21.040 when Ruby create the threats and will be
00:13:23.920 freed after the threat finishes its work
00:13:26.880 so when a thread is allow frames array
00:13:30.240 must be valid so it's safe to read
00:13:33.320 it yeah here is the implementation is
00:13:36.399 quite simple we can drop the code around
00:13:38.639 the block around the thread block it's
00:13:42.000 for synchronizing with the Ruby VM and
00:13:44.240 the stack scanner it just tell the stack
00:13:46.800 scanner okay there's new threat you can
00:13:49.440 start do your work and after the threat
00:13:52.639 finishes it work it just tell the stack
00:13:55.360 scanner okay you this threat will die
00:13:58.160 you should not scan it anymore yeah uh
00:14:02.240 after this we still need to consider how
00:14:05.040 the stack scanner works with the Ruby JC
00:14:08.079 that's because in the stack scanner the
00:14:10.079 gun's ruby memories we need to know how
00:14:12.639 Ruby manages it memory or how it how it
00:14:15.680 JC works when duty of JC is to free the
00:14:19.440 inaccessible objects for example after a
00:14:22.639 real request all its local objects are
00:14:26.079 not useful so we can release it and it
00:14:28.560 work is done by the JC but we The STB
00:14:32.399 stack scanner only scans live threads so
00:14:35.040 it's safe for for for here another duty
00:14:38.639 of JC is memory compactation we know
00:14:41.519 Ruby can enable the auto compactation
00:14:44.240 and in this function in this future it
00:14:46.720 moves objects to different locations as
00:14:50.000 the SDB needs to scan the Ruby memory uh
00:14:53.920 if the memory I mean the memory has been
00:14:57.279 moved to other location it will cause
00:14:59.519 errors but the good news is uh the Ruby
00:15:03.279 VM doesn't move all the objects i mean
00:15:06.079 some objects when they are allocated
00:15:08.320 they will not be changed to other place
00:15:11.120 and the FDB stack scanner only depends
00:15:14.240 on those those objects I mean the
00:15:16.639 execution context and it stacks so but
00:15:20.240 yeah so that's why it's safe yeah uh but
00:15:22.720 if you want to read some movable objects
00:15:25.199 without the JVL there are several
00:15:27.199 options you can read the immutable ones
00:15:29.839 and do not read them due to during every
00:15:33.199 any JC phrases or you can insert some
00:15:35.519 blocks for them uh to summarize uh the
00:15:39.360 SDB only read the SEQ field and this
00:15:44.279 oper synchroniz state with the Ruby VM
00:15:48.000 so it it never raise any dead threats
00:15:52.079 and uh as and as the SDP only depends on
00:15:56.560 the fixed objects so it works well with
00:16:00.000 the GC competition so that's why SDB can
00:16:04.079 scan the stacks without JVL
00:16:06.920 safely okay we we know the implanation
00:16:09.839 so now we need to discuss the datics of
00:16:12.320 threat safety if you look at closely you
00:16:15.199 see the Ruby where I'm updates the stack
00:16:17.680 where the ID res it without the lock so
00:16:21.279 that means the dates can happen or it
00:16:24.000 has thread safety issue but I haven't
00:16:26.399 solved it you can see some several
00:16:28.880 similar tools have this issue for
00:16:30.720 example the RB spy pan both of them
00:16:34.079 offer the non-blunking mode that means
00:16:36.480 in that mode it has this issue i mean
00:16:39.199 it's not strike safe in that mode or you
00:16:42.160 can see the Java or single profiler use
00:16:44.880 some functions doesn't guarantee the
00:16:47.360 atomic so it has has the same issue so I
00:16:50.639 believe in practice the datas is not
00:16:54.240 must to be solved
00:16:57.440 uh that's because uh for a stack
00:16:59.920 profiler action book is far more
00:17:02.000 important than strict correctness you
00:17:04.880 can imagine when the stack profiler can
00:17:07.199 yield some wrong results only when those
00:17:09.760 functions are extremely high but our
00:17:12.799 stack scanner are for detecting the low
00:17:15.600 the high latency functions so those
00:17:18.079 extremely high functions doesn't matter
00:17:20.000 to our interest so I think action book
00:17:22.880 is far more important than straight
00:17:25.199 correctness this trade off happens in
00:17:27.919 many areas actually for example the inro
00:17:31.360 consistency in database or the ru in
00:17:34.320 operating system yeah however if you
00:17:37.039 still want want To solve this we have we
00:17:40.240 have solutions yeah for example the IODB
00:17:43.200 which is a paper published in SDI 2024
00:17:47.039 it uses a generation number for
00:17:48.960 concurrence control so it's still
00:17:51.080 resolvable yeah after we have all of
00:17:54.960 this I mean we have this information I
00:17:57.360 we know how it works we need to do some
00:17:59.919 test to see does it works as we expected
00:18:03.039 or not yeah uh so I does several
00:18:06.960 evaluation in my evaluation I focused on
00:18:09.440 two things one is the the overhead i
00:18:12.559 measured the SDB with a JVL based
00:18:15.200 solution and run several stack profiles
00:18:17.840 on the real application to see what to
00:18:20.559 see the overhead of those tools
00:18:22.960 additionally I have simple one single
00:18:25.360 request to see how much information can
00:18:27.600 be gathered by the SDP uh the first
00:18:31.440 evaluation is on an AWS machine yeah I
00:18:35.039 used a simple scrap to do some
00:18:36.799 calculation i compared the SDB the no
00:18:40.400 JBL solution and a JVL based solution i
00:18:44.160 run them in different sampling intervals
00:18:46.400 to see how the different simple simple
00:18:49.600 different sampling intervals impact the
00:18:52.160 execution time i mean how much how they
00:18:54.799 impact our per
00:18:56.679 performance and the JVL based I didn't
00:19:00.720 use any existing solutions for for the
00:19:03.120 JVL yeah because the existing solutions
00:19:06.240 does too much work making it hard to
00:19:08.400 isolate the impact of the JVL so I wrote
00:19:11.600 a simple simple one it only does some
00:19:14.240 bare minimum minimum job for better
00:19:16.360 performance and here is the results you
00:19:19.280 see the X access is the simple interval
00:19:22.320 ranging from 1 millisecond to 1
00:19:25.039 microcond and the web access is the
00:19:28.720 execution time the larger means the
00:19:31.600 stack profile impacts the the scrap much
00:19:35.960 yeah uh the blue line represents the JVL
00:19:40.080 based solution you can see it clearly
00:19:43.200 when we in decrease the sampling
00:19:45.640 intervals the execution time uh increase
00:19:49.360 sharply so that means how that
00:19:51.919 demonstrates how the JVL impacts our
00:19:55.120 performance in in for our stack profile
00:19:58.559 on the other side you see the the JVL
00:20:01.840 based solution even when even we
00:20:04.799 decrease the sampling interval it
00:20:06.960 doesn't change the execution time that
00:20:09.280 means it doesn't impact our applications
00:20:12.799 yeah that's the first evaluation yeah uh
00:20:17.600 I after that after that I did a did an
00:20:21.280 evaluation on real application because
00:20:24.559 many Ruby many Ruby use the real web
00:20:27.840 frame so it represents the production
00:20:30.320 usage yeah the real use case yeah uh I
00:20:33.360 picked the homeland which is the most
00:20:36.640 famous Ruby forum in China yeah and it's
00:20:39.440 written by professional pro programmers
00:20:42.159 so I think this operation is quite
00:20:44.520 representative so I run it on on a
00:20:47.679 machine and use different DB use
00:20:49.679 different stress sender and uh here is
00:20:52.559 the setup and I picked the sampling
00:20:55.360 interval as 1 millisecond the reason is
00:20:58.400 the homeland it latency is larger than
00:21:01.360 30 milliseconds so 1 millisecond
00:21:04.159 sampling interval is quite enough for
00:21:06.320 for it so I I think it's it's okay and
00:21:10.159 the major method is quite simple i just
00:21:13.440 run the homeland without any stack
00:21:15.840 profiles to see its latency and its CPU
00:21:19.360 after that I run I run it with one one
00:21:22.960 single uh stack profiler and measure the
00:21:25.360 CPU usage and the the request average
00:21:28.400 latency again and the overhead of this
00:21:31.600 this stack profiler is the difference
00:21:34.240 between those two tests so that's the
00:21:37.000 method yeah uh
00:21:40.880 uh and here is the detail of those stack
00:21:43.440 profilers i will not read them yeah and
00:21:46.880 uh here is our result you see uh there
00:21:49.919 are two group of bars the first group is
00:21:52.480 the average latency increased i mean
00:21:55.039 after we enabled the stack profile how
00:21:57.760 much average latis has been increased
00:22:00.480 yeah and the second group of bars is the
00:22:03.360 CPU usage how much CPU those stack
00:22:06.000 profile uses usage yeah uh we can see
00:22:09.360 the green bars belong to the SDB it only
00:22:13.200 increase 0.5 percentage average latency
00:22:16.320 and use less than 3%age CPU i think it's
00:22:20.159 quite good for most applications yeah
00:22:23.039 even for some large scale of Ruby Ruby
00:22:25.520 servers yeah and uh the the blue bars
00:22:30.000 represent the RB it doesn't even it
00:22:33.840 doesn't increase our latency but it use
00:22:36.240 a lot of CPU i I guess it may because it
00:22:40.000 needs to read the memory from other
00:22:42.000 process and it may does a lot of a lot
00:22:45.280 of work in its stack profiler it it when
00:22:49.520 when we enable RB spy it calculates the
00:22:53.360 average latency dynamically so it could
00:22:56.000 cause a lot and the next bars belong to
00:23:00.840 one.0 yeah it doesn't use much CPU but
00:23:04.640 it increase 10 percentage of our average
00:23:07.840 latency i guess it's not is not good for
00:23:11.679 our default solution i mean if we want
00:23:14.080 if we enable a stack profile and it
00:23:16.880 increase 10%age latency maybe we will
00:23:20.080 not use it like this yeah uh and to be
00:23:24.320 clear the one is 1.0 it has several
00:23:28.000 several neuro version i haven't tested
00:23:30.480 them because some reason uh so but I
00:23:33.200 think the newest version of wer could
00:23:35.120 have much better performance than this
00:23:37.120 test and I I I must to say both the RB
00:23:41.679 back and wer are quite good useful stack
00:23:44.760 profilers you can imagine if we just
00:23:47.440 want to enable it occasionally the
00:23:50.159 10%age of CPU uh the 10%age of average
00:23:53.679 latency increase it doesn't that matter
00:23:56.559 and for the abs if we change The
00:23:59.039 sampling interval from 10 milliseconds
00:24:01.039 to 1 millisecond is CPU usage is
00:24:03.760 reasonable however the STB SPB is
00:24:07.679 targets for enable by default enable
00:24:10.799 always so it must use low it
00:24:15.480 must make the impactation as low as
00:24:18.240 possible and use the low CPU or other
00:24:20.880 resources so so that's the difference
00:24:23.840 here is the one simples of one here is
00:24:27.200 the simples of of one request you see it
00:24:30.240 can collect a lot of information yeah it
00:24:34.240 collect a lot of methods and actually
00:24:36.559 the whole things it collect is very
00:24:38.799 large and count can't count any so I
00:24:42.320 just cut maybe 10%age of the whole image
00:24:46.159 so that's the result i think I think
00:24:48.559 it's quite enough for most cases yeah uh
00:24:53.039 last question I'd I'd like to answer is
00:24:55.440 does the effort pay off yeah the
00:24:57.919 engineer is never as simple as to find a
00:25:00.880 better solution it's about the the the
00:25:04.159 payoff i I mean does it deserves or not
00:25:06.960 here is the facts of the STP it's stack
00:25:10.400 scanner and symbolizer do not have much
00:25:12.480 lines of code it have less than 1,000
00:25:16.080 lines so I think it implementation is
00:25:18.640 code size is quite small but to write it
00:25:21.840 we need to understand several things
00:25:24.000 under the hood i mean we need need to
00:25:27.039 understand the GVL the GC the sequence
00:25:29.919 and a little bit about the low-level
00:25:32.480 concurrency after that we need to make
00:25:34.880 the design and the trade off so to be
00:25:37.840 honest the no I introduce additional
00:25:40.880 effort yeah different people may have
00:25:43.200 different standards about does this
00:25:45.279 effort deserve or not so I can't answer
00:25:47.919 it directly yeah but for me this this
00:25:50.720 journey I mean developing this stack
00:25:52.480 profile is quite interesting even it's
00:25:54.880 challenge to learn a large complex
00:25:57.360 system I mean the Ruby VM and and is
00:26:00.640 much more much more much much more hard
00:26:02.880 to debug in it but but I learned a lot
00:26:06.000 so it's interesting and
00:26:08.679 memorable current for the current state
00:26:11.360 I I believe the SDB is still in an
00:26:14.159 experient state it's not production
00:26:15.919 ready because it only supports one ruby
00:26:19.039 version yeah I will support more in the
00:26:21.520 future another thing is I want to re
00:26:24.320 rewrite the symbolizer without the ebpf
00:26:27.360 even the ebpf is the common solution
00:26:30.240 yeah a common solution for observe
00:26:32.240 variability several several several
00:26:34.559 libraries use it for for example the
00:26:36.960 golan open telemetry implementation but
00:26:39.679 I think it's an unnecessary dependency
00:26:42.640 so I plan to remove it and additionally
00:26:45.679 I'd like to to integrate more Ruby
00:26:49.600 internal events it can help us detect
00:26:52.400 the detect the tail more more easily so
00:26:55.919 that's my plan yeah the this part is the
00:27:00.159 related work if you are interested in
00:27:02.559 this field you can check those things
00:27:04.799 but uh due to time reason I will not
00:27:07.120 cover the detail but I'd like to point
00:27:10.320 out one thing uh pre uh people devote
00:27:13.760 quite a lot of effort for making the
00:27:16.400 performance debugging more easier no
00:27:18.559 matter reduce the effort of manual
00:27:20.919 instrumentation or the effort of root
00:27:23.559 analyzing uh several previous works have
00:27:26.880 proved that if we can collect as much
00:27:30.080 information as we can we can make the
00:27:32.400 root cause analyzing super super simple
00:27:35.840 but it also means it introduce a lot of
00:27:38.559 overhead uh the SDB wants to explore
00:27:42.320 that direction but it wants to aim it
00:27:45.360 want it wants to achieve some balance I
00:27:48.080 mean collect as enough information as we
00:27:50.799 can but keep the resource usage much
00:27:53.840 more resource usage as low as possible
00:27:56.480 so it can make as practical and really
00:27:59.200 be used in industry project
00:28:02.360 projects okay to conclude a stack
00:28:05.760 profiler can release the JVL and I think
00:28:08.559 it's it is beneficial um the SDB
00:28:12.000 provides high sampling rate with low
00:28:14.320 performance impactation on the CPU usage
00:28:17.679 and I believe due to those those things
00:28:20.320 I believe the SDB has the potential to
00:28:22.960 be a default observability solution okay
00:28:26.799 that's my drawing thank you thank you
00:28:28.640 very much and the project is available
00:28:31.200 at here yeah okay so some do you feel
00:28:35.200 free to ask questions i hope I have
00:28:37.600 described it clearly
00:28:48.360 yeah okay to to Oh feel free to reach
00:28:51.360 out to me if you have anything if you're
00:28:53.679 interested in this one yeah anyone have
00:28:59.960 question okay okay if if we do not have
00:29:03.760 questions I have one little thing yeah i
00:29:06.000 I have brought some synthetic Ruby yeah
00:29:08.320 it is true Ruby but a syntactic Ruby as
00:29:10.640 small gift so I think as a Ruby
00:29:13.200 programmers this this gift is quite
00:29:15.200 interesting so feel free to reach out me
00:29:17.120 and I will offer some but I I I haven't
00:29:19.600 brought much yeah yeah that's all thank
00:29:23.120 you thank you
Explore all talks recorded at RubyKaigi 2025
+66