Summarized using AI

Ruby Stability at Scale

Peter Zhu • September 05, 2025 • Amsterdam, Netherlands • Talk

Ruby Stability at Scale: Lessons from Shopify's Monolith

Peter Zhu's talk at Rails World 2025, "Ruby Stability at Scale," focuses on how to monitor and maintain the stability of large-scale Ruby on Rails applications, particularly when instability originates from Ruby itself or native gems, rather than application code or external libraries. Drawing from his experience in managing Shopify's monolithic Rails app, one of the largest in the world, Zhu provides a comprehensive overview of crash prevention, detection, and post-mortem analysis techniques.

Main Topic

The talk centers on managing and improving the stability of Ruby and native gems in production systems, exploring strategies for monitoring, debugging, and preventing Ruby-level crashes and outages.

Key Points

  • Complex Layers in Rails Infrastructure:

    • A modern Rails app consists of multiple layers, often managed by specialized teams, with Ruby as a foundational component.
    • Native gems, often numerous even in new Rails apps, introduce additional complexity and potential vulnerability.
  • Common Sources of Instability:

    • Ruby and native gems are frequently written in C, making them susceptible to C-specific bugs such as use-after-free, buffer overflows, missing garbage collector (GC) guards, memory leaks, and incorrect usage of the Ruby C API.
    • Examples include subtle bugs where memory mismanagement leads to hard-to-diagnose production crashes (e.g., bit flips in symbols), and misconceptions around safety in Rust gems that interface with the Ruby C API.
  • Proactive Crash Prevention:

    • Compiling Ruby with assertions enabled helps catch internal errors early during continuous integration (CI) runs.
    • Setting the YJIT call threshold to one during tests forces more of the codebase through the just-in-time compiler, making bugs more likely to surface.
    • Using memory error detection tools such as Valgrind and AddressSanitizer (ASAN) can expose memory mismanagement issues before production deployments.
    • Running nightly CI tests against the latest Ruby (master branch) helps identify incompatibilities and bugs ahead of official Ruby upgrades.
  • Crash Detection and Debugging in Production:

    • Ruby generates crash reports with Ruby and C stack traces; as of Ruby 3.3, crash logs can be redirected to files for easier triage.
    • Core dumps capture the memory state at the time of the crash, offering deep inspection at the point of failure but also containing sensitive data.
    • At Shopify, a custom Crash Reporter tool uploads core dumps and crash reports to secure cloud storage and integrates crash events with error monitoring systems.
    • Effective debugging of core dumps requires access to original binaries, symbols, and a production-matching environment, often simplified by containerized deployments (e.g., Docker).
  • Upstream Reporting and Patch Management:

    • Found bugs in Ruby or native gems should be reported to maintainers or the Ruby core team.
    • Before reporting, ensure the Ruby version in use is maintained and up-to-date to avoid opening tickets for already-resolved issues.
    • Shopify maintains a public repository of custom Ruby definitions for their internal patches.

Key Takeaways

  • Instability can stem from the Ruby interpreter or native extensions, not just application code.
  • Proactive measures in development and CI—like enabling assertions, aggressive JIT testing, and memory checkers—significantly improve the detection of bugs before production.
  • Robust crash handling, logging, and restricted access to crash artifacts (such as core dumps) are essential for effective post-crash debugging and for protecting sensitive data.
  • Open communication with upstream maintainers and timely upgrades ensure long-term stability and resilience.

Ruby Stability at Scale
Peter Zhu • Amsterdam, Netherlands • Talk

Date: September 05, 2025
Published: Mon, 15 Sep 2025 00:00:00 +0000
Announced: Tue, 20 May 2025 00:00:00 +0000

There are many talks, articles, and tutorials on how to monitor your Rails app for stability. These assume the source of the bug comes from your application code, from Rails itself, or from a gem. But what if the source of instability comes from Ruby or a native gem? If Ruby crashes, do you have any monitoring or ways to debug it?

In this talk, we'll look at how we deal with Ruby crashes in the Shopify monolith, the world's largest Ruby on Rails application, and how you can use some of our techniques. We'll cover topics such as how we monitor crashes, capture core dumps for debugging, prevent crashes, and minimize the impact of crashes on production.

Rails World 2025

00:00:07.200 Hi everyone. It's an absolute honor to
00:00:09.360 be here speaking at Rails World 2025.
00:00:12.880 Infrastructure is a top priority at
00:00:15.120 Shopify and I am sure it is at your
00:00:17.600 company as well. It is important to
00:00:20.080 prevent outages, to prevent data
00:00:22.240 corruption, and ultimately to have a
00:00:24.400 good user experience. In this talk, I'll
00:00:27.599 be covering some ways to prevent and
00:00:29.519 respond to instability in Ruby.
00:00:32.880 You can find the slides uh at this URL
00:00:35.840 or by scanning this QR code. Don't worry
00:00:38.480 if you end up missing this because I'll
00:00:40.879 also have this QR code up at the end of
00:00:42.879 this talk. So, first I'll talk a little
00:00:45.440 bit about me. That's what I look like on
00:00:47.440 the internet. I'm currently based in
00:00:49.600 Toronto, Canada. I'm on the Ruby core
00:00:52.000 team and I'm a staff developer at
00:00:54.239 Shopify on the Ruby infrastructure team
00:00:57.039 where I work on performance and memory
00:00:58.960 management in Ruby. I'm the co-author of
00:01:01.760 the variable width allocation feature in
00:01:03.359 Ruby which improves performance and
00:01:04.960 memory efficiency of Ruby's garbage
00:01:06.720 collector. I'm also the co-author of the
00:01:08.960 Rub free at exit feature which frees
00:01:11.360 memory at shutdown to allow the use of
00:01:13.439 memory leak checkers such as Valgrind or
00:01:15.360 the Mac OS leaks tool. I also designed
00:01:18.320 and implemented the modular garbage
00:01:20.080 collector uh feature in Ruby.
00:01:25.439 I'm also the author of the Ruby MEM
00:01:27.439 check and the autotuner gems and in my
00:01:29.520 free time I like to travel and take
00:01:31.920 photos and I post them on Instagram at
00:01:33.920 peterzoo.phos.
00:01:36.159 So first I'll talk a little bit about
00:01:37.840 the outline of this talk. We'll first
00:01:40.560 take a look at what the infrastructure
00:01:42.479 of a typical Rails app looks like. the
00:01:44.720 teams that you might have maintaining
00:01:46.479 this and where you might have blind
00:01:48.479 spots in your infrastructure. We'll then
00:01:50.960 take a look at some of the ways uh Ruby
00:01:52.960 could cause instability in your
00:01:54.720 infrastructure
00:01:57.759 and ways to proactively prevent crashes
00:02:00.000 from happening in production.
00:02:02.479 We'll end off by discussing how to
00:02:04.079 capture metrics and information about
00:02:05.920 crashes and how we can use that
00:02:07.680 information to debug.
00:02:10.160 There are many moving parts and many
00:02:12.160 layers to the tech stack of a Rails app.
00:02:14.400 In your company, you might have teams
00:02:16.239 dedicated to each uh part of this stack.
00:02:19.120 And let's see a simplified example of a
00:02:21.440 typical Rails app.
00:02:23.760 And here's a diagram of a simplified
00:02:26.080 example of what your infrastructure
00:02:27.599 might look like. You might have teams
00:02:29.120 that manage external services such as a
00:02:32.319 database, your caches, or other
00:02:34.560 microservices that you use. You might
00:02:36.959 have teams that manage deployments of
00:02:38.720 your app in production using tools like
00:02:40.879 Docker and Kubernetes. You probably have
00:02:43.519 a large number of product developers
00:02:45.519 working on your Rails application.
00:02:48.560 Uh you might have some product
00:02:50.480 developers that might be Rails experts
00:02:53.280 or you may even have whole teams
00:02:55.040 dedicated to the architecture of your
00:02:56.959 app. And these people would be
00:02:58.800 responsible for reducing tech debt, for
00:03:00.720 triaging and debugging exceptions that
00:03:02.879 happen in production and performing
00:03:04.879 upgrades for Rails and gems in Ruby.
00:03:09.360 However, all of this runs on top of Ruby
00:03:12.159 and Ruby is just another piece of
00:03:14.000 software. So, it can have bugs and it
00:03:16.319 can crash.
00:03:18.000 Uh, do you have people responsible for
00:03:20.239 maintaining this level in your tech
00:03:22.080 stack? Do you have observability into
00:03:24.720 this layer? And do you know what to do
00:03:27.040 when Ruby crashes? In this talk, I'll be
00:03:30.000 talking about some of the reasons why
00:03:31.360 Ruby could crash, how to get information
00:03:33.680 and metrics about these crashes, and
00:03:36.080 what we can do about it. So, let's first
00:03:38.799 take a look at why Ruby could crash
00:03:43.599 or I should add a native gems. Your app
00:03:47.360 may have tens or even hundreds of native
00:03:50.159 gems.
00:03:52.720 And in fact, if you don't believe me
00:03:54.319 that there's a lot of native gems in
00:03:56.000 your app, a brand new Rails app installs
00:03:58.879 21 native gems. This adds 21 additional
00:04:03.200 possible sources of instability inside
00:04:05.840 of your app.
00:04:07.760 And so now, let's take a look at some of
00:04:09.599 the common categories of bugs that Ruby
00:04:12.239 and native gems run into.
00:04:15.120 Ruby is written in C and so are a lot of
00:04:17.840 native gems. This means that they can
00:04:19.359 run into bugs found in C code. So in C
00:04:22.880 you have to manually allocate and free
00:04:25.680 memory. This is unlike Ruby where we
00:04:28.080 have a garbage collector which
00:04:30.080 determines when an object is alive or
00:04:32.800 dead. So if you free a piece of memory
00:04:35.600 in C before all of the places give up
00:04:38.080 references to that piece of memory then
00:04:41.280 you could cause a lot of different
00:04:43.120 problems including crashes or unexpected
00:04:45.840 behaviors. And this is known as a use
00:04:48.240 after free bug. Let's see a quick
00:04:50.479 example. This is a short snippet of CC
00:04:53.199 code and we first allocate memory using
00:04:57.360 maloc and we write the string hello
00:04:59.280 rails world into it.
00:05:03.360 Then we print out the string and then we
00:05:06.479 free the memory holding that that that
00:05:08.880 contains the string
00:05:11.520 and then we print the string again. This
00:05:13.440 is not allowed and because this is a use
00:05:15.440 after free bug and the behavior of that
00:05:18.560 is nondeterministic. it could either
00:05:20.720 crash or it could even read other pieces
00:05:23.039 of memory.
00:05:24.960 There are uh also buffer overflow bugs
00:05:28.080 which access past the end of a region of
00:05:30.320 memory which could potentially cause a
00:05:32.160 crash or it could even potentially end
00:05:34.560 up reading another piece of memory. And
00:05:37.199 so this could open up your app to
00:05:38.880 attacks and allow attackers to read from
00:05:41.440 or write to other pieces of memory. The
00:05:43.919 attacker could exploit it to per to
00:05:46.000 perform unintended behaviors in your app
00:05:48.320 or even brute force it to read other
00:05:50.479 pieces of memory such as users passwords
00:05:53.440 or secrets in your infrastructure. Let's
00:05:55.759 see a quick example. This is the program
00:05:58.560 again. It's similar to the example that
00:06:00.639 I shown use for the use after free bug.
00:06:03.520 But what's the difference? Do you see
00:06:05.600 where the bug is?
00:06:09.360 plus one.
00:06:11.280 You're missing the plus one for the null
00:06:13.440 terminator. When we maloc, the null
00:06:15.840 terminator is important because it is it
00:06:18.960 signals that it is at the end of the
00:06:20.639 string. And so the stir copy here which
00:06:23.919 copies uh into that string will now
00:06:26.720 write the null terminator past the end
00:06:28.960 of the region of memory that was
00:06:31.280 allocated. So it could be reading. So it
00:06:33.680 could sorry it could be writing into
00:06:35.120 another piece of memory and the print f
00:06:38.240 after it will now read past the end of
00:06:40.720 that region of memory. So if that null
00:06:42.880 terminator was overwritten by someone
00:06:44.880 else then this would allow the attacker
00:06:47.199 to read that region of memory.
00:06:50.080 So here's an example of an error that we
00:06:52.240 saw in our production system. Our
00:06:54.960 developers uh saw this in production and
00:06:58.319 you might be wondering what is course
00:07:01.440 ID. It looks like a typo, doesn't it?
00:07:06.800 If we look at the stack trace, it's
00:07:08.960 erroring out on this line of code. And
00:07:11.440 that doesn't make any sense because the
00:07:13.039 code clearly says source ID, not course
00:07:16.080 ID. So what happened to the first
00:07:18.319 character of that symbol?
00:07:21.599 If we look at the ASKI table, lowercase
00:07:24.000 S has decimal 73 and in binary it looks
00:07:27.280 like this. Lowercase Q has decimal 71
00:07:30.960 and it looks like this in binary.
00:07:33.919 Notice how the two characters differ by
00:07:36.240 one bit. So somewhere someone has
00:07:38.800 flipped the second bit of this
00:07:40.479 character.
00:07:42.080 The cause of this bug was not CERN, but
00:07:44.240 it was possibly a use after free or
00:07:46.400 buffer overflow bug where someone wrote
00:07:48.639 to memory that it no longer owned.
00:07:52.080 C requires manual memory management. So
00:07:54.319 if you forget to free that memory, then
00:07:56.400 it will leak. A few of these memory
00:07:58.400 leaks are benign. Ruby might end up just
00:08:00.479 using a little bit more memory. However,
00:08:02.560 if it keeps on happening, then the Ruby
00:08:05.039 process will eventually run out of
00:08:06.639 memory and be killed by the system. This
00:08:09.199 could cause instability in your system
00:08:11.039 as the Ruby process may be terminated by
00:08:13.199 the system halfway during a request.
00:08:16.560 At Ruby Kagi 2024, I gave a talk with
00:08:19.199 Adam Hess from GitHub about finding and
00:08:21.759 fixing memory leaks in Ruby and in
00:08:23.599 native gems. If you're a native gem
00:08:26.000 maintainer, you can take advantage of
00:08:27.520 the Ruby free at exit feature and the
00:08:29.440 Ruby mem tool that I built to find
00:08:31.919 memory leaks in your native gem.
00:08:35.760 Ruby C API can run into bugs that we
00:08:38.240 don't encounter in Ruby code. This is
00:08:40.640 because Ruby has automatic memory
00:08:42.640 management via the garbage collector
00:08:44.320 whereas C has manual me memory
00:08:46.160 management. And this difference in
00:08:48.320 memory management paradigms means that
00:08:50.399 we need to be aware of how the Ruby
00:08:52.320 garbage collector works when we are
00:08:54.320 writing C code. The most common type of
00:08:57.279 bugs are missing garbage collector
00:08:59.519 guards. When Ruby runs the garbage
00:09:02.080 collector, it scans the C stack to find
00:09:05.040 potential Ruby objects in order to keep
00:09:07.279 them alive. This is known as uh
00:09:09.600 conservative stack scanning. However,
00:09:12.800 the C compiler may optimize the stack
00:09:15.120 space and reuse local variables on the
00:09:18.240 stack. And these local variables may
00:09:21.200 contain pointers to objects that are
00:09:23.120 actively being used. And this would
00:09:25.440 cause objects to be recycled or moved by
00:09:28.000 the garbage collector causing unexpected
00:09:30.399 behaviors in our code or even crashes.
00:09:33.440 So to demonstrate missing GC guards,
00:09:35.839 let's implement this simple Ruby method
00:09:38.720 called array each bite using the C API.
00:09:42.560 It calls this block with the integer
00:09:44.880 value of each character in each of the
00:09:47.600 strings in the array. And we can see an
00:09:50.240 example here of how to use this method
00:09:52.320 and the output that it generates.
00:09:55.360 So this is the C implementation of the
00:09:57.680 method. It's missing some checks like
00:10:00.000 checking that each of the elements in
00:10:01.760 the array is actually a string. But
00:10:03.920 that's not really important for this
00:10:05.600 demonstration. So let's quickly take a
00:10:07.920 look at uh this implementation. So we
00:10:10.640 first run a for loop that iterates a
00:10:13.279 counter over each element in the array.
00:10:16.480 We then get the string object at that
00:10:18.560 particular index
00:10:21.120 and then we get the underlying character
00:10:23.120 buffer of the string and the length.
00:10:27.040 We then iterate a counter over each
00:10:29.600 character in the string and then we
00:10:31.839 yield the character in the string uh by
00:10:34.800 converting it to a Ruby fix num. So now
00:10:38.640 let's try running the same script but
00:10:40.720 manually but running this method called
00:10:43.040 uh verify compaction references which
00:10:46.240 manually runs garbage collection
00:10:48.240 compaction in the block. Normally
00:10:51.519 compaction tries to be efficient by
00:10:53.839 minimizing the number of objects moved
00:10:56.480 in the garbage collector. However, since
00:10:58.880 we want bugs to show up, this method
00:11:01.279 will allow us to move the maximal number
00:11:03.760 of possible of objects possible.
00:11:07.200 So then if we try to run this, we see
00:11:09.680 incorrect output
00:11:12.000 and sometimes when we run it, it even
00:11:14.160 crashes. So clearly something isn't
00:11:16.720 right here. And if we look at the code
00:11:19.360 again, what we're missing is a GC guard.
00:11:22.079 And we need a GC guard right at this
00:11:24.079 line because the str variable here isn't
00:11:27.440 used later on in the code. It's
00:11:29.680 optimized out of the stack by the C
00:11:31.839 compiler. However, this makes it
00:11:34.079 invisible to the Ruby garbage collector
00:11:36.399 that we're actively using this object in
00:11:39.440 the for loop where we're yielding each
00:11:41.360 character of the string. So the string
00:11:43.760 object ends up getting moved. And so by
00:11:46.160 adding this GC guard here, it ensures
00:11:48.240 that the C compiler will keep the
00:11:50.079 variable alive on the stack up until
00:11:52.480 that point. So now we can run the script
00:11:55.360 again and we will see correct output.
00:11:59.519 So error raised in Ruby interrupt your
00:12:02.640 normal flow of your program and uh it
00:12:05.760 could jump multiple stack frames in Ruby
00:12:08.480 and it behaves similarly in C and it
00:12:11.440 uses a C feature called long jump in
00:12:14.079 order to skip stack frames. However, if
00:12:16.880 you have manual managed memory, uh it
00:12:19.920 could be lost and leaked uh when you do
00:12:22.959 that long jump. So we need to be careful
00:12:25.440 in determining what code could raise
00:12:27.680 errors when we're writing C code for uh
00:12:30.560 native gems.
00:12:33.440 Missing right barriers could also cause
00:12:35.760 subtle bugs and uh right barriers are
00:12:38.959 hard to implement correctly. Uh they are
00:12:41.600 a little bit difficult to explain. So
00:12:43.120 learning more about this is left for as
00:12:45.440 an exercise for you the viewer.
00:12:48.560 As a side note, uh there is a common
00:12:50.880 misconception that people have and this
00:12:53.360 misconception is causing an increase in
00:12:55.760 instability in our systems and I'm
00:12:58.320 talking about the rise in the number of
00:13:00.000 Rust gems mainly due to the popularity
00:13:02.399 of Rust and the promise of memory safety
00:13:05.279 from the borrow checker in Rust. While
00:13:07.760 this is true, many Rust gems directly
00:13:10.480 interface with the Ruby C API. Ruby has
00:13:13.760 a C API, not a Rust API. So there are
00:13:16.880 many implementation challenges for Rust
00:13:19.440 gems, including how to get Rust work
00:13:22.160 properly with the Ruby garbage
00:13:23.920 collector. Additionally, Rust gives
00:13:26.720 developers a false sense of security. So
00:13:29.519 many developers don't truly understand
00:13:31.519 the code that they've written.
00:13:34.240 So as a recap, this section was about
00:13:36.320 some common bugs in Ruby and in native
00:13:38.240 gems. We looked at bugs in Code,
00:13:41.200 including use after free buffer overflow
00:13:43.120 and memory leaks. We also looked at some
00:13:45.440 incorrect uses of the RubyC API,
00:13:47.839 including missing GC guards, raising
00:13:49.760 errors, causing memory leaks, and
00:13:51.120 missing write barriers. Now, let's look
00:13:53.440 at some ways to catch and prevent bugs
00:13:55.920 before they reach your production
00:13:58.079 systems. Ruby has powerful ways to check
00:14:01.440 its internal state and run in more
00:14:04.399 extreme ways for bugs to reproduce
00:14:06.800 sooner. So, now let's take a look at a
00:14:09.120 few of these.
00:14:10.800 Ruby has assertions in the code to check
00:14:13.440 that Ruby is running correctly and that
00:14:15.920 the assumptions we make during runtime
00:14:18.160 in the Ruby VM are correct. These
00:14:21.440 assertions are turned off by default
00:14:23.839 because of the negative performance
00:14:25.760 impacts that they bring. However, it is
00:14:28.720 useful to run with these assertions
00:14:30.639 turned on in CI because it can catch
00:14:33.120 bugs in Ruby and the native gems before
00:14:35.760 it is deployed to production.
00:14:38.880 In order to enable assertions, we have
00:14:41.519 to uh compile Ruby differently by adding
00:14:44.800 this uh Ruby debug to the CPP flags
00:14:47.839 during the compilation uh during the
00:14:49.680 configuration step when we com when we
00:14:51.760 compile Ruby. So for Ruby build that
00:14:54.720 ships with RBNs, you want to use the
00:14:57.120 configure ops environment variable. And
00:14:59.519 for Ruby install that ships with Cher
00:15:01.360 Ruby, add the CPP flags using a double
00:15:03.680 dash at the end of the command.
00:15:06.320 Compiling Ruby with assertions enabled
00:15:08.240 is documented in the official Ruby
00:15:10.560 contributing guides.
00:15:13.519 Since Rails 7.2, all apps enable the
00:15:16.480 YJIT just in time compiler by default,
00:15:19.120 which improves performance. However, a
00:15:21.760 just in time compiler performs
00:15:23.519 optimizations and compiles Ruby Ruby
00:15:26.000 code into machine code. So, it is yet
00:15:28.560 another additional source of bugs. YJI
00:15:31.519 by default only compiles the hot code
00:15:33.920 meaning uh code that is the most
00:15:35.680 commonly ran and because compiling code
00:15:39.199 has performance and memory impacts uh
00:15:42.320 widget limits it to code that is
00:15:44.240 executed the most frequently
00:15:47.120 however this may mean that not much of
00:15:48.959 the code in tests are using widget
00:15:51.839 because code in tests aren't executed
00:15:54.720 repeatedly so by set setting the widget
00:15:57.839 call threshold to one widget will call
00:16:00.560 will compile the code the first time it
00:16:02.880 executes it.
00:16:04.959 C uses manual memory management. We've
00:16:07.360 learned about that. So it is not
00:16:08.959 uncommon to have memory errors such as
00:16:11.120 use after free or outofbound memory
00:16:13.199 access. Maloc implementations are
00:16:16.399 designed to be resilient against memory
00:16:18.639 errors in order to prevent memory
00:16:20.959 attacks. However, if there is a memory
00:16:24.240 error in your app, then resiliency is
00:16:27.120 exactly what we don't need during
00:16:29.199 testing.
00:16:30.959 So there are tools that can help us find
00:16:33.440 memory errors such as valgrint or
00:16:36.079 address sanitizer also known as.
00:16:40.639 Here's an example of a memory error in
00:16:43.040 asan. Uh when it encounters a memory
00:16:46.240 error, it causes the program to crash
00:16:48.000 and output an error message that looks
00:16:50.079 like this. These tools are powerful
00:16:53.360 because they do extensive checks on
00:16:55.920 every memory access. Therefore, they
00:16:58.720 make your program run several times
00:17:00.560 slower and use much more memory. So,
00:17:03.199 you'll need to keep that in mind and
00:17:05.120 adjust timeouts and memory limits in CI
00:17:07.679 in order to accommodate these tools.
00:17:11.120 So, uh for more details on how to build
00:17:13.280 Ruby with ASAN, follow this uh guide in
00:17:16.160 the official Ruby building guides.
00:17:19.600 We also run nightly tests on our Rails
00:17:22.400 monolith against the latest commit of of
00:17:25.520 uh Ruby's master branch. This helps us
00:17:28.000 accomp accomplish two things. First, it
00:17:31.039 allows us to find incompatibilities in
00:17:33.440 our codebase with the next version of
00:17:35.600 Ruby. This helps us incrementally
00:17:37.760 discover these incompatibilities instead
00:17:39.840 of having to do a big push for uh Ruby
00:17:42.480 upgrades each year. Secondly, it allows
00:17:45.600 us to discover bugs in Ruby that was not
00:17:48.640 caught by Ruby's test suite. We run the
00:17:52.240 our nightly CI against Ruby head with
00:17:54.559 various configurations mentioned before
00:17:56.799 like enabling assertions running with
00:17:58.880 widget call threshold of one and with
00:18:01.120 asan enabled. This has helped us find a
00:18:03.919 wide variety of bugs in Ruby in native
00:18:06.400 gems and makes our annual Ruby upgrade
00:18:09.280 easy. For these reasons, I encourage
00:18:12.400 your company to also run nightly CI
00:18:15.760 against Ruby head across various
00:18:17.679 configurations. And then when you run
00:18:19.600 into crashes or regressions in Ruby or
00:18:22.400 native gems or or or any gems, um open
00:18:25.919 bug reports and send fixes upstream.
00:18:29.760 So to recap, in this section, I talked
00:18:32.000 about some techniques to prevent crashes
00:18:34.000 in production. Uh first I talked about
00:18:36.480 compiling a Ruby with assertions enabled
00:18:38.880 which checks the internal state of Ruby
00:18:41.280 and second uh running with widget call
00:18:44.080 threshold of one which makes widget
00:18:46.000 compile every method that gets ran and
00:18:48.240 lastly using a memory checking tool such
00:18:49.919 as Valgrren or ASAN.
00:18:52.640 So there's still the inevitable case
00:18:54.480 where bugs are not caught by CI and only
00:18:57.280 happen in production. So now let's see
00:18:59.440 how we can capture information about
00:19:01.440 crashes that happen in production.
00:19:04.720 When Ruby crashes, it generates a crash
00:19:06.960 report that includes what kind of crash
00:19:09.039 it is, the Ruby stack trace, and the C
00:19:11.440 stack trace. Here's what a crash report
00:19:13.919 looks like.
00:19:16.720 We can first see what kind of I went a
00:19:19.440 little too fast. We can first see what
00:19:20.960 kind of crash this is. And this one is a
00:19:23.520 segmentation fault, meaning that it is
00:19:25.840 some sort of memory error. We then see
00:19:28.799 the Ruby stack trace. And using this
00:19:30.720 information, we can is we can try to
00:19:32.799 isolate the issue and even try to find
00:19:35.760 uh a a small reproduction script for
00:19:38.960 this crash.
00:19:40.799 We can then see the C stack trace and
00:19:43.039 this is useful for determining where the
00:19:44.880 bug comes from whether it's a bug in
00:19:46.799 Ruby or in a native gem. And we can also
00:19:49.520 use this to look at the source code in
00:19:51.679 Ruby or in native gems. And just by
00:19:53.840 looking at the source code, sometimes we
00:19:55.280 may be even able to just identify the
00:19:57.840 bug. So normally this crash report
00:20:00.880 outputs to standard error. However, as
00:20:03.840 of Ruby 3.3, you can also redirect this
00:20:07.120 crash log into a file.
00:20:10.000 The Ruby crash report environment
00:20:12.400 variable is documented in the Ruby man
00:20:14.960 pages and explains how to add specifiers
00:20:17.679 to the file name for timestamps or the
00:20:20.080 process ID uh in the generated crash
00:20:22.880 log. So the crash report is often not
00:20:26.240 enough to debug crashes. We need to
00:20:29.120 capture core dumps from the crash. But
00:20:31.520 what are core dumps?
00:20:33.760 Core dumps are files generated
00:20:35.679 containing the state of the program at
00:20:37.760 the time of the crash. This includes
00:20:40.080 everything on the cstack and heap and
00:20:42.640 includes all local and global variables.
00:20:45.360 And then we can open up this core dump
00:20:47.039 in the debugger and try to find uh the
00:20:49.520 bug. However, uh core dumps only contain
00:20:52.720 the state of your program at the time of
00:20:54.640 the crash. So, uh it may be hard to
00:20:57.440 determine how you ended up in that
00:20:59.200 state. Uh and one thing to keep in mind
00:21:02.880 though is that core dumps contain all of
00:21:05.440 your programs memory. And this includes
00:21:07.919 things like decrypted user passwords and
00:21:10.799 PII in plain text as well as your
00:21:14.000 infrastructure secrets. So make sure to
00:21:17.120 treat core dumps with great care. At
00:21:19.679 Shopify, we upload core dumps to the
00:21:22.080 cloud so we can debug them, but we make
00:21:24.880 sure the access to the the cloud bucket
00:21:27.760 is restricted to specific teams and it
00:21:30.400 is encrypted in order to pro protect
00:21:32.720 against attackers who gain access to
00:21:34.880 that bucket in the cloud.
00:21:37.760 At Shopify, we we use a closed source
00:21:40.240 tool called Crash Reporter to upload
00:21:42.240 core dumps that we generate in
00:21:44.080 production. This tool essentially does
00:21:46.559 three things. It first uploads our core
00:21:49.200 dump to the bucket in the cloud. Then it
00:21:52.159 detects if uh the crashing binary is
00:21:54.960 Ruby and if it is then it finds the
00:21:57.360 associated crash report that we've
00:21:59.200 written to disk and uploads it. And then
00:22:02.000 finally it uh creates an event in our
00:22:04.720 error monitoring system in order to to
00:22:06.799 keep track of this crash. So the first
00:22:09.520 step is capturing the core dump when it
00:22:11.120 occurs. So we you can configure the
00:22:14.080 behavior in Linux when a program crashes
00:22:17.520 by default it does nothing. Uh you can
00:22:20.000 also configure it to write to a a
00:22:22.000 particular file or you can also use this
00:22:24.880 feature which is the one that we use
00:22:26.720 that pipes the core dump to a program
00:22:29.600 standard input. So our crash reporter
00:22:32.640 tool reads this core dump from from the
00:22:34.880 standard input and uploads it to a
00:22:37.120 bucket in the cloud.
00:22:39.600 Then secondly, when we have the core
00:22:41.520 dump, we detect whether this core dump
00:22:43.679 is from Ruby and if it is, we try to
00:22:46.000 find the associated crash report and
00:22:48.000 upload it because it contains valuable
00:22:50.240 information such as the Ruby and Calele
00:22:52.720 stack traces. Finally, we keep track of
00:22:55.840 these crashes by creating an event in
00:22:57.760 our error monitoring system. We also at
00:23:01.280 this step parse the crash report
00:23:03.520 acquired in the previous step for the
00:23:05.760 Ruby uh and the C-level stack traces.
00:23:09.039 This you this information is useful for
00:23:11.520 quickly triaging crashes and determining
00:23:14.320 uh if this crash is a new or a known
00:23:17.360 issue
00:23:18.880 and uh using the Ruby level stack trace,
00:23:21.440 we can even use it to try to find a
00:23:23.440 minimum reproduction of the bug. So now
00:23:26.480 that we've collected the necessary
00:23:28.400 information for the crash, we need to
00:23:30.799 use it to figure out what the bug is.
00:23:33.520 Here's an AI generated image of what
00:23:35.280 that process looks like. And I'm going
00:23:36.880 to be honest, I could spend hours here
00:23:39.120 giving tips and techniques, but there's
00:23:41.520 so many possibilities that even during
00:23:44.000 that time, I couldn't even possibly
00:23:45.440 cover all of the potential issues. We've
00:23:48.480 even seen cases where the C compiler we
00:23:51.280 used to compile the Ruby binary and
00:23:53.120 native gem uh had a bug in it and it was
00:23:56.320 emitting incorrect instructions into the
00:23:58.720 binary. But I'll briefly talk about how
00:24:01.200 to debug core dumps generated from
00:24:03.120 production.
00:24:04.720 So in order to debug core dumps we need
00:24:07.039 the following things. So of course we'll
00:24:10.000 need the core dump file. We'll need the
00:24:12.080 original binaries of Ruby system
00:24:14.000 libraries and native gems. And we need
00:24:16.080 these uh because they contain the
00:24:17.760 symbols in order to output meaningful
00:24:20.240 stack traces, structure definitions and
00:24:22.799 variables.
00:24:24.320 And we need to be on the same operating
00:24:26.720 system and usually the same CPU
00:24:28.559 architecture as the production system.
00:24:31.760 Those are quite a few requirements, but
00:24:33.760 fortunately, if you're using a container
00:24:35.840 system like Docker in production, you
00:24:38.000 can instead just use the production
00:24:39.679 container to debug the core dump. So, in
00:24:42.720 order to debug the core dump, you open
00:24:44.480 the debugger uh you open the core dump
00:24:46.240 up in a debugger such as GDB or LLDB.
00:24:49.520 You need to specify what the core dump
00:24:51.200 file is and uh and the Ruby binary that
00:24:54.799 was crashing.
00:24:56.880 uh then when you're in the debugger you
00:24:58.480 can analyze the core dump by looking at
00:25:00.159 the back trace and read variables and
00:25:02.799 pieces of memory.
00:25:06.080 So to recap in this section we talked
00:25:08.320 about some ways to capture information
00:25:10.320 about crashes. We saw what a crash
00:25:12.799 report looks like and how to redirect
00:25:15.200 crash reports to a file in order to
00:25:17.200 capture it. We looked at what core dumps
00:25:19.679 are and how we can generate them on
00:25:21.520 Linux systems. Finally, we looked at how
00:25:24.400 we can open core dumps in the debugger
00:25:26.240 in order to debug it. So once you found
00:25:29.760 a bug, uh you should report it. If it's
00:25:32.320 a bug in a native gem, let the
00:25:34.400 maintainer know. If it's in Ruby, let
00:25:36.640 us, the Ruby core team, know about it by
00:25:39.120 reporting it to the bug tracker. But
00:25:41.440 wait, before you open a ticket, is the
00:25:44.240 Ruby you're using an actively maintained
00:25:46.320 version of Ruby? If you're unsure,
00:25:48.640 consult this page. Right now, only Ruby
00:25:51.279 3.3 and 3.4 for are in normal
00:25:53.440 maintenance mode. If uh and Ruby 3.2 is
00:25:56.720 only in security maintenance, meaning
00:25:58.720 that regular bugs are no longer fixed.
00:26:01.440 So if you're running on Ruby 3.2 or
00:26:03.600 earlier and there are crashes, then try
00:26:05.600 upgrading your Ruby version first.
00:26:08.640 Additionally, make sure you're you're
00:26:10.720 running on the latest patch version of
00:26:12.960 your major version of Ruby. These
00:26:15.120 contains all of the backported fixes and
00:26:18.080 uh the the most recent version as of
00:26:20.000 this talk is listed here.
00:26:24.320 If you're running a maintained version
00:26:25.919 of Ruby and you're on the latest patch
00:26:27.919 version, then submit a bug report on the
00:26:30.000 Ruby bug tracker and maybe even try to
00:26:32.159 submit a fix. The guide linked here
00:26:34.400 explains how to report bugs on the bug
00:26:36.320 tracker, including what kinds of
00:26:37.760 information to include in your ticket.
00:26:41.039 At Shopify, when we fix a bug in Ruby,
00:26:43.679 we want to upgrade our Ruby version to
00:26:46.080 include that patch as soon as possible.
00:26:49.200 However, a new version of Ruby may not
00:26:51.440 come out for a few months. So, we create
00:26:53.520 an internal release uh with with these
00:26:56.159 fixes backported.
00:26:58.799 We've made all of our Ruby definitions
00:27:01.039 public and you can find it in the
00:27:02.960 Shopify/ Ruby definitions repository.
00:27:05.840 This contains Ruby build definitions for
00:27:08.000 all of the custom Ruby versions we run
00:27:10.000 at Shopify.
00:27:12.159 So today I've covered quite a few topics
00:27:14.640 including why stability of Ruby is
00:27:16.320 important, some sources of instability
00:27:18.240 in Ruby, how we can catch bugs in Ruby
00:27:21.120 and in native gems before they reach
00:27:22.799 your production systems, and finally how
00:27:25.039 to debug core dumps uh that you've
00:27:27.440 captured from production. So you can
00:27:29.919 find a copy of these slides at this QR
00:27:32.320 code. If you have any questions, feel
00:27:34.480 free to ask me after this talk or
00:27:36.400 through social media or through email.
00:27:38.799 Thank you for coming to my talk.
Explore all talks recorded at Rails World 2025
+19