Ruby Stability at Scale

Ruby Stability at Scale

Play on YouTube

Rails World 2025

#native-extensions

#memory-management

Ruby Stability at Scale

Peter Zhu • September 05, 2025 • Amsterdam, Netherlands • Talk

Ruby Stability at Scale: Lessons from Shopify's Monolith

Peter Zhu's talk at Rails World 2025, "Ruby Stability at Scale," focuses on how to monitor and maintain the stability of large-scale Ruby on Rails applications, particularly when instability originates from Ruby itself or native gems, rather than application code or external libraries. Drawing from his experience in managing Shopify's monolithic Rails app, one of the largest in the world, Zhu provides a comprehensive overview of crash prevention, detection, and post-mortem analysis techniques.

Main Topic

The talk centers on managing and improving the stability of Ruby and native gems in production systems, exploring strategies for monitoring, debugging, and preventing Ruby-level crashes and outages.

Key Points

Key Takeaways

Instability can stem from the Ruby interpreter or native extensions, not just application code.
Proactive measures in development and CI—like enabling assertions, aggressive JIT testing, and memory checkers—significantly improve the detection of bugs before production.
Robust crash handling, logging, and restricted access to crash artifacts (such as core dumps) are essential for effective post-crash debugging and for protecting sensitive data.
Open communication with upstream maintainers and timely upgrades ensure long-term stability and resilience.

Ruby Stability at Scale
Peter Zhu • Amsterdam, Netherlands • Talk

Date: September 05, 2025
Published: Mon, 15 Sep 2025 00:00:00 +0000
Announced: Tue, 20 May 2025 00:00:00 +0000

There are many talks, articles, and tutorials on how to monitor your Rails app for stability. These assume the source of the bug comes from your application code, from Rails itself, or from a gem. But what if the source of instability comes from Ruby or a native gem? If Ruby crashes, do you have any monitoring or ways to debug it?

In this talk, we'll look at how we deal with Ruby crashes in the Shopify monolith, the world's largest Ruby on Rails application, and how you can use some of our techniques. We'll cover topics such as how we monitor crashes, capture core dumps for debugging, prevent crashes, and minimize the impact of crashes on production.

Rails World 2025

00:00:07.200 Hi everyone. It's an absolute honor to

00:00:09.360 be here speaking at Rails World 2025.

00:00:12.880 Infrastructure is a top priority at

00:00:15.120 Shopify and I am sure it is at your

00:00:17.600 company as well. It is important to

00:00:20.080 prevent outages, to prevent data

00:00:22.240 corruption, and ultimately to have a

00:00:24.400 good user experience. In this talk, I'll

00:00:27.599 be covering some ways to prevent and

00:00:29.519 respond to instability in Ruby.

00:00:32.880 You can find the slides uh at this URL

00:00:35.840 or by scanning this QR code. Don't worry

00:00:38.480 if you end up missing this because I'll

00:00:40.879 also have this QR code up at the end of

00:00:42.879 this talk. So, first I'll talk a little

00:00:45.440 bit about me. That's what I look like on

00:00:47.440 the internet. I'm currently based in

00:00:49.600 Toronto, Canada. I'm on the Ruby core

00:00:52.000 team and I'm a staff developer at

00:00:54.239 Shopify on the Ruby infrastructure team

00:00:57.039 where I work on performance and memory

00:00:58.960 management in Ruby. I'm the co-author of

00:01:01.760 the variable width allocation feature in

00:01:03.359 Ruby which improves performance and

00:01:04.960 memory efficiency of Ruby's garbage

00:01:06.720 collector. I'm also the co-author of the

00:01:08.960 Rub free at exit feature which frees

00:01:11.360 memory at shutdown to allow the use of

00:01:13.439 memory leak checkers such as Valgrind or

00:01:15.360 the Mac OS leaks tool. I also designed

00:01:18.320 and implemented the modular garbage

00:01:20.080 collector uh feature in Ruby.

00:01:25.439 I'm also the author of the Ruby MEM

00:01:27.439 check and the autotuner gems and in my

00:01:29.520 free time I like to travel and take

00:01:31.920 photos and I post them on Instagram at

00:01:33.920 peterzoo.phos.

00:01:36.159 So first I'll talk a little bit about

00:01:37.840 the outline of this talk. We'll first

00:01:40.560 take a look at what the infrastructure

00:01:42.479 of a typical Rails app looks like. the

00:01:44.720 teams that you might have maintaining

00:01:46.479 this and where you might have blind

00:01:48.479 spots in your infrastructure. We'll then

00:01:50.960 take a look at some of the ways uh Ruby

00:01:52.960 could cause instability in your

00:01:54.720 infrastructure

00:01:57.759 and ways to proactively prevent crashes

00:02:00.000 from happening in production.

00:02:02.479 We'll end off by discussing how to

00:02:04.079 capture metrics and information about

00:02:05.920 crashes and how we can use that

00:02:07.680 information to debug.

00:02:10.160 There are many moving parts and many

00:02:12.160 layers to the tech stack of a Rails app.

00:02:14.400 In your company, you might have teams

00:02:16.239 dedicated to each uh part of this stack.

00:02:19.120 And let's see a simplified example of a

00:02:21.440 typical Rails app.

00:02:23.760 And here's a diagram of a simplified

00:02:26.080 example of what your infrastructure

00:02:27.599 might look like. You might have teams

00:02:29.120 that manage external services such as a

00:02:32.319 database, your caches, or other

00:02:34.560 microservices that you use. You might

00:02:36.959 have teams that manage deployments of

00:02:38.720 your app in production using tools like

00:02:40.879 Docker and Kubernetes. You probably have

00:02:43.519 a large number of product developers

00:02:45.519 working on your Rails application.

00:02:48.560 Uh you might have some product

00:02:50.480 developers that might be Rails experts

00:02:53.280 or you may even have whole teams

00:02:55.040 dedicated to the architecture of your

00:02:56.959 app. And these people would be

00:02:58.800 responsible for reducing tech debt, for

00:03:00.720 triaging and debugging exceptions that

00:03:02.879 happen in production and performing

00:03:04.879 upgrades for Rails and gems in Ruby.

00:03:09.360 However, all of this runs on top of Ruby

00:03:12.159 and Ruby is just another piece of

00:03:14.000 software. So, it can have bugs and it

00:03:16.319 can crash.

00:03:18.000 Uh, do you have people responsible for

00:03:20.239 maintaining this level in your tech

00:03:22.080 stack? Do you have observability into

00:03:24.720 this layer? And do you know what to do

00:03:27.040 when Ruby crashes? In this talk, I'll be

00:03:30.000 talking about some of the reasons why

00:03:31.360 Ruby could crash, how to get information

00:03:33.680 and metrics about these crashes, and

00:03:36.080 what we can do about it. So, let's first

00:03:38.799 take a look at why Ruby could crash

00:03:43.599 or I should add a native gems. Your app

00:03:47.360 may have tens or even hundreds of native

00:03:50.159 gems.

00:03:52.720 And in fact, if you don't believe me

00:03:54.319 that there's a lot of native gems in

00:03:56.000 your app, a brand new Rails app installs

00:03:58.879 21 native gems. This adds 21 additional

00:04:03.200 possible sources of instability inside

00:04:05.840 of your app.

00:04:07.760 And so now, let's take a look at some of

00:04:09.599 the common categories of bugs that Ruby

00:04:12.239 and native gems run into.

00:04:15.120 Ruby is written in C and so are a lot of

00:04:17.840 native gems. This means that they can

00:04:19.359 run into bugs found in C code. So in C

00:04:22.880 you have to manually allocate and free

00:04:25.680 memory. This is unlike Ruby where we

00:04:28.080 have a garbage collector which

00:04:30.080 determines when an object is alive or

00:04:32.800 dead. So if you free a piece of memory

00:04:35.600 in C before all of the places give up

00:04:38.080 references to that piece of memory then

00:04:41.280 you could cause a lot of different

00:04:43.120 problems including crashes or unexpected

00:04:45.840 behaviors. And this is known as a use

00:04:48.240 after free bug. Let's see a quick

00:04:50.479 example. This is a short snippet of CC

00:04:53.199 code and we first allocate memory using

00:04:57.360 maloc and we write the string hello

00:04:59.280 rails world into it.

00:05:03.360 Then we print out the string and then we

00:05:06.479 free the memory holding that that that

00:05:08.880 contains the string

00:05:11.520 and then we print the string again. This

00:05:13.440 is not allowed and because this is a use

00:05:15.440 after free bug and the behavior of that

00:05:18.560 is nondeterministic. it could either

00:05:20.720 crash or it could even read other pieces

00:05:23.039 of memory.

00:05:24.960 There are uh also buffer overflow bugs

00:05:28.080 which access past the end of a region of

00:05:30.320 memory which could potentially cause a

00:05:32.160 crash or it could even potentially end

00:05:34.560 up reading another piece of memory. And

00:05:37.199 so this could open up your app to

00:05:38.880 attacks and allow attackers to read from

00:05:41.440 or write to other pieces of memory. The

00:05:43.919 attacker could exploit it to per to

00:05:46.000 perform unintended behaviors in your app

00:05:48.320 or even brute force it to read other

00:05:50.479 pieces of memory such as users passwords

00:05:53.440 or secrets in your infrastructure. Let's

00:05:55.759 see a quick example. This is the program

00:05:58.560 again. It's similar to the example that

00:06:00.639 I shown use for the use after free bug.

00:06:03.520 But what's the difference? Do you see

00:06:05.600 where the bug is?

00:06:09.360 plus one.

00:06:11.280 You're missing the plus one for the null

00:06:13.440 terminator. When we maloc, the null

00:06:15.840 terminator is important because it is it

00:06:18.960 signals that it is at the end of the

00:06:20.639 string. And so the stir copy here which

00:06:23.919 copies uh into that string will now

00:06:26.720 write the null terminator past the end

00:06:28.960 of the region of memory that was

00:06:31.280 allocated. So it could be reading. So it

00:06:33.680 could sorry it could be writing into

00:06:35.120 another piece of memory and the print f

00:06:38.240 after it will now read past the end of

00:06:40.720 that region of memory. So if that null

00:06:42.880 terminator was overwritten by someone

00:06:44.880 else then this would allow the attacker

00:06:47.199 to read that region of memory.

00:06:50.080 So here's an example of an error that we

00:06:52.240 saw in our production system. Our

00:06:54.960 developers uh saw this in production and

00:06:58.319 you might be wondering what is course

00:07:01.440 ID. It looks like a typo, doesn't it?

00:07:06.800 If we look at the stack trace, it's

00:07:08.960 erroring out on this line of code. And

00:07:11.440 that doesn't make any sense because the

00:07:13.039 code clearly says source ID, not course

00:07:16.080 ID. So what happened to the first

00:07:18.319 character of that symbol?

00:07:21.599 If we look at the ASKI table, lowercase

00:07:24.000 S has decimal 73 and in binary it looks

00:07:27.280 like this. Lowercase Q has decimal 71

00:07:30.960 and it looks like this in binary.

00:07:33.919 Notice how the two characters differ by

00:07:36.240 one bit. So somewhere someone has

00:07:38.800 flipped the second bit of this

00:07:40.479 character.

00:07:42.080 The cause of this bug was not CERN, but

00:07:44.240 it was possibly a use after free or

00:07:46.400 buffer overflow bug where someone wrote

00:07:48.639 to memory that it no longer owned.

00:07:52.080 C requires manual memory management. So

00:07:54.319 if you forget to free that memory, then

00:07:56.400 it will leak. A few of these memory

00:07:58.400 leaks are benign. Ruby might end up just

00:08:00.479 using a little bit more memory. However,

00:08:02.560 if it keeps on happening, then the Ruby

00:08:05.039 process will eventually run out of

00:08:06.639 memory and be killed by the system. This

00:08:09.199 could cause instability in your system

00:08:11.039 as the Ruby process may be terminated by

00:08:13.199 the system halfway during a request.

00:08:16.560 At Ruby Kagi 2024, I gave a talk with

00:08:19.199 Adam Hess from GitHub about finding and

00:08:21.759 fixing memory leaks in Ruby and in

00:08:23.599 native gems. If you're a native gem

00:08:26.000 maintainer, you can take advantage of

00:08:27.520 the Ruby free at exit feature and the

00:08:29.440 Ruby mem tool that I built to find

00:08:31.919 memory leaks in your native gem.

00:08:35.760 Ruby C API can run into bugs that we

00:08:38.240 don't encounter in Ruby code. This is

00:08:40.640 because Ruby has automatic memory

00:08:42.640 management via the garbage collector

00:08:44.320 whereas C has manual me memory

00:08:46.160 management. And this difference in

00:08:48.320 memory management paradigms means that

00:08:50.399 we need to be aware of how the Ruby

00:08:52.320 garbage collector works when we are

00:08:54.320 writing C code. The most common type of

00:08:57.279 bugs are missing garbage collector

00:08:59.519 guards. When Ruby runs the garbage

00:09:02.080 collector, it scans the C stack to find

00:09:05.040 potential Ruby objects in order to keep

00:09:07.279 them alive. This is known as uh

00:09:09.600 conservative stack scanning. However,

00:09:12.800 the C compiler may optimize the stack

00:09:15.120 space and reuse local variables on the

00:09:18.240 stack. And these local variables may

00:09:21.200 contain pointers to objects that are

00:09:23.120 actively being used. And this would

00:09:25.440 cause objects to be recycled or moved by

00:09:28.000 the garbage collector causing unexpected

00:09:30.399 behaviors in our code or even crashes.

00:09:33.440 So to demonstrate missing GC guards,

00:09:35.839 let's implement this simple Ruby method

00:09:38.720 called array each bite using the C API.

00:09:42.560 It calls this block with the integer

00:09:44.880 value of each character in each of the

00:09:47.600 strings in the array. And we can see an

00:09:50.240 example here of how to use this method

00:09:52.320 and the output that it generates.

00:09:55.360 So this is the C implementation of the

00:09:57.680 method. It's missing some checks like

00:10:00.000 checking that each of the elements in

00:10:01.760 the array is actually a string. But

00:10:03.920 that's not really important for this

00:10:05.600 demonstration. So let's quickly take a

00:10:07.920 look at uh this implementation. So we

00:10:10.640 first run a for loop that iterates a

00:10:13.279 counter over each element in the array.

00:10:16.480 We then get the string object at that

00:10:18.560 particular index

00:10:21.120 and then we get the underlying character

00:10:23.120 buffer of the string and the length.

00:10:27.040 We then iterate a counter over each

00:10:29.600 character in the string and then we

00:10:31.839 yield the character in the string uh by

00:10:34.800 converting it to a Ruby fix num. So now

00:10:38.640 let's try running the same script but

00:10:40.720 manually but running this method called

00:10:43.040 uh verify compaction references which

00:10:46.240 manually runs garbage collection

00:10:48.240 compaction in the block. Normally

00:10:51.519 compaction tries to be efficient by

00:10:53.839 minimizing the number of objects moved

00:10:56.480 in the garbage collector. However, since

00:10:58.880 we want bugs to show up, this method

00:11:01.279 will allow us to move the maximal number

00:11:03.760 of possible of objects possible.

00:11:07.200 So then if we try to run this, we see

00:11:09.680 incorrect output

00:11:12.000 and sometimes when we run it, it even

00:11:14.160 crashes. So clearly something isn't

00:11:16.720 right here. And if we look at the code

00:11:19.360 again, what we're missing is a GC guard.

00:11:22.079 And we need a GC guard right at this

00:11:24.079 line because the str variable here isn't

00:11:27.440 used later on in the code. It's

00:11:29.680 optimized out of the stack by the C

00:11:31.839 compiler. However, this makes it

00:11:34.079 invisible to the Ruby garbage collector

00:11:36.399 that we're actively using this object in

00:11:39.440 the for loop where we're yielding each

00:11:41.360 character of the string. So the string

00:11:43.760 object ends up getting moved. And so by

00:11:46.160 adding this GC guard here, it ensures

00:11:48.240 that the C compiler will keep the

00:11:50.079 variable alive on the stack up until

00:11:52.480 that point. So now we can run the script

00:11:55.360 again and we will see correct output.

00:11:59.519 So error raised in Ruby interrupt your

00:12:02.640 normal flow of your program and uh it

00:12:05.760 could jump multiple stack frames in Ruby

00:12:08.480 and it behaves similarly in C and it

00:12:11.440 uses a C feature called long jump in

00:12:14.079 order to skip stack frames. However, if

00:12:16.880 you have manual managed memory, uh it

00:12:19.920 could be lost and leaked uh when you do

00:12:22.959 that long jump. So we need to be careful

00:12:25.440 in determining what code could raise

00:12:27.680 errors when we're writing C code for uh

00:12:30.560 native gems.

00:12:33.440 Missing right barriers could also cause

00:12:35.760 subtle bugs and uh right barriers are

00:12:38.959 hard to implement correctly. Uh they are

00:12:41.600 a little bit difficult to explain. So

00:12:43.120 learning more about this is left for as

00:12:45.440 an exercise for you the viewer.

00:12:48.560 As a side note, uh there is a common

00:12:50.880 misconception that people have and this

00:12:53.360 misconception is causing an increase in

00:12:55.760 instability in our systems and I'm

00:12:58.320 talking about the rise in the number of

00:13:00.000 Rust gems mainly due to the popularity

00:13:02.399 of Rust and the promise of memory safety

00:13:05.279 from the borrow checker in Rust. While

00:13:07.760 this is true, many Rust gems directly

00:13:10.480 interface with the Ruby C API. Ruby has

00:13:13.760 a C API, not a Rust API. So there are

00:13:16.880 many implementation challenges for Rust

00:13:19.440 gems, including how to get Rust work

00:13:22.160 properly with the Ruby garbage

00:13:23.920 collector. Additionally, Rust gives

00:13:26.720 developers a false sense of security. So

00:13:29.519 many developers don't truly understand

00:13:31.519 the code that they've written.

00:13:34.240 So as a recap, this section was about

00:13:36.320 some common bugs in Ruby and in native

00:13:38.240 gems. We looked at bugs in Code,

00:13:41.200 including use after free buffer overflow

00:13:43.120 and memory leaks. We also looked at some

00:13:45.440 incorrect uses of the RubyC API,

00:13:47.839 including missing GC guards, raising

00:13:49.760 errors, causing memory leaks, and

00:13:51.120 missing write barriers. Now, let's look

00:13:53.440 at some ways to catch and prevent bugs

00:13:55.920 before they reach your production

00:13:58.079 systems. Ruby has powerful ways to check

00:14:01.440 its internal state and run in more

00:14:04.399 extreme ways for bugs to reproduce

00:14:06.800 sooner. So, now let's take a look at a

00:14:09.120 few of these.

00:14:10.800 Ruby has assertions in the code to check

00:14:13.440 that Ruby is running correctly and that

00:14:15.920 the assumptions we make during runtime

00:14:18.160 in the Ruby VM are correct. These

00:14:21.440 assertions are turned off by default

00:14:23.839 because of the negative performance

00:14:25.760 impacts that they bring. However, it is

00:14:28.720 useful to run with these assertions

00:14:30.639 turned on in CI because it can catch

00:14:33.120 bugs in Ruby and the native gems before

00:14:35.760 it is deployed to production.

00:14:38.880 In order to enable assertions, we have

00:14:41.519 to uh compile Ruby differently by adding

00:14:44.800 this uh Ruby debug to the CPP flags

00:14:47.839 during the compilation uh during the

00:14:49.680 configuration step when we com when we

00:14:51.760 compile Ruby. So for Ruby build that

00:14:54.720 ships with RBNs, you want to use the

00:14:57.120 configure ops environment variable. And

00:14:59.519 for Ruby install that ships with Cher

00:15:01.360 Ruby, add the CPP flags using a double

00:15:03.680 dash at the end of the command.

00:15:06.320 Compiling Ruby with assertions enabled

00:15:08.240 is documented in the official Ruby

00:15:10.560 contributing guides.

00:15:13.519 Since Rails 7.2, all apps enable the

00:15:16.480 YJIT just in time compiler by default,

00:15:19.120 which improves performance. However, a

00:15:21.760 just in time compiler performs

00:15:23.519 optimizations and compiles Ruby Ruby

00:15:26.000 code into machine code. So, it is yet

00:15:28.560 another additional source of bugs. YJI

00:15:31.519 by default only compiles the hot code

00:15:33.920 meaning uh code that is the most

00:15:35.680 commonly ran and because compiling code

00:15:39.199 has performance and memory impacts uh

00:15:42.320 widget limits it to code that is

00:15:44.240 executed the most frequently

00:15:47.120 however this may mean that not much of

00:15:48.959 the code in tests are using widget

00:15:51.839 because code in tests aren't executed

00:15:54.720 repeatedly so by set setting the widget

00:15:57.839 call threshold to one widget will call

00:16:00.560 will compile the code the first time it

00:16:02.880 executes it.

00:16:04.959 C uses manual memory management. We've

00:16:07.360 learned about that. So it is not

00:16:08.959 uncommon to have memory errors such as

00:16:11.120 use after free or outofbound memory

00:16:13.199 access. Maloc implementations are

00:16:16.399 designed to be resilient against memory

00:16:18.639 errors in order to prevent memory

00:16:20.959 attacks. However, if there is a memory

00:16:24.240 error in your app, then resiliency is

00:16:27.120 exactly what we don't need during

00:16:29.199 testing.

00:16:30.959 So there are tools that can help us find

00:16:33.440 memory errors such as valgrint or

00:16:36.079 address sanitizer also known as.

00:16:40.639 Here's an example of a memory error in

00:16:43.040 asan. Uh when it encounters a memory

00:16:46.240 error, it causes the program to crash

00:16:48.000 and output an error message that looks

00:16:50.079 like this. These tools are powerful

00:16:53.360 because they do extensive checks on

00:16:55.920 every memory access. Therefore, they

00:16:58.720 make your program run several times

00:17:00.560 slower and use much more memory. So,

00:17:03.199 you'll need to keep that in mind and

00:17:05.120 adjust timeouts and memory limits in CI

00:17:07.679 in order to accommodate these tools.

00:17:11.120 So, uh for more details on how to build

00:17:13.280 Ruby with ASAN, follow this uh guide in

00:17:16.160 the official Ruby building guides.

00:17:19.600 We also run nightly tests on our Rails

00:17:22.400 monolith against the latest commit of of

00:17:25.520 uh Ruby's master branch. This helps us

00:17:28.000 accomp accomplish two things. First, it

00:17:31.039 allows us to find incompatibilities in

00:17:33.440 our codebase with the next version of

00:17:35.600 Ruby. This helps us incrementally

00:17:37.760 discover these incompatibilities instead

00:17:39.840 of having to do a big push for uh Ruby

00:17:42.480 upgrades each year. Secondly, it allows

00:17:45.600 us to discover bugs in Ruby that was not

00:17:48.640 caught by Ruby's test suite. We run the

00:17:52.240 our nightly CI against Ruby head with

00:17:54.559 various configurations mentioned before

00:17:56.799 like enabling assertions running with

00:17:58.880 widget call threshold of one and with

00:18:01.120 asan enabled. This has helped us find a

00:18:03.919 wide variety of bugs in Ruby in native

00:18:06.400 gems and makes our annual Ruby upgrade

00:18:09.280 easy. For these reasons, I encourage

00:18:12.400 your company to also run nightly CI

00:18:15.760 against Ruby head across various

00:18:17.679 configurations. And then when you run

00:18:19.600 into crashes or regressions in Ruby or

00:18:22.400 native gems or or or any gems, um open

00:18:25.919 bug reports and send fixes upstream.

00:18:29.760 So to recap, in this section, I talked

00:18:32.000 about some techniques to prevent crashes

00:18:34.000 in production. Uh first I talked about

00:18:36.480 compiling a Ruby with assertions enabled

00:18:38.880 which checks the internal state of Ruby

00:18:41.280 and second uh running with widget call

00:18:44.080 threshold of one which makes widget

00:18:46.000 compile every method that gets ran and

00:18:48.240 lastly using a memory checking tool such

00:18:49.919 as Valgrren or ASAN.

00:18:52.640 So there's still the inevitable case

00:18:54.480 where bugs are not caught by CI and only

00:18:57.280 happen in production. So now let's see

00:18:59.440 how we can capture information about

00:19:01.440 crashes that happen in production.

00:19:04.720 When Ruby crashes, it generates a crash

00:19:06.960 report that includes what kind of crash

00:19:09.039 it is, the Ruby stack trace, and the C

00:19:11.440 stack trace. Here's what a crash report

00:19:13.919 looks like.

00:19:16.720 We can first see what kind of I went a

00:19:19.440 little too fast. We can first see what

00:19:20.960 kind of crash this is. And this one is a

00:19:23.520 segmentation fault, meaning that it is

00:19:25.840 some sort of memory error. We then see

00:19:28.799 the Ruby stack trace. And using this

00:19:30.720 information, we can is we can try to

00:19:32.799 isolate the issue and even try to find

00:19:35.760 uh a a small reproduction script for

00:19:38.960 this crash.

00:19:40.799 We can then see the C stack trace and

00:19:43.039 this is useful for determining where the

00:19:44.880 bug comes from whether it's a bug in

00:19:46.799 Ruby or in a native gem. And we can also

00:19:49.520 use this to look at the source code in

00:19:51.679 Ruby or in native gems. And just by

00:19:53.840 looking at the source code, sometimes we

00:19:55.280 may be even able to just identify the

00:19:57.840 bug. So normally this crash report

00:20:00.880 outputs to standard error. However, as

00:20:03.840 of Ruby 3.3, you can also redirect this

00:20:07.120 crash log into a file.

00:20:10.000 The Ruby crash report environment

00:20:12.400 variable is documented in the Ruby man

00:20:14.960 pages and explains how to add specifiers

00:20:17.679 to the file name for timestamps or the

00:20:20.080 process ID uh in the generated crash

00:20:22.880 log. So the crash report is often not

00:20:26.240 enough to debug crashes. We need to

00:20:29.120 capture core dumps from the crash. But

00:20:31.520 what are core dumps?

00:20:33.760 Core dumps are files generated

00:20:35.679 containing the state of the program at

00:20:37.760 the time of the crash. This includes

00:20:40.080 everything on the cstack and heap and

00:20:42.640 includes all local and global variables.

00:20:45.360 And then we can open up this core dump

00:20:47.039 in the debugger and try to find uh the

00:20:49.520 bug. However, uh core dumps only contain

00:20:52.720 the state of your program at the time of

00:20:54.640 the crash. So, uh it may be hard to

00:20:57.440 determine how you ended up in that

00:20:59.200 state. Uh and one thing to keep in mind

00:21:02.880 though is that core dumps contain all of

00:21:05.440 your programs memory. And this includes

00:21:07.919 things like decrypted user passwords and

00:21:10.799 PII in plain text as well as your

00:21:14.000 infrastructure secrets. So make sure to

00:21:17.120 treat core dumps with great care. At

00:21:19.679 Shopify, we upload core dumps to the

00:21:22.080 cloud so we can debug them, but we make

00:21:24.880 sure the access to the the cloud bucket

00:21:27.760 is restricted to specific teams and it

00:21:30.400 is encrypted in order to pro protect

00:21:32.720 against attackers who gain access to

00:21:34.880 that bucket in the cloud.

00:21:37.760 At Shopify, we we use a closed source

00:21:40.240 tool called Crash Reporter to upload

00:21:42.240 core dumps that we generate in

00:21:44.080 production. This tool essentially does

00:21:46.559 three things. It first uploads our core

00:21:49.200 dump to the bucket in the cloud. Then it

00:21:52.159 detects if uh the crashing binary is

00:21:54.960 Ruby and if it is then it finds the

00:21:57.360 associated crash report that we've

00:21:59.200 written to disk and uploads it. And then

00:22:02.000 finally it uh creates an event in our

00:22:04.720 error monitoring system in order to to

00:22:06.799 keep track of this crash. So the first

00:22:09.520 step is capturing the core dump when it

00:22:11.120 occurs. So we you can configure the

00:22:14.080 behavior in Linux when a program crashes

00:22:17.520 by default it does nothing. Uh you can

00:22:20.000 also configure it to write to a a

00:22:22.000 particular file or you can also use this

00:22:24.880 feature which is the one that we use

00:22:26.720 that pipes the core dump to a program

00:22:29.600 standard input. So our crash reporter

00:22:32.640 tool reads this core dump from from the

00:22:34.880 standard input and uploads it to a

00:22:37.120 bucket in the cloud.

00:22:39.600 Then secondly, when we have the core

00:22:41.520 dump, we detect whether this core dump

00:22:43.679 is from Ruby and if it is, we try to

00:22:46.000 find the associated crash report and

00:22:48.000 upload it because it contains valuable

00:22:50.240 information such as the Ruby and Calele

00:22:52.720 stack traces. Finally, we keep track of

00:22:55.840 these crashes by creating an event in

00:22:57.760 our error monitoring system. We also at

00:23:01.280 this step parse the crash report

00:23:03.520 acquired in the previous step for the

00:23:05.760 Ruby uh and the C-level stack traces.

00:23:09.039 This you this information is useful for

00:23:11.520 quickly triaging crashes and determining

00:23:14.320 uh if this crash is a new or a known

00:23:17.360 issue

00:23:18.880 and uh using the Ruby level stack trace,

00:23:21.440 we can even use it to try to find a

00:23:23.440 minimum reproduction of the bug. So now

00:23:26.480 that we've collected the necessary

00:23:28.400 information for the crash, we need to

00:23:30.799 use it to figure out what the bug is.

00:23:33.520 Here's an AI generated image of what

00:23:35.280 that process looks like. And I'm going

00:23:36.880 to be honest, I could spend hours here

00:23:39.120 giving tips and techniques, but there's

00:23:41.520 so many possibilities that even during

00:23:44.000 that time, I couldn't even possibly

00:23:45.440 cover all of the potential issues. We've

00:23:48.480 even seen cases where the C compiler we

00:23:51.280 used to compile the Ruby binary and

00:23:53.120 native gem uh had a bug in it and it was

00:23:56.320 emitting incorrect instructions into the

00:23:58.720 binary. But I'll briefly talk about how

00:24:01.200 to debug core dumps generated from

00:24:03.120 production.

00:24:04.720 So in order to debug core dumps we need

00:24:07.039 the following things. So of course we'll

00:24:10.000 need the core dump file. We'll need the

00:24:12.080 original binaries of Ruby system

00:24:14.000 libraries and native gems. And we need

00:24:16.080 these uh because they contain the

00:24:17.760 symbols in order to output meaningful

00:24:20.240 stack traces, structure definitions and

00:24:22.799 variables.

00:24:24.320 And we need to be on the same operating

00:24:26.720 system and usually the same CPU

00:24:28.559 architecture as the production system.

00:24:31.760 Those are quite a few requirements, but

00:24:33.760 fortunately, if you're using a container

00:24:35.840 system like Docker in production, you

00:24:38.000 can instead just use the production

00:24:39.679 container to debug the core dump. So, in

00:24:42.720 order to debug the core dump, you open

00:24:44.480 the debugger uh you open the core dump

00:24:46.240 up in a debugger such as GDB or LLDB.

00:24:49.520 You need to specify what the core dump

00:24:51.200 file is and uh and the Ruby binary that

00:24:54.799 was crashing.

00:24:56.880 uh then when you're in the debugger you

00:24:58.480 can analyze the core dump by looking at

00:25:00.159 the back trace and read variables and

00:25:02.799 pieces of memory.

00:25:06.080 So to recap in this section we talked

00:25:08.320 about some ways to capture information

00:25:10.320 about crashes. We saw what a crash

00:25:12.799 report looks like and how to redirect

00:25:15.200 crash reports to a file in order to

00:25:17.200 capture it. We looked at what core dumps

00:25:19.679 are and how we can generate them on

00:25:21.520 Linux systems. Finally, we looked at how

00:25:24.400 we can open core dumps in the debugger

00:25:26.240 in order to debug it. So once you found

00:25:29.760 a bug, uh you should report it. If it's

00:25:32.320 a bug in a native gem, let the

00:25:34.400 maintainer know. If it's in Ruby, let

00:25:36.640 us, the Ruby core team, know about it by

00:25:39.120 reporting it to the bug tracker. But

00:25:41.440 wait, before you open a ticket, is the

00:25:44.240 Ruby you're using an actively maintained

00:25:46.320 version of Ruby? If you're unsure,

00:25:48.640 consult this page. Right now, only Ruby

00:25:51.279 3.3 and 3.4 for are in normal

00:25:53.440 maintenance mode. If uh and Ruby 3.2 is

00:25:56.720 only in security maintenance, meaning

00:25:58.720 that regular bugs are no longer fixed.

00:26:01.440 So if you're running on Ruby 3.2 or

00:26:03.600 earlier and there are crashes, then try

00:26:05.600 upgrading your Ruby version first.

00:26:08.640 Additionally, make sure you're you're

00:26:10.720 running on the latest patch version of

00:26:12.960 your major version of Ruby. These

00:26:15.120 contains all of the backported fixes and

00:26:18.080 uh the the most recent version as of

00:26:20.000 this talk is listed here.

00:26:24.320 If you're running a maintained version

00:26:25.919 of Ruby and you're on the latest patch

00:26:27.919 version, then submit a bug report on the

00:26:30.000 Ruby bug tracker and maybe even try to

00:26:32.159 submit a fix. The guide linked here

00:26:34.400 explains how to report bugs on the bug

00:26:36.320 tracker, including what kinds of

00:26:37.760 information to include in your ticket.

00:26:41.039 At Shopify, when we fix a bug in Ruby,

00:26:43.679 we want to upgrade our Ruby version to

00:26:46.080 include that patch as soon as possible.

00:26:49.200 However, a new version of Ruby may not

00:26:51.440 come out for a few months. So, we create

00:26:53.520 an internal release uh with with these

00:26:56.159 fixes backported.

00:26:58.799 We've made all of our Ruby definitions

00:27:01.039 public and you can find it in the

00:27:02.960 Shopify/ Ruby definitions repository.

00:27:05.840 This contains Ruby build definitions for

00:27:08.000 all of the custom Ruby versions we run

00:27:10.000 at Shopify.

00:27:12.159 So today I've covered quite a few topics

00:27:14.640 including why stability of Ruby is

00:27:16.320 important, some sources of instability

00:27:18.240 in Ruby, how we can catch bugs in Ruby

00:27:21.120 and in native gems before they reach

00:27:22.799 your production systems, and finally how

00:27:25.039 to debug core dumps uh that you've

00:27:27.440 captured from production. So you can

00:27:29.919 find a copy of these slides at this QR

00:27:32.320 code. If you have any questions, feel

00:27:34.480 free to ask me after this talk or

00:27:36.400 through social media or through email.

00:27:38.799 Thank you for coming to my talk.

explore all talks recorded at Rails World 2025

Explore all talks recorded at Rails World 2025

Rails World 2025