Ruby Stability at Scale

Memory Management

Ruby Stability at Scale

Play on YouTube

Rails World 2025

#native-extensions

#memory-management

Ruby Stability at Scale

Peter Zhu • September 05, 2025 • Amsterdam, Netherlands • Talk

Ruby Stability at Scale: Lessons from Shopify's Monolith

Peter Zhu's talk at Rails World 2025, "Ruby Stability at Scale," focuses on how to monitor and maintain the stability of large-scale Ruby on Rails applications, particularly when instability originates from Ruby itself or native gems, rather than application code or external libraries. Drawing from his experience in managing Shopify's monolithic Rails app, one of the largest in the world, Zhu provides a comprehensive overview of crash prevention, detection, and post-mortem analysis techniques.

Main Topic

The talk centers on managing and improving the stability of Ruby and native gems in production systems, exploring strategies for monitoring, debugging, and preventing Ruby-level crashes and outages.

Key Points

Key Takeaways

Instability can stem from the Ruby interpreter or native extensions, not just application code.
Proactive measures in development and CI—like enabling assertions, aggressive JIT testing, and memory checkers—significantly improve the detection of bugs before production.
Robust crash handling, logging, and restricted access to crash artifacts (such as core dumps) are essential for effective post-crash debugging and for protecting sensitive data.
Open communication with upstream maintainers and timely upgrades ensure long-term stability and resilience.

Ruby Stability at Scale
Peter Zhu • Amsterdam, Netherlands • Talk

Date: September 05, 2025
Published: Mon, 15 Sep 2025 00:00:00 +0000
Announced: Tue, 20 May 2025 00:00:00 +0000

There are many talks, articles, and tutorials on how to monitor your Rails app for stability. These assume the source of the bug comes from your application code, from Rails itself, or from a gem. But what if the source of instability comes from Ruby or a native gem? If Ruby crashes, do you have any monitoring or ways to debug it?

In this talk, we'll look at how we deal with Ruby crashes in the Shopify monolith, the world's largest Ruby on Rails application, and how you can use some of our techniques. We'll cover topics such as how we monitor crashes, capture core dumps for debugging, prevent crashes, and minimize the impact of crashes on production.

Rails World 2025

00:00:07 Hi everyone. It's an absolute honor to be here speaking at Rails World 2025.

00:00:12 Infrastructure is a top priority at Shopify, and I'm sure it is at your company as well. It's important to prevent outages, avoid data corruption, and ultimately provide a good user experience. In this talk, I'll cover ways to prevent and respond to instability in Ruby.

00:00:32 You can find the slides at this URL or by scanning the QR code. Don't worry if you miss it now—the QR code will be shown again at the end of the talk.

00:00:45 A bit about me: I'm currently based in Toronto, Canada. I'm on the Ruby core team and a staff developer at Shopify on the Ruby infrastructure team, where I work on performance and memory management in Ruby. I'm the co-author of Ruby's variable-width allocation feature, which improves performance and the memory efficiency of the garbage collector. I'm also the co-author of the "free at exit" feature, which frees memory at shutdown to allow the use of memory-leak checkers such as Valgrind or macOS's leaks tool. I designed and implemented the modular garbage collector feature in Ruby. I'm the author of the ruby-memcheck and autotuner gems. In my free time I like to travel and take photos; you can find me on Instagram at peterzoo.phos.

00:01:36 Here's the outline of the talk. First, we'll take a look at what the infrastructure of a typical Rails app looks like, the teams that maintain it, and the potential blind spots in your stack. Then we'll examine how Ruby or native gems can cause instability and ways to proactively prevent crashes in production. Finally, we'll discuss how to capture metrics and information about crashes and how to use that data to debug issues.

00:02:10 There are many moving parts and multiple layers in a Rails app's tech stack. You may have teams dedicated to external services such as databases, caches, or microservices. Other teams may manage deployments to production with tools like Docker and Kubernetes. Product developers work on the application itself; some are Rails experts, and you may have teams focused on architecture to reduce technical debt, triage exceptions, and perform upgrades for Rails and gems. All of this runs on top of Ruby, which is just another piece of software and can have bugs or crash.

00:03:18 Do you have people responsible for maintaining the Ruby layer in your stack? Do you have observability into this layer, and do you know what to do when Ruby crashes? I'll cover reasons Ruby can crash, how to collect information and metrics about crashes, and what actions you can take.

00:03:38 Your app may include tens or even hundreds of native gems. A brand-new Rails app installs 21 native gems by default, which adds many potential sources of instability. Let's look at common categories of bugs that Ruby and native gems can encounter.

00:04:07 Ruby and many native gems are written in C, so they can run into C-related bugs. In C you must manually allocate and free memory, unlike Ruby's garbage-collected environment. If you free memory too early—before all references to it are gone—you can cause use-after-free bugs that lead to crashes or unpredictable behavior. For example, if you allocate memory, write a string into it, free it, and then try to print the string again, the behavior is undefined: the program might crash or it could read other memory.

00:05:24 Buffer overflow bugs access memory past the end of an allocated region. This can cause crashes or allow reading or writing into other memory areas. Such bugs can be exploited by attackers to perform unintended behavior or to read sensitive data like user passwords or secrets. A common mistake is forgetting the null terminator in C string allocations: omitting the extra byte for the terminator will cause writes past the allocated buffer and subsequent reads to go out of bounds.

00:06:52 Here's an example we saw in production. A symbol that should have been source_id appeared as courseID due to the first character being corrupted. Looking at ASCII values shows that lowercase 's' and 'q' differ by one bit, so a single-bit flip changed the character. The root cause was likely a memory corruption bug such as a use-after-free or buffer overflow where memory that no longer belonged to the program was overwritten.

00:07:52 C requires manual memory management, so forgetting to free memory results in leaks. A few leaks might be benign, but repeated leaks will cause the Ruby process to run out of memory and be killed by the system, disrupting requests. At RubyKaigi 2024, I gave a talk with Adam Hess from GitHub about finding and fixing memory leaks in Ruby and native gems. Native gem maintainers can use Ruby's "free at exit" feature and tools like ruby-memcheck to find leaks in their code.

00:08:35 The Ruby C API introduces classes of bugs not present in pure Ruby code because C uses manual memory management while Ruby has a garbage collector. One common issue is missing GC guards. When Ruby runs the garbage collector, it scans the C stack conservatively to find potential Ruby objects to keep alive. The C compiler can optimize and reuse stack space, possibly making local variables that point to Ruby objects invisible to the collector. If those objects are moved or recycled by the GC, your code can experience unexpected behavior or crashes. For example, implementing a Ruby method in C that iterates over an array of strings and yields each character can fail if the C compiler optimizes a local string variable away. Adding a GC guard ensures the variable remains on the stack and visible to the collector, preventing the object from being moved while in use.

00:11:59 Errors raised in Ruby interrupt the normal flow and can jump multiple stack frames. The C equivalent uses longjmp to skip frames. If your C code manages memory manually, resources can be leaked when a longjmp occurs, so you must carefully consider which C code paths might raise exceptions. Missing write barriers can also cause subtle bugs; write barriers are tricky to implement correctly and deserve careful attention when interfacing with the garbage collector.

00:12:48 There's a misconception that Rust gems eliminate these problems because Rust provides memory safety via the borrow checker. While Rust improves many memory-safety issues, many Rust gems interface directly with Ruby's C API, which was not designed for Rust. Integrating Rust with Ruby's garbage collector and runtime introduces challenges, and Rust can give a false sense of security if developers don't fully understand the semantics of their FFI code.

00:13:36 To recap this section: common bugs in Ruby and native gems include use-after-free, buffer overflows, and memory leaks in C code, as well as incorrect uses of the Ruby C API such as missing GC guards, raising errors that leak resources, and missing write barriers. Next, we'll look at ways to catch and prevent bugs before they reach production.

00:14:09 Ruby has built-in assertions to check internal VM state and runtime assumptions. These assertions are off by default because they impact performance, but you should enable them in CI to catch bugs in Ruby and native gems before deployment. To enable assertions, compile Ruby with the appropriate CPP flags (for example, add RUBY_DEBUG to your compilation flags). The Ruby building guides document how to compile Ruby with assertions enabled.

00:15:13 Since Rails 7.2, YJIT (the just-in-time compiler) is enabled by default, which improves performance. However, JIT compilers perform optimizations and generate machine code, introducing another source of bugs. By default, YJIT compiles hot code only—code executed frequently. Tests often don't execute code repeatedly, so JIT coverage in CI can be limited. Setting YJIT's compilation threshold to one causes methods to be compiled the first time they run, exposing more code to JIT-based testing.

00:16:04 Because C uses manual memory management, memory errors like use-after-free or out-of-bounds access are not uncommon. Allocator implementations try to be resilient, but that resilience can mask bugs during testing. Tools such as Valgrind and AddressSanitizer (ASan) can help find memory errors by performing extensive checks on memory accesses. These tools make your program run slower and use more memory, so adjust timeouts and memory limits in CI accordingly. The Ruby build guides include instructions for building Ruby with ASan.

00:17:11 We run nightly tests of our Rails monolith against the latest commit of Ruby's master branch. This helps us discover incompatibilities with upcoming Ruby versions incrementally, instead of facing a large upgrade all at once, and it helps catch bugs in Ruby not covered by Ruby's own test suite. We run our nightly CI against Ruby head with various configurations—enabling assertions, setting YJIT threshold to one, and running with ASan enabled. This approach has helped us find many bugs in Ruby and native gems and has made our annual Ruby upgrades easier. I encourage your organization to run nightly CI against Ruby head across multiple configurations and to report and upstream fixes when you find crashes or regressions.

00:18:29 To recap preventative techniques: compile Ruby with assertions enabled to check internal state; set the YJIT compilation threshold to one to exercise JIT compilation in CI; and use memory-checking tools like Valgrind or ASan to catch memory errors.

00:18:52 Even with strong CI, some bugs will only appear in production. Let's look at how to capture information about crashes that occur in production.

00:19:06 When Ruby crashes, it generates a crash report that includes the type of crash, the Ruby stack trace, and the C-level stack trace. This report helps isolate the issue and may help you build a small reproducer. As of Ruby 3.3, you can redirect crash logs to a file by configuring the Ruby crash report environment variable; the man pages document specifiers (timestamps, process IDs, etc.) you can include in the filename.

00:20:10 Often a crash report alone is not sufficient to debug crashes; you also need core dumps. Core dumps capture the entire program state at the time of the crash, including stack, heap, local and global variables. You can load a core dump into a debugger to inspect memory and variables. Core dumps contain all program memory—including decrypted passwords, PII, and infrastructure secrets—so treat them carefully. At Shopify we upload core dumps to the cloud with strict access controls and encryption to protect sensitive data.

00:21:14 At Shopify we use a crash-reporting tool to upload core dumps generated in production. The tool does three things: it uploads the core dump to a cloud bucket, it detects if the crashing binary is Ruby and uploads the associated crash report, and it creates an event in our error monitoring system so we can track the crash. On Linux you can configure core-dump behavior: by default nothing is written, but you can set the system to write core dumps to a file or to pipe them to a program's standard input. We use the piping approach so our crash reporter can read the core dump from stdin and upload it to a cloud bucket.

00:22:37 After obtaining the core dump and the crash report, we parse the report for Ruby and C-level stack traces to help triage the crash and determine whether it's a known issue. The Ruby-level stack trace can help create a minimal reproduction. Debugging core dumps can be complex: causes range from logic bugs to compiler issues that produce incorrect instructions. There's a wide variety of techniques for debugging, and depth of investigation depends on the symptoms.

00:23:33 To debug core dumps you need the core file, the original binaries (Ruby, system libraries, and native gems) so you have symbols, and usually the same operating system and CPU architecture as production. If you use containers like Docker in production, you can debug using the production container image. Open the core dump in a debugger such as GDB or LLDB, specifying the core file and the crashing Ruby binary; then inspect the backtrace, variables, and memory to diagnose the issue.

00:25:06 To recap crash-capture and debugging: collect crash reports and redirect them to files if needed, generate and securely store core dumps, and open core dumps in a debugger with the matching binaries and environment to analyze the failure.

00:25:29 Once you identify a bug, report it: if it's in a native gem, notify the gem maintainer; if it's in Ruby, report it to the Ruby core team's bug tracker. Before opening a ticket, make sure you are using an actively maintained Ruby version. At the time of this talk, only Ruby 3.3 and 3.4 are in normal maintenance; Ruby 3.2 is in security maintenance only. If you run an older release, try upgrading first. Always run the latest patch release for your Ruby major version, as it contains backported fixes.

00:26:23 At Shopify, when we fix a Ruby bug we try to upgrade to include the patch as soon as possible. If a new Ruby release is delayed, we create internal releases with backported fixes. We've made our Ruby build definitions public in the Shopify/ruby-definitions repository, which contains build definitions for the custom Ruby versions we run.

00:27:02 Today I covered why Ruby stability is important, sources of instability in Ruby and native gems, techniques to catch bugs before they reach production, and how to debug core dumps captured from production. You can find a copy of these slides via the QR code. If you have questions, feel free to ask me after the talk, or reach out on social media or by email. Thank you for coming to my talk.

explore all talks recorded at Rails World 2025

Explore all talks recorded at Rails World 2025

Rails World 2025