00:00:07.200
Hi everyone. It's an absolute honor to
00:00:09.360
be here speaking at Rails World 2025.
00:00:12.880
Infrastructure is a top priority at
00:00:15.120
Shopify and I am sure it is at your
00:00:17.600
company as well. It is important to
00:00:20.080
prevent outages, to prevent data
00:00:22.240
corruption, and ultimately to have a
00:00:24.400
good user experience. In this talk, I'll
00:00:27.599
be covering some ways to prevent and
00:00:29.519
respond to instability in Ruby.
00:00:32.880
You can find the slides uh at this URL
00:00:35.840
or by scanning this QR code. Don't worry
00:00:38.480
if you end up missing this because I'll
00:00:40.879
also have this QR code up at the end of
00:00:42.879
this talk. So, first I'll talk a little
00:00:45.440
bit about me. That's what I look like on
00:00:47.440
the internet. I'm currently based in
00:00:49.600
Toronto, Canada. I'm on the Ruby core
00:00:52.000
team and I'm a staff developer at
00:00:54.239
Shopify on the Ruby infrastructure team
00:00:57.039
where I work on performance and memory
00:00:58.960
management in Ruby. I'm the co-author of
00:01:01.760
the variable width allocation feature in
00:01:03.359
Ruby which improves performance and
00:01:04.960
memory efficiency of Ruby's garbage
00:01:06.720
collector. I'm also the co-author of the
00:01:08.960
Rub free at exit feature which frees
00:01:11.360
memory at shutdown to allow the use of
00:01:13.439
memory leak checkers such as Valgrind or
00:01:15.360
the Mac OS leaks tool. I also designed
00:01:18.320
and implemented the modular garbage
00:01:20.080
collector uh feature in Ruby.
00:01:25.439
I'm also the author of the Ruby MEM
00:01:27.439
check and the autotuner gems and in my
00:01:29.520
free time I like to travel and take
00:01:31.920
photos and I post them on Instagram at
00:01:33.920
peterzoo.phos.
00:01:36.159
So first I'll talk a little bit about
00:01:37.840
the outline of this talk. We'll first
00:01:40.560
take a look at what the infrastructure
00:01:42.479
of a typical Rails app looks like. the
00:01:44.720
teams that you might have maintaining
00:01:46.479
this and where you might have blind
00:01:48.479
spots in your infrastructure. We'll then
00:01:50.960
take a look at some of the ways uh Ruby
00:01:52.960
could cause instability in your
00:01:54.720
infrastructure
00:01:57.759
and ways to proactively prevent crashes
00:02:00.000
from happening in production.
00:02:02.479
We'll end off by discussing how to
00:02:04.079
capture metrics and information about
00:02:05.920
crashes and how we can use that
00:02:07.680
information to debug.
00:02:10.160
There are many moving parts and many
00:02:12.160
layers to the tech stack of a Rails app.
00:02:14.400
In your company, you might have teams
00:02:16.239
dedicated to each uh part of this stack.
00:02:19.120
And let's see a simplified example of a
00:02:21.440
typical Rails app.
00:02:23.760
And here's a diagram of a simplified
00:02:26.080
example of what your infrastructure
00:02:27.599
might look like. You might have teams
00:02:29.120
that manage external services such as a
00:02:32.319
database, your caches, or other
00:02:34.560
microservices that you use. You might
00:02:36.959
have teams that manage deployments of
00:02:38.720
your app in production using tools like
00:02:40.879
Docker and Kubernetes. You probably have
00:02:43.519
a large number of product developers
00:02:45.519
working on your Rails application.
00:02:48.560
Uh you might have some product
00:02:50.480
developers that might be Rails experts
00:02:53.280
or you may even have whole teams
00:02:55.040
dedicated to the architecture of your
00:02:56.959
app. And these people would be
00:02:58.800
responsible for reducing tech debt, for
00:03:00.720
triaging and debugging exceptions that
00:03:02.879
happen in production and performing
00:03:04.879
upgrades for Rails and gems in Ruby.
00:03:09.360
However, all of this runs on top of Ruby
00:03:12.159
and Ruby is just another piece of
00:03:14.000
software. So, it can have bugs and it
00:03:16.319
can crash.
00:03:18.000
Uh, do you have people responsible for
00:03:20.239
maintaining this level in your tech
00:03:22.080
stack? Do you have observability into
00:03:24.720
this layer? And do you know what to do
00:03:27.040
when Ruby crashes? In this talk, I'll be
00:03:30.000
talking about some of the reasons why
00:03:31.360
Ruby could crash, how to get information
00:03:33.680
and metrics about these crashes, and
00:03:36.080
what we can do about it. So, let's first
00:03:38.799
take a look at why Ruby could crash
00:03:43.599
or I should add a native gems. Your app
00:03:47.360
may have tens or even hundreds of native
00:03:50.159
gems.
00:03:52.720
And in fact, if you don't believe me
00:03:54.319
that there's a lot of native gems in
00:03:56.000
your app, a brand new Rails app installs
00:03:58.879
21 native gems. This adds 21 additional
00:04:03.200
possible sources of instability inside
00:04:05.840
of your app.
00:04:07.760
And so now, let's take a look at some of
00:04:09.599
the common categories of bugs that Ruby
00:04:12.239
and native gems run into.
00:04:15.120
Ruby is written in C and so are a lot of
00:04:17.840
native gems. This means that they can
00:04:19.359
run into bugs found in C code. So in C
00:04:22.880
you have to manually allocate and free
00:04:25.680
memory. This is unlike Ruby where we
00:04:28.080
have a garbage collector which
00:04:30.080
determines when an object is alive or
00:04:32.800
dead. So if you free a piece of memory
00:04:35.600
in C before all of the places give up
00:04:38.080
references to that piece of memory then
00:04:41.280
you could cause a lot of different
00:04:43.120
problems including crashes or unexpected
00:04:45.840
behaviors. And this is known as a use
00:04:48.240
after free bug. Let's see a quick
00:04:50.479
example. This is a short snippet of CC
00:04:53.199
code and we first allocate memory using
00:04:57.360
maloc and we write the string hello
00:04:59.280
rails world into it.
00:05:03.360
Then we print out the string and then we
00:05:06.479
free the memory holding that that that
00:05:08.880
contains the string
00:05:11.520
and then we print the string again. This
00:05:13.440
is not allowed and because this is a use
00:05:15.440
after free bug and the behavior of that
00:05:18.560
is nondeterministic. it could either
00:05:20.720
crash or it could even read other pieces
00:05:23.039
of memory.
00:05:24.960
There are uh also buffer overflow bugs
00:05:28.080
which access past the end of a region of
00:05:30.320
memory which could potentially cause a
00:05:32.160
crash or it could even potentially end
00:05:34.560
up reading another piece of memory. And
00:05:37.199
so this could open up your app to
00:05:38.880
attacks and allow attackers to read from
00:05:41.440
or write to other pieces of memory. The
00:05:43.919
attacker could exploit it to per to
00:05:46.000
perform unintended behaviors in your app
00:05:48.320
or even brute force it to read other
00:05:50.479
pieces of memory such as users passwords
00:05:53.440
or secrets in your infrastructure. Let's
00:05:55.759
see a quick example. This is the program
00:05:58.560
again. It's similar to the example that
00:06:00.639
I shown use for the use after free bug.
00:06:03.520
But what's the difference? Do you see
00:06:05.600
where the bug is?
00:06:09.360
plus one.
00:06:11.280
You're missing the plus one for the null
00:06:13.440
terminator. When we maloc, the null
00:06:15.840
terminator is important because it is it
00:06:18.960
signals that it is at the end of the
00:06:20.639
string. And so the stir copy here which
00:06:23.919
copies uh into that string will now
00:06:26.720
write the null terminator past the end
00:06:28.960
of the region of memory that was
00:06:31.280
allocated. So it could be reading. So it
00:06:33.680
could sorry it could be writing into
00:06:35.120
another piece of memory and the print f
00:06:38.240
after it will now read past the end of
00:06:40.720
that region of memory. So if that null
00:06:42.880
terminator was overwritten by someone
00:06:44.880
else then this would allow the attacker
00:06:47.199
to read that region of memory.
00:06:50.080
So here's an example of an error that we
00:06:52.240
saw in our production system. Our
00:06:54.960
developers uh saw this in production and
00:06:58.319
you might be wondering what is course
00:07:01.440
ID. It looks like a typo, doesn't it?
00:07:06.800
If we look at the stack trace, it's
00:07:08.960
erroring out on this line of code. And
00:07:11.440
that doesn't make any sense because the
00:07:13.039
code clearly says source ID, not course
00:07:16.080
ID. So what happened to the first
00:07:18.319
character of that symbol?
00:07:21.599
If we look at the ASKI table, lowercase
00:07:24.000
S has decimal 73 and in binary it looks
00:07:27.280
like this. Lowercase Q has decimal 71
00:07:30.960
and it looks like this in binary.
00:07:33.919
Notice how the two characters differ by
00:07:36.240
one bit. So somewhere someone has
00:07:38.800
flipped the second bit of this
00:07:40.479
character.
00:07:42.080
The cause of this bug was not CERN, but
00:07:44.240
it was possibly a use after free or
00:07:46.400
buffer overflow bug where someone wrote
00:07:48.639
to memory that it no longer owned.
00:07:52.080
C requires manual memory management. So
00:07:54.319
if you forget to free that memory, then
00:07:56.400
it will leak. A few of these memory
00:07:58.400
leaks are benign. Ruby might end up just
00:08:00.479
using a little bit more memory. However,
00:08:02.560
if it keeps on happening, then the Ruby
00:08:05.039
process will eventually run out of
00:08:06.639
memory and be killed by the system. This
00:08:09.199
could cause instability in your system
00:08:11.039
as the Ruby process may be terminated by
00:08:13.199
the system halfway during a request.
00:08:16.560
At Ruby Kagi 2024, I gave a talk with
00:08:19.199
Adam Hess from GitHub about finding and
00:08:21.759
fixing memory leaks in Ruby and in
00:08:23.599
native gems. If you're a native gem
00:08:26.000
maintainer, you can take advantage of
00:08:27.520
the Ruby free at exit feature and the
00:08:29.440
Ruby mem tool that I built to find
00:08:31.919
memory leaks in your native gem.
00:08:35.760
Ruby C API can run into bugs that we
00:08:38.240
don't encounter in Ruby code. This is
00:08:40.640
because Ruby has automatic memory
00:08:42.640
management via the garbage collector
00:08:44.320
whereas C has manual me memory
00:08:46.160
management. And this difference in
00:08:48.320
memory management paradigms means that
00:08:50.399
we need to be aware of how the Ruby
00:08:52.320
garbage collector works when we are
00:08:54.320
writing C code. The most common type of
00:08:57.279
bugs are missing garbage collector
00:08:59.519
guards. When Ruby runs the garbage
00:09:02.080
collector, it scans the C stack to find
00:09:05.040
potential Ruby objects in order to keep
00:09:07.279
them alive. This is known as uh
00:09:09.600
conservative stack scanning. However,
00:09:12.800
the C compiler may optimize the stack
00:09:15.120
space and reuse local variables on the
00:09:18.240
stack. And these local variables may
00:09:21.200
contain pointers to objects that are
00:09:23.120
actively being used. And this would
00:09:25.440
cause objects to be recycled or moved by
00:09:28.000
the garbage collector causing unexpected
00:09:30.399
behaviors in our code or even crashes.
00:09:33.440
So to demonstrate missing GC guards,
00:09:35.839
let's implement this simple Ruby method
00:09:38.720
called array each bite using the C API.
00:09:42.560
It calls this block with the integer
00:09:44.880
value of each character in each of the
00:09:47.600
strings in the array. And we can see an
00:09:50.240
example here of how to use this method
00:09:52.320
and the output that it generates.
00:09:55.360
So this is the C implementation of the
00:09:57.680
method. It's missing some checks like
00:10:00.000
checking that each of the elements in
00:10:01.760
the array is actually a string. But
00:10:03.920
that's not really important for this
00:10:05.600
demonstration. So let's quickly take a
00:10:07.920
look at uh this implementation. So we
00:10:10.640
first run a for loop that iterates a
00:10:13.279
counter over each element in the array.
00:10:16.480
We then get the string object at that
00:10:18.560
particular index
00:10:21.120
and then we get the underlying character
00:10:23.120
buffer of the string and the length.
00:10:27.040
We then iterate a counter over each
00:10:29.600
character in the string and then we
00:10:31.839
yield the character in the string uh by
00:10:34.800
converting it to a Ruby fix num. So now
00:10:38.640
let's try running the same script but
00:10:40.720
manually but running this method called
00:10:43.040
uh verify compaction references which
00:10:46.240
manually runs garbage collection
00:10:48.240
compaction in the block. Normally
00:10:51.519
compaction tries to be efficient by
00:10:53.839
minimizing the number of objects moved
00:10:56.480
in the garbage collector. However, since
00:10:58.880
we want bugs to show up, this method
00:11:01.279
will allow us to move the maximal number
00:11:03.760
of possible of objects possible.
00:11:07.200
So then if we try to run this, we see
00:11:09.680
incorrect output
00:11:12.000
and sometimes when we run it, it even
00:11:14.160
crashes. So clearly something isn't
00:11:16.720
right here. And if we look at the code
00:11:19.360
again, what we're missing is a GC guard.
00:11:22.079
And we need a GC guard right at this
00:11:24.079
line because the str variable here isn't
00:11:27.440
used later on in the code. It's
00:11:29.680
optimized out of the stack by the C
00:11:31.839
compiler. However, this makes it
00:11:34.079
invisible to the Ruby garbage collector
00:11:36.399
that we're actively using this object in
00:11:39.440
the for loop where we're yielding each
00:11:41.360
character of the string. So the string
00:11:43.760
object ends up getting moved. And so by
00:11:46.160
adding this GC guard here, it ensures
00:11:48.240
that the C compiler will keep the
00:11:50.079
variable alive on the stack up until
00:11:52.480
that point. So now we can run the script
00:11:55.360
again and we will see correct output.
00:11:59.519
So error raised in Ruby interrupt your
00:12:02.640
normal flow of your program and uh it
00:12:05.760
could jump multiple stack frames in Ruby
00:12:08.480
and it behaves similarly in C and it
00:12:11.440
uses a C feature called long jump in
00:12:14.079
order to skip stack frames. However, if
00:12:16.880
you have manual managed memory, uh it
00:12:19.920
could be lost and leaked uh when you do
00:12:22.959
that long jump. So we need to be careful
00:12:25.440
in determining what code could raise
00:12:27.680
errors when we're writing C code for uh
00:12:30.560
native gems.
00:12:33.440
Missing right barriers could also cause
00:12:35.760
subtle bugs and uh right barriers are
00:12:38.959
hard to implement correctly. Uh they are
00:12:41.600
a little bit difficult to explain. So
00:12:43.120
learning more about this is left for as
00:12:45.440
an exercise for you the viewer.
00:12:48.560
As a side note, uh there is a common
00:12:50.880
misconception that people have and this
00:12:53.360
misconception is causing an increase in
00:12:55.760
instability in our systems and I'm
00:12:58.320
talking about the rise in the number of
00:13:00.000
Rust gems mainly due to the popularity
00:13:02.399
of Rust and the promise of memory safety
00:13:05.279
from the borrow checker in Rust. While
00:13:07.760
this is true, many Rust gems directly
00:13:10.480
interface with the Ruby C API. Ruby has
00:13:13.760
a C API, not a Rust API. So there are
00:13:16.880
many implementation challenges for Rust
00:13:19.440
gems, including how to get Rust work
00:13:22.160
properly with the Ruby garbage
00:13:23.920
collector. Additionally, Rust gives
00:13:26.720
developers a false sense of security. So
00:13:29.519
many developers don't truly understand
00:13:31.519
the code that they've written.
00:13:34.240
So as a recap, this section was about
00:13:36.320
some common bugs in Ruby and in native
00:13:38.240
gems. We looked at bugs in Code,
00:13:41.200
including use after free buffer overflow
00:13:43.120
and memory leaks. We also looked at some
00:13:45.440
incorrect uses of the RubyC API,
00:13:47.839
including missing GC guards, raising
00:13:49.760
errors, causing memory leaks, and
00:13:51.120
missing write barriers. Now, let's look
00:13:53.440
at some ways to catch and prevent bugs
00:13:55.920
before they reach your production
00:13:58.079
systems. Ruby has powerful ways to check
00:14:01.440
its internal state and run in more
00:14:04.399
extreme ways for bugs to reproduce
00:14:06.800
sooner. So, now let's take a look at a
00:14:09.120
few of these.
00:14:10.800
Ruby has assertions in the code to check
00:14:13.440
that Ruby is running correctly and that
00:14:15.920
the assumptions we make during runtime
00:14:18.160
in the Ruby VM are correct. These
00:14:21.440
assertions are turned off by default
00:14:23.839
because of the negative performance
00:14:25.760
impacts that they bring. However, it is
00:14:28.720
useful to run with these assertions
00:14:30.639
turned on in CI because it can catch
00:14:33.120
bugs in Ruby and the native gems before
00:14:35.760
it is deployed to production.
00:14:38.880
In order to enable assertions, we have
00:14:41.519
to uh compile Ruby differently by adding
00:14:44.800
this uh Ruby debug to the CPP flags
00:14:47.839
during the compilation uh during the
00:14:49.680
configuration step when we com when we
00:14:51.760
compile Ruby. So for Ruby build that
00:14:54.720
ships with RBNs, you want to use the
00:14:57.120
configure ops environment variable. And
00:14:59.519
for Ruby install that ships with Cher
00:15:01.360
Ruby, add the CPP flags using a double
00:15:03.680
dash at the end of the command.
00:15:06.320
Compiling Ruby with assertions enabled
00:15:08.240
is documented in the official Ruby
00:15:10.560
contributing guides.
00:15:13.519
Since Rails 7.2, all apps enable the
00:15:16.480
YJIT just in time compiler by default,
00:15:19.120
which improves performance. However, a
00:15:21.760
just in time compiler performs
00:15:23.519
optimizations and compiles Ruby Ruby
00:15:26.000
code into machine code. So, it is yet
00:15:28.560
another additional source of bugs. YJI
00:15:31.519
by default only compiles the hot code
00:15:33.920
meaning uh code that is the most
00:15:35.680
commonly ran and because compiling code
00:15:39.199
has performance and memory impacts uh
00:15:42.320
widget limits it to code that is
00:15:44.240
executed the most frequently
00:15:47.120
however this may mean that not much of
00:15:48.959
the code in tests are using widget
00:15:51.839
because code in tests aren't executed
00:15:54.720
repeatedly so by set setting the widget
00:15:57.839
call threshold to one widget will call
00:16:00.560
will compile the code the first time it
00:16:02.880
executes it.
00:16:04.959
C uses manual memory management. We've
00:16:07.360
learned about that. So it is not
00:16:08.959
uncommon to have memory errors such as
00:16:11.120
use after free or outofbound memory
00:16:13.199
access. Maloc implementations are
00:16:16.399
designed to be resilient against memory
00:16:18.639
errors in order to prevent memory
00:16:20.959
attacks. However, if there is a memory
00:16:24.240
error in your app, then resiliency is
00:16:27.120
exactly what we don't need during
00:16:29.199
testing.
00:16:30.959
So there are tools that can help us find
00:16:33.440
memory errors such as valgrint or
00:16:36.079
address sanitizer also known as.
00:16:40.639
Here's an example of a memory error in
00:16:43.040
asan. Uh when it encounters a memory
00:16:46.240
error, it causes the program to crash
00:16:48.000
and output an error message that looks
00:16:50.079
like this. These tools are powerful
00:16:53.360
because they do extensive checks on
00:16:55.920
every memory access. Therefore, they
00:16:58.720
make your program run several times
00:17:00.560
slower and use much more memory. So,
00:17:03.199
you'll need to keep that in mind and
00:17:05.120
adjust timeouts and memory limits in CI
00:17:07.679
in order to accommodate these tools.
00:17:11.120
So, uh for more details on how to build
00:17:13.280
Ruby with ASAN, follow this uh guide in
00:17:16.160
the official Ruby building guides.
00:17:19.600
We also run nightly tests on our Rails
00:17:22.400
monolith against the latest commit of of
00:17:25.520
uh Ruby's master branch. This helps us
00:17:28.000
accomp accomplish two things. First, it
00:17:31.039
allows us to find incompatibilities in
00:17:33.440
our codebase with the next version of
00:17:35.600
Ruby. This helps us incrementally
00:17:37.760
discover these incompatibilities instead
00:17:39.840
of having to do a big push for uh Ruby
00:17:42.480
upgrades each year. Secondly, it allows
00:17:45.600
us to discover bugs in Ruby that was not
00:17:48.640
caught by Ruby's test suite. We run the
00:17:52.240
our nightly CI against Ruby head with
00:17:54.559
various configurations mentioned before
00:17:56.799
like enabling assertions running with
00:17:58.880
widget call threshold of one and with
00:18:01.120
asan enabled. This has helped us find a
00:18:03.919
wide variety of bugs in Ruby in native
00:18:06.400
gems and makes our annual Ruby upgrade
00:18:09.280
easy. For these reasons, I encourage
00:18:12.400
your company to also run nightly CI
00:18:15.760
against Ruby head across various
00:18:17.679
configurations. And then when you run
00:18:19.600
into crashes or regressions in Ruby or
00:18:22.400
native gems or or or any gems, um open
00:18:25.919
bug reports and send fixes upstream.
00:18:29.760
So to recap, in this section, I talked
00:18:32.000
about some techniques to prevent crashes
00:18:34.000
in production. Uh first I talked about
00:18:36.480
compiling a Ruby with assertions enabled
00:18:38.880
which checks the internal state of Ruby
00:18:41.280
and second uh running with widget call
00:18:44.080
threshold of one which makes widget
00:18:46.000
compile every method that gets ran and
00:18:48.240
lastly using a memory checking tool such
00:18:49.919
as Valgrren or ASAN.
00:18:52.640
So there's still the inevitable case
00:18:54.480
where bugs are not caught by CI and only
00:18:57.280
happen in production. So now let's see
00:18:59.440
how we can capture information about
00:19:01.440
crashes that happen in production.
00:19:04.720
When Ruby crashes, it generates a crash
00:19:06.960
report that includes what kind of crash
00:19:09.039
it is, the Ruby stack trace, and the C
00:19:11.440
stack trace. Here's what a crash report
00:19:13.919
looks like.
00:19:16.720
We can first see what kind of I went a
00:19:19.440
little too fast. We can first see what
00:19:20.960
kind of crash this is. And this one is a
00:19:23.520
segmentation fault, meaning that it is
00:19:25.840
some sort of memory error. We then see
00:19:28.799
the Ruby stack trace. And using this
00:19:30.720
information, we can is we can try to
00:19:32.799
isolate the issue and even try to find
00:19:35.760
uh a a small reproduction script for
00:19:38.960
this crash.
00:19:40.799
We can then see the C stack trace and
00:19:43.039
this is useful for determining where the
00:19:44.880
bug comes from whether it's a bug in
00:19:46.799
Ruby or in a native gem. And we can also
00:19:49.520
use this to look at the source code in
00:19:51.679
Ruby or in native gems. And just by
00:19:53.840
looking at the source code, sometimes we
00:19:55.280
may be even able to just identify the
00:19:57.840
bug. So normally this crash report
00:20:00.880
outputs to standard error. However, as
00:20:03.840
of Ruby 3.3, you can also redirect this
00:20:07.120
crash log into a file.
00:20:10.000
The Ruby crash report environment
00:20:12.400
variable is documented in the Ruby man
00:20:14.960
pages and explains how to add specifiers
00:20:17.679
to the file name for timestamps or the
00:20:20.080
process ID uh in the generated crash
00:20:22.880
log. So the crash report is often not
00:20:26.240
enough to debug crashes. We need to
00:20:29.120
capture core dumps from the crash. But
00:20:31.520
what are core dumps?
00:20:33.760
Core dumps are files generated
00:20:35.679
containing the state of the program at
00:20:37.760
the time of the crash. This includes
00:20:40.080
everything on the cstack and heap and
00:20:42.640
includes all local and global variables.
00:20:45.360
And then we can open up this core dump
00:20:47.039
in the debugger and try to find uh the
00:20:49.520
bug. However, uh core dumps only contain
00:20:52.720
the state of your program at the time of
00:20:54.640
the crash. So, uh it may be hard to
00:20:57.440
determine how you ended up in that
00:20:59.200
state. Uh and one thing to keep in mind
00:21:02.880
though is that core dumps contain all of
00:21:05.440
your programs memory. And this includes
00:21:07.919
things like decrypted user passwords and
00:21:10.799
PII in plain text as well as your
00:21:14.000
infrastructure secrets. So make sure to
00:21:17.120
treat core dumps with great care. At
00:21:19.679
Shopify, we upload core dumps to the
00:21:22.080
cloud so we can debug them, but we make
00:21:24.880
sure the access to the the cloud bucket
00:21:27.760
is restricted to specific teams and it
00:21:30.400
is encrypted in order to pro protect
00:21:32.720
against attackers who gain access to
00:21:34.880
that bucket in the cloud.
00:21:37.760
At Shopify, we we use a closed source
00:21:40.240
tool called Crash Reporter to upload
00:21:42.240
core dumps that we generate in
00:21:44.080
production. This tool essentially does
00:21:46.559
three things. It first uploads our core
00:21:49.200
dump to the bucket in the cloud. Then it
00:21:52.159
detects if uh the crashing binary is
00:21:54.960
Ruby and if it is then it finds the
00:21:57.360
associated crash report that we've
00:21:59.200
written to disk and uploads it. And then
00:22:02.000
finally it uh creates an event in our
00:22:04.720
error monitoring system in order to to
00:22:06.799
keep track of this crash. So the first
00:22:09.520
step is capturing the core dump when it
00:22:11.120
occurs. So we you can configure the
00:22:14.080
behavior in Linux when a program crashes
00:22:17.520
by default it does nothing. Uh you can
00:22:20.000
also configure it to write to a a
00:22:22.000
particular file or you can also use this
00:22:24.880
feature which is the one that we use
00:22:26.720
that pipes the core dump to a program
00:22:29.600
standard input. So our crash reporter
00:22:32.640
tool reads this core dump from from the
00:22:34.880
standard input and uploads it to a
00:22:37.120
bucket in the cloud.
00:22:39.600
Then secondly, when we have the core
00:22:41.520
dump, we detect whether this core dump
00:22:43.679
is from Ruby and if it is, we try to
00:22:46.000
find the associated crash report and
00:22:48.000
upload it because it contains valuable
00:22:50.240
information such as the Ruby and Calele
00:22:52.720
stack traces. Finally, we keep track of
00:22:55.840
these crashes by creating an event in
00:22:57.760
our error monitoring system. We also at
00:23:01.280
this step parse the crash report
00:23:03.520
acquired in the previous step for the
00:23:05.760
Ruby uh and the C-level stack traces.
00:23:09.039
This you this information is useful for
00:23:11.520
quickly triaging crashes and determining
00:23:14.320
uh if this crash is a new or a known
00:23:17.360
issue
00:23:18.880
and uh using the Ruby level stack trace,
00:23:21.440
we can even use it to try to find a
00:23:23.440
minimum reproduction of the bug. So now
00:23:26.480
that we've collected the necessary
00:23:28.400
information for the crash, we need to
00:23:30.799
use it to figure out what the bug is.
00:23:33.520
Here's an AI generated image of what
00:23:35.280
that process looks like. And I'm going
00:23:36.880
to be honest, I could spend hours here
00:23:39.120
giving tips and techniques, but there's
00:23:41.520
so many possibilities that even during
00:23:44.000
that time, I couldn't even possibly
00:23:45.440
cover all of the potential issues. We've
00:23:48.480
even seen cases where the C compiler we
00:23:51.280
used to compile the Ruby binary and
00:23:53.120
native gem uh had a bug in it and it was
00:23:56.320
emitting incorrect instructions into the
00:23:58.720
binary. But I'll briefly talk about how
00:24:01.200
to debug core dumps generated from
00:24:03.120
production.
00:24:04.720
So in order to debug core dumps we need
00:24:07.039
the following things. So of course we'll
00:24:10.000
need the core dump file. We'll need the
00:24:12.080
original binaries of Ruby system
00:24:14.000
libraries and native gems. And we need
00:24:16.080
these uh because they contain the
00:24:17.760
symbols in order to output meaningful
00:24:20.240
stack traces, structure definitions and
00:24:22.799
variables.
00:24:24.320
And we need to be on the same operating
00:24:26.720
system and usually the same CPU
00:24:28.559
architecture as the production system.
00:24:31.760
Those are quite a few requirements, but
00:24:33.760
fortunately, if you're using a container
00:24:35.840
system like Docker in production, you
00:24:38.000
can instead just use the production
00:24:39.679
container to debug the core dump. So, in
00:24:42.720
order to debug the core dump, you open
00:24:44.480
the debugger uh you open the core dump
00:24:46.240
up in a debugger such as GDB or LLDB.
00:24:49.520
You need to specify what the core dump
00:24:51.200
file is and uh and the Ruby binary that
00:24:54.799
was crashing.
00:24:56.880
uh then when you're in the debugger you
00:24:58.480
can analyze the core dump by looking at
00:25:00.159
the back trace and read variables and
00:25:02.799
pieces of memory.
00:25:06.080
So to recap in this section we talked
00:25:08.320
about some ways to capture information
00:25:10.320
about crashes. We saw what a crash
00:25:12.799
report looks like and how to redirect
00:25:15.200
crash reports to a file in order to
00:25:17.200
capture it. We looked at what core dumps
00:25:19.679
are and how we can generate them on
00:25:21.520
Linux systems. Finally, we looked at how
00:25:24.400
we can open core dumps in the debugger
00:25:26.240
in order to debug it. So once you found
00:25:29.760
a bug, uh you should report it. If it's
00:25:32.320
a bug in a native gem, let the
00:25:34.400
maintainer know. If it's in Ruby, let
00:25:36.640
us, the Ruby core team, know about it by
00:25:39.120
reporting it to the bug tracker. But
00:25:41.440
wait, before you open a ticket, is the
00:25:44.240
Ruby you're using an actively maintained
00:25:46.320
version of Ruby? If you're unsure,
00:25:48.640
consult this page. Right now, only Ruby
00:25:51.279
3.3 and 3.4 for are in normal
00:25:53.440
maintenance mode. If uh and Ruby 3.2 is
00:25:56.720
only in security maintenance, meaning
00:25:58.720
that regular bugs are no longer fixed.
00:26:01.440
So if you're running on Ruby 3.2 or
00:26:03.600
earlier and there are crashes, then try
00:26:05.600
upgrading your Ruby version first.
00:26:08.640
Additionally, make sure you're you're
00:26:10.720
running on the latest patch version of
00:26:12.960
your major version of Ruby. These
00:26:15.120
contains all of the backported fixes and
00:26:18.080
uh the the most recent version as of
00:26:20.000
this talk is listed here.
00:26:24.320
If you're running a maintained version
00:26:25.919
of Ruby and you're on the latest patch
00:26:27.919
version, then submit a bug report on the
00:26:30.000
Ruby bug tracker and maybe even try to
00:26:32.159
submit a fix. The guide linked here
00:26:34.400
explains how to report bugs on the bug
00:26:36.320
tracker, including what kinds of
00:26:37.760
information to include in your ticket.
00:26:41.039
At Shopify, when we fix a bug in Ruby,
00:26:43.679
we want to upgrade our Ruby version to
00:26:46.080
include that patch as soon as possible.
00:26:49.200
However, a new version of Ruby may not
00:26:51.440
come out for a few months. So, we create
00:26:53.520
an internal release uh with with these
00:26:56.159
fixes backported.
00:26:58.799
We've made all of our Ruby definitions
00:27:01.039
public and you can find it in the
00:27:02.960
Shopify/ Ruby definitions repository.
00:27:05.840
This contains Ruby build definitions for
00:27:08.000
all of the custom Ruby versions we run
00:27:10.000
at Shopify.
00:27:12.159
So today I've covered quite a few topics
00:27:14.640
including why stability of Ruby is
00:27:16.320
important, some sources of instability
00:27:18.240
in Ruby, how we can catch bugs in Ruby
00:27:21.120
and in native gems before they reach
00:27:22.799
your production systems, and finally how
00:27:25.039
to debug core dumps uh that you've
00:27:27.440
captured from production. So you can
00:27:29.919
find a copy of these slides at this QR
00:27:32.320
code. If you have any questions, feel
00:27:34.480
free to ask me after this talk or
00:27:36.400
through social media or through email.
00:27:38.799
Thank you for coming to my talk.