00:00:07.200
So, taming the beast.
00:00:10.160
This I remember when I uh saw that my
00:00:12.080
talk was accepted, I realized like, wow,
00:00:13.759
I really picked a clickbaity title for
00:00:15.519
it. Uh we're going to be talking a bit
00:00:17.680
about how we manage schema migrations uh
00:00:20.640
particularly at intercom. Uh we've got a
00:00:22.560
big engineering team, but uh who am I?
00:00:26.320
You've heard that I've uh spoken three
00:00:28.000
times at Rails World, but I'm an
00:00:29.519
engineer on team data stores at
00:00:30.960
intercom. Uh I'm kind of directly
00:00:34.559
responsible for databases, caching,
00:00:36.880
things like that, but the team also owns
00:00:38.480
other stuff uh including our Ruby on
00:00:40.480
Rails platform, which is surprising to
00:00:42.160
some people. Uh we're responsible for
00:00:45.120
like holding and raising the bar for
00:00:47.039
productivity at the company. So,
00:00:49.039
anything that helps make our engineers
00:00:51.120
more effective is kind of part of our
00:00:53.360
wheelhouse. And uh as we already said,
00:00:56.000
it's my third time being here.
00:00:58.800
Um
00:01:00.320
quick disclaimer, uh lots of what we're
00:01:02.800
going to talk about kind of is MySQL
00:01:04.400
specific. Uh Intercom's a MySQL shop,
00:01:07.520
but the concept should broaden out to
00:01:14.880
should broaden out to other uh database
00:01:17.200
technology.
00:01:19.360
Uh
00:01:21.200
how do we think about productivity? This
00:01:22.880
is kind of important because for us like
00:01:25.040
making schema migration tooling uh
00:01:28.560
really good is an important part of
00:01:30.799
making our app available and keeping
00:01:32.159
everything like safe, but it's actually
00:01:33.680
a really important productivity tool as
00:01:35.360
well. Uh Rails has great resources for
00:01:38.560
learning. The foundation's been doing
00:01:40.240
some great work there. But things that
00:01:43.119
work when you're small don't necessarily
00:01:45.360
work at scale. uh like when you're
00:01:47.280
running a big database like intercom, we
00:01:49.119
have around 300 terabytes of relational
00:01:51.439
data, multiple like 10 plus 50 plus
00:01:54.720
billion row tables. Uh you need to know
00:01:57.520
some different things and like consider
00:01:59.119
some different stuff when you're working
00:02:00.159
at that scale. But we should still try
00:02:03.040
and make sure that like engineers
00:02:04.960
working at intercom can like Google
00:02:07.280
something, read the docs, look up the
00:02:09.360
Rails guides and get the right start.
00:02:12.239
The the worst thing would be if you
00:02:15.360
start as a new hireer in intercom,
00:02:17.840
you've worked on rails before, you know,
00:02:19.680
Rails and then you try and use Rails
00:02:22.160
like idioms in intercom and it doesn't
00:02:24.080
work. Uh, so that's what we're trying to
00:02:26.319
avoid. And that's kind of only getting
00:02:27.840
more important now that like LLMs are
00:02:29.760
doing more and more of the driving
00:02:31.040
because LLMs are trained on what's out
00:02:32.959
there on the public internet. If we had
00:02:34.640
some specific processes you had to
00:02:37.280
follow, that would be um even harder.
00:02:40.080
we'd like be pollutant context window
00:02:41.920
trying to tell tell them how to run a
00:02:43.280
migration. Uh so that's kind of that's
00:02:46.160
the framing we're coming at it from.
00:02:49.519
But you know maybe you've just started a
00:02:51.519
new Rails app.
00:02:53.680
You probably don't have a ton of data.
00:02:55.920
You maybe have no real data in your
00:02:57.519
database and you probably don't need a
00:03:00.480
you probably don't need a zero downtime
00:03:02.159
deployment. So like you can get away
00:03:04.159
with a lot. And like let me say this
00:03:06.720
right now. If that's you, if this fits
00:03:08.640
you, just you don't need to worry about
00:03:11.200
this stuff. If I didn't have to worry
00:03:12.720
about this stuff, I wouldn't worry about
00:03:14.159
it either. Uh early days, it's way more
00:03:17.280
important to focus on speed of
00:03:18.480
iteration. Like just try stuff. Uh we're
00:03:21.280
going to really quickly scale up to
00:03:22.879
talking about things that apply to the
00:03:24.720
scale intercom is at. And that might be
00:03:27.200
interesting to think about, but you
00:03:29.200
know, do what do the minimum amount of
00:03:30.799
process that will work for you and that
00:03:32.319
will keep you fast. Uh because yeah
00:03:35.840
scale is a is a nice problem to have. Uh
00:03:39.680
you know issues are going to happen as
00:03:41.360
you grow.
00:03:43.840
Uh issues are going to happen as you
00:03:45.200
grow. That's just kind of inevitable.
00:03:47.200
The first time you learn that your like
00:03:48.959
core data model is too large to migrate
00:03:51.360
safely will be when you try and run a
00:03:53.440
migration and everything goes down. And
00:03:55.599
the first time you learn that your
00:03:56.879
customers are sensitive to the downtime
00:03:59.040
being caused by your database operations
00:04:00.959
will be when you have a bunch of angry
00:04:02.400
customers beating down your door because
00:04:03.840
you just ran a migration. That's life.
00:04:06.159
Like that's fine. That's that's good
00:04:07.920
problems. But the important thing is to
00:04:10.239
learn from it and make sure that you
00:04:12.400
don't keep making those same mistakes.
00:04:14.000
You don't want to have to explain to
00:04:15.200
customers like, "Oh yeah, the four this
00:04:17.199
weak is because we keep doing the same
00:04:18.479
thing and it keeps going wrong the same
00:04:19.840
way."
00:04:20.959
Uh remember like the kind of saying from
00:04:23.600
from aviation safety and other safety
00:04:25.520
critical industries like checklists get
00:04:27.040
written in blood. So uh to that point
00:04:30.240
we're going to we're going to talk about
00:04:31.440
checklists. Uh because checklists like
00:04:34.320
runbooks whatever you want to call them.
00:04:36.400
As soon as you're big enough to start
00:04:37.759
needing some kind of process this is
00:04:39.280
probably where you're going to start.
00:04:40.479
This is like the the uh natural first
00:04:42.960
step.
00:04:44.560
So uh checklists like they document the
00:04:47.600
edge cases you're aware of. They tell
00:04:49.360
you the things you need to do. Uh, and
00:04:51.919
they're a great tool for for new people
00:04:54.400
like onboarding, understanding the
00:04:56.240
operational parameters of your app, like
00:04:58.000
where the sharp edges are. Uh, but you
00:05:00.880
want to like keep them small. You don't
00:05:02.639
want to clutter them up. Uh, I have like
00:05:04.960
an absolutely abiding hatred for what I
00:05:07.280
call a choose your own adventure
00:05:08.639
checklist where every item says like,
00:05:11.440
oh, in this situation, you have to keep
00:05:13.360
in this in mind, but in that situation,
00:05:15.360
you don't have to do it. skip on to the
00:05:16.960
next thing like turn to page four. That
00:05:19.360
stuff just drives me absolutely crazy.
00:05:21.919
Uh like every checklist item is a
00:05:24.320
liability. So you don't want to be
00:05:26.720
putting things in your process unless
00:05:28.240
you've seen it hurt you. You want to
00:05:30.000
keep those like sharp tools like BH was
00:05:32.160
talking about earlier. You want to keep
00:05:33.840
them like that's an important part of
00:05:35.440
productivity. You don't want to like try
00:05:37.440
and shave off every sharp edge like go
00:05:40.880
too far with guard rails, but you also
00:05:43.039
don't want to keep repeating the
00:05:44.479
mistakes like we talked about. Uh
00:05:48.560
so
00:05:51.039
if you have your checklist, you know
00:05:52.960
where what your operational parameters
00:05:54.720
are. It's only useful if people are
00:05:57.680
actually using it. And when people are
00:06:00.560
like working on your app, when they're
00:06:02.000
making a making a migration, uh where
00:06:04.560
are they going to end up? they're going
00:06:05.520
to end up in a PR.
00:06:07.919
So, I want to meet people where they're
00:06:09.919
at. If they're opening a migration PR,
00:06:12.319
they're in GitHub. They're doing PR
00:06:14.000
review. Uh, that's where we should put
00:06:17.039
it. When we first created a checklist
00:06:18.960
for migrations in intercom years ago, it
00:06:21.120
went into our internal wiki. I don't
00:06:23.280
know if anyone ever really read it.
00:06:25.199
Ended up in Notion. People didn't read
00:06:26.960
it there either. We have Koda now. It
00:06:29.360
might be in there. I can't find it. Uh,
00:06:32.240
so we want to post it onto the uh onto
00:06:34.639
their uh PRs. And to that point,
00:06:38.960
you can scan this QR code. It's a GitHub
00:06:40.880
action. It's what we use at Intercom
00:06:43.120
that to post the checklist. I've bundled
00:06:45.919
in a stripped down version of our
00:06:47.919
checklist. It's basically everything we
00:06:49.919
have in there, but we leave I left out
00:06:52.240
the uh the really gnarly stuff around
00:06:54.400
like how our ETL pipeline works. You
00:06:57.039
don't have to run our ETL pipeline. You
00:06:58.800
don't need to worry about how it works.
00:07:00.479
but it has some like common sense stuff.
00:07:03.039
I don't necessarily love everything
00:07:04.800
about it, but it is a good start. Uh,
00:07:08.000
I'll leave that up for a second more.
00:07:10.000
Uh, but we can take a look at what's in
00:07:14.400
there. I had to shrink this down to fit
00:07:16.639
it on the page. Don't worry, we're not
00:07:18.240
going to go through everything line by
00:07:19.520
line. This is the checklist that that
00:07:21.599
will post. Uh, the kind of maybe the
00:07:24.000
more interesting things
00:07:26.319
in yellow, but that means it will be
00:07:28.080
bold. Don't merge the PR if your
00:07:29.840
application changes before you complete
00:07:31.360
all the steps in a checklist. That's an
00:07:33.520
important part of like zero downtime
00:07:35.120
because when the deployment happens, the
00:07:37.759
migration may not have run yet. So you
00:07:39.520
don't want the code that relies on it
00:07:40.960
out there in production. Fine. Uh
00:07:43.840
another fun one, if you run Postgress,
00:07:46.880
you don't notice pain. Uh if you're
00:07:49.360
making more than one change, use change
00:07:51.599
table set bulk true because my SQL does
00:07:54.479
not have transactional DDL. So anytime
00:07:57.520
you run a statement that just happens
00:08:00.080
and if your migration fails in the
00:08:01.599
middle then it will be uh in a
00:08:03.520
inconsistent state. Uh one thing you'll
00:08:06.160
notice if you if you do read through it
00:08:07.680
or if you have opened up the the uh
00:08:09.599
GitHub action that I linked before not
00:08:12.240
everything in here is that kind of
00:08:14.080
operational scar tissue. Some of it is
00:08:16.000
purely educational
00:08:18.160
like uh your PR must contain a schema RB
00:08:21.759
file updated. If your migration file has
00:08:23.840
an older time stamp, leave the one in
00:08:26.800
the schema uh file untouched. You can't
00:08:29.120
make your migration go back in time,
00:08:30.879
that's not really that's not going to
00:08:32.479
cause an outage, but it kind of handles
00:08:35.680
things that new people to Rails might be
00:08:37.519
surprised by. So great, we have a
00:08:40.320
checklist now. All of our migrations are
00:08:43.120
going to be safe. The end, right? That's
00:08:45.760
it. Cool. Thank you. I'll be off. Uh no,
00:08:49.440
like seriously, it's not it's not the
00:08:51.600
end. Uh it's pretty good. Checklist pops
00:08:54.000
up in the migration. Fine. Uh it like it
00:08:57.360
does tell you things you need to know.
00:08:58.800
Great. But it's not quite enough. Like
00:09:03.040
you might be surprised to learn that
00:09:04.240
even checklists are fallible. Uh
00:09:08.080
I've had new hires DM me on Slack and
00:09:11.519
Intercom and ask me what does something
00:09:13.279
on the checklist mean? They tried it and
00:09:15.680
it didn't work. and I go and look and
00:09:17.680
realize like, oh, you know, that's not
00:09:20.000
worked that way for 18 months and nobody
00:09:23.440
has said anything at any point. Uh,
00:09:26.480
which tells me that everyone is doing a
00:09:28.080
real good job reading the checklist. And
00:09:29.920
like I'm guilty of it, too. I run schema
00:09:31.600
migrations. I don't read the checklist.
00:09:32.959
I wrote the checklist. I don't need to
00:09:34.320
read it. Uh, so uh by the time the PR is
00:09:38.959
open, like the engineering work is
00:09:40.399
finished, people have kind of switched
00:09:42.160
mode a bit. Uh, and if we tell them
00:09:44.880
like, oh, you have to go back and do
00:09:46.080
something differently, that's kind of
00:09:47.360
annoying. It's a bit of a productivity
00:09:48.880
drain. Uh, it's a useful tool in the
00:09:51.600
toolbox,
00:09:53.120
but you know, it's not everything. It
00:09:56.000
also doesn't do anything to handle like
00:09:57.839
if you have a big table, uh, running a
00:10:01.040
migration might take 24 hours. And if
00:10:04.000
you're doing certain operations in
00:10:05.600
MySQL, that table might be locked for
00:10:07.600
writes for 24 hours. That'd be a pretty
00:10:09.360
bad bad outage to have. And the
00:10:11.279
checklist kind of we're not going to go
00:10:13.120
back. The checklist doesn't really do
00:10:14.480
much to to help you with that. Uh and it
00:10:18.560
also doesn't handle people just making
00:10:20.480
mistakes. Like mistakes happen in the
00:10:22.480
same way that outages happen. You can't
00:10:24.079
just say like, "Oh, we'll just never
00:10:25.360
make a mistake and everything will be
00:10:26.720
fine." Uh so we do need to go a bit
00:10:29.040
deeper than this. Uh one of the things
00:10:31.360
we want to do is we want to make sure to
00:10:33.200
fix issues before they happen.
00:10:35.839
uh we should be telling engineers about
00:10:38.720
risks while they're writing their
00:10:40.320
migration and that kind of like when I
00:10:42.880
thought about it you know that sounds an
00:10:44.160
awful lot like linting uh and we have
00:10:46.880
rubocop so like let's get rub on the
00:10:50.560
case uh we can take an example now this
00:10:54.160
is I dug back through the annals of time
00:10:56.880
in intercom this is migration from 2011
00:11:00.000
when the company was only like four or
00:11:01.519
five people and I think it's it's pretty
00:11:04.000
representative of like a normal
00:11:06.000
migration that you might make on a new
00:11:07.440
Rails app. We're like adding columns on
00:11:10.480
the messages table. We're adding some
00:11:12.720
comment column col columns on the
00:11:14.880
comments table. You know, it's fine.
00:11:17.200
We're doing some ads. Uh we do the down.
00:11:19.920
We do the reverse. You know, it's all
00:11:22.240
good. This is a real thing. Uh there's
00:11:24.480
nothing wrong with it particularly. I
00:11:26.240
wouldn't let you merge it now, but
00:11:27.519
there's nothing wrong with it
00:11:28.399
particularly. Uh
00:11:31.200
why wouldn't I let you merge it? Uh,
00:11:34.480
like I said before, we kind of touched
00:11:35.839
on it in the checklist. Every line there
00:11:37.680
is executing a new SQL query. On MySQL,
00:11:40.720
none of those things aren't atomic. So,
00:11:42.720
the migration isn't atomic. You know,
00:11:44.959
you can, if you have Rubicop Rails, you
00:11:46.959
can put this into your Rubicop YAML like
00:11:49.200
Rails bulk change table enabled. You're
00:11:51.839
running on MySQL, include your
00:11:53.760
migrations uh directory, and it will
00:11:56.640
warn you that like, oh, hey, you know,
00:11:58.480
you could bulk those things together.
00:12:01.120
And that's pretty good. it will tell you
00:12:02.880
like oh those messages changes could
00:12:04.480
have been one migration the comments uh
00:12:07.680
columns could have been another bulk
00:12:09.200
change table but uh it only warns you
00:12:13.680
when operations that could be bulked
00:12:16.959
could be combined aren't being combined.
00:12:19.519
It doesn't warn you that like even if
00:12:21.200
you combined all of the changes on the
00:12:22.880
two tables that that would still result
00:12:25.920
in two distinct altars and that that
00:12:28.560
could still fail in the middle. Uh so we
00:12:31.360
went a bit further. Um this is I just
00:12:34.480
copy and pasted the code into a gist.
00:12:36.399
This is the actual cop we run in
00:12:38.000
intercom for this. Uh it kind of does
00:12:40.560
the same thing as as the uh the Rails
00:12:42.800
one, but it ensures that like in
00:12:45.120
addition to combining all of the
00:12:46.639
bulkable operations, there can only be
00:12:48.800
one operation left once they've all been
00:12:50.880
combined. So uh you can get into a
00:12:53.920
situation where a migration could fail
00:12:55.760
in the middle. It's it's a bit of an
00:12:57.760
edge case. It's annoying when it
00:12:59.200
happens. Uh so that's good. That's an
00:13:02.560
example of something we can do at
00:13:03.680
Rubocop. Something that people just
00:13:04.959
don't have to worry about anymore.
00:13:07.920
What else can we do? Uh so if you were
00:13:11.920
if you were in Toronto last year, if you
00:13:13.519
saw my talk from from 2024, you might
00:13:16.000
know you might know how this story ends.
00:13:18.480
Uh big in foreign keys. Uh so we had an
00:13:22.079
interesting problem in intercom where
00:13:24.880
since I don't remember what Rails
00:13:26.480
version the ID column defaults to being
00:13:29.760
a big int. It's a good change. Uh we've
00:13:32.880
hit max int many times on different
00:13:35.519
tables. Uh so the default being big int
00:13:39.200
is good but you can still sometimes end
00:13:41.839
up with external references being the
00:13:43.839
wrong type. uh we don't tend to use the
00:13:46.959
uh references function when we're making
00:13:49.040
migrations because actually we have a
00:13:50.800
lot of uh vertically partitioned
00:13:52.800
databases where tables live in in
00:13:54.560
different schemas so you can't join
00:13:56.639
across them or uh and also in in my SQL
00:13:59.440
you can't really rely on foreign key
00:14:01.279
constraints if you've got a a large
00:14:02.959
database so we don't do it uh but then
00:14:07.120
you can end up in a situation where like
00:14:09.040
table A references another table and
00:14:11.600
that has the foreign key is a regular
00:14:13.920
int table B is a big int and then you're
00:14:16.480
going to run out of uh when when table B
00:14:18.880
exceeds a 32-bit int, then everything
00:14:21.920
breaks because now you can't insert rows
00:14:23.519
into table A anymore. Getting pretty
00:14:25.440
abstract. Uh that causes us a lot of a
00:14:28.399
lot of headaches. Like you can watch the
00:14:30.639
watch the video from last year. It was
00:14:32.399
it was a painful time. Uh and one of the
00:14:35.040
things we did afterwards is another
00:14:37.839
simple custom cop. I mean I kind of hope
00:14:40.160
you don't need this one, but if you want
00:14:41.440
to have a look and see how it works, go
00:14:43.360
ahead. Uh what this does is look for
00:14:45.839
columns that end in underscore ID and
00:14:48.720
doesn't let you use a small int uh type
00:14:51.440
for them. It's a little heavy-handed,
00:14:53.600
but it works for us. And I think a lot
00:14:56.079
of the time with uh writing good rubocop
00:14:58.240
rules, it's a bit like uh what I think
00:15:00.160
of writing good reaxes. You don't want
00:15:02.000
to try and overfit. You just want to
00:15:04.000
handle the cases that you know about.
00:15:05.839
And uh yeah, that's that works pretty
00:15:08.480
well. Uh so last one and we're going to
00:15:12.800
lead with the code this time and I'll
00:15:14.639
you'll find out why. We have another
00:15:16.720
custom cop that blocks using the execute
00:15:20.160
method in migrations. So you probably
00:15:22.880
know that execute is the way to run raw
00:15:26.240
SQL. Um
00:15:28.800
it's fine. Like I know a lot of people
00:15:30.480
do it. Uh we don't have a ton of we
00:15:32.480
didn't have a ton of examples even
00:15:33.760
before adding this cop. Now we have
00:15:36.160
none. Uh
00:15:38.800
I like using those helper functions.
00:15:41.519
Sometimes people like writing raw SQL. I
00:15:43.680
can write raw SQL as well. But uh it's
00:15:46.240
something we wanted to stop in intercom.
00:15:48.000
And we wanted to stop it for a better
00:15:49.519
reason than just my like sense of
00:15:51.279
aesthetics because uh catching things in
00:15:54.560
development is great. You know, popping
00:15:56.160
up warnings in Rubocop, fine. That's
00:15:58.959
pretty good. But your development
00:16:00.639
database is still like different to your
00:16:02.639
production database. we don't
00:16:03.839
pre-provision everyone's developer
00:16:05.360
environment with 50 terabytes of
00:16:07.040
conversation data so that they can get a
00:16:09.440
realistic experience of working with the
00:16:11.120
production app. uh lots of context about
00:16:15.040
your production environment just isn't
00:16:16.720
available in dev and having to like
00:16:20.240
hardcode in all the rules saying like oh
00:16:22.480
well this table's big and that table you
00:16:24.639
know whatever too much too too brittle
00:16:27.600
uh you'd be like trying to you know it's
00:16:30.079
like a constantly shifting goalpost
00:16:31.600
whatever building foundations on sand so
00:16:34.880
uh we went a step further even than
00:16:37.279
doing the linting and we went for for
00:16:40.320
runtime checks and this is kind
00:16:42.480
This is, I think, the the gold standard
00:16:44.320
for doing this because the only time
00:16:46.720
you're going to have full context, like
00:16:48.240
the only way, the only place you ever
00:16:50.320
know what your production environment
00:16:51.600
looks like is in production. Uh, I don't
00:16:54.320
really believe in staging environments.
00:16:55.759
I think staging environments, we get
00:16:57.519
back to the problem of a constantly
00:16:59.279
shifting goalpost. It's hard to make
00:17:01.279
them look real. And the best place to to
00:17:04.319
know what production is is production.
00:17:07.120
Uh so if we insert ourselves into the
00:17:10.400
migration process in production then you
00:17:12.959
know we have access to the production
00:17:15.199
database. We have access to all of the
00:17:17.839
everything we need to know to to make a
00:17:19.760
guess as to whether the migration that's
00:17:21.600
happening is safe. Uh so we can kind of
00:17:24.480
start simple
00:17:26.319
uh and think about like what unsafe
00:17:28.720
operations are there.
00:17:31.039
uh like mistakes can happen like we said
00:17:33.200
before like we just talking about what's
00:17:36.160
safe in dev isn't necessarily safe in
00:17:37.760
production. Uh
00:17:42.880
I I guess yeah risks are even greater
00:17:44.640
now I think I mentioned this earlier
00:17:45.679
risks are even greater now with LLMs
00:17:47.200
doing the driving. So we've had a couple
00:17:48.960
of cases where LLMs are generating
00:17:51.520
migration files and they're not running
00:17:53.760
the generator and the migration file
00:17:55.919
that comes out at the end isn't
00:17:57.120
necessarily super wellformed.
00:17:59.520
Uh so we wanted to to handle cases like
00:18:02.880
that and one of the things we did
00:18:06.720
um kind of lost Fred there a little bit
00:18:09.679
but yeah we have this uh we we uh okay
00:18:13.760
yeah now I remember so one of the things
00:18:15.679
we did before we get into the really
00:18:17.120
lowlevel stuff is uh we just overrode in
00:18:20.559
like our rake file the DB namespace and
00:18:24.400
we blocked like DB reset DB drop things
00:18:28.480
like that we don't want anyone to drop
00:18:30.240
the database in production. So there's
00:18:31.760
no need to be able to run that rake
00:18:33.200
task. We don't want anyone to reset the
00:18:34.880
database in production. And another
00:18:36.880
thing we did, this kind of harks back to
00:18:38.720
the checklist from earlier, uh we don't
00:18:41.840
want you to run DB migrate, uh just the
00:18:45.679
bare version of DB migration because
00:18:48.240
that will run every pending migration.
00:18:51.120
And that's fine when you're like one
00:18:52.799
dev, two devs, like a small team, but we
00:18:56.640
sometimes have 10 migration PRs landing
00:18:59.440
within 15 minutes of each other from
00:19:01.440
different teams. We don't want them all
00:19:03.280
running like some of them might be for
00:19:05.120
those massive tables. Another one might
00:19:06.960
just be a create table for something
00:19:08.559
that's being spiked out. So, we want
00:19:10.480
everyone to be deliberate about only
00:19:12.160
running their own changes. So, we force
00:19:14.080
you to use the DB migrate up and specify
00:19:17.039
a version. Uh, so I hope you're never
00:19:20.480
going to need to worry about these
00:19:21.840
things yourself, but uh, if you do, like
00:19:25.039
you can just grab that code snippet.
00:19:26.640
It's right here. Drop it into your rake
00:19:28.160
file. It should work. Uh, but we want to
00:19:32.080
go further. Uh, we want to start
00:19:34.960
building up some idea of like what real
00:19:37.039
migrations are are safe and and we can
00:19:40.000
do that. So, uh, our checklist used to
00:19:43.440
tell teams, uh, that like, you know, if
00:19:47.360
they're migrating a large table, come
00:19:50.240
and talk to my team.
00:19:52.640
But that just introduces like one of
00:19:54.880
those decision points like I was talking
00:19:56.799
about before, one of these cases where
00:19:58.480
you're choosing your own adventure
00:19:59.679
because now you have to know like is a
00:20:01.440
table large and like at what at what
00:20:03.760
point is a table large?
00:20:06.160
um you know that's uh that's not very
00:20:08.799
clear. And we found that like what
00:20:10.400
tended to happen was people became
00:20:13.039
either they knew they were creating a
00:20:14.559
table and that's fine or they would just
00:20:17.120
be uncertain and they would want to come
00:20:19.280
through us for every migration and like
00:20:21.520
sense check it with us. Was it okay to
00:20:23.200
run it and like that's such a
00:20:24.720
productivity drain because there's only
00:20:26.160
there's only so many people on the team.
00:20:27.840
We're only able to respond so quickly.
00:20:30.799
uh when you want to run one of those big
00:20:33.679
migrations, there are tools like Ghost.
00:20:35.919
If you've seen Ghost, it's GitHub online
00:20:38.400
schema transformer, I think it stands
00:20:40.640
for, or there's like PT online schema
00:20:43.200
change or large hadron migrator. I think
00:20:46.159
that's the Facebook one. There's a lot
00:20:47.440
of different tools. Uh but you have to
00:20:49.840
learn how they work and and you have to
00:20:52.080
know all of these command line switches.
00:20:53.919
And again, we're deviating really far
00:20:56.640
from the the standard Rails flow at that
00:20:59.760
point. Like we said, it's hard to to use
00:21:02.720
a tool like that. We're kind of breaking
00:21:05.200
the uh the just be normal principle I
00:21:08.640
talked about earlier. You know, we want
00:21:10.240
to just run migrations of rake. I guess
00:21:12.640
you could package up running one of
00:21:14.080
those tools of rake. I wouldn't
00:21:15.600
personally feel super safe doing that,
00:21:18.320
but that's my risk tolerance in our
00:21:21.039
production environment. like your
00:21:22.559
mileage may vary. Uh so one of the
00:21:25.679
things we do to make sure that people
00:21:27.200
don't have to ask those questions is
00:21:29.200
actually just analyze the migrations at
00:21:31.440
runtime. Uh so we monkey patched active
00:21:35.440
record migration exec migration which if
00:21:38.559
you know how internals work like this is
00:21:41.120
effectively the thing that calls the up
00:21:43.280
method in your migration or calls the
00:21:45.200
change method. Uh we used the active
00:21:48.960
record command recorder which uh is a
00:21:51.600
nice bit of tech. If you've ever
00:21:53.120
wondered how Rails can take your change
00:21:55.919
method and work out how to reverse it
00:21:57.919
without you specifying the reverse, it's
00:22:00.080
thanks to command recorder where it like
00:22:02.000
executes the migration in a kind of dry
00:22:03.919
run environment, works out the commands,
00:22:06.080
and then it knows how to map every
00:22:07.919
command to the reversed version of it.
00:22:10.799
Uh so running your migration in a dry
00:22:13.120
run environment getting the commands out
00:22:14.799
that that's you might see where we're
00:22:16.640
going here. Uh and then we implemented
00:22:19.039
our heruristics that we were using like
00:22:21.360
you know if you came to me or any of my
00:22:22.960
teammates they're all sitting at the
00:22:24.159
front here. If you came to one of us and
00:22:25.840
asked like do you need to run uh this
00:22:28.240
migration with ghost did you need us to
00:22:30.480
run it for you? uh we just like wrote
00:22:32.960
coded up those rules and if you're going
00:22:36.240
to hit something there if you know if
00:22:38.400
you need to talk to us we just raise and
00:22:40.080
block the migration we'll tell you like
00:22:41.919
you okay no this this one you need to
00:22:43.679
come and talk to us we'll post it in
00:22:45.760
your console output we'll post it into
00:22:47.760
our Slack channel to say that someone
00:22:49.360
just ran the migration and they're going
00:22:50.720
to need some help uh and yeah that that
00:22:53.760
works pretty well so we again I had to
00:22:56.559
squeeze it onto the the page a bit but
00:22:59.440
uh commands here is the internal of a of
00:23:02.559
a command recorder. So, it's like a list
00:23:04.720
of operations. We block execute. And now
00:23:08.720
we get back into why we blocked execute
00:23:11.039
because uh the command recorder doesn't
00:23:13.840
actually know what you're doing in an
00:23:15.600
execute. It just knows that one
00:23:17.039
happened. So, we don't want to deal with
00:23:18.240
that. So, we block you using execute in
00:23:21.039
case you bypass the rubricop rule from
00:23:22.880
earlier. And then we check like is the
00:23:25.919
row count above a limit and is the store
00:23:28.320
size above a limit. And the reason we do
00:23:30.240
that is we found like if tables are are
00:23:33.039
sufficiently long like if they have
00:23:34.720
sufficiently many rows that can be a
00:23:36.240
problem. But a small table with a lot of
00:23:38.400
like very big rows like a very wide
00:23:40.720
table that can also be a problem. So for
00:23:43.440
us these limits are 5 million rows or 5
00:23:46.799
GB of data which is
00:23:49.919
to me low like maybe my brain is broken
00:23:52.960
at this point. Five million rows sounds
00:23:54.559
like an absolutely microscopic table to
00:23:56.320
me but I deal a lot with like 50
00:23:58.240
billion. So yeah, but uh it that means
00:24:01.600
that like if you're dealing with a new
00:24:04.480
table, like you're building out a new
00:24:05.760
feature, you're spiking it, you don't
00:24:07.200
have to worry about this. We'll try and
00:24:08.640
just get out of your way, but by the
00:24:10.240
time you have 5 million rows, it
00:24:12.000
probably has enough traction that that
00:24:14.240
doing something to it locking it for
00:24:16.000
some reason might be an issue. Uh and
00:24:18.480
then we print out some nice formatted
00:24:21.279
error message saying like, "Hey, it's
00:24:22.960
too big. You need to come talk to us."
00:24:25.520
uh and the way we get the rows uh if you
00:24:30.080
we can't run select count star because
00:24:32.159
that would also be very slow. So instead
00:24:34.240
we do explain select count star. Innov
00:24:37.360
will give you a guess at how big the
00:24:38.640
table is and that guess is good enough
00:24:40.159
for our purposes or we check the
00:24:42.159
information schema to figure out how big
00:24:44.000
is the table. We actually factor in the
00:24:46.240
index size when doing that. I've never
00:24:48.080
been completely certain if we need to
00:24:50.000
but we just do it anyway.
00:24:52.400
Uh we'll come back. you'll get a link to
00:24:54.640
the code. So, I'm skipping over a bit
00:24:56.240
fast for time. Uh, another thing is like
00:24:59.760
in older versions of MySQL and until
00:25:01.919
relatively recently, we were on 5.7,
00:25:04.320
which was still affected by this. Uh,
00:25:06.880
dropping a big table wasn't safe. Uh, it
00:25:10.480
wasn't safe because if the table was in
00:25:12.159
the buffer pool cache, dropping it would
00:25:14.159
synchronously ex uh,
00:25:17.440
God, I'm completely blanking on the
00:25:19.120
word. It would synchronously like flush
00:25:21.039
it out of the cache. and while it was
00:25:22.720
flushing the whole database could be
00:25:24.400
locked. Uh so we never wanted that to
00:25:27.360
happen. So if we ever saw that the
00:25:29.840
command being executed in a migration
00:25:31.760
was a drop table instead we would just
00:25:34.000
rename the table to something else with
00:25:36.320
a predictable format and then we had
00:25:39.360
another like periodic task that comes
00:25:41.360
along and cleans up those tables. We
00:25:44.080
again we don't worry if the table is uh
00:25:47.200
smaller than 2,000 rows. We don't really
00:25:49.279
mind what people do with that with those
00:25:50.960
little tables. So, this is just another
00:25:53.120
thing that now people can drop a table.
00:25:54.880
They don't need to know what's happening
00:25:56.080
under the hood. We just handle the rest
00:25:57.520
of it. Uh we also block change column.
00:26:02.240
Uh that might seem a little strange. If
00:26:05.279
you change a column from like int to big
00:26:08.159
int, fine. If you change a column from
00:26:10.159
like string to text, fine. Uh those are
00:26:13.120
all things that are normal and you might
00:26:15.120
want to do. But if you change the column
00:26:17.360
from int to string or from I don't know
00:26:20.559
some other type two types that aren't
00:26:22.240
compatible with each other during our
00:26:24.480
deployment process you would have a
00:26:26.000
bunch of processes that still know about
00:26:28.159
the old schema they have a cache you
00:26:29.600
have a bunch of processes know about the
00:26:31.039
new schema lots of breakages could
00:26:33.200
happen. So we just ban doing change
00:26:35.279
column. We go for like an expand
00:26:37.120
contract model where you add a new
00:26:38.880
column populated from the old column
00:26:41.039
then drop the first column again rather
00:26:43.679
than uh changing it in place. Uh it's a
00:26:46.559
little tiresome, but there isn't really
00:26:48.159
a better solution for making that safe.
00:26:51.039
Uh
00:26:53.840
we also just to have a it's not included
00:26:56.640
here. We include an escape hatch for
00:26:59.120
like expert operators, my team or or
00:27:01.840
some engineers to just like bypass these
00:27:04.480
checks because again you don't want to
00:27:06.720
like take away the sharp tools. You want
00:27:08.799
to make sure that like people aren't
00:27:10.559
just juggling knives or whatever, but it
00:27:12.960
is useful to be able to do them. So, uh,
00:27:16.880
we do let you bypass this, but we run
00:27:18.640
all of these checks and more in
00:27:20.240
production on every change and
00:27:24.000
you can too if you want to take a look
00:27:26.080
at the code. This is probably the
00:27:27.360
biggest single chunk of code. So,
00:27:29.039
there's a lot to go through. Um, I would
00:27:32.400
take it as a jumping off point. Uh, like
00:27:35.039
I said at the beginning, you don't want
00:27:36.320
to just blindly take every idea. We're
00:27:39.120
not I'm not here to pedal the the Google
00:27:41.360
SR book and tell you how to run your uh
00:27:44.080
run your startup, but like there are
00:27:46.000
ideas in there that I think are
00:27:47.440
generally useful. Uh and this is like
00:27:50.240
literally the code straight out of our
00:27:51.679
monolith. So feel free to take a look.
00:27:54.640
Uh that's kind of bringing me to the end
00:27:57.440
like where do we go in the future?
00:27:59.760
Intercom is currently in the process of
00:28:01.520
moving everything to planet scale away
00:28:03.200
from uh Aurora RDS in in uh AWS. uh that
00:28:08.399
does change a lot of the dynamics here
00:28:10.159
and it was part of why I wanted to talk
00:28:11.600
about this because a lot of these things
00:28:13.200
are probably going to go away from our
00:28:14.480
monolith and I think they are kind of
00:28:15.919
useful. uh we've already seen like a big
00:28:19.200
decrease in issues that uh engineers
00:28:22.159
were having both with like shipping
00:28:23.840
things and in production since we
00:28:25.520
implemented this for ourselves and now
00:28:27.360
moving on to planet scale it's kind of
00:28:29.279
changing things again there's a
00:28:30.720
different workflow we've actually
00:28:32.240
packaged up behind DB migrate again we
00:28:35.840
still are trying to stick to that just
00:28:37.360
be normal principle but that maybe uh
00:28:40.320
maybe will change planet scale has a
00:28:42.159
nice flow that's kind of like a pull
00:28:44.240
request for changing the schema and
00:28:46.159
maybe it makes sense to expose that to
00:28:47.919
everyone. Uh the dust is kind of still
00:28:50.640
settling there. So we'll see. One thing
00:28:53.200
we have got as a result of moving to
00:28:54.880
planet scale is now we never have to
00:28:57.120
tell engineers like hey you have to wait
00:28:59.120
for data stores on call to run your
00:29:01.360
ghost migration for you because that can
00:29:03.440
just be handled under the hood by by the
00:29:05.200
planet scale workflow. So that's nice.
00:29:08.159
Uh but yeah, thanks very much. This is a
00:29:12.399
link for like everything that's been in
00:29:14.159
the presentation so you can pull it
00:29:15.679
down. Um, feel free to have a look. I'
00:29:18.480
love to talk about migrations and
00:29:19.840
running big databases with anyone. And
00:29:22.159
otherwise, I'll leave it there. So,
00:29:24.799
thanks.