Taming the Beast: Safely Managing Database Operations in Rails in a Team of 100s


Summarized using AI

Taming the Beast: Safely Managing Database Operations in Rails in a Team of 100s

Miles McGuire • September 04, 2025 • Amsterdam, Netherlands • Talk

Introduction

This talk, "Taming the Beast: Safely Managing Database Operations in Rails in a Team of 100s," presented by Miles McGuire at Rails World 2025, discusses the challenges and solutions Intercom has developed for managing large-scale database schema migrations within a sizable engineering team using Rails and MySQL.

Main Theme

The main theme revolves around maintaining developer productivity and system safety while enabling fast, safe schema changes across massive databases, specifically focusing on processes, tooling, and automation to facilitate zero-downtime operations, prevent mistakes, and reduce risk.

Key Points

  • Context and Scale: Intercom uses Rails and MySQL to manage hundreds of terabytes of data with tables ranging up to tens of billions of rows, posing significant migration safety and productivity challenges as compared to smaller-scale projects.
  • Productivity and Developer Experience:
    • New engineers should be able to use standard Rails patterns without running into undocumented or unique process hurdles, crucial for onboarding and now especially important as more migrations are generated by LLMs.
  • The Evolution of Safety Processes:
    • Checklists and Runbooks: Intercom began with operational checklists to document migration safety steps, emphasizing that each checklist item should be justified by real experience to avoid productivity drains.
    • In-PR Automation: Checklists were integrated into GitHub PR reviews via GitHub Actions, meeting developers within their workflow and ensuring wider adoption.
  • Limitations of Checklists:
    • Checklists alone are fallible; often ignored or misunderstood, especially by experienced engineers.
    • Checklists do not prevent long-running, dangerous migrations or handle scale-specific risks.
  • Proactive Risk Reduction:
    • Linting with RuboCop: Custom RuboCop cops warn developers about risky migration patterns—such as forgetting to bulk changes, using non-BIGINT foreign keys, or using raw SQL via execute—helping catch errors before code review.
    • Blocking Risky Operations: Some operations (like execute in migrations) are blocked outright due to their bypassing of safety checks and the inability to inspect their contents programmatically.
  • Runtime Safety Checks:
    • Monkey-Patching Migrations: Intercom monkey-patches ActiveRecord migration execution in production to analyze migrations in real time, blocking those that fail heuristics (e.g., affecting tables beyond 5 million rows or 5GB in size).
    • Automated Decision Points: The system programmatically intercepts and blocks unsafe actions (such as drop table on large tables or change_column operations that could break deployments), relying on production data for accuracy.
    • Escape Hatches: Expert users can override checks if absolutely necessary, maintaining flexibility for advanced use-cases.
  • Migration Execution Practices:
    • Only single, explicitly-specified migrations are allowed in production runs to avoid conflicts and overlapping risky operations.
  • Move to PlanetScale:
    • The transition to PlanetScale will further shift workflows, leveraging built-in schema change review processes and abstracting many manual or custom controls away from the core team.

Conclusions and Takeaways

  • Safe schema migrations at scale require layered controls: documentation, code linting, automation, and runtime checks.
  • Automating safety net steps where possible minimizes productivity loss and reduces the burden on specialized database teams.
  • As database infrastructure evolves (e.g., migration to PlanetScale), many manual interventions will become obsolete, emphasizing the value of aligning with built-in tooling where possible.

Relevant Examples

  • Migration linting and blocking implemented via RuboCop and monkey-patching.
  • In-PR checklists posted via GitHub Action.
  • Runtime analysis blocking risky migrations in production (e.g., big table alters, risky drops, or unsupported changes).

Closing

The strategies developed by Intercom have reduced operational risk and improved developer experience in a large, fast-moving Rails organization, offering practical code examples and processes for others to adapt to their own scale and infrastructure.

Taming the Beast: Safely Managing Database Operations in Rails in a Team of 100s
Miles McGuire • Amsterdam, Netherlands • Talk

Date: September 04, 2025
Published: Mon, 15 Sep 2025 00:00:00 +0000
Announced: Tue, 20 May 2025 00:00:00 +0000

Intercom leverages Rails to empower hundreds of engineers to move fast and make the changes they need in production without a lengthy centralized review process. But allowing arbitrary migrations to run across hundreds of tables and hundreds of terabytes of data in MySQL comes with inherent risks. In this session, we'll look at where we came from, what changes we've made to reduce risk and enable people to move fast while safely leveraging Rails' power, and where we're going in the future.

Rails World 2025

00:00:07.200 So, taming the beast.
00:00:10.160 This I remember when I uh saw that my
00:00:12.080 talk was accepted, I realized like, wow,
00:00:13.759 I really picked a clickbaity title for
00:00:15.519 it. Uh we're going to be talking a bit
00:00:17.680 about how we manage schema migrations uh
00:00:20.640 particularly at intercom. Uh we've got a
00:00:22.560 big engineering team, but uh who am I?
00:00:26.320 You've heard that I've uh spoken three
00:00:28.000 times at Rails World, but I'm an
00:00:29.519 engineer on team data stores at
00:00:30.960 intercom. Uh I'm kind of directly
00:00:34.559 responsible for databases, caching,
00:00:36.880 things like that, but the team also owns
00:00:38.480 other stuff uh including our Ruby on
00:00:40.480 Rails platform, which is surprising to
00:00:42.160 some people. Uh we're responsible for
00:00:45.120 like holding and raising the bar for
00:00:47.039 productivity at the company. So,
00:00:49.039 anything that helps make our engineers
00:00:51.120 more effective is kind of part of our
00:00:53.360 wheelhouse. And uh as we already said,
00:00:56.000 it's my third time being here.
00:00:58.800 Um
00:01:00.320 quick disclaimer, uh lots of what we're
00:01:02.800 going to talk about kind of is MySQL
00:01:04.400 specific. Uh Intercom's a MySQL shop,
00:01:07.520 but the concept should broaden out to
00:01:14.880 should broaden out to other uh database
00:01:17.200 technology.
00:01:19.360 Uh
00:01:21.200 how do we think about productivity? This
00:01:22.880 is kind of important because for us like
00:01:25.040 making schema migration tooling uh
00:01:28.560 really good is an important part of
00:01:30.799 making our app available and keeping
00:01:32.159 everything like safe, but it's actually
00:01:33.680 a really important productivity tool as
00:01:35.360 well. Uh Rails has great resources for
00:01:38.560 learning. The foundation's been doing
00:01:40.240 some great work there. But things that
00:01:43.119 work when you're small don't necessarily
00:01:45.360 work at scale. uh like when you're
00:01:47.280 running a big database like intercom, we
00:01:49.119 have around 300 terabytes of relational
00:01:51.439 data, multiple like 10 plus 50 plus
00:01:54.720 billion row tables. Uh you need to know
00:01:57.520 some different things and like consider
00:01:59.119 some different stuff when you're working
00:02:00.159 at that scale. But we should still try
00:02:03.040 and make sure that like engineers
00:02:04.960 working at intercom can like Google
00:02:07.280 something, read the docs, look up the
00:02:09.360 Rails guides and get the right start.
00:02:12.239 The the worst thing would be if you
00:02:15.360 start as a new hireer in intercom,
00:02:17.840 you've worked on rails before, you know,
00:02:19.680 Rails and then you try and use Rails
00:02:22.160 like idioms in intercom and it doesn't
00:02:24.080 work. Uh, so that's what we're trying to
00:02:26.319 avoid. And that's kind of only getting
00:02:27.840 more important now that like LLMs are
00:02:29.760 doing more and more of the driving
00:02:31.040 because LLMs are trained on what's out
00:02:32.959 there on the public internet. If we had
00:02:34.640 some specific processes you had to
00:02:37.280 follow, that would be um even harder.
00:02:40.080 we'd like be pollutant context window
00:02:41.920 trying to tell tell them how to run a
00:02:43.280 migration. Uh so that's kind of that's
00:02:46.160 the framing we're coming at it from.
00:02:49.519 But you know maybe you've just started a
00:02:51.519 new Rails app.
00:02:53.680 You probably don't have a ton of data.
00:02:55.920 You maybe have no real data in your
00:02:57.519 database and you probably don't need a
00:03:00.480 you probably don't need a zero downtime
00:03:02.159 deployment. So like you can get away
00:03:04.159 with a lot. And like let me say this
00:03:06.720 right now. If that's you, if this fits
00:03:08.640 you, just you don't need to worry about
00:03:11.200 this stuff. If I didn't have to worry
00:03:12.720 about this stuff, I wouldn't worry about
00:03:14.159 it either. Uh early days, it's way more
00:03:17.280 important to focus on speed of
00:03:18.480 iteration. Like just try stuff. Uh we're
00:03:21.280 going to really quickly scale up to
00:03:22.879 talking about things that apply to the
00:03:24.720 scale intercom is at. And that might be
00:03:27.200 interesting to think about, but you
00:03:29.200 know, do what do the minimum amount of
00:03:30.799 process that will work for you and that
00:03:32.319 will keep you fast. Uh because yeah
00:03:35.840 scale is a is a nice problem to have. Uh
00:03:39.680 you know issues are going to happen as
00:03:41.360 you grow.
00:03:43.840 Uh issues are going to happen as you
00:03:45.200 grow. That's just kind of inevitable.
00:03:47.200 The first time you learn that your like
00:03:48.959 core data model is too large to migrate
00:03:51.360 safely will be when you try and run a
00:03:53.440 migration and everything goes down. And
00:03:55.599 the first time you learn that your
00:03:56.879 customers are sensitive to the downtime
00:03:59.040 being caused by your database operations
00:04:00.959 will be when you have a bunch of angry
00:04:02.400 customers beating down your door because
00:04:03.840 you just ran a migration. That's life.
00:04:06.159 Like that's fine. That's that's good
00:04:07.920 problems. But the important thing is to
00:04:10.239 learn from it and make sure that you
00:04:12.400 don't keep making those same mistakes.
00:04:14.000 You don't want to have to explain to
00:04:15.200 customers like, "Oh yeah, the four this
00:04:17.199 weak is because we keep doing the same
00:04:18.479 thing and it keeps going wrong the same
00:04:19.840 way."
00:04:20.959 Uh remember like the kind of saying from
00:04:23.600 from aviation safety and other safety
00:04:25.520 critical industries like checklists get
00:04:27.040 written in blood. So uh to that point
00:04:30.240 we're going to we're going to talk about
00:04:31.440 checklists. Uh because checklists like
00:04:34.320 runbooks whatever you want to call them.
00:04:36.400 As soon as you're big enough to start
00:04:37.759 needing some kind of process this is
00:04:39.280 probably where you're going to start.
00:04:40.479 This is like the the uh natural first
00:04:42.960 step.
00:04:44.560 So uh checklists like they document the
00:04:47.600 edge cases you're aware of. They tell
00:04:49.360 you the things you need to do. Uh, and
00:04:51.919 they're a great tool for for new people
00:04:54.400 like onboarding, understanding the
00:04:56.240 operational parameters of your app, like
00:04:58.000 where the sharp edges are. Uh, but you
00:05:00.880 want to like keep them small. You don't
00:05:02.639 want to clutter them up. Uh, I have like
00:05:04.960 an absolutely abiding hatred for what I
00:05:07.280 call a choose your own adventure
00:05:08.639 checklist where every item says like,
00:05:11.440 oh, in this situation, you have to keep
00:05:13.360 in this in mind, but in that situation,
00:05:15.360 you don't have to do it. skip on to the
00:05:16.960 next thing like turn to page four. That
00:05:19.360 stuff just drives me absolutely crazy.
00:05:21.919 Uh like every checklist item is a
00:05:24.320 liability. So you don't want to be
00:05:26.720 putting things in your process unless
00:05:28.240 you've seen it hurt you. You want to
00:05:30.000 keep those like sharp tools like BH was
00:05:32.160 talking about earlier. You want to keep
00:05:33.840 them like that's an important part of
00:05:35.440 productivity. You don't want to like try
00:05:37.440 and shave off every sharp edge like go
00:05:40.880 too far with guard rails, but you also
00:05:43.039 don't want to keep repeating the
00:05:44.479 mistakes like we talked about. Uh
00:05:48.560 so
00:05:51.039 if you have your checklist, you know
00:05:52.960 where what your operational parameters
00:05:54.720 are. It's only useful if people are
00:05:57.680 actually using it. And when people are
00:06:00.560 like working on your app, when they're
00:06:02.000 making a making a migration, uh where
00:06:04.560 are they going to end up? they're going
00:06:05.520 to end up in a PR.
00:06:07.919 So, I want to meet people where they're
00:06:09.919 at. If they're opening a migration PR,
00:06:12.319 they're in GitHub. They're doing PR
00:06:14.000 review. Uh, that's where we should put
00:06:17.039 it. When we first created a checklist
00:06:18.960 for migrations in intercom years ago, it
00:06:21.120 went into our internal wiki. I don't
00:06:23.280 know if anyone ever really read it.
00:06:25.199 Ended up in Notion. People didn't read
00:06:26.960 it there either. We have Koda now. It
00:06:29.360 might be in there. I can't find it. Uh,
00:06:32.240 so we want to post it onto the uh onto
00:06:34.639 their uh PRs. And to that point,
00:06:38.960 you can scan this QR code. It's a GitHub
00:06:40.880 action. It's what we use at Intercom
00:06:43.120 that to post the checklist. I've bundled
00:06:45.919 in a stripped down version of our
00:06:47.919 checklist. It's basically everything we
00:06:49.919 have in there, but we leave I left out
00:06:52.240 the uh the really gnarly stuff around
00:06:54.400 like how our ETL pipeline works. You
00:06:57.039 don't have to run our ETL pipeline. You
00:06:58.800 don't need to worry about how it works.
00:07:00.479 but it has some like common sense stuff.
00:07:03.039 I don't necessarily love everything
00:07:04.800 about it, but it is a good start. Uh,
00:07:08.000 I'll leave that up for a second more.
00:07:10.000 Uh, but we can take a look at what's in
00:07:14.400 there. I had to shrink this down to fit
00:07:16.639 it on the page. Don't worry, we're not
00:07:18.240 going to go through everything line by
00:07:19.520 line. This is the checklist that that
00:07:21.599 will post. Uh, the kind of maybe the
00:07:24.000 more interesting things
00:07:26.319 in yellow, but that means it will be
00:07:28.080 bold. Don't merge the PR if your
00:07:29.840 application changes before you complete
00:07:31.360 all the steps in a checklist. That's an
00:07:33.520 important part of like zero downtime
00:07:35.120 because when the deployment happens, the
00:07:37.759 migration may not have run yet. So you
00:07:39.520 don't want the code that relies on it
00:07:40.960 out there in production. Fine. Uh
00:07:43.840 another fun one, if you run Postgress,
00:07:46.880 you don't notice pain. Uh if you're
00:07:49.360 making more than one change, use change
00:07:51.599 table set bulk true because my SQL does
00:07:54.479 not have transactional DDL. So anytime
00:07:57.520 you run a statement that just happens
00:08:00.080 and if your migration fails in the
00:08:01.599 middle then it will be uh in a
00:08:03.520 inconsistent state. Uh one thing you'll
00:08:06.160 notice if you if you do read through it
00:08:07.680 or if you have opened up the the uh
00:08:09.599 GitHub action that I linked before not
00:08:12.240 everything in here is that kind of
00:08:14.080 operational scar tissue. Some of it is
00:08:16.000 purely educational
00:08:18.160 like uh your PR must contain a schema RB
00:08:21.759 file updated. If your migration file has
00:08:23.840 an older time stamp, leave the one in
00:08:26.800 the schema uh file untouched. You can't
00:08:29.120 make your migration go back in time,
00:08:30.879 that's not really that's not going to
00:08:32.479 cause an outage, but it kind of handles
00:08:35.680 things that new people to Rails might be
00:08:37.519 surprised by. So great, we have a
00:08:40.320 checklist now. All of our migrations are
00:08:43.120 going to be safe. The end, right? That's
00:08:45.760 it. Cool. Thank you. I'll be off. Uh no,
00:08:49.440 like seriously, it's not it's not the
00:08:51.600 end. Uh it's pretty good. Checklist pops
00:08:54.000 up in the migration. Fine. Uh it like it
00:08:57.360 does tell you things you need to know.
00:08:58.800 Great. But it's not quite enough. Like
00:09:03.040 you might be surprised to learn that
00:09:04.240 even checklists are fallible. Uh
00:09:08.080 I've had new hires DM me on Slack and
00:09:11.519 Intercom and ask me what does something
00:09:13.279 on the checklist mean? They tried it and
00:09:15.680 it didn't work. and I go and look and
00:09:17.680 realize like, oh, you know, that's not
00:09:20.000 worked that way for 18 months and nobody
00:09:23.440 has said anything at any point. Uh,
00:09:26.480 which tells me that everyone is doing a
00:09:28.080 real good job reading the checklist. And
00:09:29.920 like I'm guilty of it, too. I run schema
00:09:31.600 migrations. I don't read the checklist.
00:09:32.959 I wrote the checklist. I don't need to
00:09:34.320 read it. Uh, so uh by the time the PR is
00:09:38.959 open, like the engineering work is
00:09:40.399 finished, people have kind of switched
00:09:42.160 mode a bit. Uh, and if we tell them
00:09:44.880 like, oh, you have to go back and do
00:09:46.080 something differently, that's kind of
00:09:47.360 annoying. It's a bit of a productivity
00:09:48.880 drain. Uh, it's a useful tool in the
00:09:51.600 toolbox,
00:09:53.120 but you know, it's not everything. It
00:09:56.000 also doesn't do anything to handle like
00:09:57.839 if you have a big table, uh, running a
00:10:01.040 migration might take 24 hours. And if
00:10:04.000 you're doing certain operations in
00:10:05.600 MySQL, that table might be locked for
00:10:07.600 writes for 24 hours. That'd be a pretty
00:10:09.360 bad bad outage to have. And the
00:10:11.279 checklist kind of we're not going to go
00:10:13.120 back. The checklist doesn't really do
00:10:14.480 much to to help you with that. Uh and it
00:10:18.560 also doesn't handle people just making
00:10:20.480 mistakes. Like mistakes happen in the
00:10:22.480 same way that outages happen. You can't
00:10:24.079 just say like, "Oh, we'll just never
00:10:25.360 make a mistake and everything will be
00:10:26.720 fine." Uh so we do need to go a bit
00:10:29.040 deeper than this. Uh one of the things
00:10:31.360 we want to do is we want to make sure to
00:10:33.200 fix issues before they happen.
00:10:35.839 uh we should be telling engineers about
00:10:38.720 risks while they're writing their
00:10:40.320 migration and that kind of like when I
00:10:42.880 thought about it you know that sounds an
00:10:44.160 awful lot like linting uh and we have
00:10:46.880 rubocop so like let's get rub on the
00:10:50.560 case uh we can take an example now this
00:10:54.160 is I dug back through the annals of time
00:10:56.880 in intercom this is migration from 2011
00:11:00.000 when the company was only like four or
00:11:01.519 five people and I think it's it's pretty
00:11:04.000 representative of like a normal
00:11:06.000 migration that you might make on a new
00:11:07.440 Rails app. We're like adding columns on
00:11:10.480 the messages table. We're adding some
00:11:12.720 comment column col columns on the
00:11:14.880 comments table. You know, it's fine.
00:11:17.200 We're doing some ads. Uh we do the down.
00:11:19.920 We do the reverse. You know, it's all
00:11:22.240 good. This is a real thing. Uh there's
00:11:24.480 nothing wrong with it particularly. I
00:11:26.240 wouldn't let you merge it now, but
00:11:27.519 there's nothing wrong with it
00:11:28.399 particularly. Uh
00:11:31.200 why wouldn't I let you merge it? Uh,
00:11:34.480 like I said before, we kind of touched
00:11:35.839 on it in the checklist. Every line there
00:11:37.680 is executing a new SQL query. On MySQL,
00:11:40.720 none of those things aren't atomic. So,
00:11:42.720 the migration isn't atomic. You know,
00:11:44.959 you can, if you have Rubicop Rails, you
00:11:46.959 can put this into your Rubicop YAML like
00:11:49.200 Rails bulk change table enabled. You're
00:11:51.839 running on MySQL, include your
00:11:53.760 migrations uh directory, and it will
00:11:56.640 warn you that like, oh, hey, you know,
00:11:58.480 you could bulk those things together.
00:12:01.120 And that's pretty good. it will tell you
00:12:02.880 like oh those messages changes could
00:12:04.480 have been one migration the comments uh
00:12:07.680 columns could have been another bulk
00:12:09.200 change table but uh it only warns you
00:12:13.680 when operations that could be bulked
00:12:16.959 could be combined aren't being combined.
00:12:19.519 It doesn't warn you that like even if
00:12:21.200 you combined all of the changes on the
00:12:22.880 two tables that that would still result
00:12:25.920 in two distinct altars and that that
00:12:28.560 could still fail in the middle. Uh so we
00:12:31.360 went a bit further. Um this is I just
00:12:34.480 copy and pasted the code into a gist.
00:12:36.399 This is the actual cop we run in
00:12:38.000 intercom for this. Uh it kind of does
00:12:40.560 the same thing as as the uh the Rails
00:12:42.800 one, but it ensures that like in
00:12:45.120 addition to combining all of the
00:12:46.639 bulkable operations, there can only be
00:12:48.800 one operation left once they've all been
00:12:50.880 combined. So uh you can get into a
00:12:53.920 situation where a migration could fail
00:12:55.760 in the middle. It's it's a bit of an
00:12:57.760 edge case. It's annoying when it
00:12:59.200 happens. Uh so that's good. That's an
00:13:02.560 example of something we can do at
00:13:03.680 Rubocop. Something that people just
00:13:04.959 don't have to worry about anymore.
00:13:07.920 What else can we do? Uh so if you were
00:13:11.920 if you were in Toronto last year, if you
00:13:13.519 saw my talk from from 2024, you might
00:13:16.000 know you might know how this story ends.
00:13:18.480 Uh big in foreign keys. Uh so we had an
00:13:22.079 interesting problem in intercom where
00:13:24.880 since I don't remember what Rails
00:13:26.480 version the ID column defaults to being
00:13:29.760 a big int. It's a good change. Uh we've
00:13:32.880 hit max int many times on different
00:13:35.519 tables. Uh so the default being big int
00:13:39.200 is good but you can still sometimes end
00:13:41.839 up with external references being the
00:13:43.839 wrong type. uh we don't tend to use the
00:13:46.959 uh references function when we're making
00:13:49.040 migrations because actually we have a
00:13:50.800 lot of uh vertically partitioned
00:13:52.800 databases where tables live in in
00:13:54.560 different schemas so you can't join
00:13:56.639 across them or uh and also in in my SQL
00:13:59.440 you can't really rely on foreign key
00:14:01.279 constraints if you've got a a large
00:14:02.959 database so we don't do it uh but then
00:14:07.120 you can end up in a situation where like
00:14:09.040 table A references another table and
00:14:11.600 that has the foreign key is a regular
00:14:13.920 int table B is a big int and then you're
00:14:16.480 going to run out of uh when when table B
00:14:18.880 exceeds a 32-bit int, then everything
00:14:21.920 breaks because now you can't insert rows
00:14:23.519 into table A anymore. Getting pretty
00:14:25.440 abstract. Uh that causes us a lot of a
00:14:28.399 lot of headaches. Like you can watch the
00:14:30.639 watch the video from last year. It was
00:14:32.399 it was a painful time. Uh and one of the
00:14:35.040 things we did afterwards is another
00:14:37.839 simple custom cop. I mean I kind of hope
00:14:40.160 you don't need this one, but if you want
00:14:41.440 to have a look and see how it works, go
00:14:43.360 ahead. Uh what this does is look for
00:14:45.839 columns that end in underscore ID and
00:14:48.720 doesn't let you use a small int uh type
00:14:51.440 for them. It's a little heavy-handed,
00:14:53.600 but it works for us. And I think a lot
00:14:56.079 of the time with uh writing good rubocop
00:14:58.240 rules, it's a bit like uh what I think
00:15:00.160 of writing good reaxes. You don't want
00:15:02.000 to try and overfit. You just want to
00:15:04.000 handle the cases that you know about.
00:15:05.839 And uh yeah, that's that works pretty
00:15:08.480 well. Uh so last one and we're going to
00:15:12.800 lead with the code this time and I'll
00:15:14.639 you'll find out why. We have another
00:15:16.720 custom cop that blocks using the execute
00:15:20.160 method in migrations. So you probably
00:15:22.880 know that execute is the way to run raw
00:15:26.240 SQL. Um
00:15:28.800 it's fine. Like I know a lot of people
00:15:30.480 do it. Uh we don't have a ton of we
00:15:32.480 didn't have a ton of examples even
00:15:33.760 before adding this cop. Now we have
00:15:36.160 none. Uh
00:15:38.800 I like using those helper functions.
00:15:41.519 Sometimes people like writing raw SQL. I
00:15:43.680 can write raw SQL as well. But uh it's
00:15:46.240 something we wanted to stop in intercom.
00:15:48.000 And we wanted to stop it for a better
00:15:49.519 reason than just my like sense of
00:15:51.279 aesthetics because uh catching things in
00:15:54.560 development is great. You know, popping
00:15:56.160 up warnings in Rubocop, fine. That's
00:15:58.959 pretty good. But your development
00:16:00.639 database is still like different to your
00:16:02.639 production database. we don't
00:16:03.839 pre-provision everyone's developer
00:16:05.360 environment with 50 terabytes of
00:16:07.040 conversation data so that they can get a
00:16:09.440 realistic experience of working with the
00:16:11.120 production app. uh lots of context about
00:16:15.040 your production environment just isn't
00:16:16.720 available in dev and having to like
00:16:20.240 hardcode in all the rules saying like oh
00:16:22.480 well this table's big and that table you
00:16:24.639 know whatever too much too too brittle
00:16:27.600 uh you'd be like trying to you know it's
00:16:30.079 like a constantly shifting goalpost
00:16:31.600 whatever building foundations on sand so
00:16:34.880 uh we went a step further even than
00:16:37.279 doing the linting and we went for for
00:16:40.320 runtime checks and this is kind
00:16:42.480 This is, I think, the the gold standard
00:16:44.320 for doing this because the only time
00:16:46.720 you're going to have full context, like
00:16:48.240 the only way, the only place you ever
00:16:50.320 know what your production environment
00:16:51.600 looks like is in production. Uh, I don't
00:16:54.320 really believe in staging environments.
00:16:55.759 I think staging environments, we get
00:16:57.519 back to the problem of a constantly
00:16:59.279 shifting goalpost. It's hard to make
00:17:01.279 them look real. And the best place to to
00:17:04.319 know what production is is production.
00:17:07.120 Uh so if we insert ourselves into the
00:17:10.400 migration process in production then you
00:17:12.959 know we have access to the production
00:17:15.199 database. We have access to all of the
00:17:17.839 everything we need to know to to make a
00:17:19.760 guess as to whether the migration that's
00:17:21.600 happening is safe. Uh so we can kind of
00:17:24.480 start simple
00:17:26.319 uh and think about like what unsafe
00:17:28.720 operations are there.
00:17:31.039 uh like mistakes can happen like we said
00:17:33.200 before like we just talking about what's
00:17:36.160 safe in dev isn't necessarily safe in
00:17:37.760 production. Uh
00:17:42.880 I I guess yeah risks are even greater
00:17:44.640 now I think I mentioned this earlier
00:17:45.679 risks are even greater now with LLMs
00:17:47.200 doing the driving. So we've had a couple
00:17:48.960 of cases where LLMs are generating
00:17:51.520 migration files and they're not running
00:17:53.760 the generator and the migration file
00:17:55.919 that comes out at the end isn't
00:17:57.120 necessarily super wellformed.
00:17:59.520 Uh so we wanted to to handle cases like
00:18:02.880 that and one of the things we did
00:18:06.720 um kind of lost Fred there a little bit
00:18:09.679 but yeah we have this uh we we uh okay
00:18:13.760 yeah now I remember so one of the things
00:18:15.679 we did before we get into the really
00:18:17.120 lowlevel stuff is uh we just overrode in
00:18:20.559 like our rake file the DB namespace and
00:18:24.400 we blocked like DB reset DB drop things
00:18:28.480 like that we don't want anyone to drop
00:18:30.240 the database in production. So there's
00:18:31.760 no need to be able to run that rake
00:18:33.200 task. We don't want anyone to reset the
00:18:34.880 database in production. And another
00:18:36.880 thing we did, this kind of harks back to
00:18:38.720 the checklist from earlier, uh we don't
00:18:41.840 want you to run DB migrate, uh just the
00:18:45.679 bare version of DB migration because
00:18:48.240 that will run every pending migration.
00:18:51.120 And that's fine when you're like one
00:18:52.799 dev, two devs, like a small team, but we
00:18:56.640 sometimes have 10 migration PRs landing
00:18:59.440 within 15 minutes of each other from
00:19:01.440 different teams. We don't want them all
00:19:03.280 running like some of them might be for
00:19:05.120 those massive tables. Another one might
00:19:06.960 just be a create table for something
00:19:08.559 that's being spiked out. So, we want
00:19:10.480 everyone to be deliberate about only
00:19:12.160 running their own changes. So, we force
00:19:14.080 you to use the DB migrate up and specify
00:19:17.039 a version. Uh, so I hope you're never
00:19:20.480 going to need to worry about these
00:19:21.840 things yourself, but uh, if you do, like
00:19:25.039 you can just grab that code snippet.
00:19:26.640 It's right here. Drop it into your rake
00:19:28.160 file. It should work. Uh, but we want to
00:19:32.080 go further. Uh, we want to start
00:19:34.960 building up some idea of like what real
00:19:37.039 migrations are are safe and and we can
00:19:40.000 do that. So, uh, our checklist used to
00:19:43.440 tell teams, uh, that like, you know, if
00:19:47.360 they're migrating a large table, come
00:19:50.240 and talk to my team.
00:19:52.640 But that just introduces like one of
00:19:54.880 those decision points like I was talking
00:19:56.799 about before, one of these cases where
00:19:58.480 you're choosing your own adventure
00:19:59.679 because now you have to know like is a
00:20:01.440 table large and like at what at what
00:20:03.760 point is a table large?
00:20:06.160 um you know that's uh that's not very
00:20:08.799 clear. And we found that like what
00:20:10.400 tended to happen was people became
00:20:13.039 either they knew they were creating a
00:20:14.559 table and that's fine or they would just
00:20:17.120 be uncertain and they would want to come
00:20:19.280 through us for every migration and like
00:20:21.520 sense check it with us. Was it okay to
00:20:23.200 run it and like that's such a
00:20:24.720 productivity drain because there's only
00:20:26.160 there's only so many people on the team.
00:20:27.840 We're only able to respond so quickly.
00:20:30.799 uh when you want to run one of those big
00:20:33.679 migrations, there are tools like Ghost.
00:20:35.919 If you've seen Ghost, it's GitHub online
00:20:38.400 schema transformer, I think it stands
00:20:40.640 for, or there's like PT online schema
00:20:43.200 change or large hadron migrator. I think
00:20:46.159 that's the Facebook one. There's a lot
00:20:47.440 of different tools. Uh but you have to
00:20:49.840 learn how they work and and you have to
00:20:52.080 know all of these command line switches.
00:20:53.919 And again, we're deviating really far
00:20:56.640 from the the standard Rails flow at that
00:20:59.760 point. Like we said, it's hard to to use
00:21:02.720 a tool like that. We're kind of breaking
00:21:05.200 the uh the just be normal principle I
00:21:08.640 talked about earlier. You know, we want
00:21:10.240 to just run migrations of rake. I guess
00:21:12.640 you could package up running one of
00:21:14.080 those tools of rake. I wouldn't
00:21:15.600 personally feel super safe doing that,
00:21:18.320 but that's my risk tolerance in our
00:21:21.039 production environment. like your
00:21:22.559 mileage may vary. Uh so one of the
00:21:25.679 things we do to make sure that people
00:21:27.200 don't have to ask those questions is
00:21:29.200 actually just analyze the migrations at
00:21:31.440 runtime. Uh so we monkey patched active
00:21:35.440 record migration exec migration which if
00:21:38.559 you know how internals work like this is
00:21:41.120 effectively the thing that calls the up
00:21:43.280 method in your migration or calls the
00:21:45.200 change method. Uh we used the active
00:21:48.960 record command recorder which uh is a
00:21:51.600 nice bit of tech. If you've ever
00:21:53.120 wondered how Rails can take your change
00:21:55.919 method and work out how to reverse it
00:21:57.919 without you specifying the reverse, it's
00:22:00.080 thanks to command recorder where it like
00:22:02.000 executes the migration in a kind of dry
00:22:03.919 run environment, works out the commands,
00:22:06.080 and then it knows how to map every
00:22:07.919 command to the reversed version of it.
00:22:10.799 Uh so running your migration in a dry
00:22:13.120 run environment getting the commands out
00:22:14.799 that that's you might see where we're
00:22:16.640 going here. Uh and then we implemented
00:22:19.039 our heruristics that we were using like
00:22:21.360 you know if you came to me or any of my
00:22:22.960 teammates they're all sitting at the
00:22:24.159 front here. If you came to one of us and
00:22:25.840 asked like do you need to run uh this
00:22:28.240 migration with ghost did you need us to
00:22:30.480 run it for you? uh we just like wrote
00:22:32.960 coded up those rules and if you're going
00:22:36.240 to hit something there if you know if
00:22:38.400 you need to talk to us we just raise and
00:22:40.080 block the migration we'll tell you like
00:22:41.919 you okay no this this one you need to
00:22:43.679 come and talk to us we'll post it in
00:22:45.760 your console output we'll post it into
00:22:47.760 our Slack channel to say that someone
00:22:49.360 just ran the migration and they're going
00:22:50.720 to need some help uh and yeah that that
00:22:53.760 works pretty well so we again I had to
00:22:56.559 squeeze it onto the the page a bit but
00:22:59.440 uh commands here is the internal of a of
00:23:02.559 a command recorder. So, it's like a list
00:23:04.720 of operations. We block execute. And now
00:23:08.720 we get back into why we blocked execute
00:23:11.039 because uh the command recorder doesn't
00:23:13.840 actually know what you're doing in an
00:23:15.600 execute. It just knows that one
00:23:17.039 happened. So, we don't want to deal with
00:23:18.240 that. So, we block you using execute in
00:23:21.039 case you bypass the rubricop rule from
00:23:22.880 earlier. And then we check like is the
00:23:25.919 row count above a limit and is the store
00:23:28.320 size above a limit. And the reason we do
00:23:30.240 that is we found like if tables are are
00:23:33.039 sufficiently long like if they have
00:23:34.720 sufficiently many rows that can be a
00:23:36.240 problem. But a small table with a lot of
00:23:38.400 like very big rows like a very wide
00:23:40.720 table that can also be a problem. So for
00:23:43.440 us these limits are 5 million rows or 5
00:23:46.799 GB of data which is
00:23:49.919 to me low like maybe my brain is broken
00:23:52.960 at this point. Five million rows sounds
00:23:54.559 like an absolutely microscopic table to
00:23:56.320 me but I deal a lot with like 50
00:23:58.240 billion. So yeah, but uh it that means
00:24:01.600 that like if you're dealing with a new
00:24:04.480 table, like you're building out a new
00:24:05.760 feature, you're spiking it, you don't
00:24:07.200 have to worry about this. We'll try and
00:24:08.640 just get out of your way, but by the
00:24:10.240 time you have 5 million rows, it
00:24:12.000 probably has enough traction that that
00:24:14.240 doing something to it locking it for
00:24:16.000 some reason might be an issue. Uh and
00:24:18.480 then we print out some nice formatted
00:24:21.279 error message saying like, "Hey, it's
00:24:22.960 too big. You need to come talk to us."
00:24:25.520 uh and the way we get the rows uh if you
00:24:30.080 we can't run select count star because
00:24:32.159 that would also be very slow. So instead
00:24:34.240 we do explain select count star. Innov
00:24:37.360 will give you a guess at how big the
00:24:38.640 table is and that guess is good enough
00:24:40.159 for our purposes or we check the
00:24:42.159 information schema to figure out how big
00:24:44.000 is the table. We actually factor in the
00:24:46.240 index size when doing that. I've never
00:24:48.080 been completely certain if we need to
00:24:50.000 but we just do it anyway.
00:24:52.400 Uh we'll come back. you'll get a link to
00:24:54.640 the code. So, I'm skipping over a bit
00:24:56.240 fast for time. Uh, another thing is like
00:24:59.760 in older versions of MySQL and until
00:25:01.919 relatively recently, we were on 5.7,
00:25:04.320 which was still affected by this. Uh,
00:25:06.880 dropping a big table wasn't safe. Uh, it
00:25:10.480 wasn't safe because if the table was in
00:25:12.159 the buffer pool cache, dropping it would
00:25:14.159 synchronously ex uh,
00:25:17.440 God, I'm completely blanking on the
00:25:19.120 word. It would synchronously like flush
00:25:21.039 it out of the cache. and while it was
00:25:22.720 flushing the whole database could be
00:25:24.400 locked. Uh so we never wanted that to
00:25:27.360 happen. So if we ever saw that the
00:25:29.840 command being executed in a migration
00:25:31.760 was a drop table instead we would just
00:25:34.000 rename the table to something else with
00:25:36.320 a predictable format and then we had
00:25:39.360 another like periodic task that comes
00:25:41.360 along and cleans up those tables. We
00:25:44.080 again we don't worry if the table is uh
00:25:47.200 smaller than 2,000 rows. We don't really
00:25:49.279 mind what people do with that with those
00:25:50.960 little tables. So, this is just another
00:25:53.120 thing that now people can drop a table.
00:25:54.880 They don't need to know what's happening
00:25:56.080 under the hood. We just handle the rest
00:25:57.520 of it. Uh we also block change column.
00:26:02.240 Uh that might seem a little strange. If
00:26:05.279 you change a column from like int to big
00:26:08.159 int, fine. If you change a column from
00:26:10.159 like string to text, fine. Uh those are
00:26:13.120 all things that are normal and you might
00:26:15.120 want to do. But if you change the column
00:26:17.360 from int to string or from I don't know
00:26:20.559 some other type two types that aren't
00:26:22.240 compatible with each other during our
00:26:24.480 deployment process you would have a
00:26:26.000 bunch of processes that still know about
00:26:28.159 the old schema they have a cache you
00:26:29.600 have a bunch of processes know about the
00:26:31.039 new schema lots of breakages could
00:26:33.200 happen. So we just ban doing change
00:26:35.279 column. We go for like an expand
00:26:37.120 contract model where you add a new
00:26:38.880 column populated from the old column
00:26:41.039 then drop the first column again rather
00:26:43.679 than uh changing it in place. Uh it's a
00:26:46.559 little tiresome, but there isn't really
00:26:48.159 a better solution for making that safe.
00:26:51.039 Uh
00:26:53.840 we also just to have a it's not included
00:26:56.640 here. We include an escape hatch for
00:26:59.120 like expert operators, my team or or
00:27:01.840 some engineers to just like bypass these
00:27:04.480 checks because again you don't want to
00:27:06.720 like take away the sharp tools. You want
00:27:08.799 to make sure that like people aren't
00:27:10.559 just juggling knives or whatever, but it
00:27:12.960 is useful to be able to do them. So, uh,
00:27:16.880 we do let you bypass this, but we run
00:27:18.640 all of these checks and more in
00:27:20.240 production on every change and
00:27:24.000 you can too if you want to take a look
00:27:26.080 at the code. This is probably the
00:27:27.360 biggest single chunk of code. So,
00:27:29.039 there's a lot to go through. Um, I would
00:27:32.400 take it as a jumping off point. Uh, like
00:27:35.039 I said at the beginning, you don't want
00:27:36.320 to just blindly take every idea. We're
00:27:39.120 not I'm not here to pedal the the Google
00:27:41.360 SR book and tell you how to run your uh
00:27:44.080 run your startup, but like there are
00:27:46.000 ideas in there that I think are
00:27:47.440 generally useful. Uh and this is like
00:27:50.240 literally the code straight out of our
00:27:51.679 monolith. So feel free to take a look.
00:27:54.640 Uh that's kind of bringing me to the end
00:27:57.440 like where do we go in the future?
00:27:59.760 Intercom is currently in the process of
00:28:01.520 moving everything to planet scale away
00:28:03.200 from uh Aurora RDS in in uh AWS. uh that
00:28:08.399 does change a lot of the dynamics here
00:28:10.159 and it was part of why I wanted to talk
00:28:11.600 about this because a lot of these things
00:28:13.200 are probably going to go away from our
00:28:14.480 monolith and I think they are kind of
00:28:15.919 useful. uh we've already seen like a big
00:28:19.200 decrease in issues that uh engineers
00:28:22.159 were having both with like shipping
00:28:23.840 things and in production since we
00:28:25.520 implemented this for ourselves and now
00:28:27.360 moving on to planet scale it's kind of
00:28:29.279 changing things again there's a
00:28:30.720 different workflow we've actually
00:28:32.240 packaged up behind DB migrate again we
00:28:35.840 still are trying to stick to that just
00:28:37.360 be normal principle but that maybe uh
00:28:40.320 maybe will change planet scale has a
00:28:42.159 nice flow that's kind of like a pull
00:28:44.240 request for changing the schema and
00:28:46.159 maybe it makes sense to expose that to
00:28:47.919 everyone. Uh the dust is kind of still
00:28:50.640 settling there. So we'll see. One thing
00:28:53.200 we have got as a result of moving to
00:28:54.880 planet scale is now we never have to
00:28:57.120 tell engineers like hey you have to wait
00:28:59.120 for data stores on call to run your
00:29:01.360 ghost migration for you because that can
00:29:03.440 just be handled under the hood by by the
00:29:05.200 planet scale workflow. So that's nice.
00:29:08.159 Uh but yeah, thanks very much. This is a
00:29:12.399 link for like everything that's been in
00:29:14.159 the presentation so you can pull it
00:29:15.679 down. Um, feel free to have a look. I'
00:29:18.480 love to talk about migrations and
00:29:19.840 running big databases with anyone. And
00:29:22.159 otherwise, I'll leave it there. So,
00:29:24.799 thanks.
Explore all talks recorded at Rails World 2025
+19