Taming the Beast: Safely Managing Database Operations in Rails in a Team of 100s


Summarized using AI

Taming the Beast: Safely Managing Database Operations in Rails in a Team of 100s

Miles McGuire • September 04, 2025 • Amsterdam, Netherlands • Talk

Introduction

This talk, "Taming the Beast: Safely Managing Database Operations in Rails in a Team of 100s," presented by Miles McGuire at Rails World 2025, discusses the challenges and solutions Intercom has developed for managing large-scale database schema migrations within a sizable engineering team using Rails and MySQL.

Main Theme

The main theme revolves around maintaining developer productivity and system safety while enabling fast, safe schema changes across massive databases, specifically focusing on processes, tooling, and automation to facilitate zero-downtime operations, prevent mistakes, and reduce risk.

Key Points

  • Context and Scale: Intercom uses Rails and MySQL to manage hundreds of terabytes of data with tables ranging up to tens of billions of rows, posing significant migration safety and productivity challenges as compared to smaller-scale projects.
  • Productivity and Developer Experience:
    • New engineers should be able to use standard Rails patterns without running into undocumented or unique process hurdles, crucial for onboarding and now especially important as more migrations are generated by LLMs.
  • The Evolution of Safety Processes:
    • Checklists and Runbooks: Intercom began with operational checklists to document migration safety steps, emphasizing that each checklist item should be justified by real experience to avoid productivity drains.
    • In-PR Automation: Checklists were integrated into GitHub PR reviews via GitHub Actions, meeting developers within their workflow and ensuring wider adoption.
  • Limitations of Checklists:
    • Checklists alone are fallible; often ignored or misunderstood, especially by experienced engineers.
    • Checklists do not prevent long-running, dangerous migrations or handle scale-specific risks.
  • Proactive Risk Reduction:
    • Linting with RuboCop: Custom RuboCop cops warn developers about risky migration patterns—such as forgetting to bulk changes, using non-BIGINT foreign keys, or using raw SQL via execute—helping catch errors before code review.
    • Blocking Risky Operations: Some operations (like execute in migrations) are blocked outright due to their bypassing of safety checks and the inability to inspect their contents programmatically.
  • Runtime Safety Checks:
    • Monkey-Patching Migrations: Intercom monkey-patches ActiveRecord migration execution in production to analyze migrations in real time, blocking those that fail heuristics (e.g., affecting tables beyond 5 million rows or 5GB in size).
    • Automated Decision Points: The system programmatically intercepts and blocks unsafe actions (such as drop table on large tables or change_column operations that could break deployments), relying on production data for accuracy.
    • Escape Hatches: Expert users can override checks if absolutely necessary, maintaining flexibility for advanced use-cases.
  • Migration Execution Practices:
    • Only single, explicitly-specified migrations are allowed in production runs to avoid conflicts and overlapping risky operations.
  • Move to PlanetScale:
    • The transition to PlanetScale will further shift workflows, leveraging built-in schema change review processes and abstracting many manual or custom controls away from the core team.

Conclusions and Takeaways

  • Safe schema migrations at scale require layered controls: documentation, code linting, automation, and runtime checks.
  • Automating safety net steps where possible minimizes productivity loss and reduces the burden on specialized database teams.
  • As database infrastructure evolves (e.g., migration to PlanetScale), many manual interventions will become obsolete, emphasizing the value of aligning with built-in tooling where possible.

Relevant Examples

  • Migration linting and blocking implemented via RuboCop and monkey-patching.
  • In-PR checklists posted via GitHub Action.
  • Runtime analysis blocking risky migrations in production (e.g., big table alters, risky drops, or unsupported changes).

Closing

The strategies developed by Intercom have reduced operational risk and improved developer experience in a large, fast-moving Rails organization, offering practical code examples and processes for others to adapt to their own scale and infrastructure.

Taming the Beast: Safely Managing Database Operations in Rails in a Team of 100s
Miles McGuire • Amsterdam, Netherlands • Talk

Date: September 04, 2025
Published: Mon, 15 Sep 2025 00:00:00 +0000
Announced: Tue, 20 May 2025 00:00:00 +0000

Intercom leverages Rails to empower hundreds of engineers to move fast and make the changes they need in production without a lengthy centralized review process. But allowing arbitrary migrations to run across hundreds of tables and hundreds of terabytes of data in MySQL comes with inherent risks. In this session, we'll look at where we came from, what changes we've made to reduce risk and enable people to move fast while safely leveraging Rails' power, and where we're going in the future.

Rails World 2025

00:00:07 So, "Taming the Beast." When I saw my talk was accepted I realized I picked a clickbaity title. We're going to talk about how we manage schema migrations, particularly at Intercom.
00:00:17 You've heard I’ve spoken at Rails World before. I'm an engineer on the Data Stores team at Intercom, and I'm directly responsible for databases, caching, and related systems — our team also owns parts of the Ruby on Rails platform.
00:00:34 My team is responsible for raising the bar for productivity across the company, so anything that helps make our engineers more effective is part of our remit. As I mentioned, this is my third time being here.
00:00:56 Quick disclaimer: much of what we'll discuss is MySQL specific — Intercom is a MySQL shop — but the concepts should generalize to other database technologies.
00:01:21 How do we think about productivity? Making schema migration tooling robust is important for availability and safety, but it's also a key productivity enabler. Rails has great resources for learning, yet practices that work when you're small don't always scale. At Intercom we run a large relational database — around 300 terabytes of data with multiple tables having tens of billions of rows — so you need to consider different things at that scale. Still, we should make it possible for engineers to look up documentation and get the right start; the worst outcome is a new hire who knows Rails expecting idioms to work and finding they don't.
00:02:12 If you’re starting a small Rails app with little or no production data, you probably don’t need zero-downtime deployments and shouldn’t worry about many of these issues yet. Early on, focus on iteration speed and do the minimum amount of process that keeps you fast. Scale is a nice problem to have, but as you grow, issues will inevitably arise.
00:02:35 You’ll learn painful lessons as you grow: the first time you discover your core data model is too large to migrate safely will usually be when a migration causes an outage, and the first time you learn customers are sensitive to downtime is when you have angry customers after a migration. The important part is to learn from those incidents and stop repeating the same mistakes.
00:03:11 A common maxim from aviation and other safety-critical industries is that checklists are written in blood. Checklists and runbooks are often the natural first step once you need a process; they document the edge cases you know about and the steps people must follow.
00:03:34 Keep checklists small and actionable. Avoid complex conditional checklists that force a "choose your own adventure" flow — every checklist item is a liability. Only add items that have actually caused problems, and don’t over-engineer guardrails to the point they slow engineers down.
00:04:07 A checklist is only useful if people actually use it. When engineers work on a migration they end up in a PR doing code review, so meet them there: put the checklist in the PR. Our internal wiki or Notion pages were rarely read, so we use a GitHub Action to post a stripped-down checklist directly on migration PRs.
00:04:49 The GitHub Action posts a simplified version of our checklist. I removed the really gnarly operational details — for example, you don’t need to know how our ETL pipeline works — but it contains common-sense guidance that’s a good start.
00:05:03 The checklist calls out important items: don’t merge a PR if the application code depends on changes that haven’t been completed in the checklist, because deployments can happen before migrations run. If you’re on Postgres you may not notice some pain, but in MySQL you must take care: when making multiple changes, use change_table(..., bulk: true) because MySQL lacks transactional DDL; a migration that fails halfway can leave the schema in an inconsistent state.
00:05:35 Not everything in the checklist is scar tissue; some items are educational. For example, ensure your PR includes an updated schema.rb. If your migration file has an older timestamp, leave the schema file untouched — you can’t make a migration go back in time. These rules help developers new to Rails avoid surprises.
00:06:04 Even with a checklist, things can go wrong. New hires sometimes ask about checklist items that no longer match reality, which shows checklists must be maintained. Also, by the time a PR is open, engineering work is often considered done, and telling someone to go back and change things is a productivity drain. Importantly, a checklist doesn't address long-running migrations that lock tables for hours, nor does it prevent human error — so we need to go deeper.
00:06:49 We want to fix issues earlier by warning engineers while they're writing migrations — in other words, linting migrations. We already run RuboCop, so adding migration-specific cops makes a lot of sense.
00:07:24 For example, I dug up a 2011 migration that added several columns. That’s normal in a small app, but at scale each line executes a new SQL statement and MySQL DDL is not transactional. RuboCop can warn you to bulk change_table operations, but the standard cop only warns when operations that could be bulked together aren’t combined; it doesn’t prevent multiple distinct ALTERs that could still fail mid-run.
00:08:00 So we built a custom cop that does the same work as the Rails one but goes further: after combining all bulkable operations there can be only one remaining operation. That prevents migrations that might fail in the middle. It’s an example of moving checks earlier so engineers don’t have to worry about these edge cases later.
00:08:59 Another problem we faced was type mismatches for foreign keys during the transition to BigInt IDs. We avoid using references in migrations because we have vertically partitioned databases and MySQL foreign key constraints aren’t reliable at our scale. We wrote a custom cop that flags columns ending in _id if they use a small integer type. It’s heavy-handed but effective; good RuboCop rules handle the cases you know about without overfitting.
00:09:41 We also added a cop that blocks using execute in migrations. execute runs raw SQL and bypasses the higher-level tools that inspect and analyze migrations. Before this cop we had a few examples of execute; now we have none. We prefer helper functions and higher-level primitives.
00:10:15 Catching issues in development is helpful, but development databases differ from production. We don’t pre-provision developer environments with terabytes of production data, so many production characteristics aren’t visible in dev. Linting is a useful early safety net, but it can’t see production context, which is why we moved to runtime checks.
00:11:14 Runtime checks are the gold standard because production is the only place with full context. I don’t put a lot of faith in staging environments — they’re hard to keep realistic. If we insert checks into the production migration process, we can inspect the production database and make a much better judgment about whether a migration is safe.
00:12:24 Concretely, we monkey-patched ActiveRecord::Migration.exec_migration to analyze migrations at runtime. We use ActiveRecord's command recorder to perform a dry run of the migration and capture the list of commands it would execute. That gives us a machine-readable representation of the migration operations so we can apply heuristics an experienced operator would use.
00:13:44 If the heuristics indicate operator attention is needed, we block the migration and surface a clear message in the console and in Slack so our on-call data stores team can assist. We block execute at runtime as well because the command recorder can’t inspect raw SQL; to analyze migrations we must be able to see each operation.
00:15:06 We check table row counts and storage size. SELECT COUNT(*) on large tables is too slow, so we use EXPLAIN SELECT COUNT(*) to get an estimated row count or consult information_schema, including index sizes. Our thresholds are conservative: we trigger operator review at about 5 million rows or 5 GB of storage. Those values feel small if you work with 50 billion row tables, but they capture tables that have enough traction that locking them could have user impact.
00:17:04 Older MySQL versions (we were on 5.7) also had problems dropping large tables because removing them from the buffer pool could synchronously flush pages and block the database. To avoid that, when a migration attempts DROP TABLE we instead rename the table to a predictable name and have a periodic cleanup task remove those tables safely. We also ban change_column operations that change between incompatible types; instead we use an expand-contract pattern: add a new column, populate it, then drop the old column. It’s more work, but it’s safer.
00:18:40 We include an escape hatch so expert operators — data stores on-call or senior engineers — can bypass checks when necessary. We don’t want to remove sharp tools, only prevent people from juggling knives. We run these checks on every change in production. The implementation is substantial; treat our code as a jumping-off point rather than a drop-in solution.
00:19:56 Where are we going next? Intercom is migrating away from Aurora/RDS toward PlanetScale, and that changes a lot of dynamics. PlanetScale has a different workflow for schema changes that may make some of these monolith-specific measures unnecessary or alter how we expose schema workflows to engineers.
00:21:10 Moving to PlanetScale has already reduced the number of issues engineers experienced when shipping and in production. It also means we no longer need to tell engineers to wait for Data Stores on-call to run Ghost migrations for them; the platform can handle some of that under the hood. The dust is still settling, so we’ll see how the workflow evolves.
Explore all talks recorded at Rails World 2025
+19