Simplifying at Scale: 7 Years of Rails Architecture at Persona


Summarized using AI

Simplifying at Scale: 7 Years of Rails Architecture at Persona

Alex Coomans • July 08, 2025 • Philadelphia, PA • Talk

Simplifying at Scale: 7 Years of Rails Architecture at Persona

Speaker: Alex Coomans (RailsConf 2025)

Overview

This talk details Persona’s journey scaling a Ruby on Rails architecture over seven years, evolving from early-stage simplicity to meeting the demands of a global identity platform. Alex Coomans discusses architectural turning points, trade-offs, and the surprising role of simplicity as infrastructure scales.

Persona’s Platform and Challenges

  • Persona builds an all-in-one identity management platform, serving customers in diverse sectors with varying compliance needs and workflows.
  • Flexibility and adaptability are architectural imperatives, deeply impacting all technology choices.

Key Architectural Evolution Steps

  • Early Days on Google App Engine:

    • Started with Rails 5.2 on Google App Engine for fast-moving, minimal ops overhead.
    • App Engine abstracted away most operational tasks but limited flexibility and control as the business grew.
  • Frontend and Asset Pipeline Evolution:

    • Initial apps used jQuery, later adopted React and TypeScript via Webpacker.
    • Complexity arose as front-end tools advanced; Persona migrated to Vite for faster, modern workflows.
    • Takeaway: Rails provides strong defaults, but scaling often requires integrating tools external to the Rails ecosystem.
  • Migration to Kubernetes (GKE):

    • Shifted to Kubernetes for more control over service deployment, scaling, and infrastructure management.
    • Kubernetes introduced greater flexibility in scaling (horizontal/vertical pod autoscaling), networking configuration, and observability—at the cost of increased responsibility and operational complexity.
  • Database Scaling and Sharding:

    • Outgrew a single primary MySQL cluster; began sharding with regional and jurisdictional compliance in mind.
    • Adopted both vertical and horizontal sharding, aligning with evolving Rails support and infrastructure demands.
    • Utilized MongoDB for lookup tables to efficiently route API requests to proper shards, solving for high-availability and global access.
  • Lessons in Data Growth:

    • Large tables, especially from Rails components like Active Storage, posed significant migration and backfill challenges at scale.
    • Migrations became risky and time-consuming; schema and access pattern design had to closely align with real-world data usage.
    • Began migrating to Shrine for file attachments to simplify the schema and operations.

Return to Simplicity: The "Stacks" Model

  • Consolidated multiple clusters and databases into a self-contained deployment unit called a "stack." Each stack is a runtime hosting multiple tenants with strong boundaries.
  • Retained some shared core services (such as the lookup table) while decentralizing others.
  • Emphasizes strong routing accuracy and operational isolation, without per-customer environment overhead.

Key Takeaways

  • Complexity at scale is inevitable; the crucial decision is where to place and how to manage that complexity.
  • Rails defaults facilitate rapid development, but effective scaling often requires careful deviation and architectural investment.
  • Simplicity, intentional design, and a willingness to adopt tools outside Rails are essential for sustainable, large-scale growth.

The talk concludes with an invitation to discuss further with the Persona team for those interested in scaling Rails applications or joining the company.

Simplifying at Scale: 7 Years of Rails Architecture at Persona
Alex Coomans • Philadelphia, PA • Talk

Date: July 08, 2025
Published: July 23, 2025
Announced: unknown

What does it really take to scale a Ruby on Rails architecture to power a global identity platform trusted by businesses around the world? From our early days on Google App Engine to a globally distributed, multi-cluster Kubernetes deployment, this talk takes you through the evolution of Persona's architecture.

We'll dive into the architectural turning points that shaped our journey: the trade-offs we faced, the mistakes we made, and the surprising ways complexity crept in. Most importantly, we'll share why simplicity might be the key to scaling successfully without losing your mind. Whether you're just starting out or wrangling scale yourself, you'll leave with battle-tested insights on system design, technical decision-making, and building for the long haul that'll help you build best-in-class Rails applications.

RailsConf 2025

00:00:16.880 Alex Cumins, a software engineer on the
00:00:18.800 infrastructure team at Persona. For the
00:00:21.119 past several years, I've been part of
00:00:22.320 the team responsible for scaling and
00:00:24.000 evolving the architecture behind our
00:00:25.760 identity platform. These days, I help
00:00:28.400 design and maintain the globally
00:00:30.240 distributed multicluster Kubernetes
00:00:32.480 setup we operate today. Although I
00:00:35.040 wasn't around for the very first commits
00:00:36.880 by our founders Rick and Charles, I know
00:00:38.800 the challenges they faced like many of
00:00:40.800 y'all in taking this from an initial
00:00:43.120 idea to a fully functional Rails
00:00:45.120 application and product. But before we
00:00:47.360 dive in, let me first give you a bit of
00:00:49.120 context on who we are at Persona and the
00:00:51.600 kind of problems our platform is built
00:00:53.199 to solve. Persona is an all-in-one
00:00:56.800 identity platform. Think of onboarding,
00:00:58.800 compliance, fraud prevention. We power
00:01:01.359 the behind-the-scenes workflows that
00:01:02.879 ensure individuals and businesses are
00:01:05.280 who they say they are. Our platform's
00:01:07.439 flexibility is what makes us stand out
00:01:08.960 and applicable to so many different use
00:01:10.560 cases.
00:01:12.240 We work with customers across a wide
00:01:14.080 range of industries, fintech,
00:01:15.920 healthcare, marketplaces, crypto,
00:01:18.320 travel, government, and more. And each
00:01:20.479 one has their own unique set of
00:01:22.080 compliance requirements, user flows,
00:01:24.080 risk tolerances, and regional
00:01:26.080 regulations. That means almost every
00:01:28.240 customer uses our platform in a slightly
00:01:30.159 different way. And as such, we've had to
00:01:32.799 architect for adaptability, a choice
00:01:34.960 that has influenced almost every part of
00:01:36.560 our stack. And with that context, let's
00:01:39.280 rewind the clock and walk through how
00:01:40.720 our Rails architecture has had to evolve
00:01:42.880 to support that level of flexibility,
00:01:44.880 scale, and diversity of use.
00:01:47.840 Persona started the same way so many
00:01:49.680 other companies have, with an idea and a
00:01:51.920 command. One we've all used to bootstrap
00:01:53.840 our grand visions. But that simple
00:01:55.520 command comes with a plethora of options
00:01:57.119 behind the scenes. Some of which are
00:01:58.560 innocuous early on, but carry deep
00:02:00.640 implications for how your system will
00:02:02.079 scale, evolve, and operate years down
00:02:04.479 the line. In our case, we launched on
00:02:06.640 Google App Engine, which gave us just
00:02:08.319 enough infrastructure to move fast
00:02:09.840 without worrying too much about
00:02:11.280 provisioning or deployments. If you're
00:02:13.760 not familiar, it's a fully managed
00:02:15.280 platform that abstracts away most the
00:02:17.200 operational complexity similar to Heroku
00:02:19.920 or a lightweight subset of what
00:02:21.280 Kubernetes offers out of the box.
00:02:24.160 We started on Rails 5.2, well into the
00:02:26.720 maturity of many Rails features. And
00:02:28.560 given that was a little over 7 years
00:02:30.160 ago, here's a refresher on some of the
00:02:32.080 major features that launched then. We'll
00:02:34.080 come back to a few of these and how we
00:02:35.519 manage scaling with them. But first,
00:02:37.200 let's talk about one that hit early and
00:02:38.879 often, the asset pipeline.
00:02:41.920 Let's time travel back to 2010, a time
00:02:44.000 when jQuery was your best friend. These
00:02:46.480 are screenshots of some actual code I
00:02:48.239 wrote in a Rails 2.3 application. You'd
00:02:50.879 manually include JavaScript files in
00:02:52.480 your templates. And yes, your actual
00:02:54.160 JavaScript logic would often live right
00:02:55.760 in your view templates, tightly coupled
00:02:57.519 to the markup it was enhancing, a
00:02:59.200 pattern that's become back in style. In
00:03:01.920 2011, Rails 3.1 introduced the asset
00:03:04.640 integrated asset pipeline. And it was a
00:03:06.400 gamecher. It gave us a structured way to
00:03:08.560 organize, bundle, and minify assets. In
00:03:11.360 the lower example, we include a file
00:03:12.959 named user tabs.js
00:03:15.760 to be executed on the page. But what
00:03:17.920 happens when that file changes? We
00:03:19.840 generally want browsers to cache script
00:03:21.440 content for performance reasons. But the
00:03:23.760 if the file name stays the same, users
00:03:25.519 might keep getting the old version even
00:03:27.360 after we've deployed new code. The asset
00:03:30.080 pipeline solved this with
00:03:31.120 fingerprinting, appending a unique hash
00:03:33.519 to the file name. With a new file name,
00:03:35.440 the browser requests the file, giving us
00:03:37.200 the best of both worlds. Long-term
00:03:38.879 caching when things don't change and
00:03:40.400 instant updates when they do. The
00:03:43.120 integrated asset pipeline has evolved
00:03:44.959 significantly, but it originally meant
00:03:46.879 sprockets with defaults of coffecript
00:03:48.799 and SAS, languages that compile to
00:03:50.799 JavaScript and CSS, respectively. And of
00:03:53.519 course, yeah, respectively. And of
00:03:55.599 course, we already saw as we already saw
00:03:57.120 a healthy dose of jQuery, which was the
00:03:59.519 go-to solution for modern front-end
00:04:01.519 interactivity at the time. Options for
00:04:04.400 what we now consider a full-fledged
00:04:05.920 front-end framework were limited. The
00:04:07.840 first release of what would eventually
00:04:09.120 become Ember came later in 2011, led by
00:04:11.920 YehudaZ, a member of the Roy Rails core
00:04:14.000 team. The front-end landscape quickly
00:04:16.400 exploded though in the following years.
00:04:17.919 New frameworks, modern JavaScript
00:04:19.600 features, many of which were influences
00:04:21.759 from coffecript, richer UIs, and rising
00:04:24.320 expectations from users and product
00:04:25.840 teams alike. Ember along with other
00:04:28.000 fledgling frame frameworks like
00:04:29.600 Angular.js JS and React shifted more of
00:04:32.080 the UI into the browser and with that
00:04:34.720 Rails took an increasingly took on the
00:04:36.960 role of an API provider rather than a
00:04:38.800 full page renderer.
00:04:41.280 With that explosion, the other major
00:04:43.040 responsibility left on Rails was
00:04:45.120 coordinating with a front-end build
00:04:46.560 system in the asset pipeline. That
00:04:48.880 resulted in the Webpacker gem, an MPM
00:04:51.120 package that was added as an option with
00:04:52.720 Rails 5.1 in 2016 and became the default
00:04:56.000 with Rails 6.0 in 2019. Persona started
00:04:59.199 squarely in the middle of that
00:05:00.320 evolution, starting with React and
00:05:01.919 TypeScript through Webpacker. Webpacker
00:05:04.320 set out to connect Rails with modern
00:05:05.840 front-end tooling and for a while it
00:05:08.080 did. But as complexity grew, we started
00:05:10.160 to see the cost. Hardy debug
00:05:12.000 configurations, slow feedback loops, and
00:05:14.320 lagging support for emerging tools. The
00:05:16.400 break neck pace of front-end innovation
00:05:18.160 made it nearly impossible for Webpacker
00:05:19.919 to keep up, turning what was meant to be
00:05:21.759 a bridge into a constant game of
00:05:23.680 catch-up. and all the while struggling
00:05:25.840 to reconcile Rails's opinionated
00:05:27.919 defaults and focus on quick
00:05:29.440 implementation with the extensibility
00:05:31.360 and configurability of modern build
00:05:33.120 systems. In the end, we've moved from
00:05:35.759 Webpacker to Shakaacker and now we've
00:05:37.919 adopted VIT, a modern native JavaScript
00:05:40.479 build tool that's fast, flexible, and
00:05:42.400 designed for today's work front-end
00:05:44.160 workflows. And that's been a recurring
00:05:46.240 theme for us. Rails gives you great
00:05:48.160 defaults, but you're not locked in. When
00:05:50.320 the built-in tools no longer fit your
00:05:52.000 scale, your team, or your workflows,
00:05:54.320 it's okay to step outside the box and
00:05:56.000 bring in what works.
00:05:58.800 As we started to grow, we started to hit
00:06:00.560 the natural limits of Google App Engine.
00:06:02.240 It had served us well during our early
00:06:03.919 days, giving us speed and simplicity
00:06:05.600 when we needed it most. But eventually,
00:06:07.680 the trade-offs became too hard to
00:06:09.199 ignore. We needed more flexibility
00:06:11.280 around how we structured services, or
00:06:13.120 really just to easily deploy multiple
00:06:14.800 services at all. To set the stage,
00:06:17.199 picture what scaling looked like in our
00:06:18.720 early days. A service would spike, an
00:06:20.639 alert would fire, and someone would jump
00:06:22.080 in and manually scale things up or down.
00:06:25.440 This is a slide from one of our all
00:06:26.960 hands meetings in April of 2020, showing
00:06:29.440 five straight days of manual scaling
00:06:31.520 operations, scaling services up and
00:06:33.520 down, sometimes multiple times a day,
00:06:35.600 just to keep things running smoothly.
00:06:37.600 that moment looking at this slide and
00:06:39.759 realizing how much of our energy was
00:06:41.360 going into keeping just the lights on,
00:06:43.680 it was a clear signal it was time to
00:06:45.440 grow into something more sustainable.
00:06:48.720 So, we knew we had to move off of Google
00:06:50.240 App Engine. But what options were
00:06:52.160 actually viable? The most basic option
00:06:54.639 would be to run raw virtual machines,
00:06:56.479 maybe wrapped in Terraform and managed
00:06:58.319 with Anible or some other homegrown
00:06:59.840 tooling. Technically, that probably
00:07:02.319 would have worked, but it would have
00:07:03.680 meant taking on a ton of operational
00:07:05.280 complexity ourselves, solving problems
00:07:07.039 that much more mature tools had already
00:07:08.880 solved. Another option was to leave GCP
00:07:11.840 entirely and move to AWS or Azure for a
00:07:15.840 different platform as a service. But
00:07:17.680 realistically, that wouldn't have
00:07:18.960 guaranteed a solution to any of our core
00:07:20.800 issues, and it would have added a
00:07:22.240 massive migration on top of an already
00:07:24.080 complex problem. After weighing the
00:07:26.560 options, we decided on something that
00:07:27.919 gave us the control we needed without
00:07:29.599 starting from scratch. Kubernetes via
00:07:31.599 GKE. GKE is Google Cloud's managed
00:07:34.560 Kubernetes offering. It handles the
00:07:36.319 heavy lifting of cluster provisioning,
00:07:38.960 upgrades, and node management while
00:07:40.639 still giving us substantial operational
00:07:42.240 control.
00:07:44.400 Making the jump from Google App Engine
00:07:45.919 to Kubernetes wasn't just a change in
00:07:47.520 deployment systems. It was a fundamental
00:07:49.520 shift in how we thought about
00:07:50.560 infrastructure. App Engine handled most
00:07:52.639 of the heavy lifting for us.
00:07:53.759 Provisioning, scaling, networking, even
00:07:55.520 deployment, all abstracted behind a few
00:07:57.599 CLI commands. But that simplicity came
00:07:59.919 at the cost of control. Migrating
00:08:01.840 Kubernetes gave us flexibility,
00:08:03.520 observability, and granular control, but
00:08:06.080 required a maturity leap in tooling and
00:08:08.319 practices because it asked us to take
00:08:10.000 ownership of every part of the stack.
00:08:11.919 From networking and observability to
00:08:13.440 deploy workflows and access control, we
00:08:15.440 suddenly had a lot more flexibility and
00:08:17.759 a lot more responsibility. Let's take a
00:08:20.240 sideby-side look at how each platform
00:08:21.840 handled the key components of our
00:08:23.039 infrastructure and what changed when we
00:08:24.720 made the switch on a on App Engine. You
00:08:27.919 simply push code and Google takes care
00:08:29.199 of the compute. No servers to provision
00:08:31.199 or orchestration tools to configure. In
00:08:33.519 Kubernetes, you manage the full life
00:08:35.039 cycle of containers and the nodes they
00:08:37.200 run on. depending on the cloud vendor or
00:08:40.000 on prem that can vary between relatively
00:08:42.399 easy with managed services like EKS and
00:08:45.279 GKE to fully hands-on if you're running
00:08:47.839 your own control plane and node
00:08:49.040 infrastructure.
00:08:50.560 GA also doesn't support GPUs or other
00:08:52.800 specialized compute resources which have
00:08:54.640 become increasingly common in the mo as
00:08:56.720 modern workloads have exploded in
00:08:58.480 popularity and utility and those now
00:09:01.360 power critical parts of Persona's
00:09:03.040 platform like document analysis,
00:09:04.880 biometric matching and real-time image
00:09:06.880 processing.
00:09:08.240 When it comes to controlling your
00:09:09.440 application scaling behavior, App Engine
00:09:11.440 allows you to define targets for
00:09:12.560 CPUization and concurrent requests, but
00:09:15.440 that's about it. Kubernetes on the other
00:09:17.440 hand gives you fine grain control with
00:09:19.040 the ability to look at both system and
00:09:20.480 custom metrics in addition to being able
00:09:22.399 to scale both horizontally and
00:09:23.920 vertically. It even supports custom
00:09:25.760 scaling logic through integrations with
00:09:27.440 external metrics APIs and other
00:09:29.440 controllers, making it highly extensible
00:09:31.760 whether you're scaling based on QEP,
00:09:33.440 request latency, web hooks, or any other
00:09:35.440 signal relevant to your application. As
00:09:37.760 we alluded to earlier, the this
00:09:39.360 flexibility was a key driver in our
00:09:40.959 migration to Kubernetes. App Engine
00:09:42.959 scaling model is heavily geared towards
00:09:44.800 request response web traffic and it
00:09:46.640 didn't handle background job processing
00:09:48.399 like what we do with Sidekick very well.
00:09:50.480 We needed more control over how and when
00:09:52.720 workers scaled, especially under our
00:09:54.800 very bursty workloads workloads and
00:09:57.519 Kubernetes gave us the tools to do that.
00:09:59.600 I'll be honest though, our move to
00:10:01.120 Kubernetes didn't immediately remove our
00:10:02.880 manual scaling desires. It took time,
00:10:05.360 experience, and a bit of patience to
00:10:07.440 craft hor uh horizontal pod autoscalers
00:10:10.000 that met our needs. But once we got
00:10:12.000 there, it changed the game. The system
00:10:13.760 finally started working with us, not
00:10:15.360 waiting for us to catch up.
00:10:17.839 Networking, like compute, is fully
00:10:19.519 managed by App Engine. There are no load
00:10:21.120 balancers to configure unless you'd
00:10:22.480 explicitly like to do so. And simply
00:10:24.240 deploying your application gives you a
00:10:25.600 publicly accessible endpoint out of the
00:10:27.120 box. With Kubernetes, you're empowered
00:10:29.120 with services, ingresses, gateways, and
00:10:31.519 a host and a host of related objects and
00:10:33.519 configuration knobs. You can build
00:10:35.360 complex load balancing strategies with
00:10:37.279 traffic routed across multiple services,
00:10:39.120 paths, or backends, all without leaving
00:10:41.600 the Kubernetes ecosystem. But with that
00:10:43.600 power comes responsibility. You now have
00:10:45.839 to manage DNS, TLS, health checks,
00:10:48.320 firewall rules, and more. All of which
00:10:50.880 can be can add operational overhead if
00:10:53.040 not carefully designed and properly
00:10:55.200 configured. It's not uncommon to see
00:10:57.519 deployments missing critical pieces like
00:10:59.519 readiness probes or ingress annotations
00:11:01.519 leading to flaky traffic routing, failed
00:11:03.760 rollouts, or subtle production issues.
00:11:07.680 App Engine makes observability
00:11:10.000 effortless. Logs and metrics are
00:11:11.760 automatically captured and integrated
00:11:13.200 into Google Cloud's monitoring tools
00:11:14.800 with minimal setup. It's simple,
00:11:16.640 consistent, and good enough for many use
00:11:18.240 cases right out of the box. In contrast,
00:11:20.320 Kubernetes gives you a blank slate. You
00:11:23.040 have the freedom to plug in manage
00:11:24.240 observability platforms like data dog or
00:11:26.480 build out your own stack with tools like
00:11:28.640 Prometheus. That freedom is powerful,
00:11:30.880 but it also means you're responsible for
00:11:32.640 wiring it all together, deciding what to
00:11:34.720 measure, and making sure nothing falls
00:11:36.800 through the cracks.
00:11:38.880 So, while the cub the move to Kubernetes
00:11:40.640 gave us the control and flexibility we
00:11:42.240 needed to scale our infrastructure, it
00:11:44.079 also came with new complexity that we
00:11:46.399 had to learn to manage carefully. Our
00:11:48.399 deployment infrastructure wasn't the
00:11:49.839 only thing that had to evolve. As our
00:11:51.920 usage grew, one of the next places we
00:11:53.839 felt real pressure was in our database
00:11:55.360 layer. As our product and customer base
00:11:57.839 evolved, so did our data in volume,
00:12:00.000 structure, and complexity.
00:12:02.800 In mid 2022, we began sharding our
00:12:05.200 application to address the growing
00:12:06.639 pressure on our primary MySQL cluster.
00:12:08.880 Starting by adding a second shard in the
00:12:10.480 same compute location. And just to keep
00:12:12.480 things interesting, we kicked off work
00:12:14.079 at the same time to add a third shard.
00:12:15.920 This time in Europe, driven by data
00:12:17.760 residency requirements that called for
00:12:19.360 isolating customer data within specific
00:12:21.120 jurisdictions. In the span of just 6
00:12:23.600 months, we went from one database and
00:12:25.600 one Kubernetes cluster in one region to
00:12:28.240 three shards across two regions and an
00:12:30.639 additional Kubernetes cluster to support
00:12:32.480 it all.
00:12:34.560 Rather than relying upon a single
00:12:36.000 database, we use a combination of MySQL
00:12:37.920 and MongoDB as our primary data stores
00:12:40.240 along with Elastic Search for search and
00:12:41.920 indexing workloads and Reddus for
00:12:43.680 caching, psychic cues, and other
00:12:45.519 ephemeral data. Each of these systems
00:12:47.760 brings its own strengths and its own
00:12:49.600 operational challenges, especially in
00:12:51.600 cloud managed environments. Choosing the
00:12:53.839 right one is only half the battle.
00:12:55.600 Scaling, tuning, and managing them in
00:12:57.760 production is where the real work
00:12:59.040 begins. While MongoDB offers native
00:13:01.600 support for sharding, making horizontal
00:13:03.680 scaling more straightforward, MySQL
00:13:06.240 posed a much harder challenge. Sharding
00:13:08.160 our relational data meant untangling
00:13:09.920 assumptions deeply buried in our
00:13:11.360 application code and schema. And that's
00:13:13.680 where we'll focus next.
00:13:16.320 Rails has only recently started offering
00:13:18.399 official support for sharding, but
00:13:19.760 applications have been hacking around it
00:13:21.360 that limitation for years. Rails 6.0
00:13:24.639 added support for configurable database
00:13:26.639 connections by model, allowing
00:13:28.160 applications to route specific models or
00:13:30.079 even reads versus writes to different
00:13:31.760 database instances using the connects to
00:13:33.519 and connected to APIs. This is
00:13:36.000 effectively known as vertical sharding
00:13:37.519 where you split entire tables or domains
00:13:39.760 across databases. One database might
00:13:41.839 handle user data, another might handle
00:13:43.440 payments or autolocks. It's relatively
00:13:45.440 straightforward because the location of
00:13:46.880 each type of data is fixed. You always
00:13:49.040 know which database to query based on
00:13:50.560 the model. Importantly though, this laid
00:13:53.519 the groundwork for what we typically
00:13:54.639 think of when we say sharding,
00:13:56.079 horizontal sharding. Splitting rows of
00:13:58.160 the same table across multiple
00:13:59.519 databases, each holding a different
00:14:01.199 slice of the data, but sharing the same
00:14:02.720 schema.
00:14:05.440 Long- aaited, Rails 6.1 introduced
00:14:07.760 native horizontal sharding. It was
00:14:09.760 finally relatively easy to support
00:14:12.480 multiple shards of the same model in
00:14:14.160 your application. When we set out to
00:14:16.160 shard our MySQL cluster, we weren't
00:14:17.839 starting with a clean slate. We were
00:14:19.519 adapting a growing rapidly changing
00:14:21.600 Rails application to pattern to a
00:14:23.279 pattern the framework had just recently
00:14:24.959 begun to support and as you can imagine
00:14:27.199 that came with its own set of surprises.
00:14:29.680 Rails's connected help to helper is an
00:14:32.000 essential building block for sharding.
00:14:33.360 It lets you swap the database connection
00:14:35.040 on the fly based on context. Think of it
00:14:37.680 like a railway system. Each shard is a
00:14:39.680 different destination and connected to
00:14:41.279 is a is the track switch. Before the
00:14:43.920 train, your request job or rig task
00:14:46.880 leaves the station. you need to flip the
00:14:48.560 switch to send it down the right track.
00:14:50.560 If you forget or flip the wrong one,
00:14:52.880 your data ends up at the wrong terminal
00:14:54.560 or worse on a collision course with
00:14:56.160 something else. And just like that, and
00:14:58.240 just like in a real railway system, you
00:14:59.920 can't expect the train to figure it out
00:15:01.519 mid route. This context has to be set up
00:15:04.399 front. In practice, this means your
00:15:06.240 codebase needs to use that building
00:15:07.519 block everywhere you're interacting with
00:15:09.600 data. No small feat, even in a midsize
00:15:11.920 Rails application.
00:15:14.639 Threading shard context through an app
00:15:16.160 isn't necessarily hard in isolated
00:15:18.079 cases, but it requires discipline and
00:15:20.000 consistency. For job processing, it's
00:15:22.240 relatively straightforward since you've
00:15:23.600 already looked up the shard by querying
00:15:25.440 the record. In our case, we added a
00:15:27.519 query parameter to the global ids of
00:15:29.199 objects passed to the jobs which
00:15:30.959 indicated the correct shard, allowing
00:15:32.560 the job to reconnect to the right
00:15:33.839 database when it runs.
00:15:36.560 For web requests, though, it's a bit
00:15:38.079 trickier. You're now forced to rethink
00:15:39.760 what information is required to route a
00:15:41.360 request and make sure that shard context
00:15:43.440 is both available and trustworthy by the
00:15:45.600 time the controller code runs. Take a
00:15:47.600 public API as an example. You're
00:15:49.360 probably identifying requests with an
00:15:50.800 API key. That becomes your routing key,
00:15:53.199 a piece of context that tells you which
00:15:54.560 shard the request should go to. But
00:15:56.480 that's only half of the equation. You
00:15:58.320 now need some kind of directory or
00:15:59.920 lookup table, a centralized way to map
00:16:01.759 that routing key, API key, object token
00:16:04.560 to the correct shard. Notice the fine
00:16:06.720 shard example in this call in this
00:16:08.880 example from our application. Without
00:16:10.800 that layer of indirection, you're left
00:16:12.560 hard coding assumptions into your app
00:16:14.160 and that just doesn't scale. And here's
00:16:16.160 where things get even more interesting.
00:16:17.839 That lookup table might need to scale
00:16:19.519 far more than you'd initially expect. In
00:16:22.320 many cases, you're supporting APIs with
00:16:24.480 unchangeable contracts. Maybe they're
00:16:26.560 embedded in physical hardware, IoT
00:16:28.720 devices, or distributed SDKs in mobile
00:16:31.279 apps that can't be easily updated. That
00:16:33.920 means every request to your platform,
00:16:35.519 even the very first one, needs to hit
00:16:37.360 the right shard with no opportunity for
00:16:39.519 client side logic to help.
00:16:43.279 As a result, what looks like a simple
00:16:44.959 lookup turns into a high throughput, low
00:16:47.519 latency critical code path. One that
00:16:50.079 needs to be highly available, globally
00:16:52.000 accessible, and fast enough to sit in
00:16:54.000 front of any userfacing request.
00:16:56.800 At Persona, we solved this by backing
00:16:58.639 our lookup table with MongoDB, which
00:17:00.480 makes it very easy to support globally
00:17:02.160 distributed read replicas with minimal
00:17:04.079 operational overhead. That allowed us to
00:17:06.799 serve shard routing lookups close to the
00:17:08.640 user no matter where the request is
00:17:10.160 processing because the routing logic
00:17:12.400 sits in the critical path of almost
00:17:13.760 every request, especially in our public
00:17:16.160 or SDK facing APIs. Having that data
00:17:19.199 available fast and everywhere was
00:17:21.199 non-negotiable. As of last week, we had
00:17:23.839 just over a billion entries in that
00:17:26.000 lookup table.
00:17:28.720 Long before we even needed to shard, we
00:17:30.640 were already feeling the pressure of
00:17:31.840 working with large MySQL tables. As
00:17:34.000 usage grew, certain tables ballooned in
00:17:36.080 size, and that brought a new class of
00:17:37.840 problems. Slow queries, painful
00:17:39.919 migrations, unpredictable query plans,
00:17:42.320 and operational risk from even simple
00:17:44.080 schema changes. And while sharting helps
00:17:46.240 you scale horizontally, it doesn't
00:17:47.840 eliminate those problems. In fact, it
00:17:49.600 can make them even harder to manage. Now
00:17:51.679 you're not just maintaining one large
00:17:53.440 table. You're maintaining that same
00:17:54.799 large table across a multiple shards.
00:17:57.200 Every schema change, every index tweak,
00:17:59.600 every and every performance fix now has
00:18:01.360 to be repeated across end databases.
00:18:05.760 Schema changes on large MySQL tables can
00:18:08.480 be deceptively dangerous. By default,
00:18:10.720 operations like add column, modify or
00:18:12.960 drop index can lock the table, block
00:18:15.679 reads or writes and introduce unexpected
00:18:17.919 performance regressions, especially if
00:18:19.840 that table is in the critical path. For
00:18:21.760 a long time, tools like Perona's PT
00:18:23.840 online sk online schema change and the
00:18:26.400 LHM large headron migrator migrator
00:18:29.360 originally open sourced by SoundCloud
00:18:31.120 and now maintained by Shopify, have seek
00:18:33.360 to bridge that gap. For extremely large
00:18:36.000 tables though, it can result in
00:18:37.200 migrations taking weeks, potentially
00:18:39.120 stalling the work of other engineers or
00:18:41.120 changing query planning results and
00:18:42.880 slowing down unrelated parts of your
00:18:44.480 application. Recent versions of MySQL,
00:18:47.039 particularly 8.0, support more instant
00:18:48.960 DDL operations like adding and removing
00:18:51.200 some columns or modifying default values
00:18:53.520 without requiring full table rebuilds.
00:18:57.039 That still leaves things like index
00:18:58.640 creation requiring full rebuilds. So
00:19:00.720 that's where access pattern design
00:19:02.080 becomes essential. One of the biggest
00:19:03.840 challenges with large MySQL tables is
00:19:05.679 that small inefficiencies at scale
00:19:08.640 really start to hurt. A missing index, a
00:19:11.039 poorly chosen primary key, or an
00:19:12.640 unexpected query plan might be invisible
00:19:14.480 with 100,000 rows, but with 100 million,
00:19:16.720 it becomes a problem you can't ignore.
00:19:18.799 It's not enough to model your schema
00:19:20.240 around the shape of your data. You have
00:19:22.559 to model around how that data will be
00:19:24.000 queried, filtered, and joined in real
00:19:26.320 application usage. If you've worked with
00:19:28.480 NoSQL systems like DynamoB or MongoDB,
00:19:32.080 this mindset may already be familiar
00:19:33.840 where you have to design your schema
00:19:35.120 around your queries, not your entities.
00:19:37.679 In relational databases, that kind of
00:19:39.679 upfront thinking is often overlooked,
00:19:41.919 partly because you can get away with it,
00:19:44.320 especially early on, and it lets you
00:19:46.080 build faster. But as tables grow and
00:19:48.080 usage scales, those early shortcuts
00:19:50.160 start turning into real pain.
00:19:53.039 Large tables also introduce challenges
00:19:54.720 when needing when teams need to run back
00:19:56.240 fills. Not just because they take a long
00:19:58.160 time, but because they can
00:19:59.280 unintentionally impact performance.
00:20:01.280 Depending on how the backfill is
00:20:02.559 executed, it can evict hot pages from
00:20:04.640 the MySQL buffer pool, alter index
00:20:06.720 statistics, or disrupt caching behavior.
00:20:08.799 All of which can degrade query
00:20:10.320 performance in unpredictable ways. Of
00:20:13.360 all the table, large tables at Persona,
00:20:15.600 the top two are from a familiar Rails
00:20:17.679 component.
00:20:20.160 Active storage was introduced to
00:20:22.080 simplify interactions with files stored
00:20:23.679 in cloud object stores like S3 or GCS
00:20:26.720 and provides a flexible attachment
00:20:28.240 system that makes it very easy to
00:20:29.679 associate with any active record model.
00:20:31.840 It's built around two main tables. Blobs
00:20:34.240 which represent the actual files in the
00:20:35.919 in the object store and attachments a
00:20:37.840 polymorphic join table that connects
00:20:39.760 those blobs to an application record. At
00:20:42.559 Persona, these two tables represent the
00:20:44.640 top two by row count in our application.
00:20:47.600 Since each file is attached to exactly
00:20:49.440 one record, their row counts are nearly
00:20:51.280 identical at around 3.4 billion. That
00:20:55.120 makes them the perfect storm. They're
00:20:56.640 huge, they're hot, and they're hard to
00:20:58.559 touch, which becomes especially painful
00:21:01.200 when you need to back fill metadata,
00:21:02.880 migrate attachments, or optimize access
00:21:05.039 patterns. And modifying models outside
00:21:07.440 of your application, whether or not
00:21:09.120 they're Rails components or external
00:21:10.799 gems, is particularly tricky. Given
00:21:13.440 these challenges, we've started to
00:21:15.039 migrating to Shrine, which takes a more
00:21:16.640 lightweight modular approach, using
00:21:18.960 fields directly on individual records to
00:21:20.799 track the file attachments instead of
00:21:22.559 requiring a separate join model.
00:21:25.360 Sometimes though, it's the core of your
00:21:27.280 system. It's not the core of your system
00:21:29.039 that causes the most pain. It's the
00:21:30.640 abstractions you thought you didn't need
00:21:32.320 to think about.
00:21:34.799 We've been talking a lot about MySQL,
00:21:36.400 and that's intentional. It's the
00:21:37.760 database powering many Rails
00:21:38.960 applications. But many of these lessons
00:21:40.720 aren't unique to MySQL. You'll encounter
00:21:42.400 them with any relational database. At
00:21:44.640 scale, every data store brings its own
00:21:46.640 set of challenges, and we face them all.
00:21:48.480 We've dealt with hot shards on MongoDB,
00:21:50.400 fought through index tuning and cluster
00:21:51.840 pressure on Elastic Search, and even had
00:21:53.520 to shard Reddus to keep up with psychic
00:21:55.120 throughput. The reality is no database
00:21:57.760 stays easy forever. Once you're
00:21:59.679 operating at scale, even the managed
00:22:01.760 parts of your stack demand careful
00:22:03.440 planning, constant tuning, and a good
00:22:05.520 dose of humility. While I'd love to
00:22:07.919 unpack all those war stories, we simply
00:22:10.080 don't have time today. You might be
00:22:11.760 wondering though, where's the simplicity
00:22:13.280 in all this? That leads me into what
00:22:15.520 we're working on now, an intentional
00:22:17.200 return to simplicity.
00:22:19.679 You might be wondering why this slide
00:22:21.200 says one Kubernetes cluster and one
00:22:22.720 MySQL cluster. After everything we
00:22:24.640 talked about, that probably sounds
00:22:26.400 backwards. We didn't shrink. We didn't
00:22:29.039 suddenly stop needing to scale. What we
00:22:31.200 did was simplify. We took everything we
00:22:33.440 learned, the patterns, the guardrails,
00:22:35.600 the winds, the pain points, and
00:22:37.520 restructured it into a single
00:22:39.039 consolidated deployment model designed
00:22:41.120 to reuse, designed for reuse with strong
00:22:43.840 tenency boundaries and predictable
00:22:45.440 growth curves. It's not a step
00:22:47.440 backwards. It's a step of years of
00:22:49.520 experience teaching us that for the way
00:22:51.919 we scale, simplicity might be the only
00:22:53.919 way to do it without losing your mind.
00:22:57.679 This architectural shift is a project we
00:22:59.600 call stacks. The idea was simple.
00:23:02.000 Instead of scattering complexity across
00:23:03.440 multiple clusters, databases, and other
00:23:05.679 systems, we define a single
00:23:07.360 self-contained unit that could run our
00:23:09.039 full platform.
00:23:10.880 And then we replicate it again and again
00:23:14.640 and again.
00:23:19.280 In some ways, this architecture
00:23:20.880 resembles what was often called single
00:23:22.480 tenency, where some where each customer
00:23:24.480 gets their own isolated environment. In
00:23:27.200 our case, though, it's a bit more
00:23:28.400 nuanced. Each stack isn't necessarily
00:23:30.720 tied to one customer. It's more like a
00:23:33.120 self-contained runtime that can host
00:23:34.799 many tenants, but with strong boundaries
00:23:36.799 between the stacks themselves. So, while
00:23:38.640 we borrow some of the benefits of single
00:23:40.000 tenency, like isolation, blast radius
00:23:42.159 reduction, and operational flexibility,
00:23:44.320 we don't take on the overhead of
00:23:45.520 spinning up a new environment for every
00:23:46.880 single customer, essentially a middle
00:23:48.559 ground. While the main components of
00:23:50.400 each stack are isolated, introdu
00:23:52.159 including their own databases. There are
00:23:54.080 still a few database back services that
00:23:55.520 we share across all customers. Chief
00:23:57.600 among them is the lookup table we
00:23:59.120 discussed earlier which help routes the
00:24:00.960 requests to the correct stack.
00:24:03.520 We call these cores centralized systems
00:24:06.080 that power critical functionality across
00:24:08.159 all our environments while everything
00:24:09.679 else remains stack specific.
00:24:13.039 Routing a request to the right stack
00:24:14.880 isn't all that different from than what
00:24:16.480 we had to do with sharding. It has to be
00:24:18.080 correct from the very beginning. Just
00:24:20.159 like with database sharding, there's no
00:24:21.679 room for ambiguity. Once a request hits
00:24:23.679 our edge, we need to know exactly which
00:24:25.440 stack should handle it. If we get it
00:24:26.880 wrong, the request simply fails. So,
00:24:28.960 this isn't just a routing concern. It's
00:24:30.880 a critical correctness boundary. Let's
00:24:33.120 walk through a real world example to see
00:24:34.799 how all this comes together. Though,
00:24:36.640 just to be clear, we're pretending this
00:24:38.400 is the actual map of all of our edge and
00:24:40.000 main compute locations. The real one is
00:24:41.760 a bit too dense and not nearly as slide
00:24:43.600 friendly, but this gives you the general
00:24:45.200 idea hopefully. Consider the green
00:24:47.120 triangles to be our edge locations and
00:24:48.880 the coral cans as our main compute. Say
00:24:51.440 you were make a request from here in
00:24:53.279 Philadelphia.
00:24:54.960 That request would get routed to the
00:24:56.320 nearest edge location. That might be as
00:24:58.320 close as down the street or a few
00:25:00.159 hundred miles away. Kind of depends on
00:25:02.080 how the internet's behaving that day.
00:25:04.559 Let's say that happens to be in
00:25:05.679 Virginia, a pretty short hop at light
00:25:07.440 speed. To make stack routing work, we
00:25:10.000 run code at the edge as close to the
00:25:11.919 user as possible. This layer inspects
00:25:14.240 each incoming request and parses out the
00:25:15.919 key routing metadata.
00:25:18.640 Things like object tokens, API keys,
00:25:21.039 sessions. We'll first attempt to look
00:25:23.679 that key in a local cache. For things
00:25:26.559 like API keys where callers are
00:25:28.159 typically isolated to one or two
00:25:29.919 locations, we see a really high cache
00:25:32.320 hit rate, which means we can route that
00:25:34.320 request to the correct stack almost
00:25:36.000 instantly. In the roughly 7% of the time
00:25:38.559 that we miss, like for routing keys that
00:25:40.559 have collars spread across many
00:25:41.840 locations and don't frequently repeat,
00:25:44.480 we we'll query the lookup table in
00:25:45.919 MongoDB. We have read replicas and
00:25:48.960 distributed across the globe. So we're
00:25:50.640 able to make decisions in under 150
00:25:52.320 milliseconds for 95% of those cache miss
00:25:55.279 requests. Now that we've determined
00:25:57.600 where your request should go, we'll
00:25:58.960 actually route it there.
00:26:01.600 This time to a stack in Europe where the
00:26:03.520 request will be processed. All in this
00:26:06.400 approach allows us to introduce this
00:26:07.840 architecture change with zero
00:26:09.520 modifications to customer
00:26:10.799 implementations. No SDK updates, no
00:26:13.840 endpoint changes, no new headers, just a
00:26:17.360 clean cleaner, more scalable back-end
00:26:19.279 infrastructure that works exactly the
00:26:20.720 same from the outside. That kind of
00:26:22.640 seamless evolution is hard to pull off,
00:26:24.720 but when it works, it's one of the
00:26:26.080 clearest signs that your abstractions
00:26:27.600 are holding up.
00:26:30.960 While these stories have been about how
00:26:32.400 we manage the last seven years, scaling
00:26:34.240 Rails, taming complexity, and evolving
00:26:36.320 our architecture, they're really about
00:26:38.480 something bigger. And we've learned a
00:26:40.000 few lessons along the way. Complexity is
00:26:42.240 inevitable, but if you're intentional,
00:26:44.320 you can choose where that complexity
00:26:45.600 lives. Rails gives us great defaults,
00:26:47.919 and we've embraced them. But we've also
00:26:49.919 learned not to treat those defaults as
00:26:51.360 constraints.
00:26:52.880 Simplify where you can, scale where you
00:26:54.880 must, and don't be afraid to step
00:26:56.320 outside the box when necessary. I
00:26:58.480 appreciate you all joining me today to
00:27:00.080 hear some of our war stories. And I hope
00:27:01.840 that some of these lessons help you on
00:27:03.039 your own journey scaling Rails, whether
00:27:04.799 you're just starting out or deep in the
00:27:06.400 trenches.
00:27:08.080 On a final note, this is the Persona
00:27:09.679 team we have at RailsCom this week.
00:27:11.120 You'll be able to find us in the at the
00:27:12.480 Persona booth back there above in the
00:27:14.799 the floor above in in the Liberty foyer
00:27:17.679 or around at sessions. We'd love to chat
00:27:19.520 with you. We are hiring. Thank you very
00:27:21.840 much.
Explore all talks recorded at RailsConf 2025
Manu Janardhanan
Christopher "Aji" Slater
Hartley McGuire
Yasuo Honda
Ben Sheldon
Chad Fowler
John Athayde
Mike Perham
+77