Summarized using AI

SQLite Replication with Beamer

Kevin McConnell • September 04, 2025 • Amsterdam, Netherlands • Talk

Introduction

This talk, delivered by Kevin McConnell at Rails World 2025, focuses on "Beamer", a lightweight replication tool for SQLite databases. As SQLite gains popularity in powering Rails applications due to its simplicity and feature set, challenges arise when scaling beyond a single application server, especially in distributed or multi-server setups. Beamer aims to address these challenges by enabling easy replication and the use of read-only replicas for SQLite-based applications.

Key Points

  • Purpose of Database Replication:

    • Replication allows multiple live copies of a database across multiple machines.
    • Commonly involves a primary writable database and multiple read-only replicas.
    • Benefits include increased redundancy (avoiding a single point of failure), ability to horizontally scale reads, and isolation of resource-heavy queries.
  • Motivation behind Beamer:

    • Originated from the need to efficiently manage data for a new app (Vizzy) at 37signals, which relies on many SQLite database files and wide geographic distribution.
    • Inspired by real-world requirements such as reducing customer latency and making replication simple and easily adoptable by others.
  • How Replication Works (Technical Deep Dive):

    • Replication flow involves three key steps: capturing changes/transactions on the primary, transmitting them to replicas, and applying them to maintain consistency.
    • Data in SQLite is stored in trees, with the actual disk storage in fixed-size pages (typically 4KB).
    • SQLite supports a "WAL" (Write-Ahead Logging) mode, which records changes as a continuous stream (the WAL file) rather than overwriting in place, facilitating reliable change capture without blocking reads.
    • Beamer leverages SQLite's architecture by implementing a Virtual File System (VFS) shim. This allows Beamer to intercept transaction commits, capture changed pages from the WAL, and transmit these over HTTP to replicas.
    • On replicas, Beamer uses the virtual table sqlite_dbpage to write raw page data directly to the database, safely applying changes at the pager level.
  • Beamer in Use:

    • Integrates via a gem for Rails applications, making the necessary VFS available to capture transactions.
    • A separate Beamer process (Docker image) runs on each application server, managing network traffic and replication commands.
    • Key command is beamer switch, designating which server is the primary. Replication flow and read-only modes are managed automatically.
    • Supports failover and geographic distribution. The system can be paired with load balancers (e.g. Kamal Proxy) to direct reads to replicas and writes to the primary.
    • Illustrative example: Deploying primary databases and replicas in different regions to optimize latency for geographically distributed users.

Takeaways and Conclusions

  • Beamer makes it straightforward to add live, read-only replicas to SQLite databases, supporting scalability and geographic distribution for Rails applications.
  • Relies on low-level features of SQLite (WAL, VFS shims) and provides a relatively simple integration path.
  • The project is used internally and will be open sourced soon, with ongoing work to simplify multi-region and multi-writer configurations.

SQLite Replication with Beamer
Kevin McConnell • Amsterdam, Netherlands • Talk

Date: September 04, 2025
Published: Mon, 15 Sep 2025 00:00:00 +0000
Announced: Tue, 20 May 2025 00:00:00 +0000

SQLite is an increasingly popular choice for powering Rails applications. It's fast, simple to use, and has an impressive feature set. But without a conventional database server, scaling beyond a single application server can be challenging.

Beamer is a lightweight replication tool for SQLite, designed to make it easy to add read-only replicas to your writable databases. In this talk, we'll cover how Beamer works and how you can use it to build multi-server, geographically distributed SQLite-based Rails applications.

Rails World 2025

00:00:07.120 So I wanted to talk to you a little bit
00:00:08.639 today about Beamer, which is a
00:00:10.160 lightweight replication tool for SQLite.
00:00:13.519 So what we're going to cover today is um
00:00:15.599 we'll start by doing just a quick review
00:00:17.039 of what database replication does, why
00:00:18.880 it's a useful thing to have. Um but then
00:00:21.279 for the for the bulk of the talk, I
00:00:23.279 thought it'd be interesting if we
00:00:24.160 actually look under the hood to see how
00:00:25.760 Beamer works. And the way we're going to
00:00:27.439 approach that is by setting ourselves
00:00:28.960 the task of building a replication
00:00:30.480 system from scratch. We kind of work
00:00:32.800 through all of the points that we need
00:00:34.079 to solve in order to do that. And by the
00:00:35.920 time we get to the end of it, we'll have
00:00:37.120 figured out how Beamer works. Um, and
00:00:39.200 then at the end, uh, we'll just take a
00:00:41.200 couple of minutes to look at what it's
00:00:42.239 like to use Beamer in your Rails
00:00:43.600 application um, and your Camal
00:00:45.440 deployments.
00:00:47.520 So database replication, if anyone's not
00:00:50.480 familiar with it, it's just a way to
00:00:51.840 have multiple copies of your databases
00:00:54.239 spread out over multiple machines. And
00:00:56.399 those copies are live in the sense that
00:00:58.800 changes that you make to one copy will
00:01:00.480 be visible in the other copies. And you
00:01:02.399 can access all of these copies to run
00:01:03.840 queries and do work.
00:01:06.320 Uh database replication is often
00:01:08.240 deployed uh in a set where for each
00:01:10.560 database you'll have one writable
00:01:11.920 version called the primary and then
00:01:13.280 multiple readonly versions called
00:01:14.720 replicas. It's not the only way to do
00:01:16.400 replication, but it's a very common one.
00:01:18.320 Um, and so for the point uh point of
00:01:20.720 view of this talk, this is the kind of
00:01:21.920 replication that we have in mind that
00:01:23.360 we'll be talking about.
00:01:25.680 As for why this is useful, um, I think
00:01:27.759 there's really three main reasons. One
00:01:29.920 is just that it helps you avoid having
00:01:31.360 single points of failure in your
00:01:32.720 infrastructure because obviously if you
00:01:34.079 have all of your database data on a
00:01:35.520 single machine, it's going to be a bit
00:01:37.200 of a liability if anything happens to
00:01:38.640 that machine. So replication gives you a
00:01:40.240 way to um, spread that out and build in
00:01:42.320 some redundancy.
00:01:44.320 Second reason that replication would be
00:01:46.240 useful is you can then start
00:01:48.320 horizontally scaling your database
00:01:50.159 operations across all of the machines
00:01:51.840 that you're replicating to. Um, and this
00:01:54.720 is useful even in the type of
00:01:56.000 replication we're talking about where
00:01:57.200 your replicas will be read only because
00:01:58.960 so many applications do a lot more read
00:02:01.200 work than they do write work. Uh, using
00:02:03.759 base camp 4 as an example, I think we
00:02:05.759 often average something like 94% read
00:02:07.840 requests and only 6% writes, which is
00:02:10.560 not an unusual ratio to have in a web
00:02:12.560 application.
00:02:13.760 Um so if you're in a situation like that
00:02:15.280 you could see how having lots of even
00:02:17.360 readonly copies means now you can add a
00:02:19.120 lot more capacity to handle those reads
00:02:20.879 and by moving your reads to the read
00:02:22.319 replicas you also free up uh write
00:02:24.319 capacity on your primary to do more
00:02:26.160 rights there.
00:02:28.400 Uh and then the third reason I think
00:02:29.920 replication is useful to have is it's a
00:02:31.920 pattern that allows you to isolate
00:02:33.360 disruptive actions to machines that are
00:02:35.360 not um serving customer traffic. So an
00:02:38.160 example of this would be if you have
00:02:39.440 some heavy data analysis that you need
00:02:41.280 to do and you suspect that these queries
00:02:42.879 you're going to run will slow down
00:02:44.480 everything that's on the machine that
00:02:45.519 you do them on. You don't want to be
00:02:46.959 doing that on a machine that's also
00:02:47.920 serving customer traffic. But if you
00:02:49.440 have replication, you can just replicate
00:02:51.440 your data onto a separate machine and do
00:02:52.959 the work there and then it will be kind
00:02:54.400 of isolated.
00:02:57.920 I did want to also mention just a little
00:02:59.519 bit of context of why I went down this
00:03:01.280 replication rabbit hole and started to
00:03:02.879 look at this. Um although from David's
00:03:05.040 keynote this morning, you probably have
00:03:06.239 a sense of why we're looking at this. Um
00:03:09.120 but this really came from we're building
00:03:10.720 this new app at 37 signals called Vizzy.
00:03:13.120 It's built on top of SQLite. Um it uses
00:03:15.680 tons and tons of database files. Um for
00:03:17.760 anyone who went to Mike's talk earlier
00:03:19.200 about active record tenant, you will
00:03:20.800 know all of the hows and wise of how it
00:03:22.800 uses so many database files. We also
00:03:25.519 want to deploy this application um
00:03:28.000 spread out in various places
00:03:29.200 geographically so we can reduce customer
00:03:30.959 latency by moving the data closer to the
00:03:33.120 customers. So there's quite a lot going
00:03:34.640 on there and we need to be able to find
00:03:37.680 uh a good way to manage the replication
00:03:39.519 of the data in our SQLite databases
00:03:41.599 across all these machines. And so when
00:03:44.560 looking at a way to do this and also to
00:03:46.560 make it easy that's what led us to sort
00:03:48.720 of look through some of our options and
00:03:50.159 ultimately to start building Beamer. Um,
00:03:52.239 and we want to make this easy, not just
00:03:53.519 for our case of this app we're building,
00:03:54.799 but obviously for anyone else who wants
00:03:56.159 to build an app with a similar
00:03:57.519 architecture, we hope you can take these
00:03:58.959 pieces and find it easy as well.
00:04:03.360 Okay, so now that we know what um
00:04:06.400 database replication is for and why it's
00:04:08.159 a useful thing to have, this is the part
00:04:09.920 where we're going to start um digging
00:04:11.599 into the nuts and bolts of how Beamer
00:04:13.200 works by trying to build a replication
00:04:14.720 system from scratch. This is also going
00:04:17.040 to involve digging around a bit inside
00:04:18.880 SQLite and talking about how SQLite
00:04:20.560 works. And I did mention that because
00:04:22.720 there's a lot of detail going on there
00:04:24.240 and we don't have very much time. I'm
00:04:25.440 going to kind of speedrun some of the
00:04:26.960 SQL light parts. So if some of the
00:04:28.720 things that I say seem a bit wonky at
00:04:30.400 first, bear with me. I think it will all
00:04:32.240 make sense by the time we get to the
00:04:33.440 end.
00:04:36.560 So this is going to be our initial
00:04:39.280 starting description of what replication
00:04:40.880 does. This is going to serve as the spec
00:04:42.320 of what we're going to try and build. So
00:04:43.759 I think at the core of it, a replication
00:04:46.720 system needs to be able to do these
00:04:48.080 three things. You need to be able to
00:04:49.520 capture the changes that happen to a
00:04:50.800 primary database. Uh transmit those over
00:04:53.040 to a replica and then apply the ch the
00:04:55.199 same changes to the replica. If you can
00:04:56.560 do these three steps reliably, then your
00:04:58.639 replica databases will do everything
00:05:00.080 your primary databases do and you've
00:05:01.520 essentially implemented replication at
00:05:03.280 that point. So you could picture this a
00:05:05.680 bit like this diagram where you have a
00:05:06.880 primary database, a replica database,
00:05:08.639 and there's this constant stream of
00:05:10.160 changes that flows from the primary to
00:05:11.840 the replica. um that's sometimes
00:05:13.600 referred to as a replication log.
00:05:16.479 So, as I say, this is our starting spec.
00:05:18.080 This is what we're going to try and
00:05:18.960 build, but we need a bit more detail in
00:05:20.720 order to do that. Um and looking at it
00:05:22.960 now, there's a couple of things that
00:05:24.080 stand out to me initially. One is that
00:05:26.880 we're talking here about capturing
00:05:28.560 changes, but change is sort of a vague
00:05:30.320 term. We should probably be a little bit
00:05:31.840 more precise with what we mean there.
00:05:33.520 But since we're dealing with relational
00:05:35.120 databases, uh the unit of change there
00:05:36.800 is a transaction. So let's start by
00:05:38.880 revising our spec a bit to say that
00:05:40.320 we're actually going to capture the
00:05:41.280 transactions on the primary and then
00:05:43.360 transmit and apply those transactions.
00:05:46.960 The second thing that um stands out to
00:05:49.039 me at first with this spec is that steps
00:05:51.199 one and three here seem like they're
00:05:53.199 doing database specific tasks, but step
00:05:55.600 two doesn't really. Step two just sounds
00:05:57.440 like it's sending information from one
00:05:59.120 machine to another machine. Um and
00:06:01.199 sending data over network is something
00:06:02.400 that as web developers we do this all
00:06:04.000 the time. We use HTTP and move data
00:06:06.080 around. We're kind of familiar with
00:06:07.680 this. It seems like a bit of a solved
00:06:09.120 problem. So, let's just go ahead and say
00:06:11.120 that we'll use HTTP as the trans the
00:06:13.919 transport for this data. And we can
00:06:15.280 check off point 2 and consider that like
00:06:17.360 we figured out we know how to do two and
00:06:19.280 now we just have the other database
00:06:20.560 specific parts to solve.
00:06:23.520 Okay, so those were the only two sort of
00:06:25.520 obvious easy parts to me on this spec.
00:06:27.759 Um, now we need to start looking a bit
00:06:30.160 closer under the hood of how things work
00:06:31.600 in order to start figuring out how to do
00:06:33.199 the other parts. So we'll look at how to
00:06:34.560 capture transactions and the way we're
00:06:37.199 going to solve that uh involves looking
00:06:39.520 a bit inside SQLite. Um and the first
00:06:41.759 part we're going to look at is how
00:06:42.800 SQLite stores its data.
00:06:45.600 So logically the data you work with in
00:06:48.400 SQLite or any relational database is is
00:06:50.319 logically tabular, right? Your data is
00:06:51.919 arranged in tables with rows and
00:06:53.440 columns. Um but that's not how it's
00:06:55.440 actually stored on disk. primarily
00:06:57.280 because if it was stored this way on
00:06:58.639 disk, then a lot of common operations
00:07:01.360 would turn out to be quite awkward and
00:07:02.720 expensive to do. So an example might be
00:07:04.800 if you need to insert a row near the
00:07:06.479 start of a table, if it's laid out in
00:07:08.400 this format on disk, then suddenly
00:07:09.680 you've got to move everything else down
00:07:11.039 to make space for that new row. And in
00:07:13.120 the worst case, you're going to end up
00:07:14.160 essentially rewriting that entire table
00:07:15.840 just to add a single row, which would be
00:07:18.080 quite slow. So what databases normally
00:07:21.120 do is use some kind of tree structure
00:07:22.720 instead.
00:07:24.319 A tree is really nice for this because
00:07:26.319 you can put your data into nodes and
00:07:28.160 then use pointers between nodes to kind
00:07:30.000 of maintain whatever ordering you want
00:07:31.520 to have in that data. But where you
00:07:34.000 store those nodes on disk doesn't have
00:07:35.520 to match that. So you kind of separate
00:07:37.440 the disk layout from the logical layout
00:07:39.520 of the data. And that's what makes it
00:07:41.520 cheap to do things like insert a new
00:07:42.880 node at the top of a tree because you
00:07:44.240 only have to change the nodes it
00:07:45.280 touches. You don't have to touch any of
00:07:46.560 the rest of the data.
00:07:48.639 So SQLite does this in SQLite.
00:07:51.039 Everything is a tree. So a table is a
00:07:52.800 tree, an index is a tree because an
00:07:54.800 index is really just a table that has
00:07:56.240 pointers for values. Um the schema
00:07:58.400 catalog which is the kind of main list
00:08:00.240 of all of the uh tables, indexes and
00:08:03.599 other objects in your database, that's
00:08:04.960 also a tree. Um so this is how SQLite
00:08:08.639 organizes it uh its data.
00:08:11.840 But when we're dealing with trees that
00:08:13.360 need to persist to disk, um, one
00:08:15.840 trade-off that we have to be conscious
00:08:17.120 of is that the size of the nodes in our
00:08:19.919 tree is going to be important in terms
00:08:21.440 of performance. In particular, if we
00:08:23.199 make those nodes really small, then
00:08:25.039 every time we need to do something to
00:08:26.319 work with that tree that involves uh
00:08:28.319 fetching a node or saving a node, we're
00:08:29.840 going to be doing tiny little disk
00:08:31.120 operations and lots and lots of them,
00:08:32.560 which will end up being quite
00:08:33.519 inefficient. There's a bit of an
00:08:35.120 overhead for every IO operation you do.
00:08:37.440 and disks work a lot more or perform a
00:08:39.839 lot better if you can batch that work up
00:08:41.519 and kind of write chunks at a time. So
00:08:44.159 the way SQLite deals with this is that
00:08:46.399 all its trees are built out of these
00:08:47.920 fixedsized blocks called pages. By
00:08:50.080 default they're 4K and they'll always be
00:08:51.680 the same size um within a single
00:08:53.839 database. So we'll assume just the
00:08:56.080 default for now. They're always 4K
00:08:57.600 blocks. Um that means that the pages can
00:09:00.080 contain data for multiple rows as well
00:09:02.080 as the pointers to the other um pages in
00:09:04.800 order to kind of form the tree
00:09:05.920 structure. Then the SQLite data file is
00:09:09.200 literally just all of those 4K pages
00:09:11.839 written out in their page number order
00:09:13.760 in a file. That's the file format of a
00:09:15.519 SQLite database. And then whenever
00:09:17.760 SQLite needs to um read and write any
00:09:21.120 changes to the data, it's always working
00:09:22.399 on pages. So it's always doing these
00:09:23.760 like 4K reads, 4K writes kind of thing.
00:09:27.519 Okay, I know we're getting in the weeds
00:09:29.040 a bit here, but knowing this um we can
00:09:31.680 start making our first revision to the
00:09:33.279 spec to get more detail in there. And
00:09:35.839 now we'll just note that when we say
00:09:37.600 we're going to capture transactions for
00:09:38.959 a primary, what we're probably going to
00:09:40.320 do is capture these changed pages
00:09:42.320 because that's how SQLite works with
00:09:43.839 them. So at least now we have a little
00:09:45.600 bit of a notion of the the file format
00:09:47.440 of the data that we're dealing with.
00:09:50.480 Um, but we don't what we don't know yet
00:09:52.160 is how do we find which of the pages are
00:09:54.800 the changed pages whenever there's a
00:09:56.480 transaction. So we'll continue poking
00:09:59.279 around in SQLite and we can find the
00:10:00.640 answer to that. this time looking more
00:10:02.720 at how SQLite updates data when it
00:10:04.640 changes.
00:10:06.480 So we just said that um the region right
00:10:09.440 region rights always happen as whole
00:10:11.040 pages and we also said that SQLite's
00:10:12.959 data file is just these pages laid out
00:10:14.800 on disk one after the other in their
00:10:16.399 order. So when it comes time to persist
00:10:18.640 a change to disk the simplest possible
00:10:21.279 thing SQLite could do would just be
00:10:24.079 write the new version of the pages
00:10:25.360 directly into that file whenever they
00:10:26.880 happen. And that would work fine if you
00:10:29.440 are the only process accessing the
00:10:31.040 database at that time, but very often
00:10:33.360 you're not. Um, often there'll be lots
00:10:35.040 of processes and threads, maybe
00:10:36.240 different Puma workers or something that
00:10:37.600 are all accessing the same database. And
00:10:39.600 so we can't just write changes on top of
00:10:41.440 other um, pages anytime we want because
00:10:44.079 we're going to trip up the work that
00:10:45.279 some other process is doing if it's
00:10:46.640 trying to read from a page we're writing
00:10:47.920 to and so on.
00:10:50.079 So if we can't do this the very simplest
00:10:52.240 thing perhaps we can do the next
00:10:54.320 simplest thing which would be just to
00:10:56.079 add a big lock to this. So anytime we
00:10:59.760 want to write the process that wants to
00:11:01.519 write could take an exclusive lock on
00:11:03.040 the whole database. Then it can write
00:11:04.560 whatever it wants to whatever pages it
00:11:06.079 needs to change. When it's done it can
00:11:07.920 release the lock and then the other
00:11:09.440 processes that want to use the database
00:11:10.880 can continue their work. This actually
00:11:13.680 works fine and um SQLite has a mode of
00:11:16.079 operation that is essentially this. Um,
00:11:18.800 but the downside to this is it's just
00:11:20.320 really bad for concurrency. If you have
00:11:22.240 a lot of processes that all want to do
00:11:24.000 things to the database, but anytime one
00:11:26.000 wants to write, all the others stop. Um,
00:11:28.720 then in practice, you'll find a lot of
00:11:30.399 time where only one of your processes is
00:11:32.160 really making any progress on its work
00:11:34.000 and all the other processes are just
00:11:35.360 kind of waiting on this lock to be
00:11:36.640 released.
00:11:38.160 Um, so to improve on this, SQLite has
00:11:41.600 another mode of operation which is
00:11:43.200 called wall mode. Walnood works by
00:11:46.720 adding a second data file alongside the
00:11:48.640 main data file called the wall file
00:11:50.320 which stands for the write a headlog.
00:11:52.160 It's where wall gets its name. And the
00:11:54.800 wall file is built out of these same
00:11:56.560 fixedsiz pages as the main database
00:11:58.560 file. But the wall file is an appendon
00:12:01.040 file. So rather than overwriting
00:12:02.640 anything, you're always appending to the
00:12:04.240 end. Uh the way this gets used is that
00:12:07.600 whenever SQLite wants to record a change
00:12:09.519 to a page, it will write it onto the end
00:12:12.320 of that wall file. And then whenever it
00:12:14.240 needs to read in a page from the
00:12:15.360 database, it'll look to see if there is
00:12:16.639 a copy of it, the wall file first. If
00:12:18.480 there is, it can use that. Um, and if
00:12:20.880 there's if it's not in the wall, it will
00:12:22.399 look from the main database file. So,
00:12:24.560 another way to think of this is that
00:12:27.200 writing a a copy of a page to the end of
00:12:29.279 the wall is effectively shadowing any
00:12:31.040 previous copy you had of that page in
00:12:32.720 the wall or in the the main database
00:12:34.480 file.
00:12:36.800 It can't do this forever. You can't just
00:12:38.720 continually append a file because
00:12:39.920 eventually the wall file will become
00:12:41.120 massive and consume all your disk space.
00:12:42.880 So there's also a periodic task called a
00:12:44.800 checkpoint which runs when the wall
00:12:46.880 starts to get too big and the checkpoint
00:12:48.480 will just take the latest version of
00:12:49.920 each of these pages, copy them back up
00:12:52.399 to the main database and it can throw
00:12:53.600 the wall away and start over with a new
00:12:55.279 one. And so the the Wi-Fi will sort of
00:12:57.279 periodically rewind. But in between
00:12:59.839 those checkpoints, the point that I'm
00:13:01.360 trying to get at here is that all of the
00:13:04.079 right operations are essentially just
00:13:05.519 append append append onto the end of a
00:13:07.360 while file. So if you could see it
00:13:08.959 inside, it would probably look like
00:13:10.639 this, I think.
00:13:13.680 Now, while mode has a couple of really
00:13:16.160 nice properties that make it work better
00:13:17.680 for concurrent workloads. Um, one of
00:13:19.519 them is that writes don't need to block
00:13:21.200 reads anymore. Um, since you're not
00:13:23.600 overwriting pages, that means if other
00:13:25.120 processes want to read the database
00:13:26.240 while a right's happening, they can
00:13:27.360 continue to do so and nothing's going to
00:13:29.360 like pull the rug out from under their
00:13:30.800 feet by changing a page they're reading.
00:13:33.040 So, that's nice. Um but from our point
00:13:35.760 of view as people who are trying to
00:13:37.120 build a replication system, the other
00:13:39.279 reason the wall mode is nice is that if
00:13:41.279 you look at the wall file, it's
00:13:42.720 essentially uh a continuous stream of
00:13:45.040 changes that just happen to your happen
00:13:46.880 to your primary database. Um they're
00:13:48.720 even delineated by transaction. There's
00:13:50.560 little markers in the wall that tells
00:13:51.680 you where each transaction starts and
00:13:53.200 stops. Um and this is perfect for what
00:13:55.519 we need because this tells you all of
00:13:57.199 the changes that just happen to your
00:13:58.399 primary. This starts to look a lot like
00:14:00.320 that uh replication log on our first
00:14:02.320 diagram. And if we can capture this
00:14:04.880 information and play it back on our
00:14:06.160 replica database, our replica will do
00:14:08.399 whatever the primary did. And
00:14:09.600 essentially, we're getting to the point
00:14:10.880 where replication would work.
00:14:14.959 Knowing all that, um, once again, we can
00:14:17.279 revise our spec with a bit more detail.
00:14:18.880 And just to note that capturing
00:14:20.720 transactions now means that we'll
00:14:21.920 capture the change pages of each
00:14:23.199 transaction from the wall. I realized
00:14:25.040 that was a really long digression just
00:14:26.320 to get three more words in our in our
00:14:28.240 description there, but they're important
00:14:29.600 words. They're doing a lot of work
00:14:30.800 there.
00:14:33.279 So at this point we know quite a bit
00:14:35.680 about how what we need to do for step
00:14:38.240 one here. But the part we're missing I
00:14:39.760 think is is the timing related part. We
00:14:41.600 know what format of data we're
00:14:43.120 capturing. We know where we're going to
00:14:45.360 find it. But we don't know how how do we
00:14:47.760 know when there is a new transaction
00:14:49.120 ready to be captured if you see what I
00:14:50.800 mean. We need we need the time aspect of
00:14:52.399 this.
00:14:53.920 So once again we'll go looking at SQLite
00:14:56.560 for this. uh but this time we're taking
00:14:58.800 kind of a high level view on SQLite and
00:15:01.600 this is what its overall architecture
00:15:03.120 looks like. So SQLite's built with a
00:15:05.120 layered architecture um where each of
00:15:07.199 the layers does one specific task and it
00:15:09.199 does that by using the services of the
00:15:11.120 layer below it. So code that you write
00:15:13.360 or active record or whatever will
00:15:15.199 interact at the top level with this
00:15:16.560 interface. It can send in SQL commands
00:15:18.720 which go to the SQL command processor
00:15:20.720 where they'll be parsed out. um an
00:15:22.639 execution plan is devised and they get
00:15:24.320 turned into bite code which runs on a
00:15:25.839 virtual machine. That virtual machine
00:15:28.480 carries out that program by manipulating
00:15:30.399 the trees we were looking at earlier
00:15:31.760 using a layer called the B tree and the
00:15:34.160 B tree is going to need to read and
00:15:35.440 write those 4K pages we looked at
00:15:36.959 earlier um which it does with a layer
00:15:38.959 called the pager.
00:15:41.120 The pager is going to need to work with
00:15:42.800 files on disk for this. But it doesn't
00:15:45.120 talk uh directly to the operating system
00:15:46.959 to do this. Instead, it goes through
00:15:48.160 this last layer at the bottom called the
00:15:49.519 virtual file system or VFS. The VFS
00:15:52.800 exists mainly for portability. It's one
00:15:54.720 of the reasons that SQLite is quite easy
00:15:56.160 to port to different uh operating
00:15:58.079 systems and platforms because if you
00:16:00.399 make a VFS for your operating system,
00:16:02.560 then all these other layers can operate
00:16:04.079 on top of it and all of SQLite can run
00:16:05.920 there.
00:16:08.399 The other interesting thing that about
00:16:10.560 the virtual file system is that
00:16:14.000 in addition to the built-in one, so
00:16:15.440 SQLite has a virtual file system for
00:16:16.880 Windows and for Unix and so on. Uh, but
00:16:19.199 you can also build your own and you can
00:16:20.720 load in new ones at runtime and tell
00:16:22.320 SQLite to use them. Um, so essentially
00:16:24.800 the top five layers of this diagram are
00:16:26.880 all hardwired together, but the bottom
00:16:28.639 part is pluggable. And because the
00:16:31.040 bottom part is pluggable and because you
00:16:32.560 can write your own, you can also do a
00:16:34.320 neat thing where you can wrap a file
00:16:36.399 system with another file system. And
00:16:38.399 when you do this, SQLite calls this a
00:16:39.920 VFS shim. And so a VFS shim is a way
00:16:43.199 that you can get your own code into this
00:16:45.519 stack. Um, and your own code can be
00:16:47.519 involved in the interaction between the
00:16:49.040 pager and the virtual file system. But
00:16:50.560 you don't have to build a whole file
00:16:51.759 system to do it. You can just delegate
00:16:53.120 the work to the file system below it.
00:16:55.680 Right? So that's good news for us. um
00:16:59.199 particularly because one of the things
00:17:01.279 that happens at that level is that when
00:17:03.040 the pager whenever the pager records a
00:17:05.520 new transaction to the database and
00:17:07.280 commits it and records it into the wall,
00:17:08.880 it fires us a particular event into the
00:17:11.120 virtual file system to let it know that
00:17:12.720 that happened. So the reason that's good
00:17:15.280 for us is because it gives us the last
00:17:17.360 piece of information we need to figure
00:17:18.880 out how to implement step one of our
00:17:20.400 replication, which is that if we
00:17:22.160 implement a VFS shim, we'll be able to
00:17:24.720 intercept the commits, capture those
00:17:26.400 change pages out of the wall for every
00:17:28.000 transaction as it happens. Then we can
00:17:30.160 transmit to those to the replicas over
00:17:32.320 HTTP.
00:17:34.160 And now we just need to figure out how
00:17:35.360 to do the step on the replica part
00:17:36.960 three.
00:17:38.559 Um, so figuring out part three also
00:17:41.039 requires going back into SQL light, but
00:17:42.640 I will say uh part three is way easier
00:17:44.960 than part one. So this one won't take
00:17:46.400 nearly as long to sort out.
00:17:50.080 So looking back at this architecture
00:17:51.919 diagram,
00:17:53.760 the transaction data that we captured
00:17:55.600 out the wall um was those pages. It's
00:17:58.000 pager level data and that's what we want
00:18:00.720 to apply on the replicas. But normally
00:18:02.240 when we work with SQLite, we're dealing
00:18:04.400 with things up at this interface level.
00:18:06.160 We don't want to be going through all
00:18:07.760 these layers again on the replica
00:18:09.039 because all that work already happened
00:18:10.480 before we captured the transaction. What
00:18:12.480 we really need is a way to just skip
00:18:14.320 over that. Um,
00:18:17.520 as luck would have it, SQLite has a
00:18:19.280 thing that lets you do exactly that and
00:18:20.720 skip over it. So SQLite has a virtual
00:18:22.720 table called SQLite DB page. And SQLite
00:18:26.160 DB page is not really a table. It's
00:18:28.000 actually the page you're masquerading as
00:18:29.600 a table. So if you select out of SQLite
00:18:32.000 DB page, it will return the raw 4K
00:18:35.120 binary chunks of each page to you. You
00:18:37.679 get back rows where you have a page
00:18:39.280 number and you have that that binary
00:18:41.280 data for the page. Even better than
00:18:43.679 that, you can write to SQL IDB page. Um
00:18:46.880 so if you insert into it, you can pass
00:18:48.640 in a page number, a 4K chunk of data,
00:18:50.880 and the pager will go and write that
00:18:52.480 into the database just as if it had come
00:18:54.240 from all the layers above normally. And
00:18:57.280 because the pager as we said was that
00:18:59.200 part that's responsible for transaction
00:19:01.200 safety and concurrency safety then it
00:19:03.039 will do this in a way where you can feed
00:19:05.200 data into it like this while other
00:19:06.640 people are reading the database and it
00:19:08.480 will make sure that that all happens um
00:19:10.160 in a kind of controlled and safe way. So
00:19:12.960 that turned out to be the only thing we
00:19:14.400 needed to do for step three. Um this now
00:19:17.760 gives us the whole picture and kind of a
00:19:19.760 detailed enough spec that we could go
00:19:21.039 off and build this and this would be the
00:19:22.320 replication flow that would manage
00:19:23.760 changes from a primary database to a
00:19:25.120 replica database. And it's exactly the
00:19:26.880 way Beamer does it.
00:19:30.240 I should add though that all this part
00:19:32.640 that we've been discussing um is kind of
00:19:35.120 that core replication flow like I said.
00:19:37.520 So um it's the main thing you need to do
00:19:40.320 if you want to replicate your databases.
00:19:42.400 Um but
00:19:44.559 it's uh it's not the only thing you need
00:19:47.440 in order to have
00:19:49.840 um like a system that you can set up and
00:19:51.840 work with and operate and kind of is
00:19:53.520 easy to use. There's a bunch of other
00:19:54.720 plumbing you're going to need. So,
00:19:55.679 Beamer does add kind of a number of
00:19:57.679 other things um to make that easier, but
00:20:02.080 it's all really built around this kind
00:20:03.760 of central idea of shipping changes.
00:20:06.240 Hopefully, that makes sense.
00:20:10.000 Okay, so that kind of covers um or or we
00:20:13.120 have covered what database replication's
00:20:14.880 for um why it's a useful thing to have.
00:20:17.760 Now, we know how Beamer does it or andor
00:20:20.559 how to build it yourself if you want to
00:20:22.000 build your own. So the only other thing
00:20:24.000 that I wanted to talk about at the end
00:20:25.440 was just what it looks like to use
00:20:26.720 Beamer in your applications.
00:20:30.240 Uh getting Beamer into your application
00:20:32.000 has really two parts. One is that we
00:20:33.679 need that VFS to be available um so that
00:20:36.480 we can capture the transactions because
00:20:38.080 that's the part where we implemented
00:20:39.760 that. Um so for a Rails app, we could
00:20:42.799 just use a gem. Beamer has a gem that
00:20:44.400 when you include that in your app, it'll
00:20:45.919 make sure the VFS is available. And then
00:20:47.679 in your database YAML, you can specify
00:20:49.520 which of your databases you'd like to be
00:20:51.120 replicated um and which not if you don't
00:20:53.360 want to do them all.
00:20:55.520 Second part is that there's a beam of
00:20:57.039 process that runs on each app server.
00:20:58.799 That process is responsible for things
00:21:00.480 like the um the network traffic between
00:21:02.640 the nodes and responding to commands
00:21:04.240 that you run and so on. Um that's a
00:21:06.559 Docker image. So you can just add that
00:21:08.480 to your command deployment as an
00:21:09.840 accessory and it will run on each of
00:21:11.120 your app nodes. If you do these two
00:21:12.720 things, then Beamer is available and you
00:21:14.480 can start replicating.
00:21:17.440 At that point, um there's a couple of or
00:21:20.640 there's a several different commands
00:21:22.320 that Beamer provides that you can run to
00:21:23.919 control what the replication's actually
00:21:25.600 doing. Only one of them is really
00:21:27.520 required and that's you need a way to
00:21:29.520 tell Beamer which of your uh app servers
00:21:32.320 you would like to act as the primary. So
00:21:34.240 this command beamer switch sets the the
00:21:36.320 current primary in a cluster of um app
00:21:38.720 servers. When you do this, um, the
00:21:42.720 machine that you specify to be the
00:21:44.240 primary will allow you to write to the
00:21:46.080 databases there. It will capture the
00:21:47.600 changes and make them available to be
00:21:49.039 replicated to the other machines. All
00:21:50.640 the other machines that are not the
00:21:51.919 primary will follow along with the
00:21:54.080 changes on the databases of that
00:21:55.440 machine. They'll also set their copy of
00:21:57.360 the databases into a readonly mode so
00:21:59.120 that you don't accidentally make a
00:22:00.400 change on a primary when you shouldn't.
00:22:03.520 Um, you can run beamer switch at any
00:22:05.280 time. So if you need to do a failover
00:22:06.720 like if uh app one dies or you just want
00:22:09.120 to move your primary somewhere else so
00:22:10.480 you can take app 01 offline then you can
00:22:12.480 just call beamemer switch again all
00:22:13.679 these nodes will reconfigure themselves
00:22:15.120 to start replicating in a different
00:22:16.640 direction. Um and you could obviously
00:22:18.559 automate this as well if you want to put
00:22:20.000 it in something like console so that
00:22:21.360 when a machine dies it will
00:22:22.480 automatically elect a new one.
00:22:26.320 Um there are several other beamer
00:22:28.799 commands as well. We're not going to go
00:22:30.000 through them all um in detail, but I
00:22:32.400 just put a few of them up here so you
00:22:33.600 can get a sense of the kinds of things
00:22:34.960 that it can do.
00:22:38.400 And then the other part that I wanted to
00:22:39.840 talk about in terms of using Beamer is
00:22:41.919 that when you've set up a cluster of
00:22:43.600 machines and you've got replication
00:22:44.880 running between them and now you have
00:22:46.320 readonly replicas, you're going to want
00:22:47.840 to be able to serve some of your app
00:22:49.120 traffic to them. So to be able to do
00:22:51.600 that, we've added load balancing to
00:22:53.440 commal proxy. So it now has the ability
00:22:55.200 to load balance traffic amongst a number
00:22:57.039 of nodes and but it also knows how to
00:22:59.200 split that traffic between reads and
00:23:00.640 writes um based on the HTTP method
00:23:02.799 that's in each of those requests. So you
00:23:04.720 can do something like this where you set
00:23:06.000 up a set of machines which uses beamer
00:23:08.480 to replicate. So you have one primary um
00:23:10.799 in that orange color and a couple of
00:23:12.240 read replicas and then commal proxy will
00:23:14.880 send all of the rights to the primary
00:23:16.480 and all the reads will be load balanced
00:23:18.000 amongst the other machines there.
00:23:21.440 This kind of setup is also the um the
00:23:24.240 basis of how you can start doing the
00:23:25.919 thing of moving data nearer to customers
00:23:27.840 to reduce their latency. So the simplest
00:23:31.120 example of that I think would be that
00:23:33.039 you can make two deployments of your
00:23:34.480 app. You can use some kind of geodns
00:23:36.240 service that will route customers
00:23:37.600 traffic to the nearest point um
00:23:40.159 depending on where the traffic
00:23:41.280 originates. Um you could make a main
00:23:44.080 deployment in Europe where you have uh
00:23:45.840 your primary database and a couple of
00:23:47.280 read replicas but you can also make like
00:23:48.799 an outpost deployment in the US which
00:23:51.039 has read replicas there. So then your US
00:23:53.360 customers their traffic's going to land
00:23:54.960 on that load balancer in the US and all
00:23:57.360 their read requests will be much faster
00:23:59.360 because they're uh coming from those
00:24:01.360 read replicas that are in the US and
00:24:02.799 have much less latency.
00:24:04.880 Obviously in this setup there's still
00:24:06.240 only one primary which is in Europe. So
00:24:08.320 the US customers will still be paying
00:24:10.320 the latency to get to Europe whenever
00:24:12.159 they do a right. That might be fine. I
00:24:14.559 mean, if you're an app like Basec Camp 4
00:24:16.559 and 94% of your request to reads, then
00:24:18.720 just doing this has made 94% of your
00:24:21.039 requests faster and the other 6% are
00:24:22.960 just the same as they always were. So
00:24:24.559 that's already kind of helpful. Um, but
00:24:27.279 if you do want to get a bit fancier, you
00:24:28.960 can always segment your customers into
00:24:30.640 different writable databases and then
00:24:32.080 start putting those in different places
00:24:33.520 as well. And that way you can get have
00:24:35.520 your US customers live in a database
00:24:37.200 that's in the US and they'll get faster
00:24:39.200 reads and writes and same for Europe.
00:24:41.120 And obviously this can be expanded out
00:24:42.640 to multiple different places.
00:24:47.520 Okay. So the last thing I wanted to
00:24:49.600 leave you with before I wrap up is just
00:24:51.360 um a note on the current status of where
00:24:53.039 the Beamer project is right now. Um so
00:24:55.919 we're using this internally. It's what
00:24:57.520 powers this new app Fizzy that we've
00:24:59.039 been talking about a few times today and
00:25:00.880 which we're planning to release soon.
00:25:02.000 And so at the point where Fizzy
00:25:03.120 releases, that's when Beamer will start
00:25:04.400 having kind of real production traffic.
00:25:06.799 Um, so that should be soon. Um, we are
00:25:09.600 still experimenting a little bit with
00:25:11.039 different strategies for how to operate
00:25:13.120 this multi realm multiwriter kind of
00:25:15.840 setup like we saw in the last slide and
00:25:17.919 because you can do it now, but it feels
00:25:19.279 a little bit more complicated to set up
00:25:20.720 than it should be. Um, so we have some
00:25:22.640 ideas about trying to make that dead
00:25:24.000 simple. Um, but in any case, all of this
00:25:26.400 is going to be open source soon. You can
00:25:27.840 try out for yourself and if you do, we'd
00:25:29.360 love to hear what you think of it.
00:25:30.720 Thanks very much.
Explore all talks recorded at Rails World 2025
+19