SQLite Replication with Beamer

SQLite Replication with Beamer

Play on YouTube

Kevin McConnell

@kevinmcconnell

Rails World 2025

#distributed-systems

SQLite Replication with Beamer

Kevin McConnell • September 04, 2025 • Amsterdam, Netherlands • Talk

Introduction

This talk, delivered by Kevin McConnell at Rails World 2025, focuses on "Beamer", a lightweight replication tool for SQLite databases. As SQLite gains popularity in powering Rails applications due to its simplicity and feature set, challenges arise when scaling beyond a single application server, especially in distributed or multi-server setups. Beamer aims to address these challenges by enabling easy replication and the use of read-only replicas for SQLite-based applications.

Key Points

Takeaways and Conclusions

Beamer makes it straightforward to add live, read-only replicas to SQLite databases, supporting scalability and geographic distribution for Rails applications.
Relies on low-level features of SQLite (WAL, VFS shims) and provides a relatively simple integration path.
The project is used internally and will be open sourced soon, with ongoing work to simplify multi-region and multi-writer configurations.

SQLite Replication with Beamer
Kevin McConnell • Amsterdam, Netherlands • Talk

Date: September 04, 2025
Published: Mon, 15 Sep 2025 00:00:00 +0000
Announced: Tue, 20 May 2025 00:00:00 +0000

SQLite is an increasingly popular choice for powering Rails applications. It's fast, simple to use, and has an impressive feature set. But without a conventional database server, scaling beyond a single application server can be challenging.

Beamer is a lightweight replication tool for SQLite, designed to make it easy to add read-only replicas to your writable databases. In this talk, we'll cover how Beamer works and how you can use it to build multi-server, geographically distributed SQLite-based Rails applications.

Rails World 2025

00:00:07.120 So I wanted to talk to you a little bit

00:00:08.639 today about Beamer, which is a

00:00:10.160 lightweight replication tool for SQLite.

00:00:13.519 So what we're going to cover today is um

00:00:15.599 we'll start by doing just a quick review

00:00:17.039 of what database replication does, why

00:00:18.880 it's a useful thing to have. Um but then

00:00:21.279 for the for the bulk of the talk, I

00:00:23.279 thought it'd be interesting if we

00:00:24.160 actually look under the hood to see how

00:00:25.760 Beamer works. And the way we're going to

00:00:27.439 approach that is by setting ourselves

00:00:28.960 the task of building a replication

00:00:30.480 system from scratch. We kind of work

00:00:32.800 through all of the points that we need

00:00:34.079 to solve in order to do that. And by the

00:00:35.920 time we get to the end of it, we'll have

00:00:37.120 figured out how Beamer works. Um, and

00:00:39.200 then at the end, uh, we'll just take a

00:00:41.200 couple of minutes to look at what it's

00:00:42.239 like to use Beamer in your Rails

00:00:43.600 application um, and your Camal

00:00:45.440 deployments.

00:00:47.520 So database replication, if anyone's not

00:00:50.480 familiar with it, it's just a way to

00:00:51.840 have multiple copies of your databases

00:00:54.239 spread out over multiple machines. And

00:00:56.399 those copies are live in the sense that

00:00:58.800 changes that you make to one copy will

00:01:00.480 be visible in the other copies. And you

00:01:02.399 can access all of these copies to run

00:01:03.840 queries and do work.

00:01:06.320 Uh database replication is often

00:01:08.240 deployed uh in a set where for each

00:01:10.560 database you'll have one writable

00:01:11.920 version called the primary and then

00:01:13.280 multiple readonly versions called

00:01:14.720 replicas. It's not the only way to do

00:01:16.400 replication, but it's a very common one.

00:01:18.320 Um, and so for the point uh point of

00:01:20.720 view of this talk, this is the kind of

00:01:21.920 replication that we have in mind that

00:01:23.360 we'll be talking about.

00:01:25.680 As for why this is useful, um, I think

00:01:27.759 there's really three main reasons. One

00:01:29.920 is just that it helps you avoid having

00:01:31.360 single points of failure in your

00:01:32.720 infrastructure because obviously if you

00:01:34.079 have all of your database data on a

00:01:35.520 single machine, it's going to be a bit

00:01:37.200 of a liability if anything happens to

00:01:38.640 that machine. So replication gives you a

00:01:40.240 way to um, spread that out and build in

00:01:42.320 some redundancy.

00:01:44.320 Second reason that replication would be

00:01:46.240 useful is you can then start

00:01:48.320 horizontally scaling your database

00:01:50.159 operations across all of the machines

00:01:51.840 that you're replicating to. Um, and this

00:01:54.720 is useful even in the type of

00:01:56.000 replication we're talking about where

00:01:57.200 your replicas will be read only because

00:01:58.960 so many applications do a lot more read

00:02:01.200 work than they do write work. Uh, using

00:02:03.759 base camp 4 as an example, I think we

00:02:05.759 often average something like 94% read

00:02:07.840 requests and only 6% writes, which is

00:02:10.560 not an unusual ratio to have in a web

00:02:12.560 application.

00:02:13.760 Um so if you're in a situation like that

00:02:15.280 you could see how having lots of even

00:02:17.360 readonly copies means now you can add a

00:02:19.120 lot more capacity to handle those reads

00:02:20.879 and by moving your reads to the read

00:02:22.319 replicas you also free up uh write

00:02:24.319 capacity on your primary to do more

00:02:26.160 rights there.

00:02:28.400 Uh and then the third reason I think

00:02:29.920 replication is useful to have is it's a

00:02:31.920 pattern that allows you to isolate

00:02:33.360 disruptive actions to machines that are

00:02:35.360 not um serving customer traffic. So an

00:02:38.160 example of this would be if you have

00:02:39.440 some heavy data analysis that you need

00:02:41.280 to do and you suspect that these queries

00:02:42.879 you're going to run will slow down

00:02:44.480 everything that's on the machine that

00:02:45.519 you do them on. You don't want to be

00:02:46.959 doing that on a machine that's also

00:02:47.920 serving customer traffic. But if you

00:02:49.440 have replication, you can just replicate

00:02:51.440 your data onto a separate machine and do

00:02:52.959 the work there and then it will be kind

00:02:54.400 of isolated.

00:02:57.920 I did want to also mention just a little

00:02:59.519 bit of context of why I went down this

00:03:01.280 replication rabbit hole and started to

00:03:02.879 look at this. Um although from David's

00:03:05.040 keynote this morning, you probably have

00:03:06.239 a sense of why we're looking at this. Um

00:03:09.120 but this really came from we're building

00:03:10.720 this new app at 37 signals called Vizzy.

00:03:13.120 It's built on top of SQLite. Um it uses

00:03:15.680 tons and tons of database files. Um for

00:03:17.760 anyone who went to Mike's talk earlier

00:03:19.200 about active record tenant, you will

00:03:20.800 know all of the hows and wise of how it

00:03:22.800 uses so many database files. We also

00:03:25.519 want to deploy this application um

00:03:28.000 spread out in various places

00:03:29.200 geographically so we can reduce customer

00:03:30.959 latency by moving the data closer to the

00:03:33.120 customers. So there's quite a lot going

00:03:34.640 on there and we need to be able to find

00:03:37.680 uh a good way to manage the replication

00:03:39.519 of the data in our SQLite databases

00:03:41.599 across all these machines. And so when

00:03:44.560 looking at a way to do this and also to

00:03:46.560 make it easy that's what led us to sort

00:03:48.720 of look through some of our options and

00:03:50.159 ultimately to start building Beamer. Um,

00:03:52.239 and we want to make this easy, not just

00:03:53.519 for our case of this app we're building,

00:03:54.799 but obviously for anyone else who wants

00:03:56.159 to build an app with a similar

00:03:57.519 architecture, we hope you can take these

00:03:58.959 pieces and find it easy as well.

00:04:03.360 Okay, so now that we know what um

00:04:06.400 database replication is for and why it's

00:04:08.159 a useful thing to have, this is the part

00:04:09.920 where we're going to start um digging

00:04:11.599 into the nuts and bolts of how Beamer

00:04:13.200 works by trying to build a replication

00:04:14.720 system from scratch. This is also going

00:04:17.040 to involve digging around a bit inside

00:04:18.880 SQLite and talking about how SQLite

00:04:20.560 works. And I did mention that because

00:04:22.720 there's a lot of detail going on there

00:04:24.240 and we don't have very much time. I'm

00:04:25.440 going to kind of speedrun some of the

00:04:26.960 SQL light parts. So if some of the

00:04:28.720 things that I say seem a bit wonky at

00:04:30.400 first, bear with me. I think it will all

00:04:32.240 make sense by the time we get to the

00:04:33.440 end.

00:04:36.560 So this is going to be our initial

00:04:39.280 starting description of what replication

00:04:40.880 does. This is going to serve as the spec

00:04:42.320 of what we're going to try and build. So

00:04:43.759 I think at the core of it, a replication

00:04:46.720 system needs to be able to do these

00:04:48.080 three things. You need to be able to

00:04:49.520 capture the changes that happen to a

00:04:50.800 primary database. Uh transmit those over

00:04:53.040 to a replica and then apply the ch the

00:04:55.199 same changes to the replica. If you can

00:04:56.560 do these three steps reliably, then your

00:04:58.639 replica databases will do everything

00:05:00.080 your primary databases do and you've

00:05:01.520 essentially implemented replication at

00:05:03.280 that point. So you could picture this a

00:05:05.680 bit like this diagram where you have a

00:05:06.880 primary database, a replica database,

00:05:08.639 and there's this constant stream of

00:05:10.160 changes that flows from the primary to

00:05:11.840 the replica. um that's sometimes

00:05:13.600 referred to as a replication log.

00:05:16.479 So, as I say, this is our starting spec.

00:05:18.080 This is what we're going to try and

00:05:18.960 build, but we need a bit more detail in

00:05:20.720 order to do that. Um and looking at it

00:05:22.960 now, there's a couple of things that

00:05:24.080 stand out to me initially. One is that

00:05:26.880 we're talking here about capturing

00:05:28.560 changes, but change is sort of a vague

00:05:30.320 term. We should probably be a little bit

00:05:31.840 more precise with what we mean there.

00:05:33.520 But since we're dealing with relational

00:05:35.120 databases, uh the unit of change there

00:05:36.800 is a transaction. So let's start by

00:05:38.880 revising our spec a bit to say that

00:05:40.320 we're actually going to capture the

00:05:41.280 transactions on the primary and then

00:05:43.360 transmit and apply those transactions.

00:05:46.960 The second thing that um stands out to

00:05:49.039 me at first with this spec is that steps

00:05:51.199 one and three here seem like they're

00:05:53.199 doing database specific tasks, but step

00:05:55.600 two doesn't really. Step two just sounds

00:05:57.440 like it's sending information from one

00:05:59.120 machine to another machine. Um and

00:06:01.199 sending data over network is something

00:06:02.400 that as web developers we do this all

00:06:04.000 the time. We use HTTP and move data

00:06:06.080 around. We're kind of familiar with

00:06:07.680 this. It seems like a bit of a solved

00:06:09.120 problem. So, let's just go ahead and say

00:06:11.120 that we'll use HTTP as the trans the

00:06:13.919 transport for this data. And we can

00:06:15.280 check off point 2 and consider that like

00:06:17.360 we figured out we know how to do two and

00:06:19.280 now we just have the other database

00:06:20.560 specific parts to solve.

00:06:23.520 Okay, so those were the only two sort of

00:06:25.520 obvious easy parts to me on this spec.

00:06:27.759 Um, now we need to start looking a bit

00:06:30.160 closer under the hood of how things work

00:06:31.600 in order to start figuring out how to do

00:06:33.199 the other parts. So we'll look at how to

00:06:34.560 capture transactions and the way we're

00:06:37.199 going to solve that uh involves looking

00:06:39.520 a bit inside SQLite. Um and the first

00:06:41.759 part we're going to look at is how

00:06:42.800 SQLite stores its data.

00:06:45.600 So logically the data you work with in

00:06:48.400 SQLite or any relational database is is

00:06:50.319 logically tabular, right? Your data is

00:06:51.919 arranged in tables with rows and

00:06:53.440 columns. Um but that's not how it's

00:06:55.440 actually stored on disk. primarily

00:06:57.280 because if it was stored this way on

00:06:58.639 disk, then a lot of common operations

00:07:01.360 would turn out to be quite awkward and

00:07:02.720 expensive to do. So an example might be

00:07:04.800 if you need to insert a row near the

00:07:06.479 start of a table, if it's laid out in

00:07:08.400 this format on disk, then suddenly

00:07:09.680 you've got to move everything else down

00:07:11.039 to make space for that new row. And in

00:07:13.120 the worst case, you're going to end up

00:07:14.160 essentially rewriting that entire table

00:07:15.840 just to add a single row, which would be

00:07:18.080 quite slow. So what databases normally

00:07:21.120 do is use some kind of tree structure

00:07:22.720 instead.

00:07:24.319 A tree is really nice for this because

00:07:26.319 you can put your data into nodes and

00:07:28.160 then use pointers between nodes to kind

00:07:30.000 of maintain whatever ordering you want

00:07:31.520 to have in that data. But where you

00:07:34.000 store those nodes on disk doesn't have

00:07:35.520 to match that. So you kind of separate

00:07:37.440 the disk layout from the logical layout

00:07:39.520 of the data. And that's what makes it

00:07:41.520 cheap to do things like insert a new

00:07:42.880 node at the top of a tree because you

00:07:44.240 only have to change the nodes it

00:07:45.280 touches. You don't have to touch any of

00:07:46.560 the rest of the data.

00:07:48.639 So SQLite does this in SQLite.

00:07:51.039 Everything is a tree. So a table is a

00:07:52.800 tree, an index is a tree because an

00:07:54.800 index is really just a table that has

00:07:56.240 pointers for values. Um the schema

00:07:58.400 catalog which is the kind of main list

00:08:00.240 of all of the uh tables, indexes and

00:08:03.599 other objects in your database, that's

00:08:04.960 also a tree. Um so this is how SQLite

00:08:08.639 organizes it uh its data.

00:08:11.840 But when we're dealing with trees that

00:08:13.360 need to persist to disk, um, one

00:08:15.840 trade-off that we have to be conscious

00:08:17.120 of is that the size of the nodes in our

00:08:19.919 tree is going to be important in terms

00:08:21.440 of performance. In particular, if we

00:08:23.199 make those nodes really small, then

00:08:25.039 every time we need to do something to

00:08:26.319 work with that tree that involves uh

00:08:28.319 fetching a node or saving a node, we're

00:08:29.840 going to be doing tiny little disk

00:08:31.120 operations and lots and lots of them,

00:08:32.560 which will end up being quite

00:08:33.519 inefficient. There's a bit of an

00:08:35.120 overhead for every IO operation you do.

00:08:37.440 and disks work a lot more or perform a

00:08:39.839 lot better if you can batch that work up

00:08:41.519 and kind of write chunks at a time. So

00:08:44.159 the way SQLite deals with this is that

00:08:46.399 all its trees are built out of these

00:08:47.920 fixedsized blocks called pages. By

00:08:50.080 default they're 4K and they'll always be

00:08:51.680 the same size um within a single

00:08:53.839 database. So we'll assume just the

00:08:56.080 default for now. They're always 4K

00:08:57.600 blocks. Um that means that the pages can

00:09:00.080 contain data for multiple rows as well

00:09:02.080 as the pointers to the other um pages in

00:09:04.800 order to kind of form the tree

00:09:05.920 structure. Then the SQLite data file is

00:09:09.200 literally just all of those 4K pages

00:09:11.839 written out in their page number order

00:09:13.760 in a file. That's the file format of a

00:09:15.519 SQLite database. And then whenever

00:09:17.760 SQLite needs to um read and write any

00:09:21.120 changes to the data, it's always working

00:09:22.399 on pages. So it's always doing these

00:09:23.760 like 4K reads, 4K writes kind of thing.

00:09:27.519 Okay, I know we're getting in the weeds

00:09:29.040 a bit here, but knowing this um we can

00:09:31.680 start making our first revision to the

00:09:33.279 spec to get more detail in there. And

00:09:35.839 now we'll just note that when we say

00:09:37.600 we're going to capture transactions for

00:09:38.959 a primary, what we're probably going to

00:09:40.320 do is capture these changed pages

00:09:42.320 because that's how SQLite works with

00:09:43.839 them. So at least now we have a little

00:09:45.600 bit of a notion of the the file format

00:09:47.440 of the data that we're dealing with.

00:09:50.480 Um, but we don't what we don't know yet

00:09:52.160 is how do we find which of the pages are

00:09:54.800 the changed pages whenever there's a

00:09:56.480 transaction. So we'll continue poking

00:09:59.279 around in SQLite and we can find the

00:10:00.640 answer to that. this time looking more

00:10:02.720 at how SQLite updates data when it

00:10:04.640 changes.

00:10:06.480 So we just said that um the region right

00:10:09.440 region rights always happen as whole

00:10:11.040 pages and we also said that SQLite's

00:10:12.959 data file is just these pages laid out

00:10:14.800 on disk one after the other in their

00:10:16.399 order. So when it comes time to persist

00:10:18.640 a change to disk the simplest possible

00:10:21.279 thing SQLite could do would just be

00:10:24.079 write the new version of the pages

00:10:25.360 directly into that file whenever they

00:10:26.880 happen. And that would work fine if you

00:10:29.440 are the only process accessing the

00:10:31.040 database at that time, but very often

00:10:33.360 you're not. Um, often there'll be lots

00:10:35.040 of processes and threads, maybe

00:10:36.240 different Puma workers or something that

00:10:37.600 are all accessing the same database. And

00:10:39.600 so we can't just write changes on top of

00:10:41.440 other um, pages anytime we want because

00:10:44.079 we're going to trip up the work that

00:10:45.279 some other process is doing if it's

00:10:46.640 trying to read from a page we're writing

00:10:47.920 to and so on.

00:10:50.079 So if we can't do this the very simplest

00:10:52.240 thing perhaps we can do the next

00:10:54.320 simplest thing which would be just to

00:10:56.079 add a big lock to this. So anytime we

00:10:59.760 want to write the process that wants to

00:11:01.519 write could take an exclusive lock on

00:11:03.040 the whole database. Then it can write

00:11:04.560 whatever it wants to whatever pages it

00:11:06.079 needs to change. When it's done it can

00:11:07.920 release the lock and then the other

00:11:09.440 processes that want to use the database

00:11:10.880 can continue their work. This actually

00:11:13.680 works fine and um SQLite has a mode of

00:11:16.079 operation that is essentially this. Um,

00:11:18.800 but the downside to this is it's just

00:11:20.320 really bad for concurrency. If you have

00:11:22.240 a lot of processes that all want to do

00:11:24.000 things to the database, but anytime one

00:11:26.000 wants to write, all the others stop. Um,

00:11:28.720 then in practice, you'll find a lot of

00:11:30.399 time where only one of your processes is

00:11:32.160 really making any progress on its work

00:11:34.000 and all the other processes are just

00:11:35.360 kind of waiting on this lock to be

00:11:36.640 released.

00:11:38.160 Um, so to improve on this, SQLite has

00:11:41.600 another mode of operation which is

00:11:43.200 called wall mode. Walnood works by

00:11:46.720 adding a second data file alongside the

00:11:48.640 main data file called the wall file

00:11:50.320 which stands for the write a headlog.

00:11:52.160 It's where wall gets its name. And the

00:11:54.800 wall file is built out of these same

00:11:56.560 fixedsiz pages as the main database

00:11:58.560 file. But the wall file is an appendon

00:12:01.040 file. So rather than overwriting

00:12:02.640 anything, you're always appending to the

00:12:04.240 end. Uh the way this gets used is that

00:12:07.600 whenever SQLite wants to record a change

00:12:09.519 to a page, it will write it onto the end

00:12:12.320 of that wall file. And then whenever it

00:12:14.240 needs to read in a page from the

00:12:15.360 database, it'll look to see if there is

00:12:16.639 a copy of it, the wall file first. If

00:12:18.480 there is, it can use that. Um, and if

00:12:20.880 there's if it's not in the wall, it will

00:12:22.399 look from the main database file. So,

00:12:24.560 another way to think of this is that

00:12:27.200 writing a a copy of a page to the end of

00:12:29.279 the wall is effectively shadowing any

00:12:31.040 previous copy you had of that page in

00:12:32.720 the wall or in the the main database

00:12:34.480 file.

00:12:36.800 It can't do this forever. You can't just

00:12:38.720 continually append a file because

00:12:39.920 eventually the wall file will become

00:12:41.120 massive and consume all your disk space.

00:12:42.880 So there's also a periodic task called a

00:12:44.800 checkpoint which runs when the wall

00:12:46.880 starts to get too big and the checkpoint

00:12:48.480 will just take the latest version of

00:12:49.920 each of these pages, copy them back up

00:12:52.399 to the main database and it can throw

00:12:53.600 the wall away and start over with a new

00:12:55.279 one. And so the the Wi-Fi will sort of

00:12:57.279 periodically rewind. But in between

00:12:59.839 those checkpoints, the point that I'm

00:13:01.360 trying to get at here is that all of the

00:13:04.079 right operations are essentially just

00:13:05.519 append append append onto the end of a

00:13:07.360 while file. So if you could see it

00:13:08.959 inside, it would probably look like

00:13:10.639 this, I think.

00:13:13.680 Now, while mode has a couple of really

00:13:16.160 nice properties that make it work better

00:13:17.680 for concurrent workloads. Um, one of

00:13:19.519 them is that writes don't need to block

00:13:21.200 reads anymore. Um, since you're not

00:13:23.600 overwriting pages, that means if other

00:13:25.120 processes want to read the database

00:13:26.240 while a right's happening, they can

00:13:27.360 continue to do so and nothing's going to

00:13:29.360 like pull the rug out from under their

00:13:30.800 feet by changing a page they're reading.

00:13:33.040 So, that's nice. Um but from our point

00:13:35.760 of view as people who are trying to

00:13:37.120 build a replication system, the other

00:13:39.279 reason the wall mode is nice is that if

00:13:41.279 you look at the wall file, it's

00:13:42.720 essentially uh a continuous stream of

00:13:45.040 changes that just happen to your happen

00:13:46.880 to your primary database. Um they're

00:13:48.720 even delineated by transaction. There's

00:13:50.560 little markers in the wall that tells

00:13:51.680 you where each transaction starts and

00:13:53.200 stops. Um and this is perfect for what

00:13:55.519 we need because this tells you all of

00:13:57.199 the changes that just happen to your

00:13:58.399 primary. This starts to look a lot like

00:14:00.320 that uh replication log on our first

00:14:02.320 diagram. And if we can capture this

00:14:04.880 information and play it back on our

00:14:06.160 replica database, our replica will do

00:14:08.399 whatever the primary did. And

00:14:09.600 essentially, we're getting to the point

00:14:10.880 where replication would work.

00:14:14.959 Knowing all that, um, once again, we can

00:14:17.279 revise our spec with a bit more detail.

00:14:18.880 And just to note that capturing

00:14:20.720 transactions now means that we'll

00:14:21.920 capture the change pages of each

00:14:23.199 transaction from the wall. I realized

00:14:25.040 that was a really long digression just

00:14:26.320 to get three more words in our in our

00:14:28.240 description there, but they're important

00:14:29.600 words. They're doing a lot of work

00:14:30.800 there.

00:14:33.279 So at this point we know quite a bit

00:14:35.680 about how what we need to do for step

00:14:38.240 one here. But the part we're missing I

00:14:39.760 think is is the timing related part. We

00:14:41.600 know what format of data we're

00:14:43.120 capturing. We know where we're going to

00:14:45.360 find it. But we don't know how how do we

00:14:47.760 know when there is a new transaction

00:14:49.120 ready to be captured if you see what I

00:14:50.800 mean. We need we need the time aspect of

00:14:52.399 this.

00:14:53.920 So once again we'll go looking at SQLite

00:14:56.560 for this. uh but this time we're taking

00:14:58.800 kind of a high level view on SQLite and

00:15:01.600 this is what its overall architecture

00:15:03.120 looks like. So SQLite's built with a

00:15:05.120 layered architecture um where each of

00:15:07.199 the layers does one specific task and it

00:15:09.199 does that by using the services of the

00:15:11.120 layer below it. So code that you write

00:15:13.360 or active record or whatever will

00:15:15.199 interact at the top level with this

00:15:16.560 interface. It can send in SQL commands

00:15:18.720 which go to the SQL command processor

00:15:20.720 where they'll be parsed out. um an

00:15:22.639 execution plan is devised and they get

00:15:24.320 turned into bite code which runs on a

00:15:25.839 virtual machine. That virtual machine

00:15:28.480 carries out that program by manipulating

00:15:30.399 the trees we were looking at earlier

00:15:31.760 using a layer called the B tree and the

00:15:34.160 B tree is going to need to read and

00:15:35.440 write those 4K pages we looked at

00:15:36.959 earlier um which it does with a layer

00:15:38.959 called the pager.

00:15:41.120 The pager is going to need to work with

00:15:42.800 files on disk for this. But it doesn't

00:15:45.120 talk uh directly to the operating system

00:15:46.959 to do this. Instead, it goes through

00:15:48.160 this last layer at the bottom called the

00:15:49.519 virtual file system or VFS. The VFS

00:15:52.800 exists mainly for portability. It's one

00:15:54.720 of the reasons that SQLite is quite easy

00:15:56.160 to port to different uh operating

00:15:58.079 systems and platforms because if you

00:16:00.399 make a VFS for your operating system,

00:16:02.560 then all these other layers can operate

00:16:04.079 on top of it and all of SQLite can run

00:16:05.920 there.

00:16:08.399 The other interesting thing that about

00:16:10.560 the virtual file system is that

00:16:14.000 in addition to the built-in one, so

00:16:15.440 SQLite has a virtual file system for

00:16:16.880 Windows and for Unix and so on. Uh, but

00:16:19.199 you can also build your own and you can

00:16:20.720 load in new ones at runtime and tell

00:16:22.320 SQLite to use them. Um, so essentially

00:16:24.800 the top five layers of this diagram are

00:16:26.880 all hardwired together, but the bottom

00:16:28.639 part is pluggable. And because the

00:16:31.040 bottom part is pluggable and because you

00:16:32.560 can write your own, you can also do a

00:16:34.320 neat thing where you can wrap a file

00:16:36.399 system with another file system. And

00:16:38.399 when you do this, SQLite calls this a

00:16:39.920 VFS shim. And so a VFS shim is a way

00:16:43.199 that you can get your own code into this

00:16:45.519 stack. Um, and your own code can be

00:16:47.519 involved in the interaction between the

00:16:49.040 pager and the virtual file system. But

00:16:50.560 you don't have to build a whole file

00:16:51.759 system to do it. You can just delegate

00:16:53.120 the work to the file system below it.

00:16:55.680 Right? So that's good news for us. um

00:16:59.199 particularly because one of the things

00:17:01.279 that happens at that level is that when

00:17:03.040 the pager whenever the pager records a

00:17:05.520 new transaction to the database and

00:17:07.280 commits it and records it into the wall,

00:17:08.880 it fires us a particular event into the

00:17:11.120 virtual file system to let it know that

00:17:12.720 that happened. So the reason that's good

00:17:15.280 for us is because it gives us the last

00:17:17.360 piece of information we need to figure

00:17:18.880 out how to implement step one of our

00:17:20.400 replication, which is that if we

00:17:22.160 implement a VFS shim, we'll be able to

00:17:24.720 intercept the commits, capture those

00:17:26.400 change pages out of the wall for every

00:17:28.000 transaction as it happens. Then we can

00:17:30.160 transmit to those to the replicas over

00:17:32.320 HTTP.

00:17:34.160 And now we just need to figure out how

00:17:35.360 to do the step on the replica part

00:17:36.960 three.

00:17:38.559 Um, so figuring out part three also

00:17:41.039 requires going back into SQL light, but

00:17:42.640 I will say uh part three is way easier

00:17:44.960 than part one. So this one won't take

00:17:46.400 nearly as long to sort out.

00:17:50.080 So looking back at this architecture

00:17:51.919 diagram,

00:17:53.760 the transaction data that we captured

00:17:55.600 out the wall um was those pages. It's

00:17:58.000 pager level data and that's what we want

00:18:00.720 to apply on the replicas. But normally

00:18:02.240 when we work with SQLite, we're dealing

00:18:04.400 with things up at this interface level.

00:18:06.160 We don't want to be going through all

00:18:07.760 these layers again on the replica

00:18:09.039 because all that work already happened

00:18:10.480 before we captured the transaction. What

00:18:12.480 we really need is a way to just skip

00:18:14.320 over that. Um,

00:18:17.520 as luck would have it, SQLite has a

00:18:19.280 thing that lets you do exactly that and

00:18:20.720 skip over it. So SQLite has a virtual

00:18:22.720 table called SQLite DB page. And SQLite

00:18:26.160 DB page is not really a table. It's

00:18:28.000 actually the page you're masquerading as

00:18:29.600 a table. So if you select out of SQLite

00:18:32.000 DB page, it will return the raw 4K

00:18:35.120 binary chunks of each page to you. You

00:18:37.679 get back rows where you have a page

00:18:39.280 number and you have that that binary

00:18:41.280 data for the page. Even better than

00:18:43.679 that, you can write to SQL IDB page. Um

00:18:46.880 so if you insert into it, you can pass

00:18:48.640 in a page number, a 4K chunk of data,

00:18:50.880 and the pager will go and write that

00:18:52.480 into the database just as if it had come

00:18:54.240 from all the layers above normally. And

00:18:57.280 because the pager as we said was that

00:18:59.200 part that's responsible for transaction

00:19:01.200 safety and concurrency safety then it

00:19:03.039 will do this in a way where you can feed

00:19:05.200 data into it like this while other

00:19:06.640 people are reading the database and it

00:19:08.480 will make sure that that all happens um

00:19:10.160 in a kind of controlled and safe way. So

00:19:12.960 that turned out to be the only thing we

00:19:14.400 needed to do for step three. Um this now

00:19:17.760 gives us the whole picture and kind of a

00:19:19.760 detailed enough spec that we could go

00:19:21.039 off and build this and this would be the

00:19:22.320 replication flow that would manage

00:19:23.760 changes from a primary database to a

00:19:25.120 replica database. And it's exactly the

00:19:26.880 way Beamer does it.

00:19:30.240 I should add though that all this part

00:19:32.640 that we've been discussing um is kind of

00:19:35.120 that core replication flow like I said.

00:19:37.520 So um it's the main thing you need to do

00:19:40.320 if you want to replicate your databases.

00:19:42.400 Um but

00:19:44.559 it's uh it's not the only thing you need

00:19:47.440 in order to have

00:19:49.840 um like a system that you can set up and

00:19:51.840 work with and operate and kind of is

00:19:53.520 easy to use. There's a bunch of other

00:19:54.720 plumbing you're going to need. So,

00:19:55.679 Beamer does add kind of a number of

00:19:57.679 other things um to make that easier, but

00:20:02.080 it's all really built around this kind

00:20:03.760 of central idea of shipping changes.

00:20:06.240 Hopefully, that makes sense.

00:20:10.000 Okay, so that kind of covers um or or we

00:20:13.120 have covered what database replication's

00:20:14.880 for um why it's a useful thing to have.

00:20:17.760 Now, we know how Beamer does it or andor

00:20:20.559 how to build it yourself if you want to

00:20:22.000 build your own. So the only other thing

00:20:24.000 that I wanted to talk about at the end

00:20:25.440 was just what it looks like to use

00:20:26.720 Beamer in your applications.

00:20:30.240 Uh getting Beamer into your application

00:20:32.000 has really two parts. One is that we

00:20:33.679 need that VFS to be available um so that

00:20:36.480 we can capture the transactions because

00:20:38.080 that's the part where we implemented

00:20:39.760 that. Um so for a Rails app, we could

00:20:42.799 just use a gem. Beamer has a gem that

00:20:44.400 when you include that in your app, it'll

00:20:45.919 make sure the VFS is available. And then

00:20:47.679 in your database YAML, you can specify

00:20:49.520 which of your databases you'd like to be

00:20:51.120 replicated um and which not if you don't

00:20:53.360 want to do them all.

00:20:55.520 Second part is that there's a beam of

00:20:57.039 process that runs on each app server.

00:20:58.799 That process is responsible for things

00:21:00.480 like the um the network traffic between

00:21:02.640 the nodes and responding to commands

00:21:04.240 that you run and so on. Um that's a

00:21:06.559 Docker image. So you can just add that

00:21:08.480 to your command deployment as an

00:21:09.840 accessory and it will run on each of

00:21:11.120 your app nodes. If you do these two

00:21:12.720 things, then Beamer is available and you

00:21:14.480 can start replicating.

00:21:17.440 At that point, um there's a couple of or

00:21:20.640 there's a several different commands

00:21:22.320 that Beamer provides that you can run to

00:21:23.919 control what the replication's actually

00:21:25.600 doing. Only one of them is really

00:21:27.520 required and that's you need a way to

00:21:29.520 tell Beamer which of your uh app servers

00:21:32.320 you would like to act as the primary. So

00:21:34.240 this command beamer switch sets the the

00:21:36.320 current primary in a cluster of um app

00:21:38.720 servers. When you do this, um, the

00:21:42.720 machine that you specify to be the

00:21:44.240 primary will allow you to write to the

00:21:46.080 databases there. It will capture the

00:21:47.600 changes and make them available to be

00:21:49.039 replicated to the other machines. All

00:21:50.640 the other machines that are not the

00:21:51.919 primary will follow along with the

00:21:54.080 changes on the databases of that

00:21:55.440 machine. They'll also set their copy of

00:21:57.360 the databases into a readonly mode so

00:21:59.120 that you don't accidentally make a

00:22:00.400 change on a primary when you shouldn't.

00:22:03.520 Um, you can run beamer switch at any

00:22:05.280 time. So if you need to do a failover

00:22:06.720 like if uh app one dies or you just want

00:22:09.120 to move your primary somewhere else so

00:22:10.480 you can take app 01 offline then you can

00:22:12.480 just call beamemer switch again all

00:22:13.679 these nodes will reconfigure themselves

00:22:15.120 to start replicating in a different

00:22:16.640 direction. Um and you could obviously

00:22:18.559 automate this as well if you want to put

00:22:20.000 it in something like console so that

00:22:21.360 when a machine dies it will

00:22:22.480 automatically elect a new one.

00:22:26.320 Um there are several other beamer

00:22:28.799 commands as well. We're not going to go

00:22:30.000 through them all um in detail, but I

00:22:32.400 just put a few of them up here so you

00:22:33.600 can get a sense of the kinds of things

00:22:34.960 that it can do.

00:22:38.400 And then the other part that I wanted to

00:22:39.840 talk about in terms of using Beamer is

00:22:41.919 that when you've set up a cluster of

00:22:43.600 machines and you've got replication

00:22:44.880 running between them and now you have

00:22:46.320 readonly replicas, you're going to want

00:22:47.840 to be able to serve some of your app

00:22:49.120 traffic to them. So to be able to do

00:22:51.600 that, we've added load balancing to

00:22:53.440 commal proxy. So it now has the ability

00:22:55.200 to load balance traffic amongst a number

00:22:57.039 of nodes and but it also knows how to

00:22:59.200 split that traffic between reads and

00:23:00.640 writes um based on the HTTP method

00:23:02.799 that's in each of those requests. So you

00:23:04.720 can do something like this where you set

00:23:06.000 up a set of machines which uses beamer

00:23:08.480 to replicate. So you have one primary um

00:23:10.799 in that orange color and a couple of

00:23:12.240 read replicas and then commal proxy will

00:23:14.880 send all of the rights to the primary

00:23:16.480 and all the reads will be load balanced

00:23:18.000 amongst the other machines there.

00:23:21.440 This kind of setup is also the um the

00:23:24.240 basis of how you can start doing the

00:23:25.919 thing of moving data nearer to customers

00:23:27.840 to reduce their latency. So the simplest

00:23:31.120 example of that I think would be that

00:23:33.039 you can make two deployments of your

00:23:34.480 app. You can use some kind of geodns

00:23:36.240 service that will route customers

00:23:37.600 traffic to the nearest point um

00:23:40.159 depending on where the traffic

00:23:41.280 originates. Um you could make a main

00:23:44.080 deployment in Europe where you have uh

00:23:45.840 your primary database and a couple of

00:23:47.280 read replicas but you can also make like

00:23:48.799 an outpost deployment in the US which

00:23:51.039 has read replicas there. So then your US

00:23:53.360 customers their traffic's going to land

00:23:54.960 on that load balancer in the US and all

00:23:57.360 their read requests will be much faster

00:23:59.360 because they're uh coming from those

00:24:01.360 read replicas that are in the US and

00:24:02.799 have much less latency.

00:24:04.880 Obviously in this setup there's still

00:24:06.240 only one primary which is in Europe. So

00:24:08.320 the US customers will still be paying

00:24:10.320 the latency to get to Europe whenever

00:24:12.159 they do a right. That might be fine. I

00:24:14.559 mean, if you're an app like Basec Camp 4

00:24:16.559 and 94% of your request to reads, then

00:24:18.720 just doing this has made 94% of your

00:24:21.039 requests faster and the other 6% are

00:24:22.960 just the same as they always were. So

00:24:24.559 that's already kind of helpful. Um, but

00:24:27.279 if you do want to get a bit fancier, you

00:24:28.960 can always segment your customers into

00:24:30.640 different writable databases and then

00:24:32.080 start putting those in different places

00:24:33.520 as well. And that way you can get have

00:24:35.520 your US customers live in a database

00:24:37.200 that's in the US and they'll get faster

00:24:39.200 reads and writes and same for Europe.

00:24:41.120 And obviously this can be expanded out

00:24:42.640 to multiple different places.

00:24:47.520 Okay. So the last thing I wanted to

00:24:49.600 leave you with before I wrap up is just

00:24:51.360 um a note on the current status of where

00:24:53.039 the Beamer project is right now. Um so

00:24:55.919 we're using this internally. It's what

00:24:57.520 powers this new app Fizzy that we've

00:24:59.039 been talking about a few times today and

00:25:00.880 which we're planning to release soon.

00:25:02.000 And so at the point where Fizzy

00:25:03.120 releases, that's when Beamer will start

00:25:04.400 having kind of real production traffic.

00:25:06.799 Um, so that should be soon. Um, we are

00:25:09.600 still experimenting a little bit with

00:25:11.039 different strategies for how to operate

00:25:13.120 this multi realm multiwriter kind of

00:25:15.840 setup like we saw in the last slide and

00:25:17.919 because you can do it now, but it feels

00:25:19.279 a little bit more complicated to set up

00:25:20.720 than it should be. Um, so we have some

00:25:22.640 ideas about trying to make that dead

00:25:24.000 simple. Um, but in any case, all of this

00:25:26.400 is going to be open source soon. You can

00:25:27.840 try out for yourself and if you do, we'd

00:25:29.360 love to hear what you think of it.

00:25:30.720 Thanks very much.

Kevin McConnell

@kevinmcconnell

explore all talks recorded at Rails World 2025

Explore all talks recorded at Rails World 2025

Rails World 2025