00:00:07
I want to talk today about Beamer, a lightweight replication tool for SQLite. We'll start with a quick review of what database replication does and why it's useful. For the bulk of the talk, we'll look under the hood to see how Beamer works by setting ourselves the task of building a replication system from scratch. We'll work through the points we need to solve, and by the end we'll understand how Beamer operates. Finally, we'll take a couple of minutes to look at what it's like to use Beamer in your Rails application and in your deployments.
00:00:47
If anyone is unfamiliar with database replication, it's simply a way to have multiple copies of your database spread across multiple machines. Those copies are live: changes made to one copy are visible in the others, and you can access any of them to run queries and do work.
00:01:06
Replication is often deployed with a single writable instance called the primary and multiple read-only instances called replicas. It's not the only replication model, but it's a very common one and it's the approach we'll focus on in this talk.
00:01:25
There are three main reasons replication is useful. The first is redundancy: replication helps you avoid single points of failure. If all your database data lives on a single machine, that machine becomes a major liability if something goes wrong, so replication spreads the risk across multiple machines.
00:01:44
The second reason is horizontal scaling: replicas allow you to scale read operations across multiple machines. Many applications are read-heavy — for example, Basecamp 4 averages around 94% reads and only 6% writes — so adding read-only replicas increases read capacity and frees up the primary to handle more writes.
00:02:28
The third reason is isolation: replication lets you run disruptive or heavy queries on machines that are not serving customer traffic. If you need to run heavy data analysis that might slow down a production server, you can replicate the data to a separate machine and run the work there, isolating any impact.
00:02:57
A bit of context on why I dove into replication: we're building a new app at 37signals called Vizzy, which is built on SQLite and uses a large number of database files. We also plan to deploy the application across multiple geographic regions to reduce latency for customers by moving data closer to them. We needed a way to manage replication of our SQLite databases across all these machines, and that exploration led us to build Beamer. We wanted it to be easy to use, not just for our app but for anyone building a similar architecture.
00:04:03
Now that we've covered what replication is for and why it's useful, we'll dig into the nuts and bolts of how Beamer works by trying to build a replication system from scratch. This will involve examining some internals of SQLite. There's a lot of detail, so I'll speedrun some of the SQLite parts; it may sound a bit wonky at first, but it will make sense by the end.
00:04:36
Here's our initial specification of replication: a replication system needs to do three things reliably: capture the changes that happen to a primary database, transmit those changes to a replica, and apply the same changes on the replica. If you can do these three steps, your replicas will behave like the primary and you've implemented replication. You can picture a primary and replica with a constant stream of changes flowing from the primary to the replica — a replication log.
00:05:16
We should refine that spec. First, 'change' is vague: in relational databases the unit of change is a transaction, so we'll capture transactions on the primary and transmit and apply those transactions. Second, step two — transmitting data — is a generic networking problem. We'll use HTTP as the transport for this data and treat it as a solved problem; that leaves us to solve the database-specific parts of capturing and applying changes.
00:06:23
To capture transactions we need to understand how SQLite stores its data. Logically your data is tabular, but that's not how it's stored on disk. If tables were laid out exactly as rows on disk, common operations like inserting a row at the start would require rewriting a lot of data. Databases therefore use tree structures to organize data, letting you insert or change nodes without rewriting entire tables. SQLite represents tables, indexes, and the schema catalog as trees.
00:07:11
When trees must persist to disk, node size matters for performance. Tiny nodes cause many small I/O operations, which are inefficient. SQLite organizes trees using fixed-size blocks called pages, which are 4KB by default and consistent within a database. Pages contain multiple rows and pointers to other pages to form the tree structure. The SQLite data file is literally all of those 4KB pages written out in page number order, and SQLite always reads and writes data at the page granularity.
00:09:00
Given that SQLite works in terms of pages, we revise our spec: capturing transactions will mean capturing the changed pages for each transaction. We now understand the file format we're dealing with, but we still need to know how to identify which pages changed during a transaction.
00:09:51
Let's look at how SQLite updates data. The simplest approach would be to write the new page versions directly into the database file when they change, but that fails when multiple processes access the database concurrently: writing over a page another process is reading would break things. One option is to add a big exclusive lock on the whole database while writing, which SQLite supports in a mode of operation. That works but is terrible for concurrency: whenever one process writes, all others block, and only one process makes progress at a time.
00:11:38
SQLite improves concurrency with WAL mode (write-ahead log). WAL adds a second file alongside the main database called the WAL file. The WAL file uses the same fixed-size pages but is append-only. Whenever SQLite records a change to a page, it appends the new version to the end of the WAL file. When reading a page, SQLite checks the WAL first for a newer copy and falls back to the main database file if needed. In effect, writing a page to the WAL shadows previous copies without overwriting them.
00:13:13
The WAL cannot grow forever, so SQLite periodically runs a checkpoint that copies the latest version of each page from the WAL back into the main database file and then resets the WAL. Between checkpoints, all write operations are simple appends to the WAL file. From the standpoint of a replication system, the WAL is a continuous, append-only stream of changes delineated by transactions, which looks a lot like the replication log we need.
00:14:14
With that in mind, we update our spec: capturing transactions means capturing the changed pages of each transaction from the WAL. Those three words — 'changed pages from the WAL' — capture the core of how we'll record what happened on the primary.
00:14:33
We now know what data to capture and where to find it, but we still need the timing information: how do we know when a new transaction is committed and ready to capture? We need a way to detect commits as they happen.
00:15:01
Looking at SQLite's architecture helps. SQLite is layered: SQL commands are parsed and compiled to bytecode and executed by a virtual machine that manipulates trees through a B-tree layer. That B-tree reads and writes pages using a pager layer, and the pager uses a virtual file system (VFS) layer to interact with files on disk. The VFS is pluggable and exists for portability: you can implement and load your own VFS at runtime. SQLite also supports VFS shims, which wrap an underlying VFS and let you insert custom code between the pager and the file system. Importantly, when the pager commits a transaction and writes it to the WAL, it triggers an event into the VFS. By implementing a VFS shim we can intercept commits and capture the changed pages from the WAL as transactions occur, then transmit those changes to replicas over HTTP.
00:17:34
The final piece is applying those pages on a replica. Normally you interact with SQLite at the SQL layer, but we already have the pager-level data (the pages). Fortunately, SQLite provides a virtual table called sqlite_dbpage that exposes raw 4KB page blobs: selecting from it returns rows with a page number and the binary page data. More importantly, you can insert into sqlite_dbpage to write a page into the database. Writing pages this way goes through the pager, which enforces transactional and concurrency safety, so on the replica we can apply the captured WAL pages directly using sqlite_dbpage and let the pager handle correctness.
00:19:30
That gives us a complete, detailed specification for replication: intercept commits using a VFS shim on the primary, capture changed WAL pages for each transaction, transmit them over HTTP, and apply them on replicas via sqlite_dbpage. This is the core replication flow and it's exactly how Beamer works.
00:20:06
The core flow is the main requirement, but a usable system needs additional plumbing. To use Beamer in a Rails app, you first need the VFS available so you can capture transactions. Beamer provides a gem that ensures the VFS shim is loaded. In your database YAML you can specify which databases you want to replicate.
00:21:17
Second, Beamer runs a process on each app server to handle networking and control commands; that process is provided as a Docker image and runs as an accessory on each node. Beamer provides commands to control replication. The essential command is beamer switch, which designates which app server should act as primary. The primary allows writes, captures changes, and makes them available for replication. All other machines follow the primary and put their copies of the databases into read-only mode so you don't accidentally write to replicas. You can run beamer switch at any time to perform failover or reconfigure which node is primary, and you can automate this for automatic election when a machine fails.
00:22:26
There are other Beamer commands to control and observe replication, and we won't cover them all here. To serve traffic to read-only replicas, we've added load balancing to Commal Proxy: it can load balance traffic among nodes and split reads and writes based on the HTTP method. In a typical setup, Commal Proxy will route writes to the primary and distribute reads across replicas.
00:23:38
This setup also enables moving data closer to customers to reduce latency. For example, you can have a main deployment in Europe with the primary and some replicas, and an outpost deployment in the U.S. with read replicas. A geodns service can route users to the nearest deployment so their reads are served from nearby replicas. Writes will still go to the single primary (in this example, Europe), which may be fine for read-heavy applications. If you need lower write latency in multiple regions, you can segment customers into separate writable databases and place those databases in different regions.
00:24:47
A quick note on Beamer's status: we're using it internally to power Vizzy, the app we mentioned, and we expect Beamer to see real production traffic when Vizzy releases soon. We're still experimenting with strategies for multi-region and multi-writer setups to make them simpler to operate. In any case, Beamer will be open source soon, and we'd love feedback when you try it.