00:00:02
Welcome to Rails World, everybody. I'm so happy to welcome you to Rails World. Yesterday I learned something interesting: if you use a specific green color as the background on your slides, it becomes transparent. My display up here is the slide but with a green background, so they key it out for you. I wanted to do that in my slides today because I wanted to be completely transparent with all of you.
00:00:57
Happy Friday, everybody. It is always Friday somewhere, and today it is Friday here. I had planned my keynote in advance and was going to talk about system tests using my Mac, but since I can't do that today, I'm going to change topics.
00:01:29
I'm going to talk about David's keynote. I really enjoyed it. We all know how much David thinks about the Roman Empire, and I'm excited about the new framework, Action Push. Fun fact: it was originally called Action Push Native, and the rest of the core team gave David some pushback, so he renamed it. I also enjoyed his presentation about omachi—actually, it's pronounced 'omachi' in Japanese, and the phrase is very popular there. (Side note: it's GNU/achi—sorry, I couldn't resist correcting him on stage.)
00:02:52
I didn't have David growing a neck beard on my Rails World bingo card this year, but I was excited by the '30,000 assertions in 2 minutes' demo. I thought I could beat that, so I demonstrated running 30,001 assertions using Minitest (or a mega test) and got it down to about 10 milliseconds on my old MacBook Air M4.
00:03:49
My name is Aaron Patterson—Tenderlove on the internet. I've been on the Ruby core team since 2009 and the Rails core team since 2011. I also speak some Japanese and want to teach you a handy phrase: 'shakurasai'—it means 'Could you wait a little bit, please?' You'll hear it all the time in Japan, in restaurants and hotels. Another related word is 'omachi' or 'o-machi'.
00:04:48
I work as a Senior Staff Engineer at a small company called Shopify. We use Ruby and Rails and, I think, run one of the biggest Rails apps in the world—measured humorously, perhaps, by font size. Since this is the last talk of the conference, I joked about light topics, but actually I'm going to give a very technical presentation about the work my team has been doing and some pro tips for Rails developers.
00:05:38
At Shopify I'm on the Ruby and Rails Infrastructure team. A big goal for us is improving machine utilization by working on performance. Concretely, we want to increase the amount of parallel work a machine can handle without increasing latency. That could mean serving more web requests, running faster test suites, or doing other parallel workloads. We often have unpredictable requests—some are I/O-bound and some are CPU-bound. For example, on a four-core machine we might pre-fork processes and run six processes so we can handle six requests in parallel. If those requests are I/O-bound, CPUs sit idle and we could take on more work; if they're CPU-bound, six processes fight over four CPUs, increasing latency and causing noisy-neighbor problems. This issue affects all web servers, and the root cause is that Ruby historically has had only one construct for CPU-bound parallelism: processes.
00:10:40
To illustrate, we benchmarked a Fibonacci computation. Running sequentially on my MacBook M4 takes about two seconds. Running the same work using threads or fibers (an async framework) took the same two seconds—no speedup. Running it with processes got us down to roughly 480 milliseconds because processes can run in parallel on multiple CPUs.
00:11:45
What we'd like is a load-aware web server: if the first requests are I/O-bound, we can handle more concurrent requests by starting more processes; if the load becomes CPU-bound, we should stop accepting more work on that machine. One idea is using HTTP/2 to provide back pressure—set the max concurrent streams or otherwise inform the proxy that the server is busy. This remains theoretical but could help load balance better. If we decide to accept a new request but need to create a process quickly, we face options: fork (can we fork fast enough?), start a new thread or fiber (they don't help for CPU-bound work), or use ractors. Ractor.new lets us allocate new ractors fast and handle CPU-bound parallelism, which is the context for my team's work.
00:13:25
We're attacking these problems on two fronts: multi-CPU performance with ractors and single-core performance with a new method-based JIT compiler called ZJIT. This year our team has worked on improving ractor speed and usability for Ruby 3.5. John Hawthorne is leading the project and we've been collaborating closely with Koichi Sasada, the original author of ractors. People often ask which concurrency primitive is best—threads, fibers, processes, or ractors. My short (joking) answer is: ractors.
00:15:16
Ractors are Ruby actors—an actor-style parallelism model that provides true parallelism. Ruby still has a Global VM Lock (GVL), but each ractor has its own independent GVL, so ractors can run in parallel. In our Fibonacci benchmark, ractors, like processes, brought runtime down from two seconds to about 480 milliseconds, demonstrating true parallelism. Currently ractors print a warning that they're experimental: their API and implementation have changed, and there are issues to solve. Our team's work includes stabilizing the API and fixing implementation problems so users will feel comfortable using ractors in production.
00:17:26
There are important behavior rules for ractors. You cannot share mutable objects between ractors; mutable objects are copied when they cross ractor boundaries. For example, a mutable string sent via a queue between threads retains the same object id, but when sent between ractors it gets duplicated and has a different object id. If you make the objects immutable—e.g., enable frozen string literals or freeze the object—ractors can share them without copying. For deeply nested structures, use Ractor.shareable to deeply freeze the structure, or use libraries (like JSON.parse with freeze: true) that return frozen data structures so you can pass them between ractors without copying.
00:21:20
In Ruby 3.5 ractors use ports to communicate. Ports are essentially queues: every ractor has a default port. Any ractor can write to a port, but only the ractor that created a port can read from it. That's a mental shift if you come from the threaded world where many threads can share and read from the same queue. With ractors, only the creating ractor reads, and others write, so we need different patterns for producer-consumer coordination.
00:23:25
A recommended pattern is to create a coordinator ractor that collects work from producers and hands it out to worker ractors. Workers must explicitly request work from the coordinator (they ask for work) rather than pulling from a shared queue. This setup is slightly more complex, but it avoids locks and mutexes: there's no need for synchronize or deadlocks. Importantly, this pattern gives you CPU parallelism in pure Ruby code.
00:25:02
There is one interesting exception to immutable-copy behavior: a ractor's return value is not copied, but only once. If a ractor allocates an object and then returns it, the main ractor can receive the exact same object (same object id), and it will not be frozen. This has useful implications: historically, C extensions were the way to get CPU parallelism in Ruby because C extensions can release the GVL (for example, the bcrypt gem calls a C function while releasing the GVL so other requests can be serviced). Ractors let you achieve the same kind of 'no GVL' behavior in pure Ruby by running CPU-intensive work in a ractor and returning the result.
00:26:33
For example, imagine BCrypt implemented in pure Ruby inside a ractor: create a ractor, perform the CPU-bound bcrypt calculation, return the result, and continue servicing other requests in your web server while the ractor runs. This approach can be used for other CPU-heavy tasks, such as parsing JSON: move the parsing into a ractor and other work can proceed in parallel. When Ruby 3.5 is available, try wrapping CPU-intensive code in ractors to improve parallelism.
00:29:12
We encountered and fixed some surprising bottlenecks while testing ractors. A user reported that parsing JSON in ractors was slower than parsing serially. The cause: when parsing JSON we deduplicate identical frozen string keys using an internal fstring table (frozen string table) so that identical keys map to the same object. That table was a global shared structure and ractors had to lock it, causing contention. John rewrote the string table to be lock-free, which made JSON parsing up to twelve times faster in that case. Round of applause for that fix.
00:31:55
The frozen string table wasn't the only global table causing contention. We found other global structures inside CRuby—an ID table for symbols, a CC table for inline caches, and an encoding table for string encodings—each of which could cause locking and contention. Our team is finding and fixing these bottlenecks so ractors can be practical in production.
00:33:06
Now I want to talk about ZJIT, a new JIT compiler shipping with Ruby 3.5. First, some definitions: Ruby's VM is YARV, a bytecode interpreter. When I say 'the interpreter' I mean the YARV virtual machine executing bytecode. A JIT compiler assembles machine code at runtime—usually lazily—and replaces interpreter steps with direct machine code execution to speed programs up. A JIT must preserve interpreter semantics, but it can eliminate interpreter overhead, cache values, embed constants, speculate and deoptimize, and remove redundant type checks, all to make code faster.
00:35:04
How can a JIT make code faster? First, by eliminating interpreter overhead so code runs directly on the CPU rather than through the bytecode interpreter. Second, by caching values—such as embedding literal constants into machine code—so code doesn't repeatedly load them from memory. Third, by speculating on values: the JIT can generate optimized code for common cases and deoptimize to interpreter semantics if assumptions break (for example, if someone later monkey-patches a method). Finally, the JIT can eliminate repeated type checks by proving types earlier and avoiding repeated verification later.
00:38:51
YJIT is a lazy basic block versioning (LBBV) compiler that ships with Ruby today. It discovers and compiles basic blocks lazily, which yields very fast warm-up, low memory overhead, and low overhead for type discovery because it pauses execution and inspects values at chosen program points. One downside is register allocation: when compiling small basic blocks independently, it's hard to allocate registers consistently across different branches, so YJIT often spills local variables to memory, which is slower than keeping values in registers. Method-based compilers look at the whole method and can allocate registers more consistently, fold constants more aggressively, and use traditional compiler techniques, but they risk compiling code paths that are never executed and require more infrastructure for tracking types at runtime.
00:46:35
ZJIT is our method-based JIT that aims to combine lessons from YJIT with a method-based approach. It is new and still maturing, so expectations should be tempered: currently ZJIT is not yet as fast or as polished as YJIT, but we intend to improve it over time. If you experiment with Ruby edge builds, you can try ruby --jit=zjit (or similar) but be aware it is an early-stage effort.
00:47:56
What can Rails developers do today to be JIT-friendly? I have one pro tip: monomorphize call sites. Polymorphic call sites see many types and force the JIT or runtime to handle multiple receiver types, which slows down method dispatch and optimization. Monomorphic call sites see only one type and are easier to optimize. This applies to instance-variable shapes too: the order and presence of instance variables create object shapes—keep instance variables consistent to help the VM assume object layouts. Focus on useful polymorphism that encodes domain behavior (strategy objects, different payment processors, etc.) and remove low-value polymorphism, such as repeatedly calling to_s in a cache lookup. Instead, make the caller consistent (always pass strings or always pass symbols) or normalize once at the boundary so call sites become monomorphic. In one case, monomorphizing call sites in Prism yielded about a 13% speed improvement even on the interpreter.
00:53:36
To recap: we discussed ractors and parallelism, some surprising internal bottlenecks and fixes, how Ractor.new can act as a 'no-GVL' block for pure Ruby code, and JIT compilers—how and why they work—and differences between YJIT and ZJIT. We also covered a pro tip for writing JIT-friendly Rails code: monomorphize call sites. Please upgrade to Ruby 3.5 when you can. As a final data point: allocations are much faster in Ruby 3.5—object instantiation in one microbenchmark improved roughly 70% compared to Ruby 3.4—so upgrading should give you immediate wins.
00:55:29
Thank you—it's been an honor to be here. I'm very happy to be in Amsterdam. One last joke: I was worried whether restaurants here would split the bill, but it turns out everybody here is dining Dutch. Thank you.