Summarized using AI

Closing Keynote

Aaron Patterson • September 05, 2025 • Amsterdam, Netherlands • Keynote

Closing Keynote at Rails World 2025

Aaron Patterson, a longstanding Ruby and Rails core team member and Senior Staff Engineer at Shopify, delivered the closing keynote at Rails World 2025. The primary focus was on innovations Shopify’s Ruby & Rails Infrastructure team is contributing to Ruby core, specifically around enabling better parallelism (via Ractors) and introducing a new method-based JIT compiler (ZJIT). Patterson also shared actionable advice for Rails developers on optimizing their code for JIT performance and parallelism.

Main Theme

The main topic addressed was improving Ruby’s concurrency, parallelism, and runtime performance, making Ruby and Rails applications faster and more scalable for production use cases.

Key Points

  • Shopify’s Infrastructure Work:

    • The team’s goal is to improve machine utilization and throughput without increasing application latency.
    • Challenges with serving a mix of IO-bound and CPU-bound requests on multicore machines and Ruby’s historical process-based parallelism limitations were discussed.
  • Parallelism and Web Servers:

    • Demonstrated limits of Ruby threads and fibers for CPU-bound work, noting that only processes (and now Ractors) provide true CPU parallelism.
    • Described theoretical model for a load-aware web server using HTTP/2 for back pressure.
  • Ractors in Ruby 3.5:

    • Patterson explained Ractors as Ruby’s actor-style concurrency solution, providing true parallelism by assigning each Ractor its own GVL (Global VM Lock).
    • Emphasized current experimental status but shared ongoing efforts to stabilize and improve them in Ruby 3.5.
    • Provided examples on using Ractors, rules for passing immutable data, and communication via ports.
    • Noted design trade-offs: Ractors are harder to use than threads but eliminate common issues like mutexes and deadlocks.
    • Illustrated a production bottleneck (JSON parsing contention) and how Shopify’s team removed global lock contention, significantly speeding up parallel JSON parsing.
  • JIT Compilers: YJIT and ZJIT:

    • Explained what JIT compilers are and how they can improve performance by removing interpreter overhead, caching values, speculating on types, and eliminating type checks.
    • Compared YJIT (lazy basic block versioning) with the new ZJIT (method-based JIT). YJIT compiles code on-demand and is efficient for warm-up and memory usage, but ZJIT can produce better register allocation and performance in some cases.
    • ZJIT is new and experimental but aims to build on YJIT’s foundations.
  • Pro Tips for Rails Developers:

    • Advised on writing JIT-friendly, performance-oriented code:
    • Monomorphize call sites (use consistent types to help the compiler).
    • Set instance variables in a consistent order.
    • Avoid unnecessary, low-value polymorphism.
    • Real-world examples demonstrated how small adjustments can produce double-digit performance gains, even without a JIT.

Conclusions & Takeaways

  • Ruby 3.5 will introduce both faster object allocations and substantial improvements to parallelism and JIT compilation.
  • Developers are encouraged to experiment with Ractors and wrap CPU-intensive work in Ractors to utilize all CPU cores efficiently.
  • Upgrading to Ruby 3.5 is recommended to achieve these performance benefits in Rails apps.

Relevant Examples & Anecdotes

  • Demonstrated benchmarks for request handling, Fibonacci calculations using threads/fibers/ractors, and effect of Ractor improvements on JSON parsing.
  • Discussed real-world issues Shopify’s team faced and resolved in Ruby internals related to global shared tables.
  • Used relatable programming sketches, jokes, and compiler metaphors to explain complex concepts accessibly.

Audience

This talk targets Ruby and Rails developers interested in concurrency, application optimization, and upcoming Ruby core features.

Closing Keynote
Aaron Patterson • Amsterdam, Netherlands • Keynote

Date: September 05, 2025
Published: Sat, 13 Sep 2025 00:00:00 +0000
Announced: Tue, 20 May 2025 00:00:00 +0000

In the #RailsWorld Closing Keynote, Aaron Patterson (Ruby core team member since 2009, Rails core since 2011, and Senior Staff Engineer at Shopify) talks about the work that Shopify’s Ruby & Rails Infrastructure team is tackling in Ruby core, including Ractors for better parallelism and a new method-based JIT compiler, ZJIT, and shares some pro tips for Rails developers on writing JIT-friendly code.

Rails World 2025

00:00:16.240 welcome you to Rails World. Um, yesterday I learned something really
00:00:21.439 interesting and that is that if you use a specific color like a green color as
00:00:27.599 the background on your slides, it becomes transparent like this. Yeah.
00:00:33.200 Isn't that cool? Yeah. Yeah. So, like if you my display up here is the slide like
00:00:38.800 this, but it's got a green background. So, they they actually key it out for you. I wanted to do this in my slides
00:00:43.920 today because uh I wanted to be uh completely
00:00:49.600 transparent with all of you. Yes. Thank you. Yeah. Yeah.
00:00:57.920 Uh h happy Friday. Happy Friday everybody. Say happy Friday please. Yes.
00:01:04.640 Uh it is always Friday somewhere and today it is Friday here and I'm very happy that it is actually Friday. The
00:01:11.040 sad thing is though today I had planned out my keynote in advance, done all this
00:01:16.799 work. I was going to talk to all of you about uh system tests today
00:01:22.880 using using my Macintosh.
00:01:29.040 But but since I can't do that today, uh today I'm going to talk to you about David's keynote. Uh
00:01:36.240 I really really enjoyed David's keynote. Uh Now we all know how much uh David thinks
00:01:43.520 about the Roman Empire. Uh I'm also I'm also really excited about the new framework action push.
00:01:50.159 That seems really exciting. Right. Right. Yes. It's very exciting. Uh but
00:01:55.600 I'm going to let you all in on a little bit of inside baseball. It wasn't actually originally called Action Push. It was actually originally called Action
00:02:02.479 Push Native. Like that was the original name. But uh the rest of us on the core team, we didn't really like that name so
00:02:09.679 much. Uh so we gave David some active push back. Um but the thing is when it
00:02:16.239 comes to active push, when active push comes to action shove.
00:02:22.000 David can be a reasonable person. So he renamed it and I appreciate that. Um I I
00:02:28.239 also enjoyed David's presentation about omachi and I I hate to correct him on stage. like a really Wait, no, never
00:02:34.879 mind. I love to correct him on stage. Um, it's
00:02:40.720 it's actually GNU/achi just
00:02:52.239 anyway. Um, I I really didn't have David growing a neck beard on my Rails World
00:02:58.160 bingo card this year. Uh, yes. I was also excited about 30 30,000 assertions
00:03:06.239 in 2 minutes. That's really wild, right? But I mean, come on, please. I I know how to beat this. I can easily beat
00:03:12.560 this. So, I want to I want to give a demonstration to you. This is this is intense. Uh, right here, we're going to
00:03:18.000 beat that number. I have a test test right here. Let's go. Um, so we're going
00:03:23.760 to do 30,0001 assertions. Yeah, look at that. using mega test.
00:03:31.680 Look at that. 10 milliseconds. 10 milliseconds. Yes.
00:03:42.000 And this was all done on my slow old MacBook Air M4. Terrible.
00:03:49.599 Anyway, my name is my name is Eric Patterson. Hello everybody. I go I'm on the internet. My name is Tenderlove. Um,
00:03:55.760 I've been on the Ruby core team since 2009 and I've been on the Rails core team since 2011. Uh, I don't know how
00:04:03.840 many of you are here this morning, but it is true I do speak Japanese. And I want to give like a very short lesson
00:04:10.879 like I'm going to teach you all a handy phrase today. Uh, the phrase is shakurasai.
00:04:19.199 And uh, what it means is it means like can you can you wait a little bit please? And you'll hear this all the
00:04:24.639 time. What David didn't know is that omachi is like super duper popular in Japan. Like really popular. You you'll
00:04:31.120 hear this phrase when you go to restaurants, when you go to hotels, everywhere. Uh so I want to teach you
00:04:37.120 another phrase, too. Like what if you're going up to somebody and you say, "Hey, I'd like to have a little omachi,
00:04:42.639 please. Can I have that? I would like to I would like to try out this operating system." Uh you can say exactly the same
00:04:48.160 thing. Showai. It's that is that is how you would say
00:04:54.880 it. Um I work for a senior staff engineer at a mom and pop startup called Shopify.
00:05:01.199 Small company. Hopefully you've heard of us. We use Ruby and Ruby on Rails. Uh I
00:05:07.280 think we run probably the biggest Rails app in the world. Now unfortunately I
00:05:13.280 didn't really know how to measure that. Like how do you measure what is the biggest app in the world? So the way I
00:05:19.039 decided to measure it was by font size and indeed we have
00:05:25.919 biggest Rails app in the world. Um since today is Friday and this is the last
00:05:32.000 talk of the conference uh I thought we would try to have some fun and talk about something some very very light
00:05:38.080 topics. So I hope you're all excited all excited for that. Uh just I'm just
00:05:43.680 kidding. We are not we are not going to do that today. Today we are going to have a very very technical presentation and I apologize. I know for some of you
00:05:51.120 you're happy because this means the end of the pun section of my presentation and onto the technical section of my
00:05:57.600 presentation. But like why why am I doing a technical presentation?
00:06:03.919 Uh the reason is mainly because I love programming. I love programming a lot.
00:06:10.000 Uh, I like to do it as my hobby and I also get to do it as my job and I just
00:06:15.039 really love it and I'm excited about the things that I work on. I'm very excited. Yeah, slide. There we go. I'm very
00:06:21.759 excited and I'm really excited to share the stuff that uh I've been working on with my team at work. So, that's what I
00:06:28.000 want to talk about today is mainly uh work stuff. So today I'm going to talk about work stuff, specifically the stuff
00:06:34.240 that my team has been working on at Shopify and how it's going to improve our lives and your lives, as well as
00:06:40.800 some pro tips for Rails developers. Uh at Shopify, I'm on the Ruby and Rails
00:06:47.840 infrastructure team. And you may be surprised to learn this, but we work on
00:06:53.759 Ruby and Rails and infrastructure. So we are the Ruby and Rails
00:06:59.199 infrastructure team. uh our team like our team's kind of I don't know we we
00:07:05.120 work on a lot of stuff but I want to give you some context around the work that we do and a lot like a big goal of
00:07:12.479 ours is to improve machine utilization at work so we want to work on performance and that helps us improve
00:07:18.960 machine utilization I want to make that a little bit more concrete and give you an example from work of the like kind of
00:07:25.280 things we're focusing on and I want to do this to provide context for the work that we do inside of uh Ruby and Rails.
00:07:32.720 Uh so one of the things that we want to do is we want to increase and when I say
00:07:37.759 improve um what do I say improve productivity we want to increase the number of like the amount of parallel
00:07:43.919 work that we do on a machine but we don't want to increase latency. So for example we want to have we want to
00:07:49.520 service more requests but we don't want to destroy latency for anyone. So we want to improve the amount of parallel
00:07:54.879 work that we're able to handle anywhere. Uh, and I'm careful not to say requests necessarily because we're really talking
00:08:00.879 about parallel work. It could be web servers, test suites, whatever. Uh, we we just want to get more work done on a
00:08:06.800 machine, but we don't want to degrade latency for anybody. And to make this these terms a little bit more concrete,
00:08:13.840 uh, we have a very large application at work uh, that has a very very unpredictable workload. And
00:08:19.520 unfortunately what this means is when we get a request coming in, we don't know whether that request is going to be IO
00:08:25.199 bound or whether it's going to be CPUbound. We can't we don't really know that. Uh and what this means is let's
00:08:32.399 like let's say we have a web server uh a machine a processbased web server with four cores. Uh this is a mom and pop
00:08:40.399 startup so we can only afford four core machines. Um let's say there's
00:08:47.680 four cores and we know that some of the requests are going to be IO bound and we know that some of them are going to be CPUbound and we want to we want to
00:08:54.320 utilize this machine as much as possible. So what we'll do is we'll fork off say I don't know 1.5 times the
00:09:00.080 number of processes. So we'll have four six processes on our four core machine. We'll pre-fork them. So each process can
00:09:07.680 only handle one request at a time. So we can handle six requests in parallel. But I want to consider like two extreme
00:09:15.360 cases. Let's say we get six requests coming in and they're all doing IObound
00:09:20.959 work. So if we if that happens, all of these processes are basically like doing IO stuff. I don't know, writing to a
00:09:27.519 database, doing whatever. And our CPU utilization, our CPUs aren't doing anything at all. This is kind of a
00:09:33.200 bummer because our web server could take on more work, but uh it's not. we're
00:09:39.040 we're not processing any more requests. Now let's consider the other end of the spectrum. Let's say we get six processes
00:09:46.720 coming in or six CPUbound requests coming in. Now we have six processes that are all fighting over four CPUs. So
00:09:55.440 unfortunately this impacts our latency and we don't we don't like that either. Uh this ends up increasing latency
00:10:02.320 because we have noisy neighbors. They all want to get some time on the CPU. So some of them have to be scheduled off
00:10:07.440 the CPU and that's just not good for our end users. And a side note like I want to go on a little bit of a side note.
00:10:13.200 This impacts all web servers. This isn't just process-based web servers. It's also like Falcon, Puma, whatever. And
00:10:20.959 the reason this happens is because we only have one construct in Ruby for doing uh CPUbound parallelization and
00:10:28.160 that's processes. We can only run code in parallel on processes. It's the only way we can do it. And I wanted to I want
00:10:34.959 to show a little bit of a demonstration here. So let's say we've got two examples. Uh we're going to calculate a
00:10:40.560 Fibonacci sequence because that's what we do at work.
00:10:46.160 Fibonacci sequence. There's going to be a lot of this in the presentation. Uh we're going to do it just uh sequentially versus threads. So we'll do
00:10:53.760 one just straight line and we'll compare it to threads. If we do this on my lowly
00:10:59.279 MacBook M4, we get it we we get it done in about 2 seconds. If we do the same example using threads, unfortunately, it
00:11:06.480 takes the same amount of time, 2 seconds. So, we didn't get any parallelization, zero.
00:11:11.760 Now, let's compare that again to uh fibers for example. We use a async framework. Again, 2 seconds is our
00:11:18.560 baseline. If we run this example with async again 2 seconds, we're not doing any better than before.
00:11:25.760 fibers will take the same amount of time. Let's do let's do this again, but this time we'll use processes. So, of
00:11:31.200 course, linearly we'll take two seconds. If we run this with processes on my machine, it was about uh 480
00:11:37.200 milliseconds around there. So, we're actually seeing some parallelization here. We're getting some we're getting some time. So, let's return to web
00:11:45.920 servers now for a minute. uh what we'd really like to have is we'd like to have a loadaware web server
00:11:52.079 where uh for example let's say our first four requests come in they get scheduled
00:11:57.120 to these different processes let's say all of them are IO bound this that's what happens they come in now another
00:12:03.279 request come in uh we can take those on we'll just spin up another process and
00:12:09.040 take that on because our machine isn't isn't that busy we can take on more now let's compare that to say we get a
00:12:15.920 CPUbound request coming in. So the CPUbound request will come in, we start getting load on the CPU. Uh eventually
00:12:22.720 we start using up all four of our CPUs and then when the next request comes in, uh we'll say, "Yeah, you know, we can't
00:12:29.040 take that. Let's not do that one right now. We're busy. Can you send that off to another another machine? We're we're
00:12:35.839 currently at capacity." But the question is like how do you do how do you do this? And this is going to be a little bit handwavy here, but an idea that
00:12:42.399 we've talked about or we've had is to use uh provide back pressure using HTTP 2.0. Uh the reason we are thinking about
00:12:49.600 doing this is because uh H2 can send send information upstream
00:12:54.959 asynchronously. So we can say like oh hey proxy I'm busy. We can set max
00:13:00.880 concurrent streams for example and say like I can't take any more streams right now. Please like send data somewhere
00:13:07.120 else. So we can actually provide back pressure to the proxy. This solves the
00:13:12.240 communication error between the router and the proxy and the web server and we're a we should be able to load balance stuff better. And this is by the
00:13:18.800 way this is all theoretical. This is what we would like to get to. So let's say we had this uh let's say this
00:13:25.360 actually existed. What if it did exist? Now when a request comes in, I showed
00:13:30.639 this before like what do we do in this particular case? We have to create a new process, right? we have to spin up a new
00:13:37.360 process. Now, how do we do that? Like, how do we like how do we do that? What is the code that we write to do that? One idea is
00:13:44.480 well, we could fork. We could just fork a new process in our web server and then take on that request. But the question
00:13:49.920 is like can we can we fork fast enough? We don't know if that's true. Uh, another potential another potential
00:13:55.920 answer is we could do well we could say thread.new or we could do fiber. Create a new thread create a new fiber. But as
00:14:01.600 we saw in the previous benchmarks, those can only handle IO bound requests. They can't handle CPUbound requests. So what
00:14:08.720 could we do? And the answer is uh ractor.new. This is where ractors come into play. We can absolutely allocate
00:14:15.440 new ractors fast enough. And there they allow us to handle CPUbound parallelization.
00:14:21.120 So this is kind of the context for our team. This is what we've been working on. uh the stuff that I've described
00:14:27.680 these production problems the this is the things we've been thinking about and what I want to discuss today is like
00:14:34.720 from the language level where we're trying to attack these problems and we're trying to attack these problems on
00:14:40.480 two two different fronts the the first front is the multi-CPU performance so
00:14:46.000 multi-core performance and that would be working with working with ractors uh ractors we're hoping that ractors
00:14:53.199 will allow us to make the most efficient usage of all CPUs on the machine at the same time. But this doesn't address
00:15:00.560 single single core performance. For single core performance, uh we're working on a new JIT compiler called
00:15:07.600 Zjit. Uh and I want to talk about both of these both of these efforts today, Ractors and and Zjit. So first let's
00:15:16.880 discuss Ractors. Uh this year my team has been working on improving Ractor speed and usability in Ruby 3.5. John
00:15:23.279 Hawthorne if you've met him here at the conference he is leading the project and we've also our team has been working
00:15:29.519 really closely with Kohici Sasada who is the original author of Ractors so we're
00:15:35.600 uh we're working on these now you might be asking we get this question all the time we have a lot of concurrency
00:15:42.240 choices in Ruby which one is the best one we have threads we have fibers we have processes we have ractors which one
00:15:47.600 is the best and the problem is if you ask you know talking heads like me what
00:15:53.839 the answer is to this. The they'll say it depends, but I think this is a small dog answer.
00:16:04.240 I think the big dog answer is raptors. You just always use raptors.
00:16:15.279 Also, uh, John made a really really great logo for for RAT, which I want to
00:16:29.199 all right. So, uh, if you're not familiar, let's discuss let's discuss ractors. Ractors are Ruby Ruby actors,
00:16:35.839 and that's where the name comes from. So, it's Ruby actors. Ractors. uh they are basically an actor style of
00:16:42.720 parallelism in Ruby and they ractors give us true parallelism. Now,
00:16:48.320 unfortunately, well, actually, no, it doesn't matter. Why did I say unfortunately? There is still a GVL. So, Ruby still has still has a GVL, but the
00:16:56.000 way that we've designed the system is such that each Raptor has its own independent GVL that can run they can
00:17:02.880 all run independently of each other. And that means, oh, let's do the slide. Each Ractor has its own GVL that can run
00:17:08.559 independently of each other. And that means that we can get true parallelism out of all of them. So uh let's do our
00:17:15.520 Fibonacci sequence test again. If we run this run this with RA in serially of
00:17:20.720 course it's two seconds like you saw before. If we try this with ractors we'll hit 480 milliseconds. So we're
00:17:26.720 able to do true parallel parallelization with ractors. And just to like refresh
00:17:33.520 your memory here I've got a benchmark here where we're checking our base measurement versus threads fibers
00:17:39.039 processes ractors. And you'll see that the first three are around to 2 seconds
00:17:44.400 whereas if we use processes or ractors we can actually get down to 480 milliseconds. So again uh true
00:17:51.919 parallelism the only way we can get that in Ruby is either via processes and ractors and we want to focus on the
00:17:59.600 ractor uh solution. Unfortunately if you use ractors today you'll see this you'll
00:18:04.960 see this warning like this this warning comes out and it is a big scary warning.
00:18:10.160 uh and it says Ractor is experimental and the behavior the behavior may cha
00:18:15.440 may change and unfortunately that's totally true. It absolutely has changed. Uh APIs are are changing and part of the
00:18:22.960 work that we want to do on the team is stabilize that API, make sure that it works well and fast. And the other the
00:18:30.799 other thing that it says is also there are many implementation issues and that is true as well too. There are many
00:18:36.720 there are many implementation issues. So we are trying to take on we're trying to take on those uh behavioral changes as
00:18:43.600 well as the implementation issues. make sure that the behavior uh behavior is stable and that the implementation
00:18:50.000 issues are fixed and we're hoping that if we can do enough work on that we can actually get rid of this message so that
00:18:56.640 people will feel a lot more comfortable using reactors in the future. So I want to talk a little bit about the behavior
00:19:02.000 changes or how to use how to use ractors. Uh and this is in Ruby 3.5. So you need
00:19:09.760 to be using Ruby either wait for Ruby 3.5 or don't. You can build it from
00:19:14.799 edge. Please do that. Um and we'll also
00:19:19.919 talk about some implementation issues. So ractors are very similar to threads. They're like threads but harder to use
00:19:27.360 but in a good way. And I'm going to explain that in a minute here. So ractors have kind of a rule rule with
00:19:33.200 them. The the rule is that you cannot share uh you cannot share immutable objects between ractors. That's not
00:19:39.840 allowed. So we don't allow that. Uh ractors will copy immutable objects and
00:19:45.520 I have an example of that here. So here we have two two code examples. On the left is a thread example and on the
00:19:51.919 right is our ractor example. And all we're doing here is we're pushing we're
00:19:57.360 getting a mutable string. And this is important. We've got a mutable string here. We're pushing the mutable or we're
00:20:02.480 popping the string off of a que. So in thread world, we'll pop it off a Q. In
00:20:07.760 Ractor world, we'll pop it off a Q. But this is actually our default mailbox, what we call the mailbox for the Ractor.
00:20:14.000 Uh after that, we'll print out the object ID of the string uh along with
00:20:19.760 the string itself. So we don't need to make a queue. One nice thing about ractors is we don't need to make a
00:20:25.440 queue. They already have default cues. So we can just push on to that default Q. Uh if we run this code, we'll see
00:20:33.039 here on the threaded side, the object IDs are identical. There was no copy.
00:20:38.559 But if we look on the Ractor side, we'll see that the object IDs changed. And that's because we actually duped the
00:20:44.320 string when it crossed the boundary. So when that string went from one reactor to the other, we ended up copying the
00:20:49.600 string. So we're not allowed to share mutable objects between ractors. Uh if
00:20:55.919 we change this code, so we we just add where is it? Frozen string literal here. True at the top. The thing I was
00:21:02.000 complaining about this morning. If we add that at the top, we'll see that actually the object IDs are identical.
00:21:08.240 So as long as the objects are immutable, we can pass them between pass references between ractors and it doesn't matter.
00:21:13.840 It's fine. There's no copy. Uh so how do we make immutable data? Uh one I already showed one example. We can
00:21:20.400 have frozen string literal true or uh we can freeze an object but unfortunately
00:21:26.720 that's not going to work if we have like a deeply nested object like let's say for example we parse some JSON we got a
00:21:32.159 big old JSON hash out of it. In that case we can use ractor.sharable make sharable and that will deeply freeze the
00:21:38.960 object's data structure or like some libraries like JSON will provide APIs
00:21:44.159 that give you back frozen data structures already. So in this case you can pass like freeze colon true into
00:21:49.600 JSON.parse and the data structure that comes back out of JSON parse will already be frozen. So you can pass that
00:21:55.520 between your ractors without copying. So uh communication how does that work?
00:22:00.640 Like we know how to make immutable data but how do we communicate between ractors? Uh ractors use what is called a
00:22:06.080 port and this is in Ruby 3.5. Again uh ports are basically a Q and we can
00:22:15.200 use them we use them like this. Every ractor has a default port. These two bits of code are actually exactly
00:22:21.120 identical uh except that rather than calling ractor.curren.default port we
00:22:27.120 can and receive on that we can just call ractor receive. So one is shorthand for the other. uh these two chunk chunks of
00:22:33.679 code do exactly the same thing. Ports are ports are just like cues. However, they have to abide by two rules. One is
00:22:39.840 that any ractor can write to a port. So any ractor can write to a port. And here
00:22:45.120 is the mindbending rule. Uh only the creating ractor only the ractor that
00:22:50.960 created the port can read from it. So this is kind of like takes a second to
00:22:56.080 wrap your head around. So only the creating ractor can read and I'm showing an example here on the left. This
00:23:01.840 example works fine. We create a port. We pass in the main reactor. We pass it to a child ractor and we're we're allowed
00:23:08.080 to write to that port and the main reactor is allowed to read from that port because it created it. Now, on the
00:23:13.840 other hand, if we created a created a port on the main reactor, but we pass it to a child ractor and we try to we try
00:23:19.679 to read from it, you'll get an exception. It doesn't work. So, this these are kind of the rules that we have
00:23:25.039 to play by. Uh, of course, if you're coming from the threaded world, this may require some like mental shifts. I know
00:23:31.679 it did for me. And here's an example of threads. They're all these threads are
00:23:37.280 all sharing references to exactly the same queue. So, we have many threads and they're referring to the same que where
00:23:43.760 we're trying to read each child thread is trying to read from that Q. But if only one Ractor is allowed to read from
00:23:49.280 a port, how can you like how can you write this code example? How can you do this? Uh it's a little bit more
00:23:54.720 complicated in the reactor world and I want to show you how to like how how we can accomplish this. Uh in the reactor
00:24:00.480 world we create a rendevu point. I call it a I don't know the right name for it but I call it kind of a rendevu vu point
00:24:07.120 or a coordination point for producers and consumers. So first we have a coordination point. Uh this ractor is in
00:24:14.480 charge of collecting work from producers and handing that work out to consumers. So in this case we'll have a
00:24:19.919 coordinator. It takes the work in and then it hands that work out to the consumers.
00:24:25.360 Uh the biggest difference between this and the Q-based system is that all of the worker ractors have to specifically
00:24:32.159 ask for work rather than pulling the work. So here we say our coordinator
00:24:37.760 says, "Hey, I'm going to wait for the next ractor that wants work and as soon as I get that ractor, I'm going to give
00:24:44.320 that raptor work." So each of the child raptors has to ask ask for work. So,
00:24:50.159 here are what the workers look like. Our workers say, "Hey, coordinator, I need work.
00:24:56.320 I need work. Please give me something to do." Uh, the coordinator will give it work. It sits here and waits until the
00:25:02.799 coordinator gives it work and then it does whatever it needs to do. So, it's a little bit more complicated setup than
00:25:08.480 the queuing system, but what's nice is there are no locks or mutexes. We didn't have to write synchronize anywhere.
00:25:14.080 That's not a thing. There is no synchronization. There is no there are no deadlocks. And what's really even
00:25:20.480 better is that we get CPU parallelism in pure Ruby in pure Ruby code. Uh one more
00:25:26.799 feature I want to show one more feature and I think this particular feature is especially interesting to Rails developers. So I was talking about how
00:25:34.799 mutable data gets copied like let's say for example we have two ractors uh one ractor creates a string it wants to pass
00:25:41.679 that string to a second ractor. When that happens it copies the data. Now
00:25:46.799 there's one exception to this case which I think is a very interesting exception and that is that the ractor return value
00:25:53.039 is not copied but only once and I'm going to show you an example to make this a little bit more clear. Let's say we have a ractor like this. We're
00:25:59.840 creating a ractor R. This Ractor allocates object O. Uh we're going to
00:26:05.919 print out the object ID and the frozen state of the object. R returns the object O and then our main
00:26:13.679 Raptor will try to read that object, get it, and then it's going to print out the object ID and the frozen status. And if
00:26:19.360 we do that, we'll see that these objects are identical. Oh no. Oh no, it's the same object. All
00:26:26.240 right, I'm going to do the next slide. I'm sad my little thing didn't line up. Anyway, um if you run this code, you'll
00:26:33.200 see that neither of them are frozen. They're not frozen. However, the object IDs are identical. So that means that uh
00:26:41.679 we are able to pass an a mutable object between two ractors without doing a copy. So why is this interesting? Why
00:26:49.120 why would any of this be interesting to us as Rails developers? I'm going to tell you why did I ask that? I'm going
00:26:54.480 to tell you why. It's a rhetorical question. Let's talk let's talk about it. So
00:27:00.880 yes, I am nervous. Um what is one use case for C extensions?
00:27:06.720 There there is a use case for C extensions. Usually we're using C extensions to bind to native libraries
00:27:12.240 like lib XML is binding for or no giri is binding for lib XML. Uh of course
00:27:17.760 there are many other C extensions. There is another use case and that other use case is for releasing the GVL. In C
00:27:25.520 extensions we actually have a way where we can release the GVL and do other work. So a good example of this is the
00:27:31.360 B-crypt gem. So B-crypt is written in C. The B-cript gem is written in C. We have
00:27:36.720 these three lines are in the brypt gem and basically what they're saying is hey I want to call this function but I want
00:27:44.159 you to call the I want you to release the GVL then call the function and then acquire the GVL again. So what this
00:27:51.120 means for us is that we're calculating password hashes without holding the GVL. So let's say for example you're using
00:27:57.440 Puma as your web server. Uh somebody goes to log into your application you start calculating that password hash.
00:28:04.159 while you're calculating the password hash, Puma is able to service another request in parallel. So, we're able to
00:28:10.559 we're able to get CPU parallelism in C extensions by releasing the GVL.
00:28:16.000 What's nice about uh Ractors is we can kind of think of them as a no GVL block.
00:28:22.000 So, let's imagine that we had BCrypt written in pure Ruby. Let's imagine that was pure Ruby and not C. We could write
00:28:28.480 something like this where we create a new ractor, we calculate brypt and we return the value of brypt. And we can
00:28:35.120 think of this as kind of just a node gvl where other threads can run. We can have our puma web server uh and it can serve
00:28:42.480 up requests while we're calculating brypt doing exactly the same thing that the C extension did, but we were able to
00:28:48.399 write it in pure Ruby. So uh maybe not maybe not calculating brypt a real world
00:28:54.080 example might be just parsing JSON for example like maybe you have an API server you're spending a lot of time
00:29:00.159 parsing JSON throw it in a ractor and now all of a sudden you can do other requests in parallel while you're
00:29:05.440 parsing that JSON so this seems like a very loweffort way to start introducing ractors into your system and get some
00:29:12.000 parallelization but without necessarily having a raptor-based web server so one
00:29:17.679 thing I would like to pitch to all of you is when you upgrade to Ruby 3.5 in
00:29:22.720 the future and you're going to do that right away, right? Yes. Yes. Yes. Uh try
00:29:28.320 wrapping your CPU intensive code inside of a Ractor uh and see if that see if that is able to help out your
00:29:33.679 parallelism. Uh of course do this in Ruby 3.5. So another thing I want to talk about here is like weird
00:29:39.360 bottlenecks that we've had to solve. So, we've been we've been trying out Ractors, finding finding crash, fixing
00:29:46.240 crashes, uh trying to improve the API, but I want to talk about a weird bottleneck just because I think it's
00:29:52.720 very very fun and interesting. So, somebody filed a an issue on the Ruby
00:29:58.399 bug tracker. And the issue, like I'm just going to give you a summary of the issue. The issue is uh parsing parsing
00:30:05.360 JSON is slower in Ractors than if I do it serially. So, if I try to parse JSON in parallel, it's actually slower than
00:30:11.840 if I just tried to do one at a time. And uh we've fixed this bug. We've since
00:30:17.120 fixed this bug. But I want to tell you the problem because I think it's a little bit surprising and interesting.
00:30:22.159 Uh yes, we fixed the bug. Nice. Nice. Yes. John, not we. John fixed the bug. Great
00:30:28.880 job, John. So, here is here is a problem. When we parse this JSON here,
00:30:34.240 we get a we get a hashback. And the key to the hash is a string. And interestingly, both of these strings are
00:30:42.000 exactly the same object, right? They're exactly the same. They have the same object ID. They get dduplicated when we
00:30:48.720 parse when we parse string or parse JSON. And we do exactly we have to do
00:30:54.080 this dduplication when we're doing it inside of a Ractor as well. So both of these when we parse the JSON here, we
00:30:59.600 check that keys that keys object ID. We'll see oh no. Oh, this is indeed the
00:31:04.640 same object. And the way that this is implemented internally to Ruby is we have something called an fstring table.
00:31:10.480 I think it stands for frozen strings because these strings these strings are frozen. But this fstring table is
00:31:17.120 essentially a global within within Ruby. And we have to consult this global hash table in order to resolve those keys to
00:31:24.399 be exactly the same object. So while we were doing multiple ractors, these ractors had to take a lock on this
00:31:30.640 fringstring hash. So we're ending up getting lock contention on this this global hash
00:31:36.640 table within the internals of Ruby. And you wouldn't know this necessarily just from looking at the Ruby code itself. Uh
00:31:43.039 the solution to this was to turn the string table into a lock free a lock free hash. Uh which John was able to do.
00:31:50.000 And once he did that uh JSON parsing got 12x faster. So we should Yes. round of
00:31:55.600 applause.
00:32:01.600 Do people do people here parse JSON? Is that a thing?
00:32:07.760 Now, unfortunately, the fstring table is not the only thing like this. We've found we found within Ruby internals.
00:32:13.679 Uh, also we have a global table, an ID table. This is for symbols. So, symbols
00:32:18.880 get resolved into the same object. So, we had contention there. Uh, a thing called the CC table which is for inline
00:32:24.880 caches. whenever we needed to create an inline cache or look up an inline cache,
00:32:29.919 uh we would end up locking. Uh another example is the encoding table when we're looking up string encodings. So
00:32:36.080 internally to C Ruby, we have all these these global data structures that you wouldn't necessarily know are global
00:32:42.000 while you're writing Ruby code. Uh anyway, the point is that our our team is trying to find these bottlenecks and
00:32:48.080 fix them. Figure out how we can remove these bottlenecks so that when you upgrade, we can actually start using Ractors for
00:32:54.320 real in production. Uh the next thing I want to move on to I'm going to move on to the next topic
00:32:59.440 now which is Zjget. Uh Zjget is a new compiler that we're
00:33:06.080 going to be shipping with Ruby 3.5. Uh so I want to talk a little bit about the work that we're doing on that and I want
00:33:12.880 to talk about what like what are what are JIT compilers in general because people have asked me this. uh we'll look
00:33:19.360 at the differences between yjit and zjit and then I want to give some like tips for rails developers on how you can
00:33:25.519 write more JIT friendly code in your application. So the first thing is that uh before we get to the meat of this JIT
00:33:32.399 section uh I kind of want to define a term which was confusing to me when I first started working on uh JIT
00:33:39.279 compilers. Ruby's virtual machine is called Yarve and it's what we call a
00:33:44.320 bite code interpreter. So it's interpreting the bite code. When we compile your Ruby code, we turn that into bite code. That bite code gets
00:33:51.200 interpreted. Uh and we have a virtual machine that interprets that bite code. So if anybody, namely me on the stage
00:33:58.640 here, refers to something as the interpreter, what they mean is this virtual machine that's interpreting that
00:34:04.000 bite code. Uh so what like what is a JIT? I think this is an important question to ask because there I think
00:34:11.040 there are a lot of definitions for what a JIT compiler could be. But to me, a JIT compiler is something that assembles
00:34:16.720 code at runtime and assembles machine code at runtime and is usually as lazy
00:34:21.760 as possible. So here's here's an example of something that could be considered a JIT compiler. Here we're saying, hey, uh
00:34:28.720 we're not going to define the adder accessor right now. We're going to wait until method missing gets called and
00:34:33.760 then when method missing is called we're going to define the adder accessor. So here we'll say like ah yeah uh we'll
00:34:39.839 just define that and then the next time the method is called we don't call method missing anymore because we
00:34:45.040 generated this code at runtime. So we could kind of think of this as a as a JIT compiler. Uh of course we wouldn't
00:34:51.599 actually write this in your in in our code, right? Please don't.
00:34:58.560 Um but this is an example that could be considered one. So we generate code at
00:35:04.560 runtime. I I think of them as generating code at runtime. They're usually lazy. We try to be lazy about it. Meaning that
00:35:10.560 they're late. They're kind of late. They do it later. Uh and the another important aspect is that they they
00:35:16.960 should speed up our program. This is uh unfortunately I have written JIT
00:35:22.800 compilers that do not speed up programs. So yeah, this is I think this is an
00:35:28.560 important aspect. Another question people ask me are like, you know, how how can a JIT speed up our code? Like,
00:35:36.079 how can it do that? It has to do the JIT compiler has to generate code that does exactly the same thing as our program
00:35:42.160 did originally. If it's doing exactly the same thing, how can it possibly be any faster than our byte code
00:35:47.920 interpreter? So, I want to talk about that a little bit. Uh, the JIT and the interpreter must match. As I was saying,
00:35:54.960 whatever whatever code the JIT compiler produces, it must have exactly the same behavior as the interpreter. So, how can
00:36:01.760 they like how can we fix anything? Uh, one one thing that we can do with a JIT
00:36:08.240 compiler is that we can eliminate the interpreter overhead. So, this interpreter has overhead and typically
00:36:13.520 when we're running a program in Ruby, we'll have something that looks like this. So we have our CPU uh we have
00:36:19.440 Yarve which is our byte code interpreter and on top of that we're running our Ruby code. Uh the interpreter is running
00:36:26.160 that running that Ruby code on the CPU. Yarve is a is a C program and that C program is running code on your your
00:36:33.280 actual CPU. The way I think of a the way I think of a JIT compiler is it's taking
00:36:38.800 this it's getting rid of this Yarve step here. It's basically moving promoting your code to running directly directly
00:36:45.520 on the CPU. I can kind of think of this as like well you're running your code inside of a docker container. Uh but
00:36:51.680 somehow you're able to escape that and run on instead of running inside of a container, you're running on the bare
00:36:57.520 metal itself. And the JIT compiler is just doing that at runtime for you automatically. The other way that we can
00:37:03.520 speed up uh speed up programs is by caching values. So here's an example
00:37:08.880 again back to our Fibonacci sequence because that's what what we do speed up Fibonacci. Um we have an example here of
00:37:15.359 the number 35 and you can see if we look at the Yarvite code we have a literal 35
00:37:20.400 in there. We have that number. Uh what we can do in the JIT compiler is we can say hey I'm going to actually embed that
00:37:26.960 number directly into the machine code. So when we compile this, if you were to disassemble the machine code, you might
00:37:32.480 see something like this where we have the number 35 directly directly in there. So that's another way another way
00:37:39.040 that we can speed this up is rather than loading that 35 from memory, we already have it have it in the machine code. Uh
00:37:45.520 another thing that we can do is speculate. We can speculate on values and we're able to do this in a way that
00:37:51.359 the interpreter cannot do. So let's say we have code that looks like this. Sum 2 + 5. uh our JIT compiler might say,
00:37:58.720 "Hey, you know what? I don't think anybody is going to monkey patch plus
00:38:03.839 hopefully." So in advance, what it can do is it can say, "I'm just going to add those two
00:38:09.359 numbers together. We're going to get the number seven and we're just going to keep that there and return that rather
00:38:14.720 than executing the plus value itself." But what's cool that a cool thing a JIT compiler can do that uh our interpreter
00:38:21.680 can't do is it can deoptimize this. So let's say somebody does monkey patch plus like it did that it did the
00:38:27.119 constant folding and then all of a sudden somebody monkey patches plus our compiler can detect that and say oh you
00:38:33.280 know what uh I messed up my speculation was wrong I'm going to fall back to this
00:38:38.800 particular implementation so I'll just go back to calling plus on the number and then do whatever the interpreter
00:38:44.160 would have done. Uh another thing we can do is we can eliminate type checking.
00:38:51.200 Yeah, everyone. Wait, who who likes type checking? Yeah, type checking. Well, we
00:38:57.200 don't have to we don't have to do it. We can eliminate this typeing. This is a different type.
00:39:02.320 This is a different type of type checking. I'm going to explain this here. Uh when we have code like this,
00:39:07.760 our friend Fibonacci again, uh when we have code like this, when it runs, we have to check like, okay, uh when we do
00:39:14.640 this comparison, is n an integer? The interpreter has to check this. Is n an integer? If so, I'm going to do an
00:39:20.079 integer comparison. And then when we do minus, it's like, oh, is n an integer? If so, I'm going to do minus. In both of
00:39:26.880 these cases in our in our compiler, we can say, well, I know that n is an
00:39:33.040 integer on the first one. I did that test. I was able to do that type check, and now there's no reason for me to do
00:39:38.320 it later on in the program because I know for a fact that n is an integer. So, it's able to eliminate these two
00:39:44.880 particular type checks. Uh so these are different ways that a compiler can speed up the code that an
00:39:51.520 interpreter cannot do. So let's take a look at yjit and zjit uh and how they
00:39:58.240 work and the differences between them. Who's who is using widget in produ production? Anyone? Yeah. Oh my gosh, so
00:40:04.400 many people. That's awesome. Thank you. That's great. All right. Uh wit is a
00:40:11.119 lazy basic lazy basic block versioning compiler. LBBV. Uh, of course, this
00:40:17.119 already ships with Ruby. You all are using it. Many many many of you are using it in production. Uh, widget uses
00:40:22.720 a technique that was pioneered by Maxim, the author of the the author of the JIT compiler in her PhD thesis on lazy basic
00:40:30.640 block version. So, I'm going to describe how this compiler works and then uh move
00:40:35.839 on to Zjit, the differences between that and Zjit. So, L this is how LBBV works in in action. LBBV ah this compiler
00:40:47.760 discovers basic blocks lazily and what a basic block is is a straight line of
00:40:53.760 code that has no jumps in it. So no if statements nothing like that. So it's just code until we find an if statement
00:41:00.960 that is a basic block. So in this particular example when we compile the add method what widget will do is it'll
00:41:07.520 go here and it'll say hey I'm going to compile y equals z. all of a sudden it gets to the if statement and it's like oh this is the end of a basic block
00:41:13.920 because it could jump between uh one of these branches. So it compiles that basic block and then it waits. It
00:41:21.040 basically waits and it says which side of this if statement am I going to execute?
00:41:26.720 Let's say it executes this side of the if statement the top one. In that case it'll compile that that side of the if
00:41:32.400 statement. it'll create a new basic block until there's another jump again which is a and after that uh we'll
00:41:39.280 create a third basic block down here that's just return y. So we end up with something that looks like this. So we we
00:41:44.880 compile compile like that. And one thing to notice is that this code right here
00:41:50.480 it never got compiled. We didn't compile it because we didn't actually use it. That never happened. So what this means
00:41:57.599 is YJI gives us really really fast warm-up because we're compiling as
00:42:02.720 little code as possible. Uh another nice thing is that we have low memory overhead because again because we're
00:42:08.880 compiling as little code as possible and we also have low overhead on type discovery. So what I mean by this is
00:42:15.440 when YJIT pauses at those particular locations, it's able to look at all the values on the stack and say, "Oh, n
00:42:22.480 that's an integer. I know that's an integer." and then generate code that's specific for that integer.
00:42:28.480 Some of the downsides of widget though are that it's uh register allocation is a little bit harder. Uh whenever we're
00:42:36.079 dealing with variables in our program, we want the compiler to keep those variables in registers and that's
00:42:41.680 because registers are faster to access than memory. So we would really really want to keep those keep those values
00:42:47.839 inside of registers. Unfortunately, registers are a finite ah I didn't went too far. Registers are a finite
00:42:53.920 resource. So we can't just put all of our variables in registers. We have to only put some of them in registers some
00:42:59.359 of the time. So an issue is like let's say we're compiling this code with y.
00:43:05.440 I've rewritten the code a little bit here, but the functionality is exactly the same. Let's take a look at this. We have y0 at the top. Uh we're defining
00:43:12.720 two other variables y1 in both branches of the if statement and then we return
00:43:17.839 y1 at the end. So let's say y compiled this and we
00:43:23.119 turned this into machine code and we decided that this value y0 should go into the x0 register on my crappy ARM
00:43:31.680 machine. So we put it we put it in there and that's where we decide to put it. Now
00:43:37.280 now when y goes to compile the other branch like let's say we execute this code again and we want to compile the
00:43:43.280 else statement right here. The question is which register do we use? We have to have
00:43:50.560 maintain the knowledge that that other side of the if statement used x0 for the register the value of y1. And this is
00:43:58.000 actually a very very difficult problem. This particular this particular case this if statement is a a big problem in
00:44:05.359 compilers and one of the main things that folks focus on. So the way that we handle this in widget is we actually
00:44:11.200 just end up writing these local variables into essentially known locations in memory. they're written to
00:44:16.800 known locations. So rather than keep track of all the registers, we just basically treat treat it as if it has an
00:44:23.599 infinite number of registers where those registers are essentially just memory. So uh I'm not going to get into the
00:44:30.720 details uh but as we discussed reading memory is not as fast as reading registers. So we'd prefer to keep things
00:44:36.319 in registers. So comparing this with a method-based compiler, a method-based compiler is a little bit different. it
00:44:41.760 can say like, "Oh, okay. I'm just going to look at the entire method and it'll compile both sides of the if statement."
00:44:48.319 And because it has the context of all the variables used in either side of those if statements, it's able to figure
00:44:53.680 out like, oh, okay, in one side we stored y1 in x0, so we're going to do
00:44:58.720 the same thing in the other side. And then all of our code works together. So, we're able to keep those values in registers 100% of the time rather than
00:45:05.200 writing them spilling them to memory uh a lot of the time. So another nice thing
00:45:10.720 with uh methodbased compilers is it's a little bit easier for us to fold constants. Uh as I said we can do more
00:45:16.960 efficient register usage. The other thing is it's easier to learn a little bit. Uh a lot of the compiler
00:45:23.200 documentation out there they focus on methodbased compilers. You can use traditional uh traditional resources for
00:45:29.200 uh compiler theory on a on a method-based compiler. Now the downside though is that it might compile too much
00:45:35.520 code as we saw earlier like if we don't take one side of that if statement a methodbased compiler is going to compile
00:45:42.000 it where where widget would have done nothing. Another downside is that tracking types
00:45:47.359 is a little bit harder. We're not really pausing with we're not pausing in the
00:45:52.560 actual code with a method-based JIT. So what we have to do is we have to start tracking types as the interpreter the
00:45:58.720 interpreter is executing. So we had to add we had to add some infrastructure for for uh tracking types. So I guess
00:46:06.319 the idea with Zjit was we thought you know what if what if what if we could
00:46:12.640 take what we had learned from YJIT uh and like how to actually put a JIT
00:46:18.400 into C Ruby, how to deploy it, uh how to do development on a JIT compiler, how to
00:46:24.640 make it low overhead. What if we could take all of those techniques and apply those to a method-based compiler? And
00:46:30.400 that's where we came up with the idea for Zjit, a next generation JIT compiler
00:46:35.440 uh which is essentially our method-based JIT. Now, I am very very excited about this work. However,
00:46:41.520 uh I would like to temper your expectations a little bit. It is very very new. So, if I if I can make a
00:46:49.040 humble request that you lower your expectations of this chick compiler, a little bit lower,
00:46:56.240 this is probably good. Uh, it's it's not rock solid at the moment. If you go check out if you go
00:47:02.560 check out uh Ruby Edge and you try to run Ruby Edge in production like all of you are going to do after this
00:47:08.160 presentation, right? Yeah. Yeah. Yes. Uh you can use zjit if you do ruby-
00:47:16.000 zjit. Uh but it's probably not going to be as fast as yit. Wait, no, not
00:47:21.359 probably. It will it will not be as fast as widget, but we are going to get it there. So with that said, I want to talk
00:47:27.440 about like what what can I do today? Like what can I do today to write JIT friendly code?
00:47:34.319 So as a as a Rails developer, why do I care about this and what can I do about it? Now, me working on compiler stuff, I
00:47:43.200 really want to say to you, write whatever code that you want because the idea is like you should be able to write
00:47:49.839 any code that you want to and then we'll just figure out how to speed it up so that you don't have to change anything.
00:47:56.079 But, uh that is theoretically how things should work. But the the truth is that
00:48:01.200 some patterns are easier to speed up than others. So, otherwise like why you
00:48:06.960 know why bother profiling your code? just wait for the newer version of Ruby. But sometimes we do we do want to speed
00:48:12.319 things up. Uh there's an opportunity cost. So you can wait for us to improve the JIT compiler or you can write code
00:48:19.680 that is a little bit more friendly and have those speed ups today. So do you want to wait or do you want to have
00:48:24.880 speed ups today? And I'm going to give you one. I said I was going to give you pro tips, but I'm going to give you just
00:48:31.040 one pro tip. We are going to monomorphize these pro tips.
00:48:38.319 Thank you, Uffuk. Yes. So, uh,
00:48:44.559 we're gonna talk. Thanks. We're going to talk about monomorphizing call sites.
00:48:49.599 This is a a very dumb compiler joke. I'm so sorry. Um, in order to describe to
00:48:55.440 you what this is, first I want to describe polymorphic polymorphic call sites. We're all familiar with
00:49:00.559 polymorphism, right? We use polymorphism every day. Yes. Yes. We write an O.
00:49:06.079 Polymorphism is great. Here's an example of a polymorphic call site. So we're we have two classes A and B. Uh and we're
00:49:13.599 passing instances into call fu. We call foo on the thing. This is what we call a polymorphic call site. And it we call it
00:49:20.480 polymorph polymorphic because it sees different types. So um computer vision is getting really
00:49:27.680 good. It's able to see see them. Uh yeah. So here we have an
00:49:33.119 instance of A and an instance of B both passed to the same method and this thing calls a method on two different types.
00:49:40.480 Now we'll compare this with monomorphic. All monomorphic means is that it just sees one type. So it's just one versus
00:49:46.960 many. So in this case we we only see one type here. Uh it's monomorphic call site. So because it only ever sees that
00:49:53.680 one a type. So if you have a polymorphic call site and you JIT compile that that
00:50:00.960 code, the JIT code is going to end up with a whole bunch of tests saying like, okay, is this an instance of A? Is it an
00:50:08.000 class A? Is it an an instance of class B, is it an instance of class C, etc., etc., etc. So the more types that you
00:50:13.920 see there, the longer it's going to take to find that particular method and call it. So if you can monomorphize that call
00:50:21.119 site, make it only one type, you can actually speed up that call site very nicely. And I have an example like a real world example of it today. Here um
00:50:28.400 we had a pull request for visitor methods in the uh in Prism. And we were
00:50:34.800 able to see a 13% a 13% uh speed improvement over Ruby 34 by
00:50:40.480 monomorphizing call sites inside of inside of Prism. Uh and this isn't even
00:50:45.839 with the JIT compiler. This is just with a normal interpreter. So this particular pro tip will work for you whether you're
00:50:51.119 using YJIT or not. Uh so the other thing is it's not just about types like which
00:50:57.760 class you use. It's also about instance variables as well. So we can think of this here as a polymorphic instance
00:51:03.680 variable read because depending on the branch that you take and initialize you'll end up with two different shapes.
00:51:09.920 So the order in which instance variables are set actually matters. So the thing that I want to convey to
00:51:16.559 you is like I don't want to say to you you know don't use polymorphism. Polymorphism is terrible. I think it's
00:51:21.839 great. I really think it's great. But what I want to convey is you should use useful types. Uh set set instance
00:51:30.240 variables in a consistent order. So try to set all of your IVs in a consistent
00:51:35.359 order. Don't randomly set them. When I say useful types, like what does that mean? What is a useful a useful type?
00:51:41.920 When I say useful, I mean useful to your application. So something that actually provides you with value. So for example
00:51:48.000 here like maybe maybe you have multiple different credit card processors uh and you want to have some sort of
00:51:55.040 strategy object that changes depending on the type of credit card processor or process payment processor backend that
00:52:01.440 you're using. So in this case we have an example where polymorphism is really helping out our application. It's
00:52:07.280 encoding business logic. It's making the code easier to understand and easier to maintain. This is this is something that
00:52:14.079 should be like celebrated and something that we should be doing in our code. When I say like low value polymorphism,
00:52:20.880 what I'm talking about is code code that looks kind of like this where we've got a cache and we're passing a key to the
00:52:26.800 cache and we're just calling 2S on that thing. And the reason we're calling 2S on that thing is because we want to take
00:52:32.480 either strings or we want to take symbols as our inputs and we need them to be consistent when we look it up in
00:52:38.400 the cache. So we're just calling to s on that. So, I don't really think that this type of code is this is my this is my
00:52:45.040 very hot take on stage today. I don't think that this type of code is very valuable because like what what are we
00:52:51.599 actually getting out of this? You could just change the caller to use a string instead. So, we could say, "Oh, instead
00:52:57.119 of using a symbol here, let's just use be consistent and always use a string in this case." Or even better, you're like,
00:53:03.359 "No, no, no, Aaron. I need to use a symbol there. I have to got to use a symbol. We don't know where the data is coming from. It could be a symbol. we
00:53:10.079 need to make it a symbol. So maybe instead of uh doing the 2S here, we could actually change the caller and
00:53:15.839 say, well, let's just call 2S in this case. And if you do that, you're ending up with two monomorphic call sites. And
00:53:23.359 of course, well, since it's always a consistent type going into the lookup method, we can just remove the 2s and
00:53:28.640 say, well, now we only have a single monomorphic call site. So this is the type of like in my opinion low value
00:53:36.000 polymorphism that we should be removing from our removing from our applications. Uh a famous person once said uh keep
00:53:44.720 useful polymorphism remove useless polymorphism.
00:53:51.599 Love quoting myself. All right let's let's wrap this up. Uh we talked about
00:53:56.640 ractors. Uh we talked about parallelism. We talked about weird bottlenecks which I thought were very very fun. Uh we
00:54:03.119 talked about how Ractor new can be thought of as a no GVL block. We talked about JIT compilers uh why how and why
00:54:10.960 they work. Uh we talked about the differences between Zjit and YJIT as well as writing JIT friendly code with
00:54:16.720 by monomorphizing our code. So I want all of you to please please upgrade to
00:54:22.000 Ruby 3.5 now. Right now. I'll wait. I know we have a party coming. Oh, don't
00:54:28.720 no don't get hold on one more thing. I have a question for all of you. Um, do
00:54:34.880 do you do you allocate objects in your Rails app?
00:54:40.319 Yeah, object allocate. Yes, let's cheer on object allocation. Woo.
00:54:45.599 Uh, I love asking questions that I know the answer. Like, yes, we allocate. Woo. Um,
00:54:53.920 allocations are much much faster in Ruby 3.5, which is I hope I hope is a good
00:54:59.200 reason for all of you to upgrade. I want to show you a benchmark. We we made them faster. Uh, here's a benchmark. We have
00:55:04.559 a user class here, and we're going to instantiate it 500,000 times. Uh, we
00:55:11.359 actually made this 70% faster on Ruby 3.5 than it is in Ruby 3.4. So,
00:55:24.160 Please, please, please upgrade. It's been an honor to be here with you this year. I'm so happy I could come to
00:55:29.520 Amsterdam. Oh, wait. I have one more joke. So,
00:55:34.880 uh I was going out going out to dinner here in Amsterdam and I was really really worried uh whether or not they
00:55:41.200 would split the bill for me. Um but it turns out that everybody here is dining Dutch.
00:55:49.839 Oh, come on. All right. Thank you.
Explore all talks recorded at Rails World 2025
+19