From Resque to SolidQueue - Rethinking our background jobs for modern times


Summarized using AI

From Resque to SolidQueue - Rethinking our background jobs for modern times

Andrew Markle • July 10, 2025 • Philadelphia, PA • Talk

Introduction

Andrew Markle, a staff engineer at FullScript, presents at RailsConf 2025 about their migration from Resque to SolidQueue, detailing why the change was necessary, their approach to the migration, challenges encountered, and the benefits realized.

Background and Motivation

  • FullScript's background job system had used Resque since 2011, but as the company scaled (to over 150 engineers), Resque's limitations became apparent: maintenance burdens, mysterious failures, underwhelming reliability, and operational complexity.
  • SolidQueue, a new default job runner in Rails 2024, became attractive due to:
    • Native support for various databases (e.g., MySQL, Postgres)
    • Deep integration with Rails, implying long-term community support
    • Improved observability by leveraging familiar database tooling

Migration Strategy

  • Preparation: Ensured all jobs used ActiveJob (a Rails abstraction). Migrated legacy Resque jobs to use this interface to ease transition.
  • Database Decisions: Implemented SolidQueue on a separate MySQL database in production to avoid adding heavy write loads to the application’s primary database. Development and staging used the same DB in a different schema.
  • Transactional Integrity: Leveraged SolidQueue’s configuration for transactional integrity, choosing to keep behavior matching Resque during migration. Tools like the isolator gem were used to identify potential issues.
  • Queue Renaming and SLO Alignment:
    • Moved away from confusing, legacy priority/domain-based queue names to names directly reflecting job expected latency (e.g., within_1_minute).
    • This improved clarity for developers and operations, made SLOs explicit, and enabled actionable alerting and autoscaling tied to queue latency metrics.
  • Worker Configuration: Set up dedicated workers for each latency queue, enabling straightforward autoscaling by queue demand. Fast queues were over-provisioned to handle unexpected spikes, while balancing resource costs.
  • Monitoring: Utilized the Yabida gem to pull queue latency and job count metrics from the database, pushed to Prometheus, making it easy to build dashboards for real-time observability of queue health.

Migration Execution and Pragmatic Lessons

  • Incremental Rollout: Overrode ActiveJob's queue assignment method to direct jobs to SolidQueue or Resque, allowing gradual, low-risk migration.
  • Error Handling and Rollback: Implemented around-queue blocks to fall back to Resque on SolidQueue failures, ensuring no lost jobs during transition.
  • Challenges Encountered:
    • Argument length limits – resolved by increasing DB column size.
    • Unexpected worker shutdowns – traced to memory limits, resolved by increasing instance memory.
    • Database connection exhaustion – occurred during large job influxes, mitigated by scaling DB and limiting autoscaler.
    • Misplaced long-running jobs – enforced per-queue job execution ceilings and monitored via Sentry alerts.
    • Migrating delayed/scheduled jobs – custom scripts were used to migrate jobs from Redis to SolidQueue.
    • Observed that batch job enqueueing creates DB spikes; addressed with techniques like batching and randomized scheduling.

Results and Takeaways

  • Migrated all jobs with minimal downtime and no lost jobs over a couple of months with a small team.
  • After migration, job failure rates dropped significantly for high-volume jobs, demonstrating improved reliability.
  • SolidQueue delivered strong performance and scaling with improved observability and operational simplicity.
  • Renaming queues based on latency SLOs was a substantial benefit, clarifying responsibilities and making alerting actionable.
  • The process serves as a real-world playbook for modernizing background job infrastructure in Rails without disrupting service.

From Resque to SolidQueue - Rethinking our background jobs for modern times
Andrew Markle • Philadelphia, PA • Talk

Date: July 10, 2025
Published: July 23, 2025
Announced: unknown

If your Rails app has been around for a while, you might still be using Resque for background jobs. We were too—until scaling issues, missing features, and increasing maintenance costs made it clear that Resque was no longer working for us.

This year we migrated to SolidQueue, Rails’ new default job runner and haven't looked back. This talk will walk you through how we did it—what worked, what didn’t, and what we learned along the way.

Key takeaways:
• Why we left Resque
• How we migrated with minimal disruption using a parallel rollout
• Why we went through the effort to re-name all our queues so that they were SLO-based (within_1_minute) and why this matters
• Lessons learned, pitfalls to avoid, and how SolidQueue made our jobs (and jobs!) easier

If your background jobs are from a previous era, this talk will give you a practical, real-world migration playbook to modernize with SolidQueue—without breaking anything.

RailsConf 2025

00:00:00.000 All
00:00:16.960 right, let's let's get started. So, like
00:00:18.960 Rosa said, I'm a staff engineer at Full
00:00:20.560 Script, which is a platform for uh we
00:00:23.600 help practitioners deliver holistic
00:00:25.359 medicine to their patients. And uh our
00:00:28.560 first commit was in 2011 and what we've
00:00:32.640 seen is the business has like grown a
00:00:34.079 lot since then. So when I joined which
00:00:35.760 is about 7 years ago and we were a dozen
00:00:38.160 engineers and now engineering is more
00:00:40.719 than 150 and as we grow and scale what
00:00:43.680 we're finding is the solutions that we
00:00:45.520 had in the past are no longer scaling
00:00:47.360 with the business and like a good
00:00:49.440 example of this is our background
00:00:50.879 queuing system. So last year we came to
00:00:53.280 the conclusion that it was really time
00:00:55.600 to say goodbye to Rescue. Uh Rescue had
00:00:58.960 served us very well for many years. Uh
00:01:02.879 and when our app started, Rescue was
00:01:04.720 kind of the clear choice. It was before
00:01:06.880 Sidekick, which continues to be great
00:01:08.640 today. It was before active job. But in
00:01:11.280 the last couple of years, there's been
00:01:12.799 some really nice improvements in kind of
00:01:14.159 like rethinking what's possible in this
00:01:16.479 space. And I think it started with uh
00:01:18.640 with good job and some improvements to
00:01:20.640 databases themselves. Uh now using a
00:01:23.280 database is a viable alternative to
00:01:25.360 Reddus and um after Good job solid was
00:01:28.880 built as an alternative and you could
00:01:31.360 use like all the databases that Rails
00:01:33.680 supports. So SQLite, MySQL, Postgress
00:01:36.400 and in 2024 solid Q was incorporated
00:01:38.799 into Rails as the default Q adapter. So
00:01:41.600 around that time we were starting to
00:01:42.799 like really think seriously about
00:01:44.960 replacing rescue with something else and
00:01:47.520 it was because we had like a couple
00:01:49.360 problems.
00:01:51.439 The first was a dependency story. We
00:01:54.320 were maintaining a bunch of forks that
00:01:56.320 we never wrote and didn't understand. Uh
00:02:00.000 we also started to have this issue where
00:02:03.200 our rescue workers would kind of just
00:02:04.799 like uh take a nap at various times. uh
00:02:08.000 they just kind of stopped working and
00:02:09.920 and uh any jobs were in the middle of
00:02:11.840 running they would just get killed and
00:02:13.040 they'd throw this error and now we
00:02:15.120 didn't lose any jobs because of this. Uh
00:02:17.120 we could always like manually retry them
00:02:19.200 but it didn't give us a lot of
00:02:20.480 confidence that we had a reliable and
00:02:22.239 resilient system in place here and it
00:02:24.640 was really kind of fairly mysterious as
00:02:26.239 to like why this was exactly happening.
00:02:28.640 The error we were getting was uh pretty
00:02:31.040 like low-level CC code and we kind of
00:02:33.760 had a choice. Do we invest more time and
00:02:36.560 energy into rescue? We fix the bug. Do
00:02:39.680 we get it going again? Maybe we have to
00:02:41.280 fork rescue or whatever. Or do we invest
00:02:44.879 our time and energy into an alternative?
00:02:48.800 And so we started shopping around for
00:02:50.160 alternatives. And um I think like
00:02:52.720 psychic is a great alternative. I think
00:02:54.959 we would have been happy had we chosen
00:02:56.720 sidekick. Uh good job is great but it
00:02:59.519 supports only postgress and we're on my
00:03:01.599 SQL so that's not an option for us. And
00:03:04.080 solid Q at this time was made the new
00:03:06.560 default in Rails. So we were like very
00:03:08.080 curious about it and there's uh some
00:03:10.959 nice things about solid Q. So the main
00:03:12.640 one is it's backed by a database that
00:03:15.120 supports my SQL like I said and while
00:03:17.280 it's possible to have metrics and kind
00:03:19.519 of like get information out of Reddus,
00:03:22.720 it's actually way more easy and
00:03:24.480 straightforward just to make a read
00:03:26.080 replica of your database and devs can
00:03:28.480 then go and write SQL queries and figure
00:03:30.159 out what's going on in there. Uh that's
00:03:32.080 really nice. like I like that a lot. The
00:03:35.200 other major incentive is that it's part
00:03:37.200 of Rails, which to me means it's going
00:03:39.440 to be a robust solution that's well
00:03:41.599 supported and maintained for the future.
00:03:43.599 And we're heavily invested in Rails in
00:03:45.200 the future of Rails and it just made
00:03:47.440 sense to use the defaults uh wherever
00:03:49.920 possible.
00:03:52.480 Uh the other thing is that a bunch of us
00:03:54.239 were really lucky enough to go to Rails
00:03:55.920 World last year and we watched Rose's
00:03:58.239 talk about Solid Q, how it was built,
00:04:00.319 how it works, and why Base Camp decided
00:04:02.239 to build it. Uh I really recommend
00:04:04.400 watching one of Ros's talks if you want
00:04:06.159 to deep dive into how Solid Q works
00:04:08.000 under the hood. And we came away from
00:04:10.159 that talk really wanted to give Solid
00:04:11.760 Cube a try and see if it would be a good
00:04:13.519 fit for us.
00:04:16.239 So early this year, we started to work
00:04:18.079 on it and like here's the plan. So, I'm
00:04:19.759 going to go through this throughout the
00:04:22.000 course of this presentation and uh yeah,
00:04:24.960 let's get started. So, the first thing
00:04:27.759 you just need to make sure that all of
00:04:29.360 your jobs are using active job. That's a
00:04:32.320 requirement. Like I said, rescue was
00:04:33.840 written before active job ever existed.
00:04:36.560 So, it has its own way of incuing jobs.
00:04:38.960 Uh and you have to kind of convert those
00:04:40.639 all over to to be active job. And we had
00:04:42.800 luckily done this like years ago. Uh so,
00:04:45.280 we were kind of ready to go in this
00:04:46.639 scenario. Uh the next thing you can do
00:04:48.880 is you can set up mission control jobs.
00:04:50.639 And this is the dashboard app which
00:04:52.479 allows you to retry or discard failed
00:04:54.639 jobs. It supports both rescue and solid
00:04:57.199 Q.
00:04:59.360 And to set up solid Q is is fairly
00:05:01.600 simple like the readme is really great.
00:05:03.680 Uh but there is a couple important
00:05:05.360 decisions that you need to make. And the
00:05:07.600 most important one is like uh do you use
00:05:10.800 your primary database or do you make a
00:05:13.440 separate one? And the answer is really
00:05:15.600 going to depend on the scale that you're
00:05:17.919 operating in. For us, our primary
00:05:20.720 database was not an option. Solid Q is
00:05:23.759 going to add like a ton of write heavy
00:05:25.759 load to your DB. Uh and this could be a
00:05:27.919 risk for us. We don't want to do that.
00:05:29.120 So instead, we created a new MySQL uh Q
00:05:32.320 database for production. But in
00:05:34.880 development, review environments,
00:05:36.960 staging, etc., we use the same database
00:05:39.680 as our primary just on a different
00:05:41.440 schema. And for those environments, it's
00:05:43.600 perfect. We can save on some cost and uh
00:05:46.720 we aren't worried about scale in those
00:05:48.240 environments.
00:05:50.880 All right, so pop quiz. What happens
00:05:54.080 uh if this transaction gets rolled back?
00:05:58.400 Well, the answer is it depends. So on
00:06:00.880 rescue, there's no transactional
00:06:03.039 integrity. Uh the job is always going to
00:06:05.680 get fired and this can be a big source
00:06:07.280 of bugs in your in your system. So a
00:06:10.080 user never gets created and then you try
00:06:11.600 to send them an email. Great. Uh but
00:06:14.639 then this kind of like makes me ask a
00:06:16.160 question, you know, like if we have a
00:06:18.400 separate DB in production but the same
00:06:21.199 one in staging, would we have
00:06:22.960 transactional integrity in staging but
00:06:24.639 not in production? Like how's that going
00:06:25.919 to work?
00:06:27.600 And that would be kind of confusing. So
00:06:28.960 it's really nice that in uh in Rails aid
00:06:31.280 we have this config option for solid Q
00:06:33.039 and it allows us to toggle transaction
00:06:35.440 integrity on and off for each job if you
00:06:38.400 want. And this is false by default. So
00:06:41.600 in this example, when in Q after
00:06:44.319 transaction commit is true, transaction
00:06:47.120 completes, job fires, transaction rolls
00:06:49.840 back, job doesn't fire. So for us, while
00:06:53.039 we're migrating, we just leave it off
00:06:54.560 because we want to just mimic how
00:06:56.479 everything's going in rescue. We don't
00:06:58.080 want to change anything at this point.
00:06:59.919 Uh but we can think about like enabling
00:07:01.599 it in the future.
00:07:04.160 Okay, so to recap, we're using the same
00:07:06.560 type of database. For us, it's MySQL as
00:07:09.199 our primary, but we've configured Solid
00:07:12.080 Q to be on a separate database in
00:07:14.319 production environments. We have no
00:07:16.639 transactional integrity. And for now, we
00:07:18.479 want to keep things the same as Rescue.
00:07:21.120 And uh if you're curious about like
00:07:25.039 figuring out transactional integrity, I
00:07:27.280 recommend looking at the isolator gem.
00:07:29.039 It's going to help you find all the
00:07:30.400 places in your app where transactional
00:07:32.240 integrity with jobs might be a problem.
00:07:36.479 Now, the second most important thing to
00:07:38.080 think about is like how do you set up
00:07:39.919 your workers? And to talk about that, I
00:07:41.360 first need to talk about cues.
00:07:45.120 So, with rescue, we had some technical
00:07:47.680 debt and was was around how we named our
00:07:50.080 cues. So, this is a list of all the cues
00:07:52.880 we had and there's a lot of problems
00:07:55.919 with this setup. So the big problem is
00:07:58.240 we had a mix of priority based cues and
00:08:02.560 domain specific based cues. And domain
00:08:06.560 cues kind of like happen because uh
00:08:09.599 maybe you first put your job in the
00:08:12.960 critical queue because it's really
00:08:14.319 important, but then there's a bunch of
00:08:15.520 other stuff in the critical queue. So
00:08:17.680 either your job is slowing those jobs
00:08:19.840 down or you're or like something's
00:08:22.240 slowing something down. And so you're
00:08:23.759 like, I know what to do. I'm just going
00:08:25.199 to make my own queue. Right. Uh but this
00:08:28.960 does not scale. Uh nothing else can go
00:08:31.599 into this queue. Ops has to set up new
00:08:34.479 infrastructure to support this new
00:08:36.240 queue. And what happens when some
00:08:38.320 notifications that need to go out
00:08:39.760 immediately and there's some that can go
00:08:42.320 out in the next hour and those are
00:08:43.839 mixing together. You still have the
00:08:45.760 potential problem where some less
00:08:47.279 important jobs are slowing down the
00:08:49.279 delivery of other jobs. And you also
00:08:51.680 have a language problem. So, what type
00:08:53.040 of notification can I put in the
00:08:54.880 notifications queue? I don't know.
00:08:59.680 And priority based cues are no better.
00:09:01.839 The problem with this setup is no one
00:09:03.600 knows what any of these mean. As a
00:09:06.480 developer, you write a new job. Which
00:09:08.480 queue do you put it in?
00:09:11.040 I don't know. Maybe you know. For ops,
00:09:14.240 when do you need to autoscale the medium
00:09:16.240 queue? Do you need to or is it okay if
00:09:19.040 this job just fills up for most of the
00:09:20.800 day and then clears up by the next day?
00:09:23.040 I don't know. The problem is no one
00:09:25.519 knows. So, what do we actually want to
00:09:28.480 know? We want to know how long is my job
00:09:31.839 going to sit in the queue for.
00:09:34.959 There's a great talk uh from Ruby Conf
00:09:37.440 2022 called What does high priority
00:09:40.000 mean? The secret to happy cues. Uh it's
00:09:42.640 great. You should go watch it at some
00:09:44.080 point. Uh, I watched this a couple years
00:09:46.399 ago and like it kind of like stuck in my
00:09:48.000 brain and I thought like whenever we
00:09:49.600 like touch our background queuing
00:09:50.880 system, I'm going to do that. And in
00:09:53.600 Daniel's talk, he really suggests naming
00:09:56.000 your cues based on latency tolerances.
00:09:58.560 So based on that idea,
00:10:01.360 our job Q went from this
00:10:04.240 uh to this.
00:10:11.839 So this does a couple things. makes it
00:10:14.560 clear to developers and to ops what the
00:10:18.079 cues mean. It becomes a contract. So
00:10:20.240 each queue has an implicit SLO. What
00:10:22.399 we're saying is here is jobs in the
00:10:24.880 within one minute queue are guaranteed
00:10:26.560 to run within 1 minute or less. That's
00:10:29.120 the longest tolerable latency that's
00:10:32.000 going to be acceptable for that queue.
00:10:33.519 You put a job in the within one minute
00:10:35.200 queue, it's going to get in and get out
00:10:36.880 in under a minute. That's what we're
00:10:38.160 promising. The beauty of this is we can
00:10:40.880 directly tie our alerting to this
00:10:43.040 expected SLO. If the queue latency is
00:10:46.160 taking longer than it should, we can
00:10:47.760 raise an alert. We can autoscale more
00:10:50.079 resources. And that's because we now
00:10:51.839 understand what each of these cues mean.
00:10:54.880 And to get these names, uh, we just made
00:10:57.440 a spreadsheet. So, we listed all the
00:10:59.440 jobs. We assigned uh the team that owned
00:11:02.079 that job. and we just asked uh what's
00:11:04.560 the longest amount of time that this job
00:11:06.399 can sit in the queue before something
00:11:08.399 bad happens and then we we we aggregated
00:11:11.440 that data and we we figured out these
00:11:13.519 new Q names and this this is how we we
00:11:15.600 got them.
00:11:17.839 So now we have our Q names and what we
00:11:21.600 wanted to do is we want to have
00:11:22.800 dedicated workers for each que. So
00:11:25.200 meaning that they don't share resources.
00:11:27.519 A worker is uh these are basically just
00:11:29.839 like rail servers that we boot up. Each
00:11:32.079 one is in charge of running bin jobs
00:11:34.000 which will run a solid QQ supervisor
00:11:36.000 which will fork a separate process for
00:11:37.680 each solid Q worker and the worker they
00:11:40.079 just turn through jobs and the idea
00:11:42.560 behind this is that we can autoscale up
00:11:44.320 and down depending on how busy we are.
00:11:46.560 If the within one hour queue is very
00:11:48.880 busy the queue latency is getting a bit
00:11:51.120 undesirable we'll just like autoscale up
00:11:53.200 more workers to address that load. Uh
00:11:55.600 but there is a a bit of a caveat here.
00:11:59.200 uh it takes us about a couple minutes
00:12:00.720 for us to boot up a new rail server and
00:12:02.959 if we notice that the queue latency is
00:12:04.959 is creeping up in the within one minute
00:12:07.839 uh and it's getting really bad and we
00:12:09.200 want to autoscale it by the time that
00:12:11.360 new servers come online we've already
00:12:13.120 blown through our SLO
00:12:15.600 now to fix that problem uh we just over
00:12:19.120 overprovision the really fast cues uh
00:12:21.760 and that's been working for us but it
00:12:23.440 does mean that we have some servers just
00:12:24.959 sitting idle for a bunch of the time now
00:12:27.200 this costs us some
00:12:28.959 But we our jobs get delivered even when
00:12:32.959 there's an unexpected spike.
00:12:37.200 The other thing that we wanted to do
00:12:38.399 early on is just have some metrics. So
00:12:40.720 we needed to know the Q latency. And
00:12:42.639 what this means in other words is like
00:12:44.480 how long has the longest job in the
00:12:46.560 queue been waiting for.
00:12:48.720 And because solid Q is just a database
00:12:50.480 you can just query the database and get
00:12:51.839 an answer, right? So we use uh Yabida
00:12:54.959 gems which basically will just like run
00:12:56.639 this query at some interval. It'll
00:12:58.880 collect those metrics and it'll push
00:13:00.320 them up to Prometheus and then from
00:13:02.399 there we can create some dashboards. Uh
00:13:04.880 and these are like an example of some
00:13:06.720 dashboards that we built. Top three rows
00:13:09.120 is Q latency. Uh red is bad, green is
00:13:12.240 good. Bottom is the number of jobs in
00:13:15.040 the queue at that time. And with these
00:13:17.120 we're able to monitor the health of our
00:13:18.560 cues and just to make sure that we're
00:13:19.839 meeting our SLOs's or not.
00:13:23.440 Okay, so that's the setup. That's all
00:13:25.839 the configuration and philosophy behind
00:13:28.240 what we're doing. And then how do we
00:13:30.560 actually like migrate from rescue to
00:13:32.720 solid Q? And remember that we ideally we
00:13:35.440 want like no downtime, no lost jobs,
00:13:38.240 right? And the answer is like pretty
00:13:40.560 simple. So first of all, we just made
00:13:42.959 this constant in our application job. Uh
00:13:45.760 this just lists all the cues that we're
00:13:48.000 going to support going forward.
00:13:50.560 Then we override the QA as method. Now
00:13:53.839 this comes from active job. And what
00:13:55.279 we're doing is we're just looking for
00:13:57.360 the Q name. If the Q name is within
00:14:01.279 underscore something. We just make sure
00:14:03.519 it's valid. And then if it matches that,
00:14:05.680 then we set the Q adapter to solid Q. If
00:14:08.639 it doesn't match, we just call super and
00:14:11.760 it just does what it normally did. And
00:14:13.839 we incue the job and rescue. And that's
00:14:15.600 it.
00:14:17.440 Then your changes are are are really
00:14:18.959 simple. So all you need to do is change
00:14:20.079 the Q as method in your job and then
00:14:22.399 when this is deployed what happens is
00:14:24.880 any job that was previously incued in
00:14:26.800 rescue will stay in rescue but new jobs
00:14:29.199 are incued uh getting cued with solid Q
00:14:32.639 like pretty simple and and it mostly is
00:14:35.839 this simple. So for 95% of our jobs,
00:14:38.000 this was the only change that we needed.
00:14:40.240 And what we did is we started by moving
00:14:42.320 a couple jobs where it's it's kind of
00:14:44.160 okay if something bad were to happen. If
00:14:46.639 the worst case scenario happens here and
00:14:48.399 we lose a bunch of jobs, it's like it's
00:14:50.160 okay for those ones. And then we started
00:14:52.800 with those ones and then we started to
00:14:55.440 uh take on ones that are higher risk,
00:14:58.399 more complicated as we went, as we
00:15:00.320 ironed out the kinks in the system. And
00:15:02.480 I'll go through a few problems that we
00:15:04.000 encountered along the way and how we
00:15:06.160 solved them.
00:15:08.000 Uh but here's a tip.
00:15:10.399 Pretty much all the problems we
00:15:12.240 encountered was because our
00:15:14.320 infrastructure was underprovisioned for
00:15:16.639 the needs of our system. And you
00:15:19.199 probably like have a good guess of like
00:15:21.279 what the needs of your system are. Like
00:15:23.040 we did some napkin math as to figuring
00:15:24.959 out like how many jobs are we running
00:15:26.480 and what size of database are we going
00:15:28.800 to need and we got it completely wrong.
00:15:34.000 So if you have something like this
00:15:35.519 though, if you get it wrong,
00:15:37.040 everything's okay. So what this does,
00:15:39.040 it's just an aroundq block in your
00:15:41.040 application job. Uh if solid q can't
00:15:43.440 incue can't incue your job, uh we just
00:15:46.800 rescue that error. Uh we reincue the job
00:15:50.480 back into rescue and this way you don't
00:15:53.199 lose anything. Nothing bad happens if
00:15:54.800 you make a mistake. You can like roll
00:15:56.399 back that change. You have some time to
00:15:58.079 figure out what went wrong and you can
00:15:59.920 address it.
00:16:02.880 Uh so a couple problems we ran into. So
00:16:05.440 one was we had uh too many arguments. So
00:16:08.720 Solid Q needs to store everything about
00:16:10.560 your job in a table including the job's
00:16:12.480 arguments. And for whatever reason we
00:16:16.480 had a couple jobs that had a an argument
00:16:20.399 list that was insanely long. Uh so by
00:16:23.440 default my SQL is going to store 65,000
00:16:25.839 bytes in the text column here. And uh
00:16:29.839 for one of our jobs it just wasn't
00:16:31.759 enough.
00:16:33.759 So the solution was uh just to make that
00:16:36.320 column larger. So we bumped it up to
00:16:38.160 medium text. This gives us 16 million
00:16:40.240 bytes in my SQL. Uh and that just solved
00:16:42.880 the problem for us. Uh but then we
00:16:45.839 started to like migrate some more jobs
00:16:47.759 in batches. And we see in mission
00:16:50.240 control that like some workers are just
00:16:52.079 starting to die. They would kind of just
00:16:54.320 quit unexpectedly. Like what's going on
00:16:56.079 here? This was surprising to us because
00:16:58.800 we configured our workers to have like a
00:17:00.720 graceful shutdown is what you should do.
00:17:02.639 So instead of like sending a sig kill,
00:17:04.640 which is that signal mine in the error
00:17:06.720 here, instead you want to send a sig
00:17:09.360 term or a sig int. And it just tells the
00:17:11.600 worker like, hey, we're going to shut
00:17:12.799 you down. Finish what you're doing.
00:17:15.039 Don't take on any more work. And then
00:17:17.120 you give it a bit of time, and then only
00:17:19.280 then if the worker hasn't exited, then
00:17:21.120 you send a sik kill. And we were doing
00:17:22.880 that. So this error was surprising and
00:17:25.120 it kind of reminded us of this like
00:17:26.959 problem we were having with rescue. But
00:17:29.440 then we realized we dug a bit deeper and
00:17:31.919 it just turns out that we were just
00:17:33.120 running completely out of memory. We
00:17:35.200 added a few new jobs and those jobs were
00:17:37.600 just more memory intensive than the
00:17:39.360 previous ones and required more
00:17:40.799 resources to do what they were doing.
00:17:43.200 The solution for us, we just bumped up
00:17:45.440 the memory. We doubled it from two gigs
00:17:48.000 to four gigs. And this is kind of a
00:17:49.280 brute force solution, right? If you're
00:17:51.520 really concerned with like cost, you
00:17:53.760 could separate out the jobs that require
00:17:56.640 heavy use of memory into their own cues.
00:17:58.400 You could separate out cues that have
00:18:00.320 high CPU into their own cues. For
00:18:02.160 example, you could do a within one
00:18:04.000 minute high memory or within one minute
00:18:06.240 high CPU if you really wanted to, but
00:18:08.480 for us it's not a big concern. We just
00:18:10.080 prefer to keep it simple. So bumping up
00:18:11.760 the memory made the most sense for us.
00:18:15.120 Uh the next problem we encountered was
00:18:17.440 we ran out of database connections. Why
00:18:20.480 was that happening? Uh, and as it turns
00:18:22.799 out, we did this to ourselves. What
00:18:25.440 happened was we migrated a new job over
00:18:27.679 to Solid Q and it was set to run on a
00:18:29.840 cron. So, at one uh 1 p.m. Eastern, it
00:18:34.400 incued tens of thousands of new jobs at
00:18:37.280 the same time. And our autoscaler did
00:18:40.320 exactly what we told it to do. It
00:18:42.240 increased workers to meet demand. it uh
00:18:45.679 made so many workers that we just ran
00:18:48.240 out of database connections completely.
00:18:52.320 Now the solution here was just we
00:18:54.640 increased the size of our database.
00:18:56.080 Again we had underprovisioned and
00:18:57.919 underestimated what we actually needed.
00:19:00.320 Uh and we uh we use AWS for this. We
00:19:04.160 just did like a blue green deployment to
00:19:05.840 be your database. And to do this, you
00:19:08.400 just kind of like basically just create
00:19:10.000 like creates a replica of your instance
00:19:12.240 and then when they're in sync, you just
00:19:14.080 swap the load balancer and then you're
00:19:15.600 on the new instance essentially. Um, and
00:19:19.760 now we have like more than double the DB
00:19:21.600 connections that we had and that kind of
00:19:23.120 leaves us with a lot of room for any
00:19:24.720 spikes uh going forward. Uh, we also
00:19:28.080 though put a limit onto the number of
00:19:30.160 workers that we can autoscale up to so
00:19:32.400 that we don't do this to ourselves
00:19:34.160 again.
00:19:37.360 Okay, so you might see a problem with
00:19:40.320 this setup though. So like what happens
00:19:42.559 for example if someone
00:19:45.039 uh takes a job that takes a really t
00:19:47.280 long time to run. Imagine we have a job
00:19:49.120 that takes five minutes to run from
00:19:50.720 start to finish and and someone puts
00:19:52.640 that in the within one minute queue.
00:19:54.559 Like that's not going to work, right?
00:19:58.000 So the problem with this is if you have
00:20:00.240 these slow jobs, they start consuming
00:20:02.080 all the threads on your workers. And
00:20:03.760 what that means is that worker can't do
00:20:06.160 anything until these jobs finish
00:20:07.760 running. You end up with a backlog of
00:20:09.760 jobs just waiting for the queue to be
00:20:11.840 executed and it's essentially blocked.
00:20:16.160 So to ensure this doesn't happen, we
00:20:17.840 just want to set up a mechanism in place
00:20:19.919 to enforce the idea that fast jobs need
00:20:22.480 to run on the fast cues, slow jobs can
00:20:24.960 run on the slower cues. And we set an
00:20:26.720 objective of one/tenth, so 10% of the
00:20:29.919 allowable time. So within 1 minute,
00:20:32.400 that's 6 seconds. Within 5 minutes, 30
00:20:35.039 seconds, 10 minutes, 1 minute. You get
00:20:37.440 the idea.
00:20:40.320 And to do this, we just set up a rule in
00:20:42.640 a in a round perform action in our
00:20:44.880 application job. Um, and this all this
00:20:48.720 does is measure when the job starts and
00:20:50.559 when it finishes. So if it's longer than
00:20:52.880 10% of that of the que's name, we just
00:20:56.159 send an alert up to Sentry. This way we
00:20:59.200 know if a job is like too slow for a
00:21:01.039 given queue and we'll either just move
00:21:02.799 the job to a slower queue or we'll work
00:21:06.320 with that team and we'll figure out like
00:21:08.000 how can we make this job more performant
00:21:09.840 so it can stay in this queue.
00:21:14.320 The other challenge we came across is
00:21:16.559 delayed jobs. So what these are are
00:21:19.760 scheduled jobs that you scheduled at
00:21:22.480 some point into the future. Now I got to
00:21:24.960 be clear like I don't like this pattern.
00:21:27.039 Uh but we have we actually had a lot of
00:21:30.320 jobs like scheduled like so far into the
00:21:33.280 future like I'm talking like I will be
00:21:35.679 dead before they ever run
00:21:38.720 and this is not a not a great pattern
00:21:41.039 like a lot can change from now until
00:21:42.960 then and a lot of the work was just like
00:21:44.880 refactoring that code so that doesn't
00:21:46.799 happen maybe we set up a cron job every
00:21:48.960 day and we check like what jobs we need
00:21:50.400 to run today and we do that uh but some
00:21:53.679 jobs we did need to migrate over
00:21:56.640 And to do that, we wrote this gnarly
00:21:58.400 script that doesn't fit on a slide. Uh
00:22:01.039 I'll include like a QR code at the end
00:22:02.880 of this talk. So you can uh uh grab this
00:22:05.280 code if you need it. But it would
00:22:07.600 basically go into Reddus, grab the
00:22:09.440 scheduled jobs, and then convert them
00:22:11.120 and encue them into solid Q and then
00:22:12.880 delete them from Reddus. That's what it
00:22:14.240 does. And it's like fairly complicated.
00:22:16.480 Uh some of these jobs were uh incued in
00:22:20.480 a version of Rails that was fairly old
00:22:22.640 and the arguments are like different
00:22:24.480 between different versions of Rails and
00:22:26.240 it was uh we had to support all those
00:22:28.400 and it was like kind of gnarly. Uh but
00:22:31.440 we did it happened.
00:22:34.400 Uh the other insight that we had is that
00:22:38.000 uh my SQL is just not the same as
00:22:39.840 Reddus. So, uh, my thinking is that like
00:22:44.799 Reddus, you can kind of just like throw
00:22:46.640 anything at it and it's like mostly
00:22:48.080 fine. It doesn't really seem to care.
00:22:50.159 Uh, but for a relational database,
00:22:51.840 that's probably not the best. And what
00:22:53.360 we noticed is that we had these kind of
00:22:55.360 like big peaks and valleys of like CPU
00:22:58.240 and database connections and load and
00:23:00.880 that's not ideal. And I and we I kind of
00:23:02.880 wanted to like flatten this curve like
00:23:04.320 how can we make this like more round?
00:23:06.240 Uh, what's actually going on here under
00:23:07.840 the hood?
00:23:09.679 And we looked into it and most of most
00:23:13.120 of that was just you have some thing and
00:23:18.000 you incue thousands upon thousands of
00:23:20.960 jobs one after the next and this is
00:23:25.039 pretty expensive right uh and I wanted
00:23:27.919 to like yeah decrease those peaks. So a
00:23:31.200 really kind of like easy way to do this
00:23:32.960 is you can just do a like a find in
00:23:35.440 batches. What you can do is is in this
00:23:38.080 case you initialize 500 jobs into memory
00:23:42.080 and you pass this to active job perform
00:23:44.799 all later. And so instead of 500 SQL
00:23:48.559 insert statements one after the next for
00:23:50.960 this batch and there's probably more
00:23:52.320 batches to come, right? Uh it's just one
00:23:55.039 SQL insert statement per batch which is
00:23:57.760 really nice. And the other thing you can
00:24:00.159 do uh if you want to get a bit fancy is
00:24:02.320 you can like set a like a random wait
00:24:05.039 time for each job. And this will spread
00:24:06.799 out when the jobs are actually run uh
00:24:10.400 from 0 to 30 minutes. And that'll kind
00:24:12.559 of like spread them out a bit more if
00:24:14.080 you want to do that.
00:24:18.159 Now, assuming you've done all this and
00:24:20.000 converted everything over, you're pretty
00:24:21.919 much ready to swap the Q adapter over to
00:24:24.480 solid Q. So, just don't forget uh
00:24:27.279 there's gems that might be using it.
00:24:29.760 There's mailers, there's active storage.
00:24:32.400 Uh don't forget about those. But at this
00:24:34.880 point, like once you deploy this and
00:24:37.200 everything's working, you're pretty much
00:24:38.640 ready to delete rescue from your
00:24:40.400 codebase.
00:24:43.600 All right, so that's what we did.
00:24:48.400 How did it go? So we had about two devs
00:24:52.240 working on this for a couple months uh
00:24:54.480 including myself and we migrated all the
00:24:56.960 jobs over in that time and most of the
00:24:58.960 work was really just setting up solid Q
00:25:00.480 and adjusting our infrastructure when we
00:25:02.559 got it wrong. Uh but once we were
00:25:05.600 confident we could kind of just migrate
00:25:07.440 the jobs in big batches over a couple
00:25:09.360 weeks. It was like pretty fast once we
00:25:10.960 kind of figured it all out. Uh we had
00:25:13.039 one dev from ops to help us set up the
00:25:15.120 infrastructure and fine-tune it as we
00:25:17.360 went. And we also had a DBA that make
00:25:19.200 sure that our our DB was like running
00:25:21.760 properly and we had good metrics for it.
00:25:24.080 Everything was running smooth.
00:25:26.400 Uh so this is probably the best
00:25:28.000 screenshot I have of the difference
00:25:29.279 between running jobs in rescue versus
00:25:31.039 solid Q. One of our team leads uh sent
00:25:33.760 me this message with this graph.
00:25:37.679 Uh he thanked me for fixing the problem
00:25:39.520 and all I did was switch the job from
00:25:41.360 running from rescue to over to solid Q.
00:25:43.440 And what these are, these are just like
00:25:44.960 job failures for a high volume job that
00:25:47.360 we run every day. On the left is the job
00:25:49.679 running with rescue and on the right is
00:25:51.600 when we switched over to solid Q. All
00:25:53.600 those failures just stopped cold. We had
00:25:56.080 really achieved the reliability that we
00:25:57.840 were after
00:26:04.960 and all in like really happy with it. Uh
00:26:07.679 we're processing about like three
00:26:09.039 million jobs a day. uh performance of
00:26:11.760 solid q is amazing compared to rescue.
00:26:14.559 In fact, it's too good in a lot of
00:26:17.760 instances. Uh for our within 24hour
00:26:21.200 queue, the queue latency is like 30
00:26:24.000 seconds or something ridiculous. We need
00:26:26.159 to kind of like slow something down in
00:26:28.559 there like uh reduce the resources or
00:26:31.200 something and I'm not sure yet. Uh but
00:26:33.120 it's really easy to horizontally scale
00:26:35.520 more workers based on demand. That's
00:26:37.120 really nice. Uh the observability is
00:26:39.279 great. uh being able to write SQL
00:26:41.200 queries and just sort of see what's
00:26:42.720 going on. Awesome.
00:26:44.960 And I think renaming our cues was one of
00:26:47.200 the biggest benefits. Just gaining a lot
00:26:48.880 of understanding as to what the cues are
00:26:50.880 and what they mean. And again, like I
00:26:52.880 said, reliable. I really like the
00:26:55.279 reliability of it. It's been running
00:26:57.360 smooth and I'm really happy with it.
00:27:00.640 Thank you very much.
Explore all talks recorded at RailsConf 2025
Manu Janardhanan
Christopher "Aji" Slater
Hartley McGuire
Yasuo Honda
Ben Sheldon
Chad Fowler
John Athayde
Mike Perham
+77