00:00:00.000
All
00:00:16.960
right, let's let's get started. So, like
00:00:18.960
Rosa said, I'm a staff engineer at Full
00:00:20.560
Script, which is a platform for uh we
00:00:23.600
help practitioners deliver holistic
00:00:25.359
medicine to their patients. And uh our
00:00:28.560
first commit was in 2011 and what we've
00:00:32.640
seen is the business has like grown a
00:00:34.079
lot since then. So when I joined which
00:00:35.760
is about 7 years ago and we were a dozen
00:00:38.160
engineers and now engineering is more
00:00:40.719
than 150 and as we grow and scale what
00:00:43.680
we're finding is the solutions that we
00:00:45.520
had in the past are no longer scaling
00:00:47.360
with the business and like a good
00:00:49.440
example of this is our background
00:00:50.879
queuing system. So last year we came to
00:00:53.280
the conclusion that it was really time
00:00:55.600
to say goodbye to Rescue. Uh Rescue had
00:00:58.960
served us very well for many years. Uh
00:01:02.879
and when our app started, Rescue was
00:01:04.720
kind of the clear choice. It was before
00:01:06.880
Sidekick, which continues to be great
00:01:08.640
today. It was before active job. But in
00:01:11.280
the last couple of years, there's been
00:01:12.799
some really nice improvements in kind of
00:01:14.159
like rethinking what's possible in this
00:01:16.479
space. And I think it started with uh
00:01:18.640
with good job and some improvements to
00:01:20.640
databases themselves. Uh now using a
00:01:23.280
database is a viable alternative to
00:01:25.360
Reddus and um after Good job solid was
00:01:28.880
built as an alternative and you could
00:01:31.360
use like all the databases that Rails
00:01:33.680
supports. So SQLite, MySQL, Postgress
00:01:36.400
and in 2024 solid Q was incorporated
00:01:38.799
into Rails as the default Q adapter. So
00:01:41.600
around that time we were starting to
00:01:42.799
like really think seriously about
00:01:44.960
replacing rescue with something else and
00:01:47.520
it was because we had like a couple
00:01:49.360
problems.
00:01:51.439
The first was a dependency story. We
00:01:54.320
were maintaining a bunch of forks that
00:01:56.320
we never wrote and didn't understand. Uh
00:02:00.000
we also started to have this issue where
00:02:03.200
our rescue workers would kind of just
00:02:04.799
like uh take a nap at various times. uh
00:02:08.000
they just kind of stopped working and
00:02:09.920
and uh any jobs were in the middle of
00:02:11.840
running they would just get killed and
00:02:13.040
they'd throw this error and now we
00:02:15.120
didn't lose any jobs because of this. Uh
00:02:17.120
we could always like manually retry them
00:02:19.200
but it didn't give us a lot of
00:02:20.480
confidence that we had a reliable and
00:02:22.239
resilient system in place here and it
00:02:24.640
was really kind of fairly mysterious as
00:02:26.239
to like why this was exactly happening.
00:02:28.640
The error we were getting was uh pretty
00:02:31.040
like low-level CC code and we kind of
00:02:33.760
had a choice. Do we invest more time and
00:02:36.560
energy into rescue? We fix the bug. Do
00:02:39.680
we get it going again? Maybe we have to
00:02:41.280
fork rescue or whatever. Or do we invest
00:02:44.879
our time and energy into an alternative?
00:02:48.800
And so we started shopping around for
00:02:50.160
alternatives. And um I think like
00:02:52.720
psychic is a great alternative. I think
00:02:54.959
we would have been happy had we chosen
00:02:56.720
sidekick. Uh good job is great but it
00:02:59.519
supports only postgress and we're on my
00:03:01.599
SQL so that's not an option for us. And
00:03:04.080
solid Q at this time was made the new
00:03:06.560
default in Rails. So we were like very
00:03:08.080
curious about it and there's uh some
00:03:10.959
nice things about solid Q. So the main
00:03:12.640
one is it's backed by a database that
00:03:15.120
supports my SQL like I said and while
00:03:17.280
it's possible to have metrics and kind
00:03:19.519
of like get information out of Reddus,
00:03:22.720
it's actually way more easy and
00:03:24.480
straightforward just to make a read
00:03:26.080
replica of your database and devs can
00:03:28.480
then go and write SQL queries and figure
00:03:30.159
out what's going on in there. Uh that's
00:03:32.080
really nice. like I like that a lot. The
00:03:35.200
other major incentive is that it's part
00:03:37.200
of Rails, which to me means it's going
00:03:39.440
to be a robust solution that's well
00:03:41.599
supported and maintained for the future.
00:03:43.599
And we're heavily invested in Rails in
00:03:45.200
the future of Rails and it just made
00:03:47.440
sense to use the defaults uh wherever
00:03:49.920
possible.
00:03:52.480
Uh the other thing is that a bunch of us
00:03:54.239
were really lucky enough to go to Rails
00:03:55.920
World last year and we watched Rose's
00:03:58.239
talk about Solid Q, how it was built,
00:04:00.319
how it works, and why Base Camp decided
00:04:02.239
to build it. Uh I really recommend
00:04:04.400
watching one of Ros's talks if you want
00:04:06.159
to deep dive into how Solid Q works
00:04:08.000
under the hood. And we came away from
00:04:10.159
that talk really wanted to give Solid
00:04:11.760
Cube a try and see if it would be a good
00:04:13.519
fit for us.
00:04:16.239
So early this year, we started to work
00:04:18.079
on it and like here's the plan. So, I'm
00:04:19.759
going to go through this throughout the
00:04:22.000
course of this presentation and uh yeah,
00:04:24.960
let's get started. So, the first thing
00:04:27.759
you just need to make sure that all of
00:04:29.360
your jobs are using active job. That's a
00:04:32.320
requirement. Like I said, rescue was
00:04:33.840
written before active job ever existed.
00:04:36.560
So, it has its own way of incuing jobs.
00:04:38.960
Uh and you have to kind of convert those
00:04:40.639
all over to to be active job. And we had
00:04:42.800
luckily done this like years ago. Uh so,
00:04:45.280
we were kind of ready to go in this
00:04:46.639
scenario. Uh the next thing you can do
00:04:48.880
is you can set up mission control jobs.
00:04:50.639
And this is the dashboard app which
00:04:52.479
allows you to retry or discard failed
00:04:54.639
jobs. It supports both rescue and solid
00:04:57.199
Q.
00:04:59.360
And to set up solid Q is is fairly
00:05:01.600
simple like the readme is really great.
00:05:03.680
Uh but there is a couple important
00:05:05.360
decisions that you need to make. And the
00:05:07.600
most important one is like uh do you use
00:05:10.800
your primary database or do you make a
00:05:13.440
separate one? And the answer is really
00:05:15.600
going to depend on the scale that you're
00:05:17.919
operating in. For us, our primary
00:05:20.720
database was not an option. Solid Q is
00:05:23.759
going to add like a ton of write heavy
00:05:25.759
load to your DB. Uh and this could be a
00:05:27.919
risk for us. We don't want to do that.
00:05:29.120
So instead, we created a new MySQL uh Q
00:05:32.320
database for production. But in
00:05:34.880
development, review environments,
00:05:36.960
staging, etc., we use the same database
00:05:39.680
as our primary just on a different
00:05:41.440
schema. And for those environments, it's
00:05:43.600
perfect. We can save on some cost and uh
00:05:46.720
we aren't worried about scale in those
00:05:48.240
environments.
00:05:50.880
All right, so pop quiz. What happens
00:05:54.080
uh if this transaction gets rolled back?
00:05:58.400
Well, the answer is it depends. So on
00:06:00.880
rescue, there's no transactional
00:06:03.039
integrity. Uh the job is always going to
00:06:05.680
get fired and this can be a big source
00:06:07.280
of bugs in your in your system. So a
00:06:10.080
user never gets created and then you try
00:06:11.600
to send them an email. Great. Uh but
00:06:14.639
then this kind of like makes me ask a
00:06:16.160
question, you know, like if we have a
00:06:18.400
separate DB in production but the same
00:06:21.199
one in staging, would we have
00:06:22.960
transactional integrity in staging but
00:06:24.639
not in production? Like how's that going
00:06:25.919
to work?
00:06:27.600
And that would be kind of confusing. So
00:06:28.960
it's really nice that in uh in Rails aid
00:06:31.280
we have this config option for solid Q
00:06:33.039
and it allows us to toggle transaction
00:06:35.440
integrity on and off for each job if you
00:06:38.400
want. And this is false by default. So
00:06:41.600
in this example, when in Q after
00:06:44.319
transaction commit is true, transaction
00:06:47.120
completes, job fires, transaction rolls
00:06:49.840
back, job doesn't fire. So for us, while
00:06:53.039
we're migrating, we just leave it off
00:06:54.560
because we want to just mimic how
00:06:56.479
everything's going in rescue. We don't
00:06:58.080
want to change anything at this point.
00:06:59.919
Uh but we can think about like enabling
00:07:01.599
it in the future.
00:07:04.160
Okay, so to recap, we're using the same
00:07:06.560
type of database. For us, it's MySQL as
00:07:09.199
our primary, but we've configured Solid
00:07:12.080
Q to be on a separate database in
00:07:14.319
production environments. We have no
00:07:16.639
transactional integrity. And for now, we
00:07:18.479
want to keep things the same as Rescue.
00:07:21.120
And uh if you're curious about like
00:07:25.039
figuring out transactional integrity, I
00:07:27.280
recommend looking at the isolator gem.
00:07:29.039
It's going to help you find all the
00:07:30.400
places in your app where transactional
00:07:32.240
integrity with jobs might be a problem.
00:07:36.479
Now, the second most important thing to
00:07:38.080
think about is like how do you set up
00:07:39.919
your workers? And to talk about that, I
00:07:41.360
first need to talk about cues.
00:07:45.120
So, with rescue, we had some technical
00:07:47.680
debt and was was around how we named our
00:07:50.080
cues. So, this is a list of all the cues
00:07:52.880
we had and there's a lot of problems
00:07:55.919
with this setup. So the big problem is
00:07:58.240
we had a mix of priority based cues and
00:08:02.560
domain specific based cues. And domain
00:08:06.560
cues kind of like happen because uh
00:08:09.599
maybe you first put your job in the
00:08:12.960
critical queue because it's really
00:08:14.319
important, but then there's a bunch of
00:08:15.520
other stuff in the critical queue. So
00:08:17.680
either your job is slowing those jobs
00:08:19.840
down or you're or like something's
00:08:22.240
slowing something down. And so you're
00:08:23.759
like, I know what to do. I'm just going
00:08:25.199
to make my own queue. Right. Uh but this
00:08:28.960
does not scale. Uh nothing else can go
00:08:31.599
into this queue. Ops has to set up new
00:08:34.479
infrastructure to support this new
00:08:36.240
queue. And what happens when some
00:08:38.320
notifications that need to go out
00:08:39.760
immediately and there's some that can go
00:08:42.320
out in the next hour and those are
00:08:43.839
mixing together. You still have the
00:08:45.760
potential problem where some less
00:08:47.279
important jobs are slowing down the
00:08:49.279
delivery of other jobs. And you also
00:08:51.680
have a language problem. So, what type
00:08:53.040
of notification can I put in the
00:08:54.880
notifications queue? I don't know.
00:08:59.680
And priority based cues are no better.
00:09:01.839
The problem with this setup is no one
00:09:03.600
knows what any of these mean. As a
00:09:06.480
developer, you write a new job. Which
00:09:08.480
queue do you put it in?
00:09:11.040
I don't know. Maybe you know. For ops,
00:09:14.240
when do you need to autoscale the medium
00:09:16.240
queue? Do you need to or is it okay if
00:09:19.040
this job just fills up for most of the
00:09:20.800
day and then clears up by the next day?
00:09:23.040
I don't know. The problem is no one
00:09:25.519
knows. So, what do we actually want to
00:09:28.480
know? We want to know how long is my job
00:09:31.839
going to sit in the queue for.
00:09:34.959
There's a great talk uh from Ruby Conf
00:09:37.440
2022 called What does high priority
00:09:40.000
mean? The secret to happy cues. Uh it's
00:09:42.640
great. You should go watch it at some
00:09:44.080
point. Uh, I watched this a couple years
00:09:46.399
ago and like it kind of like stuck in my
00:09:48.000
brain and I thought like whenever we
00:09:49.600
like touch our background queuing
00:09:50.880
system, I'm going to do that. And in
00:09:53.600
Daniel's talk, he really suggests naming
00:09:56.000
your cues based on latency tolerances.
00:09:58.560
So based on that idea,
00:10:01.360
our job Q went from this
00:10:04.240
uh to this.
00:10:11.839
So this does a couple things. makes it
00:10:14.560
clear to developers and to ops what the
00:10:18.079
cues mean. It becomes a contract. So
00:10:20.240
each queue has an implicit SLO. What
00:10:22.399
we're saying is here is jobs in the
00:10:24.880
within one minute queue are guaranteed
00:10:26.560
to run within 1 minute or less. That's
00:10:29.120
the longest tolerable latency that's
00:10:32.000
going to be acceptable for that queue.
00:10:33.519
You put a job in the within one minute
00:10:35.200
queue, it's going to get in and get out
00:10:36.880
in under a minute. That's what we're
00:10:38.160
promising. The beauty of this is we can
00:10:40.880
directly tie our alerting to this
00:10:43.040
expected SLO. If the queue latency is
00:10:46.160
taking longer than it should, we can
00:10:47.760
raise an alert. We can autoscale more
00:10:50.079
resources. And that's because we now
00:10:51.839
understand what each of these cues mean.
00:10:54.880
And to get these names, uh, we just made
00:10:57.440
a spreadsheet. So, we listed all the
00:10:59.440
jobs. We assigned uh the team that owned
00:11:02.079
that job. and we just asked uh what's
00:11:04.560
the longest amount of time that this job
00:11:06.399
can sit in the queue before something
00:11:08.399
bad happens and then we we we aggregated
00:11:11.440
that data and we we figured out these
00:11:13.519
new Q names and this this is how we we
00:11:15.600
got them.
00:11:17.839
So now we have our Q names and what we
00:11:21.600
wanted to do is we want to have
00:11:22.800
dedicated workers for each que. So
00:11:25.200
meaning that they don't share resources.
00:11:27.519
A worker is uh these are basically just
00:11:29.839
like rail servers that we boot up. Each
00:11:32.079
one is in charge of running bin jobs
00:11:34.000
which will run a solid QQ supervisor
00:11:36.000
which will fork a separate process for
00:11:37.680
each solid Q worker and the worker they
00:11:40.079
just turn through jobs and the idea
00:11:42.560
behind this is that we can autoscale up
00:11:44.320
and down depending on how busy we are.
00:11:46.560
If the within one hour queue is very
00:11:48.880
busy the queue latency is getting a bit
00:11:51.120
undesirable we'll just like autoscale up
00:11:53.200
more workers to address that load. Uh
00:11:55.600
but there is a a bit of a caveat here.
00:11:59.200
uh it takes us about a couple minutes
00:12:00.720
for us to boot up a new rail server and
00:12:02.959
if we notice that the queue latency is
00:12:04.959
is creeping up in the within one minute
00:12:07.839
uh and it's getting really bad and we
00:12:09.200
want to autoscale it by the time that
00:12:11.360
new servers come online we've already
00:12:13.120
blown through our SLO
00:12:15.600
now to fix that problem uh we just over
00:12:19.120
overprovision the really fast cues uh
00:12:21.760
and that's been working for us but it
00:12:23.440
does mean that we have some servers just
00:12:24.959
sitting idle for a bunch of the time now
00:12:27.200
this costs us some
00:12:28.959
But we our jobs get delivered even when
00:12:32.959
there's an unexpected spike.
00:12:37.200
The other thing that we wanted to do
00:12:38.399
early on is just have some metrics. So
00:12:40.720
we needed to know the Q latency. And
00:12:42.639
what this means in other words is like
00:12:44.480
how long has the longest job in the
00:12:46.560
queue been waiting for.
00:12:48.720
And because solid Q is just a database
00:12:50.480
you can just query the database and get
00:12:51.839
an answer, right? So we use uh Yabida
00:12:54.959
gems which basically will just like run
00:12:56.639
this query at some interval. It'll
00:12:58.880
collect those metrics and it'll push
00:13:00.320
them up to Prometheus and then from
00:13:02.399
there we can create some dashboards. Uh
00:13:04.880
and these are like an example of some
00:13:06.720
dashboards that we built. Top three rows
00:13:09.120
is Q latency. Uh red is bad, green is
00:13:12.240
good. Bottom is the number of jobs in
00:13:15.040
the queue at that time. And with these
00:13:17.120
we're able to monitor the health of our
00:13:18.560
cues and just to make sure that we're
00:13:19.839
meeting our SLOs's or not.
00:13:23.440
Okay, so that's the setup. That's all
00:13:25.839
the configuration and philosophy behind
00:13:28.240
what we're doing. And then how do we
00:13:30.560
actually like migrate from rescue to
00:13:32.720
solid Q? And remember that we ideally we
00:13:35.440
want like no downtime, no lost jobs,
00:13:38.240
right? And the answer is like pretty
00:13:40.560
simple. So first of all, we just made
00:13:42.959
this constant in our application job. Uh
00:13:45.760
this just lists all the cues that we're
00:13:48.000
going to support going forward.
00:13:50.560
Then we override the QA as method. Now
00:13:53.839
this comes from active job. And what
00:13:55.279
we're doing is we're just looking for
00:13:57.360
the Q name. If the Q name is within
00:14:01.279
underscore something. We just make sure
00:14:03.519
it's valid. And then if it matches that,
00:14:05.680
then we set the Q adapter to solid Q. If
00:14:08.639
it doesn't match, we just call super and
00:14:11.760
it just does what it normally did. And
00:14:13.839
we incue the job and rescue. And that's
00:14:15.600
it.
00:14:17.440
Then your changes are are are really
00:14:18.959
simple. So all you need to do is change
00:14:20.079
the Q as method in your job and then
00:14:22.399
when this is deployed what happens is
00:14:24.880
any job that was previously incued in
00:14:26.800
rescue will stay in rescue but new jobs
00:14:29.199
are incued uh getting cued with solid Q
00:14:32.639
like pretty simple and and it mostly is
00:14:35.839
this simple. So for 95% of our jobs,
00:14:38.000
this was the only change that we needed.
00:14:40.240
And what we did is we started by moving
00:14:42.320
a couple jobs where it's it's kind of
00:14:44.160
okay if something bad were to happen. If
00:14:46.639
the worst case scenario happens here and
00:14:48.399
we lose a bunch of jobs, it's like it's
00:14:50.160
okay for those ones. And then we started
00:14:52.800
with those ones and then we started to
00:14:55.440
uh take on ones that are higher risk,
00:14:58.399
more complicated as we went, as we
00:15:00.320
ironed out the kinks in the system. And
00:15:02.480
I'll go through a few problems that we
00:15:04.000
encountered along the way and how we
00:15:06.160
solved them.
00:15:08.000
Uh but here's a tip.
00:15:10.399
Pretty much all the problems we
00:15:12.240
encountered was because our
00:15:14.320
infrastructure was underprovisioned for
00:15:16.639
the needs of our system. And you
00:15:19.199
probably like have a good guess of like
00:15:21.279
what the needs of your system are. Like
00:15:23.040
we did some napkin math as to figuring
00:15:24.959
out like how many jobs are we running
00:15:26.480
and what size of database are we going
00:15:28.800
to need and we got it completely wrong.
00:15:34.000
So if you have something like this
00:15:35.519
though, if you get it wrong,
00:15:37.040
everything's okay. So what this does,
00:15:39.040
it's just an aroundq block in your
00:15:41.040
application job. Uh if solid q can't
00:15:43.440
incue can't incue your job, uh we just
00:15:46.800
rescue that error. Uh we reincue the job
00:15:50.480
back into rescue and this way you don't
00:15:53.199
lose anything. Nothing bad happens if
00:15:54.800
you make a mistake. You can like roll
00:15:56.399
back that change. You have some time to
00:15:58.079
figure out what went wrong and you can
00:15:59.920
address it.
00:16:02.880
Uh so a couple problems we ran into. So
00:16:05.440
one was we had uh too many arguments. So
00:16:08.720
Solid Q needs to store everything about
00:16:10.560
your job in a table including the job's
00:16:12.480
arguments. And for whatever reason we
00:16:16.480
had a couple jobs that had a an argument
00:16:20.399
list that was insanely long. Uh so by
00:16:23.440
default my SQL is going to store 65,000
00:16:25.839
bytes in the text column here. And uh
00:16:29.839
for one of our jobs it just wasn't
00:16:31.759
enough.
00:16:33.759
So the solution was uh just to make that
00:16:36.320
column larger. So we bumped it up to
00:16:38.160
medium text. This gives us 16 million
00:16:40.240
bytes in my SQL. Uh and that just solved
00:16:42.880
the problem for us. Uh but then we
00:16:45.839
started to like migrate some more jobs
00:16:47.759
in batches. And we see in mission
00:16:50.240
control that like some workers are just
00:16:52.079
starting to die. They would kind of just
00:16:54.320
quit unexpectedly. Like what's going on
00:16:56.079
here? This was surprising to us because
00:16:58.800
we configured our workers to have like a
00:17:00.720
graceful shutdown is what you should do.
00:17:02.639
So instead of like sending a sig kill,
00:17:04.640
which is that signal mine in the error
00:17:06.720
here, instead you want to send a sig
00:17:09.360
term or a sig int. And it just tells the
00:17:11.600
worker like, hey, we're going to shut
00:17:12.799
you down. Finish what you're doing.
00:17:15.039
Don't take on any more work. And then
00:17:17.120
you give it a bit of time, and then only
00:17:19.280
then if the worker hasn't exited, then
00:17:21.120
you send a sik kill. And we were doing
00:17:22.880
that. So this error was surprising and
00:17:25.120
it kind of reminded us of this like
00:17:26.959
problem we were having with rescue. But
00:17:29.440
then we realized we dug a bit deeper and
00:17:31.919
it just turns out that we were just
00:17:33.120
running completely out of memory. We
00:17:35.200
added a few new jobs and those jobs were
00:17:37.600
just more memory intensive than the
00:17:39.360
previous ones and required more
00:17:40.799
resources to do what they were doing.
00:17:43.200
The solution for us, we just bumped up
00:17:45.440
the memory. We doubled it from two gigs
00:17:48.000
to four gigs. And this is kind of a
00:17:49.280
brute force solution, right? If you're
00:17:51.520
really concerned with like cost, you
00:17:53.760
could separate out the jobs that require
00:17:56.640
heavy use of memory into their own cues.
00:17:58.400
You could separate out cues that have
00:18:00.320
high CPU into their own cues. For
00:18:02.160
example, you could do a within one
00:18:04.000
minute high memory or within one minute
00:18:06.240
high CPU if you really wanted to, but
00:18:08.480
for us it's not a big concern. We just
00:18:10.080
prefer to keep it simple. So bumping up
00:18:11.760
the memory made the most sense for us.
00:18:15.120
Uh the next problem we encountered was
00:18:17.440
we ran out of database connections. Why
00:18:20.480
was that happening? Uh, and as it turns
00:18:22.799
out, we did this to ourselves. What
00:18:25.440
happened was we migrated a new job over
00:18:27.679
to Solid Q and it was set to run on a
00:18:29.840
cron. So, at one uh 1 p.m. Eastern, it
00:18:34.400
incued tens of thousands of new jobs at
00:18:37.280
the same time. And our autoscaler did
00:18:40.320
exactly what we told it to do. It
00:18:42.240
increased workers to meet demand. it uh
00:18:45.679
made so many workers that we just ran
00:18:48.240
out of database connections completely.
00:18:52.320
Now the solution here was just we
00:18:54.640
increased the size of our database.
00:18:56.080
Again we had underprovisioned and
00:18:57.919
underestimated what we actually needed.
00:19:00.320
Uh and we uh we use AWS for this. We
00:19:04.160
just did like a blue green deployment to
00:19:05.840
be your database. And to do this, you
00:19:08.400
just kind of like basically just create
00:19:10.000
like creates a replica of your instance
00:19:12.240
and then when they're in sync, you just
00:19:14.080
swap the load balancer and then you're
00:19:15.600
on the new instance essentially. Um, and
00:19:19.760
now we have like more than double the DB
00:19:21.600
connections that we had and that kind of
00:19:23.120
leaves us with a lot of room for any
00:19:24.720
spikes uh going forward. Uh, we also
00:19:28.080
though put a limit onto the number of
00:19:30.160
workers that we can autoscale up to so
00:19:32.400
that we don't do this to ourselves
00:19:34.160
again.
00:19:37.360
Okay, so you might see a problem with
00:19:40.320
this setup though. So like what happens
00:19:42.559
for example if someone
00:19:45.039
uh takes a job that takes a really t
00:19:47.280
long time to run. Imagine we have a job
00:19:49.120
that takes five minutes to run from
00:19:50.720
start to finish and and someone puts
00:19:52.640
that in the within one minute queue.
00:19:54.559
Like that's not going to work, right?
00:19:58.000
So the problem with this is if you have
00:20:00.240
these slow jobs, they start consuming
00:20:02.080
all the threads on your workers. And
00:20:03.760
what that means is that worker can't do
00:20:06.160
anything until these jobs finish
00:20:07.760
running. You end up with a backlog of
00:20:09.760
jobs just waiting for the queue to be
00:20:11.840
executed and it's essentially blocked.
00:20:16.160
So to ensure this doesn't happen, we
00:20:17.840
just want to set up a mechanism in place
00:20:19.919
to enforce the idea that fast jobs need
00:20:22.480
to run on the fast cues, slow jobs can
00:20:24.960
run on the slower cues. And we set an
00:20:26.720
objective of one/tenth, so 10% of the
00:20:29.919
allowable time. So within 1 minute,
00:20:32.400
that's 6 seconds. Within 5 minutes, 30
00:20:35.039
seconds, 10 minutes, 1 minute. You get
00:20:37.440
the idea.
00:20:40.320
And to do this, we just set up a rule in
00:20:42.640
a in a round perform action in our
00:20:44.880
application job. Um, and this all this
00:20:48.720
does is measure when the job starts and
00:20:50.559
when it finishes. So if it's longer than
00:20:52.880
10% of that of the que's name, we just
00:20:56.159
send an alert up to Sentry. This way we
00:20:59.200
know if a job is like too slow for a
00:21:01.039
given queue and we'll either just move
00:21:02.799
the job to a slower queue or we'll work
00:21:06.320
with that team and we'll figure out like
00:21:08.000
how can we make this job more performant
00:21:09.840
so it can stay in this queue.
00:21:14.320
The other challenge we came across is
00:21:16.559
delayed jobs. So what these are are
00:21:19.760
scheduled jobs that you scheduled at
00:21:22.480
some point into the future. Now I got to
00:21:24.960
be clear like I don't like this pattern.
00:21:27.039
Uh but we have we actually had a lot of
00:21:30.320
jobs like scheduled like so far into the
00:21:33.280
future like I'm talking like I will be
00:21:35.679
dead before they ever run
00:21:38.720
and this is not a not a great pattern
00:21:41.039
like a lot can change from now until
00:21:42.960
then and a lot of the work was just like
00:21:44.880
refactoring that code so that doesn't
00:21:46.799
happen maybe we set up a cron job every
00:21:48.960
day and we check like what jobs we need
00:21:50.400
to run today and we do that uh but some
00:21:53.679
jobs we did need to migrate over
00:21:56.640
And to do that, we wrote this gnarly
00:21:58.400
script that doesn't fit on a slide. Uh
00:22:01.039
I'll include like a QR code at the end
00:22:02.880
of this talk. So you can uh uh grab this
00:22:05.280
code if you need it. But it would
00:22:07.600
basically go into Reddus, grab the
00:22:09.440
scheduled jobs, and then convert them
00:22:11.120
and encue them into solid Q and then
00:22:12.880
delete them from Reddus. That's what it
00:22:14.240
does. And it's like fairly complicated.
00:22:16.480
Uh some of these jobs were uh incued in
00:22:20.480
a version of Rails that was fairly old
00:22:22.640
and the arguments are like different
00:22:24.480
between different versions of Rails and
00:22:26.240
it was uh we had to support all those
00:22:28.400
and it was like kind of gnarly. Uh but
00:22:31.440
we did it happened.
00:22:34.400
Uh the other insight that we had is that
00:22:38.000
uh my SQL is just not the same as
00:22:39.840
Reddus. So, uh, my thinking is that like
00:22:44.799
Reddus, you can kind of just like throw
00:22:46.640
anything at it and it's like mostly
00:22:48.080
fine. It doesn't really seem to care.
00:22:50.159
Uh, but for a relational database,
00:22:51.840
that's probably not the best. And what
00:22:53.360
we noticed is that we had these kind of
00:22:55.360
like big peaks and valleys of like CPU
00:22:58.240
and database connections and load and
00:23:00.880
that's not ideal. And I and we I kind of
00:23:02.880
wanted to like flatten this curve like
00:23:04.320
how can we make this like more round?
00:23:06.240
Uh, what's actually going on here under
00:23:07.840
the hood?
00:23:09.679
And we looked into it and most of most
00:23:13.120
of that was just you have some thing and
00:23:18.000
you incue thousands upon thousands of
00:23:20.960
jobs one after the next and this is
00:23:25.039
pretty expensive right uh and I wanted
00:23:27.919
to like yeah decrease those peaks. So a
00:23:31.200
really kind of like easy way to do this
00:23:32.960
is you can just do a like a find in
00:23:35.440
batches. What you can do is is in this
00:23:38.080
case you initialize 500 jobs into memory
00:23:42.080
and you pass this to active job perform
00:23:44.799
all later. And so instead of 500 SQL
00:23:48.559
insert statements one after the next for
00:23:50.960
this batch and there's probably more
00:23:52.320
batches to come, right? Uh it's just one
00:23:55.039
SQL insert statement per batch which is
00:23:57.760
really nice. And the other thing you can
00:24:00.159
do uh if you want to get a bit fancy is
00:24:02.320
you can like set a like a random wait
00:24:05.039
time for each job. And this will spread
00:24:06.799
out when the jobs are actually run uh
00:24:10.400
from 0 to 30 minutes. And that'll kind
00:24:12.559
of like spread them out a bit more if
00:24:14.080
you want to do that.
00:24:18.159
Now, assuming you've done all this and
00:24:20.000
converted everything over, you're pretty
00:24:21.919
much ready to swap the Q adapter over to
00:24:24.480
solid Q. So, just don't forget uh
00:24:27.279
there's gems that might be using it.
00:24:29.760
There's mailers, there's active storage.
00:24:32.400
Uh don't forget about those. But at this
00:24:34.880
point, like once you deploy this and
00:24:37.200
everything's working, you're pretty much
00:24:38.640
ready to delete rescue from your
00:24:40.400
codebase.
00:24:43.600
All right, so that's what we did.
00:24:48.400
How did it go? So we had about two devs
00:24:52.240
working on this for a couple months uh
00:24:54.480
including myself and we migrated all the
00:24:56.960
jobs over in that time and most of the
00:24:58.960
work was really just setting up solid Q
00:25:00.480
and adjusting our infrastructure when we
00:25:02.559
got it wrong. Uh but once we were
00:25:05.600
confident we could kind of just migrate
00:25:07.440
the jobs in big batches over a couple
00:25:09.360
weeks. It was like pretty fast once we
00:25:10.960
kind of figured it all out. Uh we had
00:25:13.039
one dev from ops to help us set up the
00:25:15.120
infrastructure and fine-tune it as we
00:25:17.360
went. And we also had a DBA that make
00:25:19.200
sure that our our DB was like running
00:25:21.760
properly and we had good metrics for it.
00:25:24.080
Everything was running smooth.
00:25:26.400
Uh so this is probably the best
00:25:28.000
screenshot I have of the difference
00:25:29.279
between running jobs in rescue versus
00:25:31.039
solid Q. One of our team leads uh sent
00:25:33.760
me this message with this graph.
00:25:37.679
Uh he thanked me for fixing the problem
00:25:39.520
and all I did was switch the job from
00:25:41.360
running from rescue to over to solid Q.
00:25:43.440
And what these are, these are just like
00:25:44.960
job failures for a high volume job that
00:25:47.360
we run every day. On the left is the job
00:25:49.679
running with rescue and on the right is
00:25:51.600
when we switched over to solid Q. All
00:25:53.600
those failures just stopped cold. We had
00:25:56.080
really achieved the reliability that we
00:25:57.840
were after
00:26:04.960
and all in like really happy with it. Uh
00:26:07.679
we're processing about like three
00:26:09.039
million jobs a day. uh performance of
00:26:11.760
solid q is amazing compared to rescue.
00:26:14.559
In fact, it's too good in a lot of
00:26:17.760
instances. Uh for our within 24hour
00:26:21.200
queue, the queue latency is like 30
00:26:24.000
seconds or something ridiculous. We need
00:26:26.159
to kind of like slow something down in
00:26:28.559
there like uh reduce the resources or
00:26:31.200
something and I'm not sure yet. Uh but
00:26:33.120
it's really easy to horizontally scale
00:26:35.520
more workers based on demand. That's
00:26:37.120
really nice. Uh the observability is
00:26:39.279
great. uh being able to write SQL
00:26:41.200
queries and just sort of see what's
00:26:42.720
going on. Awesome.
00:26:44.960
And I think renaming our cues was one of
00:26:47.200
the biggest benefits. Just gaining a lot
00:26:48.880
of understanding as to what the cues are
00:26:50.880
and what they mean. And again, like I
00:26:52.880
said, reliable. I really like the
00:26:55.279
reliability of it. It's been running
00:26:57.360
smooth and I'm really happy with it.
00:27:00.640
Thank you very much.