00:00:15.280
perfect uh everybody um I'm gon to be talking about thanks
00:00:20.600
whoever said that thank you uh I'm gonna be talking about zero downtime deploys and how to make them easy uh so real
00:00:26.960
quick I'm Matt Duncan I work on the rails team at Yammer uh I also have a huge confession to make
00:00:32.599
which is that this is not really a simple problem at all and there's really not a super easy way to do this uh but
00:00:39.280
it turns out if you say there is uh people will come to your talk um so really I'm just going to like depress
00:00:45.120
the hell out of everyone and uh show you the horrible workarounds and then
00:00:50.239
hopefully we can all kind of figure out a better solution so there is no Silver Bullet sorry um also this is like a
00:00:58.320
really broad topic so I'm going to kind of focus on anything from your framework
00:01:03.760
down uh so I'm going to ignore web servers I'm going to ignore your network stack anything beyond that uh I'll
00:01:09.840
assume that you know how to deploy those uh without downtime you may or may not
00:01:16.680
um also this isn't just a problem with rails uh I'm going to be kind of talking specifically about rails uh just because
00:01:22.119
it's probably most familiar to everyone um and in addition active record but
00:01:27.759
it's really a problem with uh pretty much any framework um so yeah uh three
00:01:34.920
topics uh in general which is how to do database migrations uh how to make
00:01:40.799
changes to your database while not taking your site down um how to make changes to your async workers um stuff
00:01:48.520
running in the background how to make changes to the way you cue things up um
00:01:53.640
and also how to handle external services so basically anything other than your database that is
00:02:00.520
uh used by your application so first let's start with the database uh it's kind of probably
00:02:06.600
the most familiar to everyone um and we're going to walk through a simple example um let's say we have this
00:02:14.160
awesome site it has tons of users and we've decided that we want to be able to make users administrators uh we want
00:02:20.560
some of our users to see different features than others um so that's easy right we'll just make a new migration
00:02:27.120
We'll add a admin column the table uh it'll default to false obviously because
00:02:33.080
we don't want everyone to be an administrator and we will make it not null because being n it doesn't really
00:02:39.280
make sense in this case so uh we'll push that code out we will run our
00:02:45.560
migration and we're going to notice a couple things immediately one is that that migration is taking a really long
00:02:52.159
time to run uh that's bad also uh if we have really good monitoring we're going
00:02:57.840
to notice something else which is that our site is actually down now um so
00:03:03.040
what's what's going on what's Happening um well all of our web processes are
00:03:08.319
just hung right now uh they are stuck so let's hop over to the database uh this
00:03:13.760
is how you do it in postgress there's similar ways to do it in basically any database you're using um but this
00:03:20.840
basically will check to see what has granted locks on tables uh granted
00:03:26.400
exclusive locks so any lock which will not allow reads or rights now obviously
00:03:32.959
uh our migration is actually still running because it's just taking forever and we're going to see that the uh the
00:03:40.439
uh table change that we were making is the command that has the users table
00:03:46.080
locked so why did that happen well migrations are transactional
00:03:51.319
um in this case we're actually doing two things even though it looked like we were only doing one um we're doing one
00:03:57.319
thing which is really fast and really easy and we're doing one thing which is not as easy necessarily um all in that
00:04:04.439
one command so what the database actually needs to do is it needs to add the new column that's easy that's fast
00:04:12.040
uh if it's not you should find a new database probably um the other thing is it needs to go through every single Row
00:04:18.639
in the database and update it uh because we said it can't be null here's the
00:04:23.919
default so it needs to go through and write that default value in every single row um so the larger your table the
00:04:30.440
longer it's going to take um now how do we how do we get around this
00:04:37.120
well you could probably just do it at off peak times right uh let's wait until we have less users on the site uh so
00:04:43.960
we'll do it like Friday night late um your traffic graph if you have a
00:04:49.120
reasonably popular site though probably looks a little bit like this uh that big peak in the center is probably weekday
00:04:55.320
traffic those dips are weekend traffic uh if you have a different type of s it may be reversed where you get more
00:05:01.360
traffic on the weekends in any case notice how that Valley doesn't quite touch zero there uh it gets lower for
00:05:08.560
sure but it never actually touches zero uh so anytime you do this you're going
00:05:13.720
to affect real users um so there's a trade-off involved here right like we
00:05:20.080
can do it uh while people are while not as many people are using our site but
00:05:26.120
we're still going to affect them that may be okay because it was really easy to write that simple migration and run it
00:05:32.520
um the other thing we can do is just get a faster database right throw some money at the problem that's that's always a
00:05:38.880
simple solution uh that'll help right um that means we can actually do more of
00:05:44.160
those updates uh during that transaction before users start to actually notice eventually though you're going to
00:05:51.600
be working with tables that are hundreds of millions or billions of records and just throwing money at the problem is
00:05:59.000
not really feas uh eventually you have to start throwing money at people who need to come in and
00:06:05.880
do weird things and it's yeah weird things happen um so let's walk through
00:06:11.360
how we could actually do this uh without throwing money at the problem we'll just throw a little bit of time at the problem
00:06:17.120
instead uh so in this case we'll do almost the exact same thing in our migration uh we're going to make one
00:06:23.720
simple change which is we're going to allow null values so this lets us bypass that entire hard part of the uh table
00:06:31.639
change and just do the fast part which is adding that new column so right now
00:06:37.880
every single record in that uh table is going to be null that's fine we're not using it yet uh the next thing we're
00:06:44.000
going to do is write a quick task that will do the hard part for us so the
00:06:49.319
reason we're doing this is so that we can uh basically use some very small locks while we're uh doing the updates
00:06:57.400
so in this case we're just going to update single record uh which has a null value for the admin and just change it
00:07:04.319
to false um and this can run behind the scenes you can let it run for as long as it takes it doesn't matter it's not
00:07:10.680
going to cause any significant load on your database uh you're not going to even notice it's running probably uh
00:07:16.960
unless you have a really crappy database in which case throw a little money at the problem um so we'll push that code out
00:07:24.120
we'll run our migration uh this time we'll notice something awesome which is that it ran really really fast that's
00:07:30.960
good uh then we'll go ahead and kick off our task can run that behind the scenes
00:07:37.240
um you can start it up in a screen session on one of your servers and just let it run um you can also get more
00:07:44.159
creative with it uh for example at Yammer we actually have some tools which let us uh parallelize these types of
00:07:50.240
things so that we can run multiple at the same time really easily um and
00:07:55.319
really simply uh so once that's done we can actually go back and change our table back to uh having that non-null
00:08:02.360
constraint so this time all we need to do uh or all the database needs to do is just verify it needs to do a basically a
00:08:08.960
quick table scan to make sure that there aren't any null values uh if there are it'll update them to the default if
00:08:15.520
there aren't it's done basically um so it's super quick uh it's basically as fast as your database can do a
00:08:21.479
sequential scan so again we'll push that code up we'll run our migration and awesome site
00:08:29.720
still up also wow that was a lot of work right uh turns out this is actually just
00:08:35.719
the tip of the iceberg um but like I said uh this talk is all about trade-offs um in a lot of cases it's
00:08:43.680
going to be worth the effort to do all of that work because it will keep your sight up in a lot of other cases it's
00:08:49.519
not going to be because as you can see it was a huge pain in the ass um so yeah
00:08:55.839
uh migrations are also not the only place that can cause these issues uh this is actually apparently the iceberg
00:09:02.880
that is thought to have sunk the Titanic uh very innocent yet uh does big damage
00:09:10.920
um so yeah be careful um so so let's kind of walk through the rest of the
00:09:16.279
stuff that can happen inside the database um as we were just talking about long database locks are a big
00:09:23.240
problem um they can happen in these are the two biggest cases where they'll
00:09:28.360
happen uh when you are adding non-null constraints with default values to a table um also when you're adding indexes
00:09:34.920
uh indexes need to lock the table in most cases uh if you're using postgress Create the index concurrently
00:09:41.839
uh it'll run behind the scenes and then it'll just switch into being used if you're using MySQL just switch to
00:09:49.200
postgress uh or there are tools that you can use but they're probably harder to use than
00:09:55.680
switching to postgress so uh yeah anyway um um the other case is out of sync
00:10:01.600
schema so this is actually an interesting one because uh what happens
00:10:08.680
is your application thinks that you have one schema and your database knows that
00:10:13.920
you have another schema um so how many of you ever seen an error like this before anyone no one thank you a lot of
00:10:21.600
you actually awesome um yeah so so what happens here well we have removed a
00:10:27.399
column from our database but our application is still doing this it's still sending it along uh so why is that
00:10:34.120
happening well when active record loads a model uh it asks the database for your
00:10:39.200
schema right um that's why you don't have to specify the schema inside of every model that you write uh you know
00:10:46.880
don't repeat yourself right um the problem with that is uh if the schema
00:10:52.440
changes active record doesn't actually go through and update them it would have to pull or something and that's just
00:10:58.560
kind of painful um so yeah uh that's that's one
00:11:05.399
problem um here's here's kind of the most common cases uh renaming renaming columns renaming tables uh try to avoid
00:11:13.399
renaming tables it's probably not worth the effort um and removing columns removing columns is uh obviously the
00:11:20.200
most common one um you have a column that you don't really need anymore you don't want to leave it around because
00:11:25.800
then it just stays forever basically um so let's walk through the process of getting rid of that column
00:11:33.000
without again taking the site down uh so three steps uh we're obviously going to
00:11:38.959
start off initially writing to that column uh so next step is to stop
00:11:44.680
writing to the colum uh how do we do that well first thing we need to do is tell the database that it's cool if we
00:11:50.720
don't write to the column so tell the database that it can have null values in the
00:11:56.560
column uh the next thing we need to do is actually tell active record to ignore
00:12:02.200
that column uh so it's relatively simple um which is we just override the or
00:12:08.920
Define the uh columns method here and just ignore the column that we're getting rid of so when active record
00:12:16.399
loads the schema for that table uh it's going to just skip that
00:12:21.639
column and it's if as if it never actually existed um and this is all before we run our migration so now we
00:12:29.639
want to remove the admin column we will remove the admin column um oh yeah and
00:12:35.440
then we also need to go back and clean up all of the stuff that we just added to our users model um so all
00:12:44.199
of all of this code here uh we can actually just get rid of because we don't need it
00:12:50.320
anymore uh oh yeah uh your and couch and Lotus Notes won't solve these
00:12:56.680
problems either um this they will like let you think about data
00:13:02.079
and schemas in different ways um and that may potentially be useful actually um but they're definitely not going to
00:13:07.480
solve these problems uh I didn't have enough time to go into um the same
00:13:12.600
problems with those uh but hopefully uh hopefully the lessons
00:13:18.600
transfer all right so moving right along uh background workers uh stuff that runs
00:13:24.279
behind the scenes stuff that you dump into a queue and then have jobs then you have workers pick up up and work um
00:13:30.920
these are actually fairly simple uh if you just keep one thing in mind cues are not going to be empty um whenever you
00:13:37.920
deploy code whenever you deploy your workers uh the cues are going to have stuff in them uh just make that
00:13:44.240
assumption there are cases where they probably won't uh and you could purge them if you really wanted to but that's
00:13:50.519
probably a bad idea so just make the assumption that they won't be empty and that there will be jobs in there from
00:13:55.720
the previous code so uh if you're changing the format of your messages uh
00:14:01.079
so you're adding a new parameter make sure you handle the case where that parameter doesn't exist um and the param
00:14:07.759
and the case where the parameter does exist once all of those messages have flowed through and you know it's clear
00:14:14.199
uh you can then stop handling uh that previous case um again if you're getting
00:14:19.279
rid of uh workers so you have something that you don't really need anymore you probably just need to go purge the queue
00:14:25.040
or leave one of the workers around to kind of finish off running those processes
00:14:31.000
all right so so those were the those were the easy cases uh let's move on to the more interesting one which is
00:14:37.320
external Services um and I'm going to talk about Services inside of your
00:14:43.480
company but first I'm going to get some
00:14:48.680
water um so I'm going to talk about Services inside your company uh primarily because it's easier to
00:14:56.040
rationalize about them um you control their entire life cycle so uh first thing version them um
00:15:05.120
doesn't really matter how you do it uh you can use URLs you can use headers you can use whatever weird format you want
00:15:12.360
uh just make sure they're versioned um so let's let's walk through like a Ideal
00:15:18.279
World scenario and then let's shoot holes in it uh so the ideal World scenario is you have an application it's
00:15:25.600
using the first version of your API uh you deploy a new version and you start
00:15:30.839
writing to it you don't read from it yet but you start writing to it uh so the reason you would do that is so that you can actually just both you can do a lot
00:15:38.560
of things actually um you can start doing validations on the data uh to make sure that your new version is actually
00:15:45.440
doing what you expect it to be doing and that the rights match what the first version was doing uh you can also make
00:15:52.000
sure it can handle your production load um and you can do any backfilling of
00:15:57.399
data into that version if you need to uh let's say they may be potentially running on different uh data
00:16:04.160
sources so then eventually you can switch your uh write or your reads
00:16:09.480
excuse me um to your new version and you can continue writing to your first
00:16:15.000
version if you want to or not uh one of the nice things nice things about continuing to write to that version is
00:16:21.360
that you could always fall back to it if something goes catastrophically wrong um so it's kind of a little safety net uh
00:16:28.040
but eventually obviously you'll move off of it so uh one thing that you should have
00:16:35.639
uh in mind uh one thing that is super useful is the ability to uh switch these
00:16:42.440
things around uh to switch them on and off uh the reason for that is when you're deploying um things are actually
00:16:48.240
going to look a little more like this uh you're going to have some servers that are uh have old code still running um
00:16:55.800
this is mid deploy you're going to have old Ser or servers still have old code running uh which are reading from your
00:17:01.160
first version and some servers which have the new code deployed which are reading from your second version um so
00:17:09.640
having a switch in place that lets you more automically uh make that transition
00:17:15.439
is really important um oh yeah and don't forget to deploy your services in the exact same
00:17:21.520
way that we've been deploying everything else because these same issues apply um
00:17:27.600
I've I've been talking a lot about uh your kind of main core application uh and how it interacts with uh the
00:17:34.840
database and other services but obviously this like flows all the way down into your services and their data
00:17:40.520
sources and then their services and their data sources and on and on um so yeah
00:17:50.160
uh what what happens if you can't um well you know uh give yourself a way to
00:17:56.720
turn services off um the ability to just flip the switch on a service uh so let's
00:18:03.679
say for example your search um the ability to just remove that search bar um for you know 10 or 15 minutes while
00:18:11.400
you're deploying that new version you can actually just take that thing out of production uh do things nicely uh do
00:18:18.039
things quickly not have to do that whole migration dance that we just saw um and then just flip it back on again and
00:18:24.039
users May notice they may not um but your site isn't going to be down and your users AR see errors they're just
00:18:29.640
not going to see the full features that they might have before so uh let's let's talk about uh
00:18:37.480
what we can learn from this um hopefully not everyone is super depressed yet um anything can go wrong uh there
00:18:46.520
are like so many ways that this can go wrong um I skipped a lot of them um so
00:18:52.520
for for example uh you know you roll out a new validation for users um
00:18:59.760
I have loaded the signup form enter in some data you deploy the site and then I
00:19:04.880
hit submit oh all of the valid data that I just submitted is now invalid because
00:19:10.559
the logic on that is totally changed uh so I see errors and I get really annoyed
00:19:16.080
and I never sign up um not everything is worth the effort though uh cases like that are rarely
00:19:24.240
worth the effort in handling uh sometimes they are uh maybe sign up as a
00:19:29.520
case where it is worth the effort uh I would argue that it's probably not worth
00:19:34.840
the effort to handle um the forms case uh in most
00:19:45.640
cases um also make your deploy simple and fast uh if you don't have a push button deploy you should do that uh this
00:19:54.640
is kind of what it looks like to deploy stuff at Yammer uh you literally pick what you want to deploy where you want
00:19:59.960
to deploy it and just hit deploy um and to fit into the last talk we have
00:20:06.400
metrics uh so we can actually see how many times things get deployed um and the cool thing is like the easier
00:20:13.280
something is to deploy the more likely you are to deploy it right um the other thing you should do
00:20:20.720
is separate your migrations from your deploys uh you should think of them as totally separate things uh deploying
00:20:27.440
migrations are basically database deploys uh deploying
00:20:32.559
your application is an application deploy uh you probably won't be deploying those at the same time you
00:20:38.080
probably shouldn't be deploying those at the same time um so I'm G to get some more
00:20:45.360
water it's a very dry city
00:20:50.679
um so so the way we used to deploy ammer uh when I first started uh and you can
00:20:56.720
tell the ammer employees in the room because they will start laughing as I tell the story uh was that we all like
00:21:02.240
crammed into this room and it was really hot and really sweaty and we would drink
00:21:07.600
because we were really afraid that we were going to take the site down uh and we would run the migrations and then we'
00:21:13.240
play a lot of really loud music and it looked a little like this and it was
00:21:18.320
really terrible um and we would basically frantically run the migrations and then deploy the site as quickly as
00:21:24.440
we could because we knew the site was probably down uh because of all the things that we just talked about um and
00:21:31.240
we probably all lost a few years off of our life due to stress um yeah also roll
00:21:36.919
out your services gradually um one of the things we do whenever we uh roll out a new service at Yammer is that we roll
00:21:44.360
out Services really slowly um we'll put maybe 5% of our traffic onto them and
00:21:49.840
then kind of bump that up to 10% and then if that looks good maybe 20% maybe 50% um this gives us the advantage of
00:21:58.679
kind of forcing us to think of the graceful uh scenario when we need to
00:22:04.880
degrade um and it gives us the ability to turn things off if we need to um so
00:22:11.600
this this is an example that I found uh we we have our own internal tool but
00:22:16.720
this one looked pretty awesome uh it has exactly what you need uh in a tool like this uh which is that you can roll it
00:22:23.679
out to a certain percentage of users or uh of requests or whatever um you can
00:22:30.919
pick groups so for example roll out services to uh inside your site first um
00:22:38.039
or excuse me inside of your company uh so that you can dog food those Services before they hit production uh especially
00:22:45.400
make sure your CEO has access to these because he will be the most likely to complain about them if things go wrong
00:22:52.400
and things will get fixed much quicker uh if he is the one seeing them um but
00:22:58.520
you can also add in specific users uh if you want to give access to a few users uh say they're beta users or whatever
00:23:07.400
um yeah uh so I'm gonna I'm gonna go to the question slide slightly differently
00:23:14.159
than most people uh which is I'm GNA ask you guys some questions uh which is like how do we make this easier uh this is
00:23:21.360
really annoying to have to do all that kind of weird stuff um how how could we build like databases
00:23:28.520
that make this easier how can we build Frameworks that make this easier uh yeah
00:23:33.919
um I can answer some questions maybe yes here a
00:23:42.600
mic have you looked at uh chunko which is written by cookpad to do rolling out
00:23:48.320
features to parts of users to uh no uh explain a little bit more about it um
00:23:54.880
chenko is a allows you to roll out a a feature to a certain set of users or a
00:24:00.400
certain percentage of users and it's all baked into rails and provides a a framework to do part of it interesting
00:24:06.679
yeah yeah yeah that we we have our own internal Tool Set uh but yeah that that is like exactly the type of tool that uh
00:24:13.799
I would recommend using um which is the ability to roll out to both certain percentage users and also pick users
00:24:21.840
that can get into that rollout group um it's always nice to be able to force users into a group um both seeing
00:24:28.640
something and not seeing something uh yeah
00:24:43.240
absolutely um I recently heard of a technique I thought was rather interesting which was if you're going to
00:24:50.320
be modifying a table and upgrading your code to the
00:24:55.360
table um for almost I think zero down time you
00:25:00.760
just uh cause a copy of the table to go with the new schema element in
00:25:06.440
it and Meanwhile your old code is still working on the old table and then you
00:25:13.320
when that's done you then deploy code that talks to the new table
00:25:18.799
and then uh a final little task to bring up anything that's in the Delta between the
00:25:25.120
old and the new table and then you're you're kind of running and then when you're confident you can drop the old
00:25:30.679
table yeah just yeah that's that's an interesting strategy one of the I guess one of the downsides would be that you
00:25:36.159
have that like Delta of data between those um but if that's something that you can live with yeah that's absolutely
00:25:41.720
a great way to handle that um and it totally avoids the uh annoyingness of
00:25:47.279
like all of those steps um
00:25:53.840
yeah anyone else I saw
00:26:05.799
I'm interested how you manage the complexity of like sort of the the multi-step migration deploy that you
00:26:12.320
talked about at the beginning while you're striving for like simple push button deploys like that first part has
00:26:18.360
to be a manual process right yes it is a manual process uh poorly is the answer to that
00:26:24.320
um the so so when we when we look at uh the push button process that's all
00:26:29.399
application uh level push button process the database stuff is not quite as
00:26:35.000
awesome um we're looking to make it better uh
00:26:40.279
our our site stuff used to be not push button as well uh it's become push button um yeah uh
00:26:48.200
poorly uh yeah I I I I don't have a great solution for that uh the way the way that we normally do it now is we'll
00:26:54.559
kind of run one migration at a time uh manually usually there aren't a ton that that becomes
00:27:00.399
unreasonable um and it also gives us a way to kind of vet things that are
00:27:08.279
shipping uh to answer the question of one way we can actually do it easier is
00:27:13.559
um don't remove your tests until after your code is out um make sure that all
00:27:19.399
of your old tests are passing yeah so that you don't run into that that weird migration window yeah yeah absolutely
00:27:31.039
hi I'm just curious how you enforce this migration policy among all the engineers
00:27:37.080
right see it would seem like a complicated thing because yeah if there are multiple
00:27:42.200
steps you know a new guy is coming in he doesn't remember that step everything uh
00:27:48.039
poorly uh no uh yeah I mean it it's it's
00:27:53.159
a it's kind of a human management problem right like it's a it's something
00:27:58.679
that uh all of the engineers basically just kind of need to be on the same page about
00:28:09.480
um uh I would say trust your engineers uh you should be able to trust
00:28:16.440
your engineers
00:28:24.080
uh uh yeah I I I mean I I would still separate that into the like make your
00:28:31.399
database deploys separate make your database schema changes separate from your application deploys um for example
00:28:38.760
like creating a table is perfectly safe like you could do that at any time um it's not going to affect your
00:28:44.600
application at all unless you're doing something really weird uh but yeah uh I
00:28:52.039
mean it's it's hard uh it's it's basically a human
00:28:58.240
problem uh a kind of like Collective understanding problem um just kind of getting everyone on the same page uh
00:29:05.240
having having people kind of like ultimately responsible for uh the deployment of those migrations uh does
00:29:10.919
help um because there are usually other things that we need to look at uh when we're for example adding or removing
00:29:17.799
stuff uh for example our analytics team may be using uh some of that data and they need to know to upgrade their
00:29:24.039
scripts uh to get rid of those columns as well
00:29:36.080
um etsy's done some talk on using code to do defaults versus using the database
00:29:41.320
to do defaults like Auto incrementing and what have you guys experimented with that to find so you're not locking your
00:29:47.760
tables when you're doing n value uh we have not really um I I would be
00:29:52.799
interested to try it though uh the the database constraints are nice
00:29:58.880
um but yeah I mean we we arguably don't get a lot of benefit out of them uh just
00:30:04.679
because we have a single applications writing and reading from the database uh if you have more than one application
00:30:11.840
it's kind of less useful to have those uh defaults in each application uh or I should say it's more useful to have them
00:30:17.519
in a centralized place um but yeah I it's it's definitely an interesting idea
00:30:24.000
that we haven't really explored very much um yeah uh let's do one
00:30:31.919
more basically what we're describing here are uh transitions between legal states of the production environment um
00:30:38.960
yeah and uh so are you familiar with anyone who is exploring um uh describing
00:30:45.000
those transitions at a higher level where we can say uh that I'm I'm starting at a state that has these known
00:30:52.240
constraints on it and I want to apply this set of forward transforms it kind of like we do with migration but at a
00:30:58.720
higher level yeah at a at a more like operational level you're saying um
00:31:04.760
vaguely yes uh I I don't know of any I'm not aware of any uh good way to do that
00:31:10.639
or kind of simple way to do that uh arguably this wasn't very simple either um but it is kind of more familiar to
00:31:18.440
people uh yeah I I would definitely be interested in learning about them though yeah uh all right cool thank you