MongoDB to MySQL the How and the Why

In her talk "MongoDB to MySQL: The How and the Why" at RubyConf 2011, Sarah Mei presents the journey of the Diaspora project, a decentralized social network built on Rails, which transitioned from using MongoDB to MySQL as its primary data store. The discussion provides insights into the development choices made, challenges faced, and the eventual decision to migrate databases, challenging common perceptions about the suitability of social data for relational databases.

Key Points Discussed:
- Initial Technology Choice: The Diaspora team originally chose MongoDB due to its perception that social data wasn't suited for relational databases, and because of their familiarity with the technology. They believed MongoDB’s document-oriented storage could better accommodate complex social interactions.
- Problems Encountered: Mei details significant issues that arose from using MongoDB, particularly with relational aspects of social data. Notably, the social network's data structure involved many interlinked objects, which MongoDB doesn't handle well. This led to increased development friction and challenges in querying data efficiently.
- Performance and Development Friction: As development progressed, the team experienced a slowdown in speed and efficiency due to MongoDB’s limitations, such as less support for gem libraries and difficulties in testing. Such friction proved detrimental for a startup needing to iterate rapidly.
- Migration Strategy: Mei outlines the process of migrating to MySQL, emphasizing the need for thorough testing and the importance of getting the migration right the first time due to the risks of data loss. The team adopted a phased approach to the migration, starting with smaller, manageable data documents before moving to the entire database.
- Lessons Learned: The transition highlighted the importance of aligning technology with the nature of the data. Mei concludes that choosing non-standard technology can reduce a startup's ability to pivot and innovate quickly, which is critical for survival in competitive markets.

Overall, the talk illustrates the complexities of data management in large-scale applications, advocating for consideration of relational databases in scenarios involving interconnected data.

MongoDB to MySQL the How and the Why
Sarah Mei • New Orleans, LA • Talk

Date: September 29, 2011
Published: Mon, 12 Dec 2011 00:00:00 +0000
Announced: unknown

Diaspora is the crowd-funded open-source decentralized social network built on Rails. Full buzzword compliance: on by default. We have many thousands of active users and they generate a lot of social data. But after nine months of full-time development with MongoDB as our primary storage engine, a few months ago we converted it all to MySQL. Wait...what? Most people are going the other way, dropping Mongo into a project in place of MySQL or PostgreSQL. Plus, conventional wisdom says that social data is ill-suited to a traditional data store. Come hear a story about a large-scale Rails project that tried it both ways. You'll see crisis and redemption, facts and figures, nerds, kittens, ponycorns, and, of course, the secret sauce. Hecklers will be piped to /dev/null.

RubyConf 2011

00:00:19.199 morning you guys are awfully awake for 9:30 in the

00:00:24.760 morning lots of caffeine yeah well you got one up on me then they didn't have the coffee out when I went out this

00:00:32.079 morning so are we ready to talk about some Ruby this morning yeah

00:00:37.239 yay all right uh well it's too bad there's not

00:00:42.320 actually a lot of Ruby in this talk um this is a talk about taking a rails

00:00:47.640 application that started out on mongod DB and moving it over to mySQL and I think we all know that there's not a

00:00:53.440 whole lot of ruy in rails these days so there's a little bit of Ruby but there's also some other stuff there's a lot of

00:00:58.640 data I'll try not to let it take over the slides too much uh there's a lot of different ways

00:01:04.799 of storing it ways of thinking about it ways of querying it there's a little bit of youthful

00:01:12.240 indiscretion uh and there's some more data and but before we get into that let me talk a little bit about something

00:01:18.479 else this is me in 96 Point font I believe I'm Sarah May on GitHub and

00:01:25.920 Twitter and because this is a larger rubycon than we've had before and there's a lot of people here I don't

00:01:31.159 know this is Sarah Allen we have the same first name we

00:01:36.759 work together on a lot of projects we're both on the board of railsbridge which is a group that's working to bring more underrepresented folks into Ruby and

00:01:44.159 also promotes Ruby as a teaching and education tool and we also work on Pi which is a DSL for simple

00:01:51.200 games um but despite all of that we are not the same person so now you

00:01:57.920 know I am a soft consultant at pivotal labs in San Francisco how many people

00:02:03.039 have heard of pivotal Labs Oh yay how many people use pivotal

00:02:08.319 tracker not bad um many of the many people probably not the people in this

00:02:14.120 room but many people are surprised to find out that pivotal is not in fact A Product Company we're mostly a

00:02:19.519 consulting company and pivotal tracker is our internal tracking tool that we developed that we released for other

00:02:25.560 people to use a few years ago so 95% of pivotal employees don't work on

00:02:31.879 tracker and I'm one of the ones that doesn't work on tracker um instead I spend all of my

00:02:37.959 time doing rails projects and pivotal does a lot of mobile these days but uh I pretty much spend all my time doing rails mostly because I just like it and

00:02:44.519 I don't want to learn how to manage my own memory again uh so of course since I spend my

00:02:51.440 days doing rails uh I spend my free time volunteering on another rails project because who needs that much variety

00:02:57.879 anyway um and and uh that project is called diaspora that's a picture of a

00:03:04.440 stack of stickers I brought with me so if you want one let me know I've got them up here diaspora is distributed

00:03:10.200 social networking software written in rails and if the name sounds familiar you may have heard of it because about a

00:03:15.799 year and a half ago they asked for $10,000 for the four of them four people

00:03:22.360 that started working on it to work on it over the summer basically Ramen money on Kickstarter and they got their $10,000

00:03:29.519 and they got about $210,000 beyond that and so they started working on it fulltime and they've been working on it

00:03:35.560 full time for about a year and a half now um how many people have heard of

00:03:41.840 diaspora before now wow um how many people thought it was dead at this point you can raise your

00:03:48.840 hand you can be honest that's all right okay I I I think that uh that's pretty reasonable the guys haven't been very

00:03:54.760 good about talking about what they do um but they've been making a lot of interesting

00:04:01.319 progress um so it was started as I mentioned by four students from NYU

00:04:06.480 these guys and they were inspired to do it because they went to a talk called

00:04:12.439 freedom in the cloud was given by this guy uh anyone heard of this guy before

00:04:18.479 okay yes is this is a free software crowd that doesn't surprise me so this guy is a lawyer and as you can see from

00:04:24.520 the picture he's also a bit of a neck beard he actually wrote most of the GPL version 3 license

00:04:30.360 so that's sort of where he's coming from and he gave this talk excuse

00:04:36.000 me he gave this talk called freedom in the cloud in which he talked about the fact that when we use these free

00:04:41.080 services that are available to us on the web they're not actually free we're paying for them with the data that we

00:04:46.240 give to these companies and it's not just our explicit data like our social connections and our status messages and all that stuff but it's the little bits

00:04:53.000 of data dandruff that we scatter all over the web that these companies pick up and Aggregate and so in his talk he

00:04:59.240 described this idea for you know what are the software tools that we need to make it so that there are alternatives

00:05:05.919 to this type of service so that people can continue to connect with people online and do stuff online but not have

00:05:12.400 to participate if they don't want to in these data aggregation schemes so one of the things he talked

00:05:18.000 about was we need a distributed version of social networking software like a distributed version of Facebook so these

00:05:24.240 guys were like okay sure doesn't sound too hard we'll uh you know we'll just go

00:05:29.280 back and they were looking for a project to work on um and they went back and hacked on it for a few weekends then

00:05:34.520 they did the kickstarter thing suddenly everything got serious so the basic idea of a

00:05:42.400 distributed social networking ecosystem is that you have different servers we

00:05:47.440 call them pods so these circles are each different servers and you can make an account on any of the servers and then

00:05:52.880 you can make friends on any of the other servers and when you make a status update it's copied to your friends on

00:05:58.960 the other servers and they see it as if you were both participating in the same centralized social network now the

00:06:04.759 advantage of this is that uh you can choose which pod you trust your data with

00:06:11.479 so uh what that means is there may be pods whose terms of service do allow them to aggregate in Sol their data and

00:06:17.680 there'll be other pods whose terms of service do not allow that and so you have the choice of moving your account between pods to find one whose terms of

00:06:24.479 service that you agree with and if you're really paranoid you can run your own

00:06:31.479 so when they started working on this they made a couple of Technology choices early on the first one was rails it's

00:06:39.240 you know it's kind a bit of a no-brainer I guess these guys had a couple of them had had an internship where they had done some rails and it seemed like a

00:06:46.800 better choice than the Java they had done in their classes uh then they made another technology choice that was a little more

00:06:53.360 interesting they decided to use mongodb as their backing data store rather than something like MySQL or postgress

00:07:01.440 and um talking to the guys after the fact there were basically two reasons that they did

00:07:07.400 this first one is this and uh I put this in quotes a lot

00:07:12.599 of people say this especially on Twitter I think uh but you can find blogs and everything there's a lot of people who

00:07:18.160 talk about the fact that social data really isn't suited to a relational database because it's some in some way not

00:07:25.240 relational we'll come back to that uh and the other reason that they chose

00:07:34.039 it basically that and I make fun of them a little bit but I think that actually both of these things are sometimes

00:07:42.000 sometimes legitimate reasons to make a technology choice right so this one you know if you're if you're looking at a

00:07:48.560 project you don't have a lot of experience you go and do some research you ask some people you know and this is what they

00:07:54.479 say there's not a whole lot you can do with that I think that that they made the best decision they could at the time and this you know

00:08:00.080 it started out as a hack Knight project basically right and that's the place where you want to use all these toys

00:08:05.280 that's the place to take all of these things where you're like okay I haven't done this in a real project but I want

00:08:10.319 to play with it and that's the place to do it and and I don't think any of them were ever expecting it to become as

00:08:16.120 serious as it did and they did this Kickstarter thing and then all of a sudden they had 500,000 people on a

00:08:21.479 mailing list waiting for invitations to the service that they hadn't built yet it's a lot of

00:08:27.840 expectation so I wanted talk about a couple of different things that we found in the process of building this out on

00:08:34.399 mongod DB there were two major sets of sorts of Mis Masters that we found the

00:08:39.839 first was at the data layer so the data that diaspora generates social

00:08:45.200 data uh is not in fact terribly well suited to a document

00:08:50.760 store uh the second was at the mapping layer we were using mapper and I'll talk a little bit about our

00:08:56.760 experiences with that and the third thing I want want to talk about is the mechanics of how you actually do a

00:09:03.279 migration between mongodb and a SQL store so we'll start with

00:09:10.640 that so let's talk about what mongodb is a little

00:09:16.000 bit uh how many people here have done a project with at some point all

00:09:21.680 right now keep your hands up how many people are still doing a project with one goes up that wasn't up before

00:09:29.640 yeah okay thanks

00:09:37.040 Steve well then we should go out and have a drink later I

00:09:42.640 think um so mongod DV is a document oriented database and what that means

00:09:50.440 essentially is that rather than storing your data in tables it stores it basically in what looks like a big Json

00:09:56.440 blob so I'll show you what that means so let's say you have a set relationships looks kind of like this this is similar

00:10:02.160 to a project that we had come through pivotal that used it was actually a really good fit for so we had uh

00:10:08.519 a set of TV shows TV shows have many seasons seasons have many episodes and each episode has many reviews and many

00:10:14.240 cast members and the way this site worked is that someone would come in to a page for

00:10:20.120 a TV show explicitly because they were there to drill down into the seasons and episodes for that TV show so it was

00:10:27.320 important to be able to pull out all of the information for a TV show at one time so there's a number of ways you

00:10:33.120 could think about representing that right in a typical relational store each of these boxes would be like a table right so you're looking at a several you

00:10:39.760 know five table join to get all the information out that you want which is

00:10:46.000 not a big deal really um but another way to think about this data is to think

00:10:51.160 about it sort of like a set of nested hashes which is what you think about it in so this is the ex this is an

00:10:57.279 example document for a TV show um and thinks about anytime

00:11:03.000 you in SQL land you hear Row in my in Mongol land you think document so this

00:11:08.600 is a document that represents a single TV show it's got some metadata up at the top uh and then it's

00:11:16.120 got a set of Seasons an array of Seasons with the square brackets and each season is

00:11:22.399 itself an object so it's a hash of with metadata like the season number and then

00:11:27.680 a set of episodes and within the set of episodes metadata and then a couple of sets of things within that so it's sets

00:11:34.440 of sets of sets of sets and it's all in one document and so you can pull back this entire document with one

00:11:42.519 query so you might think actually let me go back to that for a moment so um

00:11:49.000 social data actually looks a lot like this on the surface right uh when you come to a a social networking site

00:11:55.120 here's a screenshot uh of diaspora there's a bunch of stuff going on here and there's

00:12:00.200 a bunch of queries we do to get all the stuff on this page but there's really only one important query right that's

00:12:05.480 the one that brings back your stream the one that gets the posts posts that have photos that have likes that have

00:12:10.720 comments etc etc so there is sort of a nested structure

00:12:17.519 oops there's a nested structure of data here that looks kind of similar to what we were looking at with the TV

00:12:23.639 shows so it's slightly more complex um user have friends friends

00:12:30.480 have posts posts have comments and likes and actually that's a typo each comment

00:12:35.760 has one commenter and each like has one lier right so it's not it's not a whole lot more complicated data wise like

00:12:41.720 relationship wise than uh than the TV show example there's one really important

00:12:48.519 difference here though if you look at something like this oh and here's what so here's what uh here's what you might

00:12:53.800 think about in terms of what a document for a user stream might look like got the name of the user user has friends a

00:13:02.399 set of friends each friend has a set of posts each post has some metadata and then a bunch of sets of other stuff

00:13:08.320 right so similar to what we were looking at before but if you're looking at this TV show example there's one key way in

00:13:16.920 which it's it's quite a bit simpler and that is that none of these objects reference any of the other objects it's

00:13:23.639 except for one way right these These are strictly One Way relationships but if you look at this

00:13:29.720 you realize that all of these things are the same object right so user has

00:13:34.959 friends but the friend May themselves be a user or they may not because it's a distributed system but that's a whole

00:13:41.480 layer of complexity that I'm kind of skimming over and same way commenters and likers may also be users right

00:13:49.199 so uh when you're looking at something like this in a document store you've got two ways that you can approach it you

00:13:56.360 can either denormalize all of the user data that you need into every single document and then keep it updated every

00:14:03.680 single time it changes or you can store references to the other

00:14:10.959 objects uh and so and when you're looking at this like

00:14:18.160 there's really not a whole lot here that has that type of complexity right there's no need to store other

00:14:24.120 objects and everything can just be denormalized your entire query can be denormalized into one thing one

00:14:31.120 document um but here there's a lot of links in between objects right so when

00:14:37.480 you come onto a social networking site anytime you see something looks like a name or a picture you expect to be able to click on it and go see that user

00:14:43.680 right go see their profile go see their posts in a tv show application that doesn't really happen that way right

00:14:49.079 you're on an episode one season one episode one of some of Babylon 5 you don't expect to be able to go and see

00:14:54.519 season one episode one of like General Hospital right like there there's no links between the documents

00:14:59.560 necessarily but here there's a lot of linking between the documents and it's inherent in what social data is right

00:15:05.279 because it's all about who's doing it and what is your relationship to

00:15:11.079 them so our original document for the stream query looks something like this

00:15:16.639 as I said with a name and posts and all that stuff so we basically replaced um the information about

00:15:23.639 friends with references to the user that was that that represented that person

00:15:29.120 right so a reference rather than denormalizing the data um and uses

00:15:34.160 Bon IDs which are sort of like guids rather than autoincrement primary keys and we'll come back to that in a little

00:15:40.519 bit but so you store user ID rather than storing user information so and that's you know it's

00:15:46.880 not a big deal right we're used to storing references to stuff but the problem with that is that doesn't

00:15:52.160 really provide a way to do joins so you end up with code in your application

00:15:59.000 that does it for you right so you pull out the stream query and then you have to do another query to pull out all of

00:16:04.399 the information for the user IDs for the um for the references that you have so

00:16:09.440 there's no way to do that in one query you end up doing multiple queries and doing the joins more or less manually in

00:16:14.839 your code to get the information that you need so it was at this point that we

00:16:22.199 kind of felt like code that was supposed to be under the hood and hidden was out

00:16:28.560 in the open and in our application code and it was at this point that we realized that that was a

00:16:35.279 sign and that at the moment that you start storing references to other objects in that's a sign that your

00:16:43.199 data actually is relational that there's value to that structure and that you may

00:16:48.279 be going against a little bit at least at first going against what mongod DB has in

00:16:55.120 mind so there's two things I learned from this and the first is that when people say social data isn't relational

00:17:00.720 that's not actually what they mean this is actually what they

00:17:06.240 mean what they mean is that and they're right right you look at that structure there's a you you're

00:17:12.240 joining a lot of tables to get all that information right uh but that is in many ways a

00:17:19.880 separate problem right the the solution for that is denormalization also known

00:17:25.039 as caching and you can cach into something like and people are like oh I'll

00:17:30.760 just use and my database will be my cash and then I can scale and uh the

00:17:36.919 problem with that is that databases and caches are very different things and they're good at different things and you

00:17:44.120 really really don't want to conflate them the other thing I learned is that when mongodb tells you that they're

00:17:50.360 about documents you should believe them and when they when they say

00:17:55.720 document they mean in many ways they mean like some something that you can print out on a piece of paper and hold

00:18:01.360 it's a sort of self-contained piece of information and it may have a lot of internal structure it may have headings

00:18:07.480 and subheadings and paragraphs and Footers but it doesn't have it doesn't

00:18:12.799 have links to other things it doesn't have references to other things it's a self-contained piece of structured data

00:18:18.360 and so if your data looks like that then it's a good use case for and if it

00:18:23.480 doesn't look like that then your data it's probably not a good use case for

00:18:28.919 so uh I want to talk a little bit about our experiences using

00:18:34.240 mapper which is actually quite uh a nicely written piece of software despite

00:18:39.320 all the problems I'm about to tell you about uh I call it sort of it's kind of

00:18:44.600 like faking it right so mapper gives you a data mapper like interface to which means it gives you a

00:18:50.440 relational esque interface to a document store and there's a lot of problems with

00:18:58.080 that right because are conceptually very different ways of storing data and what that means is you get methods that you

00:19:03.799 think you know how they work because you know how they work in active record but they work just slightly differently in

00:19:09.240 mapper world and so there's a bit of a especially if you're and I think probably the more experienced you are

00:19:15.559 with data mapper Andor active record um the bigger the mismatch

00:19:21.159 because the more you think you know about how stuff should work and the more surprised you'll be when it doesn't work

00:19:26.320 the way you exactly expect um so fa relational interface bad um

00:19:34.919 does have a very nice JavaScript based map produce um which is a very

00:19:40.520 nice way of querying although you should never do it on your master database because it'll Spike the hell out of your CPU um

00:19:46.720 but there's a couple problems with using it one is it's very low level so you'll end up writing a lot of JavaScript to do

00:19:52.760 things like count your objects you would think would be pretty straightforward um and it it is I mean

00:19:59.080 it's straightforward in map redu land too but it just feels like a piece of JavaScript that I should not need to test out and write and the other thing

00:20:05.799 uh that was more of a deal breaker for us was that it's really much more difficult to test so you can test your

00:20:12.760 JavaScript map ruc stuff using Jasmine or some other JavaScript unit testing framework but since they don't run with

00:20:18.200 the rest of your rspc tests having it's a little bit of extra overhead

00:20:24.600 right um so in general some of the problems we ran into

00:20:29.720 to uh there's a lot less documentation available for so we would run into problems like calling create on an a

00:20:38.080 class would create the object and return true um but sometimes it wouldn't create

00:20:44.400 the ID for you that would never happen in active record land right uh if something failed

00:20:51.360 to create and failed to create an ID it would not return true and this wasn't actually a bug in the software but it's

00:20:56.400 just sort of a mismatch between the way that thinks about IDs as sort of vaguely optional right um and but it's

00:21:05.559 hard to find documentation about this kind of stuff um if I were not being charitable I might suggest that

00:21:12.799 Ten's business model is to sell Consulting Services related to mongodb so it may or may not be in their best

00:21:18.559 interest to provide good documentation um but actually I think

00:21:23.760 it's just that there's a lot fewer people using it so there's a lot fewer sort of official documentation but

00:21:29.640 there's also a lot less unofficial documentation I mean the way that I find out anything in rails really is looking

00:21:35.600 at some blog post right but uh there's so much there's a lot fewer people using

00:21:40.720 and mapper and it's much harder to find information about things that you run

00:21:47.159 into um and both of those are a manifestation I guess of having a smaller community so there's often no

00:21:54.080 one in IRC available to answer questions there's fewer forums there's fewer all kinds of stuff right um and it also

00:22:02.240 meant that we were often in more than one case the first person to run into a

00:22:07.559 particular bug either in mapper or in the the driver between Ruby and

00:22:12.640 um which is never once happened to me working on Rails I have never once been the first person to run into a bug

00:22:19.000 and active record um just because there's so many more people using it just by the time I

00:22:24.480 get into work at 9:00 and upgrade the patch level someone else and some other time time zone has already reported it

00:22:29.960 right and I can Google and I can find it uh but you know as a company that you

00:22:35.960 know diaspora was a fairly resource constrained startup and they're not in the business

00:22:41.840 of they don't really want to be on the bleeding edge of of omm Technology right

00:22:47.000 that's not what we want to spend our time doing less gem support was a big big problem for us so you know we

00:22:53.480 started needed to start uh doing background jobs I was like okay well what just rescue oh wait rescue doesn't

00:23:01.159 support manga mapper um or it was delay job that we wanted to use at first and delay job

00:23:06.600 didn't use manga mapper um we wanted to start processing images we use paperclip I've used it on eight out of the last

00:23:13.159 nine rails project I've worked on but it doesn't support either um and there was always a a fork

00:23:21.200 or someone's Branch somewhere that supported it and so it was never a problem of not being able to use things

00:23:28.520 but it was it was just a problem of finding the right fork and then having your gem file get littered with

00:23:34.640 references particular Shaws on gith HUB that we needed to use rather than using the standard set bugs didn't get fixed

00:23:40.240 on the branch sometimes they got fixed on the main line and so on and so forth um so it made other stuff it sort of the

00:23:46.440 the difficulty that we had with the OM bled out into other parts of the Rails

00:23:53.279 application um and the big problem here which compounds all the problems above

00:23:58.640 it is that the developers really didn't have a whole lot of experience with

00:24:03.679 so um and maybe the and none of these things are a deal breaker in and of themselves but they were all compounded

00:24:09.799 by that issue so I want to talk a little bit about why we switched and it has to do

00:24:17.039 with you know the data mismatching is not really a good enough reason right what it comes down to is development

00:24:23.600 friction um development friction is something we think a lot about at pivotal and these are little things that

00:24:28.960 you don't really notice but slow down your overall velocity that if you fix them you can sometimes get big

00:24:35.039 winds um and on this project was just a constant source of friction it

00:24:40.720 made everything harder slowed our velocity down in like on no matter what we were trying to do because everything

00:24:47.520 in rails when it comes down to it is about data in some way and so everything

00:24:53.120 was a little bit different and everything was slightly non-standard and as as a

00:24:59.279 startup um we can't afford to to have uh

00:25:05.840 to take to make these choices that slow down our ability to iterate quickly right our lifeblood as as a startup is

00:25:12.279 our ability to move fast and if you look at all these things like none of these are none of these are huge problems

00:25:18.960 these are not problems we can't overcome but over time they significantly slow

00:25:24.559 down our ability to actually make progress you know in this in this picture right we this

00:25:30.840 guy right you can you can think of the cruise ship as Facebook or Google or whoever you like but our our single

00:25:37.600 competitive Advantage is our ability to move quickly and get out of the way and

00:25:42.640 take advantage of things that the big guys are too slow to be able to get a hold of and that's the only way to

00:25:49.200 compete against companies that can throw an entire building full of Engineers at any problem they want and so anything we

00:25:55.399 do that that reduces our ability to do that is reducing the the likeliness that our company will still be around in a

00:26:01.480 few months so I put this in big letters this was like a fairly important learning for

00:26:08.000 us which is that every non-standard technology choice we make reduces that ability to iterate and reduces our ability to

00:26:15.480 compete and there are organizations where that's not the case right there are organizations where it doesn't

00:26:21.360 matter that much whether you're able to iterate quickly and there are projects I've been on at pivotal for example

00:26:27.520 where we're not there to help them iterate quickly we're there to help them uh refactor and reorganize their code

00:26:33.080 and leave it in a better place than we found it and so there's different sets of constraints in different projects but

00:26:38.679 for a small startup this is the one that's important and so you know sometimes

00:26:44.720 things will be worth it and sometimes they won't right non-standard technology choices that reduce that ability to iterate are sometimes worth it it

00:26:51.440 depends on the depends on the thing someone could for instance a friend of mine was like when he saw these slides

00:26:56.520 was like well rails is a non-standard tech technology choice and I said okay yeah I guess you

00:27:02.960 could you know depending on where you work you might say that um but rails actually has significant productivity

00:27:09.720 gains that offset the fact that it's for example his example was it's harder to deploy than like a war file right a Java

00:27:17.120 application which in the in the age of Heroku is arguable but you

00:27:22.480 know you may or may not have a point there but I think that uh the other productivity gain of rails offset right

00:27:30.360 so it's just a calculation that you need to make and at diaspora I think we should have made that calculation

00:27:35.880 explicitly a lot sooner than we did so at this point I'd like to take a

00:27:42.240 deep breath and

00:27:47.399 out this is the view from my in-laws dock in Marine County

00:27:52.519 California and when I have to think about all this stuff sometimes I like to come out here and sit with the Ducks out

00:27:58.320 of the water okay so those are the problems we

00:28:03.399 ran into trying to use uh and mapper and now I want to talk a little bit about the mechanics of of how

00:28:11.320 we actually did the migration so there's two things that you

00:28:18.159 should know about a large scale data migration having done a bunch of these in the last couple years and that and

00:28:24.360 the first one is that it always takes a really really long time so you should start it as soon as you feel like it's

00:28:31.039 something you're going to need and the other thing is it's really important to get it right

00:28:38.039 the first time because once you go into production you actually do the migration

00:28:43.159 it's often very hard to roll back without data loss so it's really important to get right and uh these

00:28:49.760 sorts of things make me really nervous and the way I react to this sort of nervousness is that I I write a lot of

00:28:55.279 tests basically so I want to talk about how you test drive a migration the first

00:29:00.399 thing I'm going to do is put air quotes around test driving because it's it's not really test driving in the traditional sense it's more like hurting

00:29:08.440 maybe because here's the process that we went through and this is the easy way that you should do if you possibly can

00:29:13.519 manage it which should be fine for 95% of your tables so the first thing we did was

00:29:19.840 test drive the conversion of a single sub document from the Json representation to CSV we actually used a

00:29:25.320 command line tool called expor I believe to dump it into Json and then

00:29:31.000 run a conversion script to convert it from Json to CSV and then we test drove the import of

00:29:36.679 that CSV into active record and it was much easier it's much easier to do it in two steps that way than to sort of drive

00:29:43.880 it from the perspective of now first it's in and then I want it to be over here in active record it's much

00:29:48.960 easier to split into two steps because there's going to be different challenges on each

00:29:55.919 side so then you run that on the copy of the production database there's an asterisk there that I'll get

00:30:01.600 to in a moment and then inevitably what happens is you add more test cases because once

00:30:07.360 you run it on real data you discover a lot of problems that you didn't realize especially because you're moving from a

00:30:14.120 a data store that does not care about um or does not care as much about

00:30:21.159 referential integrity to one that really really really cares about it uh you will find cases where IDs don't exist or IDs

00:30:27.519 don't match up um other exciting things like that and then you repeat three and

00:30:33.440 four until you can do a conversion of that particular table in an acceptable

00:30:39.360 amount of time there's a step zero here actually I didn't mention which is that you need to decide how much downtime is acceptable for your site when we did

00:30:46.480 this we were alpha alpha I guess you could say so you know a day or even a

00:30:52.399 weekend worth of downtime was acceptable in other situations that won't be the case and a lot of of of how you do this

00:30:59.200 is driven by how much downtime do you think you can stand and actually it turns out that we had about two hours of

00:31:06.080 downtime in the end which was a lot less than we thought so that asteris on step

00:31:14.240 three uh sometimes when you do this you're going to discover that for whatever reason whether it's because the

00:31:19.279 data is very complicated or whether it's because there's a lot of it that this uh

00:31:25.000 this process the easy way is not possible to do so in that case there's a couple of ways

00:31:30.519 that you can deal with this and usually this is we had one or two tables uh I believe where we had ran into this

00:31:36.679 problem uh the first thing you can do is is batch insert an active record so the straightforward way of doing it

00:31:43.760 is just take a CSV make an active record object out of each one of them uh and

00:31:49.600 originally what we did was we put the transaction around the entire Loop uh the problem with that is that of course that means that everything gets put into

00:31:55.960 memory before it gets transactionally committed it's the end and if you have 600,000 rows that may be an

00:32:03.840 issue so you can batch it there's we found that it depends on the memory

00:32:09.000 footprint of the machine you're on as to what the ideal batch size is um we found it somewhere between 10,000 and

00:32:15.399 20,000 uh there's a gem called active record import that makes a lot of this easier that I highly recommend uh and we

00:32:21.679 ended up using some load data INF file which is just something that takes a CSV and puts it directly into the database

00:32:28.120 bypassing all active record and the main problem with this actually is it's a bit difficult to test because load data

00:32:34.080 infile operates outside of active records transactional fixtures which means that it's uh it's you're

00:32:39.399 responsible for cleaning out any data that gets put in there during your test which can be a little bit like what I

00:32:44.840 need to do what now um so there are some GES that you're

00:32:50.279 going to run into the first one is character encoding um doesn't care

00:32:55.919 particularly my SQL really really cares uh so does postgress so a lot of this

00:33:01.639 was just figuring out what are the right collation settings for my SQL and what are the right uh character encoding

00:33:07.159 settings for my SQL and make sure all the magic comments are at the top of every file and we support both ruby8 and

00:33:13.000 1 n so we also had some interesting like things that happen on one and not the other and vice versa it was

00:33:19.720 great the other main thing is converting IDs as I mentioned so um uses

00:33:26.039 these Bon guid looking things which are strings uh and the first thought we had was well we'll just put them in there

00:33:33.480 and use those as the primary key except they're strings right you don't want a primary key this a string so uh but you

00:33:41.320 kind of need to have them in there at least initially because you need to set up all of the foreign Keys into other

00:33:46.720 tables uh so what we did was we um on every table we had another column that

00:33:52.760 was like Legacy ID and then we set up the foreign Keys using the regular primary keys but you know by by looking

00:33:59.240 at the Legacy amango IDs and then when we were done we just deleted that

00:34:04.679 column uh and the other problem is character encoding you'll run into more character encoding problems if you have

00:34:10.800 any kind of of international data and we have a lot of international users on diaspora which I guess made it so that

00:34:18.079 we had to bulletproof it quite a bit so uh of course because this is open

00:34:24.320 source you can have a look at how exactly we did this conversion um that's the code on GitHub and that's

00:34:31.679 the tag where we have the conversion code I think we've ripped it out since then so if you uh just clone the repo

00:34:37.399 you won't find it but the go and look at that tag um and let's

00:34:44.359 see I think that's it it's all I have so this is diaspora this is me um I have

00:34:51.839 stickers and invites and I may have time for questions a few questions if anyone

00:34:58.200 has one I know it's kind of early

00:35:08.320 yes so the question was whether we considered Neo forj or other alternative data Stores

00:35:15.000 um we're considering them now as a way of caching data but because of all of

00:35:23.280 the overhead of using a non-traditional data store as your Prim AR data store we

00:35:29.079 never really considered using it as the primary um we the way we look at it is that we have

00:35:36.760 one set of of of canonical data and then we'll probably end up caching it in a number of different ways um at the

00:35:43.240 moment we're using a lot of mcash d uh but we'll probably end up we may actually we're considering actually

00:35:49.319 bringing back for a few things possibly for caching but not as a primary data

00:35:54.920 store yes you mentioned that

00:36:34.440 so if I'm understanding it correctly the question is how do you how do you evaluate how standard a technology is and how it will affect your velocity

00:36:41.000 without without trying it um that's a good question I think that

00:36:47.440 that the way I think about it is that uh if I haven't tried it then I probably

00:36:53.079 don't want to use it in production as the first thing I do um so

00:36:58.119 generally how I approach it is if there's something I'm interested in I think it might be useful I'll try some

00:37:03.280 other project with it some side project and I'll try and figure out you know go

00:37:08.800 into it with this idea of okay can I use this for X does it feel like it's

00:37:14.160 useful um and even then I I try to take into account the fact that the first time I use something I'm really not

00:37:20.160 going to probably understand it very well right I need to do a couple of projects I'm the type of person where I need to do a couple of projects before I

00:37:26.400 start seeing the patterns coming out so for myself personally the way that I look at it is if I haven't used it in a

00:37:32.440 few different projects I'm probably not going to put it on a project where speed is of the essence because that you're

00:37:39.160 really I feel like that's increasing the risk a little too much and I know other people have sort of different thresholds

00:37:46.480 for that but that's where I stand speak

00:38:00.440 uh the question was what what patterns have I seen that would make a project a good fit for mango the TV show project was actually a pretty good fit for

00:38:08.319 just because all of the data was in one place it didn't have links between it um

00:38:14.560 and I believe that project is still on as far as I know we did have one issue where uh our product owner put something

00:38:21.560 into the backlog that said um users should be able to click on an actor's name and see all of the TV shows that

00:38:28.079 they've ever been in which is like so we had been storing

00:38:34.160 uh the cast members basically it's two strings right the actor name and the character name so in order to compute

00:38:41.000 this we would have had to go into every single show go into every single season every single episode go through the list

00:38:47.280 of cast members and do string matching to find the same actor's name uh and at this point uh I suggested

00:38:55.880 maybe we should just link to IMDb um that didn't go over very well but

00:39:02.599 there's other ways to deal with that right so you can denormalize the data in different ways to get at that problem but that was one of the signs that was

00:39:08.240 like oh people are starting to ask for these links between documents that's and

00:39:13.680 this isn't really designed for that yeah

00:39:34.200 I actually think that it was mostly a mismatch between the type of data we had and the

00:39:41.960 type of data that is good at so I don't think it's necessarily A deficiency in uh I think that

00:39:48.359 is occasionally I don't think probably not by the people that that actually build it but I'm not sure but

00:39:56.240 it's occasionally oversold I think I think that a lot of people talk about as if it's a dropin replacement

00:40:01.560 for my sequel but it's really really different and it took me a while to figure that out I think that that

00:40:08.599 is good at very different things than active record and SQL in general right

00:40:14.880 SQL is really good at joining things um and so if I if you're in a if you're in

00:40:20.319 a situation where you need to do these these joins SQL is a great choice

00:40:25.720 um and great if you have something where you can essentially denormalize a query into one place right so I think

00:40:34.280 that a lot of it was just uh are in experience with and um just a

00:40:41.880 smaller Community around the software around mango which meant that we fixed a lot more bugs in our own mapping

00:40:47.800 software than we would otherwise

00:40:57.640 why am my SQL and not postgress uh we actually do support postgress now um at

00:41:02.680 the time I think it was just we wanted to uh go to something that we had the

00:41:09.359 most familiarity with and we had the most familiarity with my SQL at the time

00:41:15.240 um and in the last month or so we've we've fixed up a bunch of stuff and now

00:41:20.280 we support postgress as well because we're planning on moving uh our site over anyway and but I think we're

00:41:26.839 probably going to support my SQL and postgress both moving forward

00:41:39.119 yeah the question is what level of referencing if any um would be acceptable in a document database I

00:41:45.240 think that um it's certainly it's certainly possible to do uh even in the

00:41:50.720 TV show project we did have a few links between uh between things and references to things um you in the end I

00:42:01.240 think it's really hard to say right there's a certain amount that you know you do a couple of joins in your

00:42:06.760 application code and you feel like that's really an exceptional case this is a feature that's kind of out there

00:42:12.720 that we're not going to build a lot more of I think that's one thing and I think there just becomes a Tipping Point where

00:42:18.280 you're like okay I've got all this code in my application that would be like one

00:42:23.800 statement in Arrow right and it's an entire class that I had to test drive out the whole thing and so there's you

00:42:31.359 know you do that once and you're like okay all right this is not worth switching for you do that

00:42:36.480 twice and is I think it just there there's a certain Tipping Point you reach and you'll know it when you get

00:42:41.720 there sorry for the non-answer other

00:42:50.520 questions okay thank you very much

Sarah Mei

@sarahmei

Explore all talks recorded at RubyConf 2011

+55

RubyConf 2011