MongoMapper, Mixins, and Migrations - a look at managing data-model complexity


Summarized using AI

MongoMapper, Mixins, and Migrations - a look at managing data-model complexity

Tim Connor • September 29, 2011 • New Orleans, Louisiana • Talk

In the talk "MongoMapper, Mixins, and Migrations," Tim Connor discusses the intricacies of managing data-model complexity when using MongoMapper, a DataMapper pattern implementation, particularly in document stores like MongoDB and its performance in comparison to traditional relational database systems (RDBMS). Tim highlights the benefits of creating a dynamic schema freed from the constraints of schema-based databases, contrasting it with the rigid structure of third normal form in SQL databases.

Key points discussed include:
- Dynamic Schema Flexibility: The DataMapper pattern allows developers to define schemas in code, enabling dynamic adjustments which are particularly useful in rapidly evolving applications.
- Comparison of Data Models: Tim elaborately contrasts the efficiency of MongoMapper with traditional SQL models, pointing out the advantages of embedded documents that reduce the need for complex joins characteristic of RDBMS setups.
- Mixins vs. Inheritance: He discusses the significance of Ruby's mixin system in enhancing code reusability and maintainability as opposed to single inheritance, which can complicate Rails applications.
- Challenges with Data Migrations: Tim shares practical approaches to handle migrations within MongoDB, emphasizing the need for careful planning to avoid data inconsistencies, especially when dealing with large datasets. He recommends implementing a strategy that includes transforming records iteratively rather than all at once to prevent performance issues.
- Real-world Experiences: Drawing on his consulting experiences, Tim warns against the pitfalls of using bleeding-edge technologies without a deep understanding, suggesting that the ease of Mongo's schema-less nature can lead to complicated refactoring and maintenance challenges.
- Recommendations: He advocates defining a clear, cohesive schema design upfront and stresses the importance of migrations, suggesting that proper migration strategies can facilitate smoother transitions between data structures as applications grow.

Tim concludes by cautioning developers to weigh the decision of using NoSQL databases against their specific project needs, as the flexibility of dynamic data modeling could lead to complexity if not managed properly. He emphasizes the necessity of understanding the underlying trade-offs when choosing between MongoDB and traditional SQL databases, encouraging experimentation on smaller projects before committing to larger implementations.

MongoMapper, Mixins, and Migrations - a look at managing data-model complexity
Tim Connor • New Orleans, Louisiana • Talk

Date: September 29, 2011
Published: December 13, 2011
Announced: unknown

An exploration of how the DataMapper pattern used by MongoMapper works very well with key value stores, in general, and how exceptionally well it works in a Document store like Mongo exceptionally, versus it's less ideal match with a schema based document store like an RDBMS. Tim will be comparing benefits of how the ease of modeling your data in such a way, versus the iconic DB modeling of 3rd normal form, which brings up issues of composition over inheritance, which favors 3rd normal and how ruby's mixin system plays into that tension. This all leads up to having a data model that is trivial to make excessively dynamic, and challenging to keep sane and consistent, in such a system. Methods of migration and data massage are discussed with Tim suggesting what has worked best on some real Mongo projects, but such an approach would work well with any kv store.

RubyConf 2011

00:00:16.960 thank you everyone that made it to the last talk all 35 of you
00:00:24.000 feel free to move up and heckle um i welcome questions at any point
00:00:29.599 obviously my name's tim connor since it's been up there for a while um i run a small consultancy called cloud cd
00:00:35.600 development um as anyone else i'm always hiring and i live in san francisco so i know a
00:00:41.280 bunch of people so if you don't want to work for me or some of these other people i can introduce you and
00:00:46.960 based on i do a pairing interview and based on that sort of say well you might be a good fit
00:00:54.559 but that's not what the talks about obviously it's about mapper sort of
00:01:00.079 which is a lot of fun to use i'll admit but i generally advise people against for most projects
00:01:07.920 particularly if they think oh it's web scale which it isn't uh
00:01:13.760 use dynamo bigtable like react as a dynamo implementation cassandra is to
00:01:18.799 just write plain log files and then do a map reduce that's not a good reason to use because you can't figure out
00:01:24.400 how to scale mysql bad reason you pay a bleeding edge tax
00:01:30.720 you're going to end up having a lot of people who don't know what they're doing versus how well they would know with mysql as developers make a lot
00:01:37.119 of mistakes and the tooling is going to change there's a whole bunch of reasons ask carbon five there's one guy in here
00:01:42.640 about like the bleeding edge tax they've paid on a couple of uh projects they will gladly tell you
00:01:49.840 and uh it does have a schema cliff moon enlightened me about this there's sort of a hidden scheme it uses to determine
00:01:56.720 how big the rose should be and unfortunately when you exceed this and you don't realize it all of a sudden
00:02:01.920 your gigabytes upon gigabytes upon gigabytes of data all have to be rewritten
00:02:07.439 and resized and then your website falls over so i don't count that as web scale
00:02:14.640 while i was using it on some projects replication changed twice in a year two replica sets
00:02:19.760 and i don't remember what to after which again is that sort of bleeding like bleeding edge text
00:02:25.280 so one of the reasons i didn't for a long time recommend postgres because the replication wasn't quite as out of the box as mysql i meant we know how mysql
00:02:32.319 replication works and it just does work it's kind of important for building big sites call me crazy but
00:02:40.400 oh yeah and they and they revamped mapreduce so of course if you have a lot of data and you're doing big data you
00:02:45.920 need to do mapreduce on and to have them say oh by the way you can't use it that way anymore rewrite all your
00:02:51.920 stuff again bleeding edge taxes using the coolest oh my god we all have to use it tech sometimes sucks for a couple
00:02:59.200 reasons but so why would you use it well tengen in their defense are awesome
00:03:06.000 and responsive you go jump on irc or somewhere and they will like tell you why something does it the way it does
00:03:11.200 and possibly fix it immediately so that's kind of big and it's powerful i mentioned that
00:03:16.959 before in my slide i didn't say it um what do i mean by powerful well
00:03:24.239 not having a schema is huge i mean obviously we hear about no schema and why is this such a big thing because you
00:03:30.239 can dynamically define your schema in the code and this is important because it leads to a really cool
00:03:35.519 pattern of development that um yeah i'll get to that in a second
00:03:41.040 which as a combination of embedded documents is another part of it and what you can do with mixins
00:03:47.680 um i thought this pattern that i'm about to get to i mislabeled it originally composition over inheritance
00:03:54.080 um and i was wrong and brianary corrected me and said no that's not composition that's just mixins you
00:03:59.439 dumbass and thank you he was right um he said i'm much nicer than that
00:04:04.879 so mixins i'll get back around to are very important because the inheritance
00:04:10.640 usually when you're building a rails app kind of sucks like single table inheritance i think everyone knows is
00:04:16.560 a really ugly approach to solve this problem but it's your only real choice if you're doing active record
00:04:21.919 um mapper has the same thing called single collection inheritance and again it's less than ideal
00:04:30.160 so as i said mixins um what mixins enable is i would say active record
00:04:36.800 is a good fit for sql for having a schema stored in the database that's what active back to record pattern is
00:04:42.400 for of hey look at your database here's the schema let's create a domain around it data model not a good fit for activerec
00:04:50.000 or for sql like datamapper was a cool project but to define your schema in the code
00:04:57.360 and then again have it defined in the database not such a good fit it doesn't really fit the dynamic nature data mapper
00:05:03.919 which um nosql does
00:05:09.199 embedded documents i mentioned if you don't know about them by now i think everyone's heard so much about
00:05:14.479 that maybe everyone in the room already does they're kind of like the serialized hash from active support but
00:05:20.080 like seriously plus plus if you're using postgres people will say hey we have native arrays which are cool
00:05:27.120 and powerful you're right um they're not quite as cool and powerful as
00:05:32.560 embedded documents here's what embedded documents do
00:05:38.560 it lets you have a whole object that is embedded in another one so it's
00:05:45.199 like a it's like a collection but it lives on the parent document which
00:05:50.320 why would you want to do that when you could just have a relation like in a database it cuts out a bunch of joins and so if you have a truly contained
00:05:57.039 sub-object there are some serious advantages all the associations work the same but
00:06:02.880 you can do stuff like this without even a join where hey find me all of the
00:06:08.319 people where their address city is in chicago and that's blindingly fast if you have things indexed right and it's
00:06:13.759 just if you use them a lot you discover there are some cases where it actually
00:06:19.680 it's not a one-to-many in the usual sense it is a truly contained object it just works a lot better and it's kind of
00:06:24.880 cool having support for that built into the data store but again mix-ins
00:06:31.120 and these sort of get related you'll see in a bit so yeah i know having instance message
00:06:36.720 class method pattern from active support is dumb and bad and we hate the pattern but it's pretty ubiquitous and it is
00:06:43.280 convenient so everything in mapper is implemented as a plugin which is just an
00:06:48.960 active support concern which means you have class method instance method and a configure hook to plug it in
00:06:54.720 so what this lets you start doing is instead of using single inheritance you can have a mixing you can say hey let's
00:07:01.440 define addresses in a couple related fields and dynamically mix them in
00:07:08.319 to any other model so you can have companies and people both have addresses
00:07:13.680 so instead of having some weird glommed together single inheritance chain you can say hey there's a whole bunch of
00:07:19.520 functionality grouped around has addresses that we can define as a well encapsulated module which includes
00:07:26.240 as you know about encapsulation both the data on the object and the behavior and that's something that's kind of really
00:07:31.840 hard to do honestly in a sql database an active record but it's trivial in
00:07:37.280 mapper and it's about the coolest thing you can do in mapper downside of this power
00:07:43.840 you have like 50 ways you can do any domain model in sql really there's like one true way you if you've been building
00:07:50.160 sites for a while you sort of know what third normal form is and a good way to abstract it and there is sort of almost
00:07:55.680 one proper way with the trade-off of embedded documents you're not they're a relation you end up having a lot of
00:08:02.319 choices decide when to embed is complicated it'll take you a while to figure out where the trade-offs are
00:08:09.599 and if there's a lot of trade-offs that means you're going to get it wrong a lot and you're going to refactor
00:08:15.520 and what does lots of refactoring mean when you're dealing with big amounts of data in a non-sql data store
00:08:23.759 well it means you have a lot of different data versions and i hate to break it to you but there's no
00:08:30.560 magic date of validity unicorns if your data exists in six different versions you're
00:08:36.880 um because you can't really code against an undefined data scheme it's hard to write code that you're like well if it was
00:08:43.279 written like a year ago it probably matches this pattern but if it's written now it sort of fits like this and that
00:08:49.440 leads to completely unmaintainable code trust me i've seen it um you can do long-running feature
00:08:55.440 switches where you're like oh hey if it matches this pattern then we have to act this way on it and
00:09:01.279 if it matches this new structure we have to have this other implementation but if you start having like 50 feature
00:09:07.360 switches in your code that are all there for two years you have a giant mess of every method takes like six flags of
00:09:13.680 knowing how to behave so there's a solution which is
00:09:19.040 hey maybe we should have our schema design defined in one place which is the code
00:09:24.320 and there should be one version of that which means you're going to have to come up with a way to migrate your data
00:09:29.839 mapper stuff um and please if you disagree or have a
00:09:34.959 question shout out this is a small group so we can afford a little heckling
00:09:42.320 um yeah you actually need them you don't just need them for date for
00:09:48.080 mapper the title's misleading you need them for any time you're using
00:09:53.279 which really is anytime you're using no sql i mean you have this problem people say well what about react same problem
00:09:58.320 if you don't have a defined schema in your database you're going to have the issue of your data has different schemas
00:10:04.399 in different points in time and that's really a no-no when you're trying to figure out what the hell they're doing your code
00:10:10.800 turns out though that's really just a problem of migrating data in general and
00:10:16.079 even in sql it's a problem when you have big data i meant the standard i think rails has done a bad thing with rakedb migrate in
00:10:22.880 a way and that's made everyone think oh hey this magically happens that we just can change our schema
00:10:28.160 and everything takes care of itself if you start dealing with a lot of data you realize it doesn't work that well i
00:10:34.000 meant if you're on the exact right version of mysql 5.1 before they toggle the switch back and forth of whether
00:10:40.959 column renames are fast it sometimes works but in reality if you're transforming your data much at all you need to take
00:10:47.440 your site down and put up a maintenance page and that sucks i mean otherwise you're in a state where the database either locked up or halfway through
00:10:54.079 transforming a bunch of stuff so doing in-band large migrations of data
00:10:59.519 it's just a horrible pattern that the simple approach to building a vanilla rails app leads you to be like oh hey
00:11:05.680 migrations are free we just do them and they run well that's
00:11:10.959 i think there's there's a silver lining that i discovered to like having to solve this problem a couple times in mapper
00:11:17.360 and that's that you end up doing it right honestly there's one way to do it and
00:11:22.720 y'all can think about it in about five minutes and come up with what it is um
00:11:28.640 whoa that slide was supposed to be broken down like that hey i forgot to delete one
00:11:35.360 so you have to deploy code that writes to both structures your old one and your new one
00:11:41.680 you've got to use finders against the old one because until you have all the new new structure written your finders
00:11:48.399 against the new one won't find all the records you get to update them over time because you have a lot of data and if you do one
00:11:54.720 big migration all in one go it's going to bring your site to its knees
00:12:00.079 when they're all updated you can switch to using finders against the new structure because in theory now all of your code all of your data looks like
00:12:07.120 one way that a new finder can find and then you clean up the whole data structure whenever the hell you get
00:12:12.639 around to whenever disk space gets expensive or something yep
00:12:23.200 well that's a good question those finders get helpful for that um a little later i'll mention that it is very
00:12:29.920 useful to have a finder for that purpose to be able to tell what are updated if you if you have a intelligent finder
00:12:36.320 that is like look for this old structure or look for the lack of the new one then you can do a count on that and
00:12:42.480 is reasonably fast and if you have things indexed right when that is zero then you're done
00:12:48.959 the later ones will get into the processing of that a little bit but i put a version for example
00:12:56.560 that's one of the things you can do i don't i don't like the version column because just like the old version column
00:13:03.200 in rails it sort of doesn't work with feature development and then getting versions out it's a little better if
00:13:08.880 it's on each model but i prefer at most collections
00:13:14.639 yeah you have to do a niche record yeah you could
00:13:20.720 also if you really are obsessive try to drop your old data at the same time but that means again you have to have finder
00:13:26.800 that finds based on the new schema and the old schema and do sort of a cross union join
00:13:32.160 thing cross engine join in your own data store and it just don't try it you can do whatever you
00:13:37.600 want but it's a bad idea um so how do we sort of implement this thing
00:13:44.880 typically i mean as a rails person we're going to say well we have a before save and we when we load up the record we do
00:13:51.279 it before save we transform it into the new version and then we save it
00:13:56.399 turns out you don't need to do that because of a kind of cool trick it's a combination of how mapper
00:14:01.600 is implemented in the fact that is a document object store so you're storing
00:14:06.800 the whole object each time
00:14:12.720 mapper always uses the setter when you initialize it rails does some different stuff but if you define an
00:14:19.120 attribute setter just by loading the object it will be run through that
00:14:25.120 so if you have your old way there's a bug
00:14:30.480 in this code if you can find it points if your old way of doing it is address
00:14:37.360 and you decide you're adding multiple addresses if you just have an address setter just by loading it this will run in the
00:14:44.639 middle here and then of course you have your thing to make it work with the old way
00:14:51.199 and then if you save you're done there's no before save just because you defined a setter in the way mapper loads
00:14:57.440 all attributes through the setter you don't even need a before save all you need to do is code to make sure that hey if i'm loaded handle it
00:15:04.720 which means migrate becomes you just have to touch and save every record in your data store
00:15:09.839 which is you know not a problem with billions at all ever
00:15:14.880 um every untouched because it's a little annoying if you're dealing with billions and you've saved you've
00:15:21.120 updated half the records over time just letting the stuff run to then go through and all the billions again
00:15:28.160 which is the answer to your question you really really really want to find her for the untransformed records
00:15:34.720 because you want to be able to say hey let's update only half of the billions not all of them
00:15:42.320 so the actual migration process went through on a project where i had to
00:15:48.000 build something unfortunately sorry guys i didn't have time to abstract it out to an open source thing it was a client
00:15:53.199 project you could touch every record in like rails migration and make it work
00:15:58.720 how rails people expect and be all seamless which is going to be slow it's going to
00:16:04.639 kill your deploy it's going to take a bunch of work to fake it's gonna not have any advantages other than making people feel comfortable which isn't
00:16:11.199 always such a bonus then you say hey why don't we do that other hack which is we make a rake test
00:16:17.759 called rake migrate something or other that we run out of band
00:16:23.279 at least it's out of band so now you can do a deploy and then run it it's still going to be slow you're not going to have good error tracking you're
00:16:30.000 not going to know what the status is it's not going to automatically be timed
00:16:35.199 for you it's going to be impossible to add additional workers so after you've implemented that and it
00:16:40.800 doesn't kind of work very well you say hey we all use rescue anyways or some sort of background job why don't we
00:16:46.160 just use rescue that sounds like a perfect solution you still have a lot of rows and you
00:16:52.160 need a way to track progress though so just like a rescue worker isn't the solution i ended up finding a pretty good
00:16:58.160 solution being sorry to sub class uh rescue status and add in
00:17:04.640 some custom logging and outputting of errors um because you could just do the
00:17:10.319 standard retisting and shove your errors into redis and be like hey we have redis and there's an error why don't we shove it in there
00:17:16.000 but oh wrong order damn it oh well
00:17:21.520 what the slide that isn't there and i got in the wrong order is if you shove all of your errors in redis your redis
00:17:27.760 is going to blow up and fall over if you're talking about billions at billions of records and you make one mistake ever or you have a little
00:17:33.919 inconsistent data and personally i don't like making redis fall over when i'm counting on it to keep track of which
00:17:39.360 records i'm migrating so you need to keep tracking things a little better so come up with
00:17:45.200 some sort of cool subclass of status i found it useful to output log since hey
00:17:50.400 disk space is cheap just a log file of say all the records that errored or just keep track of the ids
00:17:57.039 you're going to hit another problem then you're like sweet okay so we have this cool rescue thing that's keeping track of what's going on
00:18:03.120 um time out false there's some weird time-outing time-out issues with
00:18:08.799 long-running cursors in and mapper and i don't remember exactly but it might
00:18:15.200 have been a case like it wasn't even accepting passing in the timeout like setting time out not to be true which
00:18:21.200 meant you would have something running for some number of hours and then just deciding oh hey i timed out too bad so
00:18:29.600 you if you implement this yourself you may have to like invent your own batching process which isn't that hard
00:18:35.120 but don't count on mapper and or mongo's like timeout each without timeout or
00:18:40.400 something to work quite how you expect
00:18:45.760 but one pass through i swear to you is not going to fix all your data
00:18:50.960 if you hit billions of rows and you try to do a pass-through you're going to discover that oh you forgot a few cases so you're going to need to be able to do
00:18:56.640 a multi-pass process oh and here's a slide that i thought was supposed to be earlier
00:19:04.559 oh yeah and call out to the thoughtbot people pop toad is another possibility because hey
00:19:09.760 there's nothing quite like tracking errors in and shoving it into someone else's instance and
00:19:14.880 seeing which wins first what was that
00:19:21.440 yeah when it works well hey they're on too so you know what what can you say
00:19:28.240 uh yeah so this is great so now we're not overloading our redis we
00:19:34.000 are having one instance running to try to transform billions records which
00:19:39.200 you might guess could take a while so you need some way to slice it up
00:19:45.360 the first answer people come up with is hey take a mod of the id and then we can have 10 workers and so
00:19:51.520 we divide it up into 10 chunks and then run workers against that which god why do we have to keep coughing
00:19:56.960 sorry i'm getting over cold um
00:20:02.960 10 workers 100 how many ever it's not very flexible it'll get the job done
00:20:08.320 but there's got to be sort of better ways to do that one of the other cool things about
00:20:14.640 is uh it has a find and modify where you can actually find a record at the same time atomically set a flag or modify a
00:20:22.720 value so you can do something like if you have your version flag or some bit of under transform you can find a
00:20:30.320 record at the same time as you're setting it to be being transformed so given that you can make sure that
00:20:36.720 you're not finding a record that a different worker is already transforming this lets you spin up and this is my
00:20:43.360 ideal final solution it's spin up and workers with a randomized finder so at any point take like a thousand out
00:20:50.960 of the chunk for each worker and process it and since you're setting a flag you're kinda safe that you're not
00:20:58.000 stepping on your toes and because of how it's working all you're to do is end up re-transforming it if you do a little
00:21:03.280 yeah i mean that'll work for manga
00:21:09.840 you're going to have to find your own way of doing it yeah that the there's another solution yeah but this
00:21:17.200 is a cool supposedly this is a mapper talk so that is one cool thing about is that operation because then you can say
00:21:24.080 hey add 10 more workers when it's off hours and so then this allows you to transform a large data set
00:21:29.840 where you're like hey hit only so hard during the hours that matter and at night put 100 more workers on it which
00:21:37.200 is actually a pretty cool thing and how we should probably generally be transforming big data in general not trying to pretend like oh it's just
00:21:43.760 the same rails solution one of the other ones yeah depending on
00:21:49.520 how big your redis instance is is you could just shove all the ids into a queue since you're using redis anyways
00:21:55.440 and then spin off that q there'll be some interesting timing things of how many do you want to put in there and it
00:22:01.280 could get complicated but it would work in the react case we don't have time to modify
00:22:08.400 personally i'd like to be safe so i still have an instance method that can tell if that flag is set in case my timing is weird
00:22:15.919 just to not do the work so i can escape early because it's really slow to transform in ruby land billions of
00:22:21.280 records so i kind of like to not waste time doing it
00:22:27.039 so now you have awesome migrations i meant if you do implement this it's not that hard general approach
00:22:32.960 hey ping me if you have any questions because i have done it once or twice um
00:22:39.600 but it's a pain in the ass because you're still tracking multiple
00:22:46.000 versions of a lot of data and having code that works around it and it turns out that
00:22:51.360 you should just avoid it i wouldn't have more than one or two big transforms going at once
00:22:57.200 it starts getting hard to keep track of what's going on and then something falls over because one of them
00:23:02.960 that's interfering with the others which means god forbid you actually have to suddenly be careful
00:23:10.000 about your upfront design that you're getting the modeling your data right which the funny thing about this is why you went with mapper supposedly
00:23:16.480 was you didn't have to worry about the schema and now wait but i have to worry about the schema more because it's going to be
00:23:22.960 paying to transform and now i actually do have to worry about the schema so i think it's a little bit of a not such a
00:23:28.799 great trade-off situation there if you're not dealing with giant amounts of data it's not as much of an issue but
00:23:34.720 then you still don't have the hey let's just do rails transform approach so
00:23:39.760 i have projects i would use on but some of this don't worry about it cost is totally illusionary you're just
00:23:45.919 pushing cost to a different point in the cycle which is three months in when you try to refactor something and you discover
00:23:52.240 oh this is really hard to keep track of as i said talk to the carbon 5 guys they've moved a couple projects off of
00:23:57.679 back to mysql when i was at pivotal i saw a client there do the exact same thing where it's
00:24:05.039 rails people are so fast well that's the end by the way the slides rails people are so fast working against mysql we
00:24:11.840 know what we're doing you actually end up losing productivity i think honestly
00:24:16.880 generally by using there are some specific where yes this is actually a document it
00:24:22.159 makes perfect sense and i really need the flexible schema that make it worth it but generally i've found it's a drain as
00:24:29.679 well as a benefit so really think twice about it i still talk too fast and need more
00:24:36.240 slides so we all get to go drinking early unless there's many questions does it make sense to actually
00:24:43.120 use for some parts of the of an application or in small parts
00:24:48.799 my hbs for example yeah i've considered it i have a project now where i'm
00:24:54.960 if i went back in time i would despite my aversion to new and shiny use node and
00:25:01.120 for one piece and mysql for the rest the problem is is then you get into that
00:25:07.039 dreaded ground of cross-engine joins and it is really nice to have as much
00:25:12.400 data as possible in one in one data store because then whatever that data store approach for handling joining
00:25:18.559 models or joining tables as it were you can use that instead of having to get a group of
00:25:24.320 ids against one and then pull it over to another one and select against that and like we know how to tune mysql pretty
00:25:30.799 well by now even aggregate queries and can do we can do some cool stuff with that um or if you're using a giant data
00:25:37.200 store like a dynamo like a react there are ways to work with your data i found it
00:25:44.159 usually usually it's not for me a breakdown of the uh the data looks a little different
00:25:50.159 but the use case has to be different enough such as redis is great as a data store for semi-persistent non-persistent
00:25:56.480 data not necessarily a primary data store or i use a big tables or use hbase
00:26:02.640 or something roll stuff up and put it back into mysql so my main app can use it like there are trade-offs always in
00:26:08.799 your data store thing and all the proponents of one or the other make it sound like theirs is the be-all end-all
00:26:13.840 but i mean enough people write about all the trade-offs look at the trade-off for which one and fit c does it fit my use
00:26:20.799 case not does it fit my model really my data model because honestly if you've been
00:26:26.480 writing like mysql apps for a while you can figure out how to model just about anything in a database
00:26:32.480 when you don't have experience with whatever you actually know that the first time you're
00:26:39.279 gonna have a lot of stuff to actually fight with yep fine
00:26:44.480 so do it on a toy project first like experiment experiment with data stores on your own personal project not
00:26:50.080 necessarily a client's time if you're a consultant um or find consultants said like yes we want to do like you're
00:26:55.840 sure that i guess like okay sweet i'll learn something new it's i
00:27:01.360 won't at all be frustrated when i hit my head against the wall repeatedly um
00:27:06.720 yeah it's hard if you haven't used it to know the trade-offs but at this point mongo's been around a while so you can sort of
00:27:13.279 google and find a lot of people ranting about it both directions it's a lot of fun to have a fully
00:27:19.679 dynamic schema that is determined by your code that can be determined at run time even
00:27:25.120 that fun has to be tempered with what are the costs of sort of lack of constraint i mean we talk a lot in rails
00:27:30.159 about constrain constraints being creative and like making us solve the problem right and was the other
00:27:36.080 direction of like hey let's have no constraints and do whatever we want i had one project on it there was a
00:27:41.760 blast but yeah it was a nightmare dealing with different data versions which is why i said hey maybe i should give a talk on this i meant to abstract
00:27:48.240 it to a gym but proprietary code you can't always get permission to do that when particularly you're
00:27:53.919 done with the project and you can't remember it perfectly and i don't know that i'm going to write
00:27:59.039 another app until i have a client that's like yes let's put it in at this point i would lean more towards
00:28:04.320 react redis and mysql or postgres because i meant
00:28:10.240 does do a lot of cool stuff but i don't know if terabytes is the
00:28:15.440 answer i know some people that are pulling terabytes out of just writing flat to flat log files and
00:28:21.520 then using mapreduce to roll that up because mongo's just not holding up for them anymore
00:28:26.640 i meant there are works great when either your index or your data fit in memory
00:28:32.399 it is blindingly fast because it's designed to be better than mysql it's sort of a i can fit into memory so you can do
00:28:38.880 even unindexed queries against small enough data sets that you can't believe how fast they return you're like there's not an index on here how can you do this
00:28:45.440 complicated query against all these sub objects and just return it right away it's
00:28:51.039 blazingly faster that which is kind of cool but that doesn't work
00:28:56.399 that doesn't work when you have like a terabyte of data anymore and then i think you need a solution that's built more around replication as its core such
00:29:03.600 as a dynamo sort of thing cassandra supposedly works now but i don't know i've been watched how long people have
00:29:10.159 been trying to get cassandra working that i'm a little nervous about it and there's a lot of people that i've talked to who are smarter than me and
00:29:16.240 know a lot more about databases and every one of them keeps saying well why don't you use react so
00:29:21.840 that's kind of why i'm leaning that way was there another hand or does everyone
00:29:27.039 want to go get drinks okay go ahead but when people leave then it's done so
00:29:33.840 how about where is some embedded documents are they equally fast
00:29:39.279 for example you you don't have an index yeah yeah i was surprised at how fast queries
00:29:45.600 against embedded documents worked when they weren't indexed and that's what i was saying i talked to the tengen guys they're like well yeah if your data sets
00:29:51.840 small enough to fit memory in a dismissive tone and then i was like oh yeah that makes sense um
00:29:58.480 does some interesting things about how it loads things into memory when you access the collection and if it's small enough it's going to do incredibly fast
00:30:04.799 queries against it so yeah throw on your bfs server for small data sets where small is in how
00:30:11.600 many other gigabytes of ram you can afford and it'll work it'll be great you won't even have to worry about it'll just be
00:30:17.600 like sweet i type away at this dumb query and it returns and i don't have to think about my joins and i don't have any problems and then your data set
00:30:24.240 grows larger than your memory and it falls over and blows up in your face
00:30:29.840 it's not really your data set growing from memory it's your active data set active yes
00:30:34.960 like we have way more memory usage way larger index size than ram but we
00:30:42.159 have no performance issues because if that's not the active data set because typically the this latest partition is
00:30:48.320 actually active yeah so it's active data set not all could everyone hear that
00:30:53.440 um he just basically said your active data set is what matters not your data set uh his their indexes even are far
00:30:59.039 larger than their memory but since they're not accessing all of it which is a good point if you fit within the profile of it is sort of nice how
00:31:05.919 fast and easy it is i think there are some costs that are trade-offs there but uh
00:31:11.360 i i was very impressed at how how performing it was for what would be in mysql stupid hard queries where it's
00:31:18.880 just like oh here you go here's all those records you wanted like but how are you doing that
00:31:24.159 they'd put some thought into it couldn't you so for example i don't know start the
00:31:31.279 to data series with i don't know uh 100 gigabytes of memories
00:31:37.360 uh and then we
00:31:42.480 i didn't quite follow your questions yeah so if um
00:31:48.640 scanning of memory certain memory objects are fast enough couldn't you just
00:31:54.240 chat the data among instances you could
00:31:59.360 i don't know he might know more because it's been three six months he was asking about well how about just sharding your
00:32:05.919 then um i i was just burned by dealing with
00:32:12.159 replication of a couple times and i sort of avoided it and treated it like i meant shardings a different problem
00:32:17.919 but if you're going to the efforts to shard then you could use data stores that
00:32:23.679 sort of work better with large data i'm at large data in this case with how big memory is getting can mean seriously
00:32:30.240 large data i mean a lot of the websites we build are gonna fit fine in i personally don't think it's worth the
00:32:36.320 cost of dealing with a fully undefined schema and not knowing what my data is i meant it is really nice to know that hey
00:32:44.080 at least within certain constraints the data stores enforcing i know what the data looks like because i can write my
00:32:49.200 code against it you can't really write code against an undefined schema so
00:32:54.559 if you spend the effort to guarantee you know what what the code is it's totally worth it uh if you're hiring junior
00:33:01.039 intermediate or even advanced rails people that haven't touched they're going to be way faster writing code in mysql often and so why pay that
00:33:08.559 cost but not saying don't ever use it i've used it it was fun
00:33:14.080 i just fun is a thing where it keeps developers productive to some degree but you don't
00:33:20.480 necessarily want to pay too many costs for that but it there are plenty of shops that
00:33:27.039 well john noonmaker who like wrote mapper obviously using it with great success otherwise he wouldn't have read
00:33:32.799 it written it and keep maintaining it um that's part of why i like my experience mongood my first
00:33:39.679 project kind of hurt sorry whoever writes mongoose but i didn't like it nearly as much
00:33:46.159 i think it was the plug-in sort of data mapper pattern that's so baked into mapper which ripple which is the
00:33:53.679 react client was inspired by so i haven't played with ripple but it sounds like you should have a similar
00:33:59.840 experience using a react if you use the ruby client ripple
00:34:05.120 yeah you don't have fine to modify but you have a bunch other cool too so i think that looks like about it so we
00:34:11.679 can get out of here early anyone nope cool
00:34:57.920 you
Explore all talks recorded at RubyConf 2011
+55