Summarized using AI

JRuby and Big Data

Jeremy Hinegardner • September 29, 2011 • New Orleans, Louisiana • Talk

Overview of JRuby and Big Data

In this presentation, Jeremy Hinegardner discusses the intersection of JRuby and big data, particularly focusing on how Ruby developers can leverage Java libraries and frameworks for handling large data sets, specifically through the Hadoop ecosystem. The talk primarily serves as an exploration of what big data is, its characteristics, and the various tools available to address big data challenges.

Key Points Covered:

  • Definition of Big Data:

    • Big data is characterized by its volume, velocity, and variety.
    • Common definitions highlight issues like the inability to process large data sets quickly with traditional tools.
    • Neat definitions presented include:
    • "Too much data to fit on a time machine backup."
    • Chad Fowler's approach: Identifies big data when ActiveRecord cannot manage it.
  • Understanding Requirements:

    • Developers often mistakenly think they need big data solutions. Sometimes, simply understanding the problem and sampling data is sufficient.
    • A statistical example shows that only a small percentage of data is needed for accurate analyses, suggesting a reflective approach before deploying big data strategies.
  • Hadoop Ecosystem Basics:

    • The talk introduces key components of Hadoop including HDFS (Hadoop Distributed File System) for data storage, and MapReduce for data processing.
    • Emphasizes Hadoop’s ability to efficiently manage large datasets through distributed processing.
  • JRuby Integration:

    • Discusses how JRuby can integrate with Hadoop, albeit with challenges such as needing to write Java code.
    • Existing solutions for using JRuby with Hadoop include projects like Radoop and Hadoop Papyrus, but they’re outdated and not widely supported.
    • Provides insight on using JRuby to write MapReduce jobs and how current tools may need updates for better functionality.
  • Additional Tools:

    • Briefly explores further technologies in the Hadoop ecosystem, namely Avro for data serialization, Zookeeper for service coordination, and HBase for low-latency data access.
    • Emphasizes the role of each component in managing and processing big data efficiently.
  • Practical Examples:

    • Examples include processing Twitter data and the challenges faced when data arrives at high velocities (e.g., 155 million tweets a day). This illustrates the need for scalable solutions.

Conclusion

Jeremy concludes with a call to action, encouraging developers to explore existing JRuby projects for Hadoop integration and contribute to enhancing these tools due to their potential usefulness in the Ruby ecosystem for big data handling. He encourages a proactive approach in experimenting with Hadoop to leverage JRuby’s capabilities in the big data landscape.

JRuby and Big Data
Jeremy Hinegardner • New Orleans, Louisiana • Talk

Date: September 29, 2011
Published: December 12, 2011
Announced: unknown

One of the amazing things that we as Ruby developers benefit from, when using JRuby, is the existing ecosystem of Java libraries and platforms for manipulating Big Data. The most famous, and probably of the most use to rubyists is Hadoop and the Hadoop ecosystem of projects. This talk will touch on how we, as developers who use Ruby, may be use JRuby within a variety of Hadoop and Hadoop related projects such as HDFS, MapReduce, HBase, Hive, Zookeeper, etc.

RubyConf 2011

00:00:16.880 hi everybody i'm jeremy uh i'm going to talk to you a little bit about big data in jruby today
00:00:22.160 and this is more of an overview talk we're getting a couple of smaller details on uh on hadoop stuff when we get down
00:00:28.080 in there but initially this is just sort of an overview of what definite what that what big data actually is and then
00:00:34.239 some tools and some procedures and stuff that you can go through to to help you if you have a big data problem so
00:00:40.320 first question i have is uh what is big data who's got a definition come on raffle scale okay that is a
00:00:48.879 great definition nick of course has one uh
00:00:54.800 more than i can put on my time machine backup okay that's a good definition of big data um so yeah there's there's lots of
00:01:01.440 definitions for big data uh where there actually are some official definitions you know all right everybody read this
00:01:08.240 it's kind of dry you know big data applied to large data sets to sizes beyond the ability of commonly used software tools to capture
00:01:14.240 management process data without reasonable elapsed time okay so what nick said yeah it you know it doesn't
00:01:20.720 fit on a time machine backup there's also another one that gartner has of course they get involved in all
00:01:27.200 sorts of stuff like this um because it's big business and enterprise and all that kind of good stuff so increasing volume velocity and
00:01:34.159 variety of data um i personally you know these are yeah yeah they're dry
00:01:40.000 anybody can think of them uh this is my definition one of them one of them or who's had
00:01:45.920 this happen yeah yeah yeah yeah nagios low disk space you know there you go all that
00:01:51.520 kind of good stuff or the ever popular uh that data processing job the one that takes all data run you know
00:01:57.759 it's got to crunch all that data yeah we need to do it over um it's wrong and we need to run it every day
00:02:03.439 on yesterday's data by 80 am so who has that problem okay yes yes and it takes longer than 24
00:02:10.640 hours to run exactly so i asked you guys and i've also i also did the chad fowler approach
00:02:17.040 and many other people's approaches what is your definition of big data and they kind of broke down into three different areas
00:02:25.440 that uh that i can kind of define one of them is that's a lot of data so i personally
00:02:31.440 like cjs which is if active record can't do it it's big data
00:02:36.959 yeah that's i think that's a good one that's a good one um you know uh jeg2 everyone knows james
00:02:42.720 edward gray awesome guy um he says it's big data when i can't use a when i have to use a specialized
00:02:48.879 database so that's also a decent one um in terms of big data you know
00:02:54.879 different people have different volumes that they think are uh are quite a lot does anyone recognize this image
00:03:01.920 yeah yeah yeah tim you know it uh this was actually generated with ruby so uh how many of you know era howard or
00:03:08.000 heard of air howard okay yes yes all those of us in boulder are big fans of era so um years ago when
00:03:15.280 arrow was working at the earth's observation group uh this is basically is the night lights from uh
00:03:22.800 the lights of earth from space at night so these are you know highly populated areas and all that wonderful great stuff
00:03:29.120 um so they have all this satellite data raw satellite data and some of it cleaned a little bit and
00:03:34.480 i don't know all the processes they go through but someone decided they wanted to buy a lot of it and by a lot i mean
00:03:40.159 all of it uh which was 10 petabytes so who has 10 beta bytes of information
00:03:46.879 oh all right all right we have one um in terms in terms of 10 petabytes of information
00:03:53.760 uh data i actually was looking at the backblaze you know they had this great blog post a little while ago about adding more space and doubling the
00:03:59.920 size of their servers and all this kind of great stuff uh they 10 petabytes i think is
00:04:05.200 two-thirds of the total amount of data storage that backblaze currently has in their data center
00:04:10.400 so everyone who backups with backblaze across the entire world two-thirds of it was bought by one
00:04:15.920 company from a volume perspective uh so to get that data to the customer they had five tape robots
00:04:22.479 putting the data on tape for 30 days shipped it ups or fedex or you know
00:04:29.840 courier huh station wagon yes i'm about to use that quote you know which one i'm talking about
00:04:35.280 um and then the and uh it was 75 000 worth of tape to start with just to write it once and
00:04:40.960 then the customer spent 30 days reading it off of tape so who knows the andrew tannenbaum quote i'm about to use
00:04:47.919 okay all right there's two of you never underestimate the bandwidth of a station wagon going down the highway with a box of tapes in the back
00:04:54.160 okay the latency sucks i mean 60-day latency is not good but the
00:04:59.840 bandwidth is tremendous so if you feel like it somebody calculate how long it would take to do 10 petabytes of information
00:05:06.479 over um your 50 megabyte make a bit link at home somebody did run the numbers i
00:05:11.680 haven't done it yet so that's a large volume of data it wasn't headed there fast but that's
00:05:17.440 another thing um another definition that people kind of came up with when they were talking about when i asked the question of what is big data
00:05:23.600 they said it hit a lot of data a good chunk of data heads my way really really fast charlie baker is he in the
00:05:29.360 room yes charlie okay he said nine terabytes of compressed data coming at me every
00:05:34.400 day not too bad more than processing that fits in ram or yeah lots of data headed your way
00:05:42.560 fast for an example how many of you use twitter okay
00:05:48.240 do you realize that your tweet is more than just 140 characters you know the actual volume of data that's in a tweet as a consumer of
00:05:55.120 someone who does tweets is about 2500 bytes which includes like the tags the location if you have it turned on a lot
00:06:00.880 of your author profile all sorts of lovely information so this is a while ago 155 million
00:06:06.160 tweets a day which is 16
00:06:12.160 gigabytes an hour or four megabytes a second
00:06:17.840 or 33.8 megabits who's the networking person in here knows volumes which is the majority of
00:06:24.240 an oc one line so that's a lot of that's a data coming at you very very fast so how many people
00:06:30.160 have an oc one line at least into their data center or wherever they're getting data
00:06:35.680 there we go two yeah not bad so this is all from gnip which is a company that actually
00:06:41.120 uh processes twitter data and you can repurchase it from them and all sorts of good stuff
00:06:46.639 or subscribe to it the other piece that people were talking about we've got large amounts of data coming at you very fast and then what do
00:06:52.720 you need to do with it actually i like nutty com's uh example here he says it's big data when order
00:06:58.479 login is too slow to process it which is kind of you know when you say hey when a really
00:07:04.960 fast algorithm can't do it then you know it's it's it's big data and one that i have to deal with on occasion
00:07:11.360 to add to this is how long do i need to keep this amount of data around so
00:07:16.479 i've got a lot of data coming at me very fast i have to do a lot of work with it and i
00:07:22.160 have to keep it around for a while maybe process it again later so my personal term for
00:07:29.440 big data is any data that makes me feel uncomfortable
00:07:38.479 or as also you can say how fast and how often can you boil the ocean but never fear help is on the way
00:07:47.360 anyone know basics instructions yes okay very popular comic i like it a lot um the first help i'm gonna try to give
00:07:55.599 you is you don't need big data so a lot of people think you need big data
00:08:01.759 but you need to understand why you think you need big data uh daxus who is probably not quite here
00:08:07.680 yet uh has a great blog post he's in the state
00:08:12.800 okay good he's in the state um he has a great blog post on the problem with big data as a rant
00:08:18.160 so there's a good problem there's a good chance that when you think you have a lot of data you actually don't need it
00:08:24.240 so based on your situation it's probably okay to throw a lot of it away for example say we have 155 million
00:08:32.000 things i'm not going to say what these things are but say you have 155 million of them
00:08:37.120 how many of those things do you need to make an analysis of the population as
00:08:42.560 a whole with these criteria one percent error tolerance and 99 confidence
00:08:47.680 anyone want to take a wild guess 10 000 all right well you guys are
00:08:53.680 actually pretty close i was thinking people are going to guest live a lot more you actually need 16 588 of them to make a
00:09:00.880 a a really good uh analysis of the entire population as a whole
00:09:06.000 so if you had 155 million things that were 2500 bytes apiece it's something like 34 megs of data
00:09:12.480 which is really easy to process in a reasonable amount of time so the one thing i'm going to say is
00:09:17.920 about this is if you you need to understand your problem domain there's a very good
00:09:22.959 chance that if you think you have a big data problem you don't you actually just need to sample it and process the sample
00:09:29.760 so now there are cases so initially you don't need big data to
00:09:34.800 start with assume that but after you've decided that or you've made the decision that that assumption
00:09:40.560 doesn't work and you do need to process a large volume of data in a timely manner maybe
00:09:46.399 continually and store it for a long time there are some other tools out there to help you out and this is where we're
00:09:52.160 going to talk a bit about uh the big happy hadoop family so how many people have used any portion
00:09:59.200 of the hadoop systems at all that's not too bad so anyone feel free to tell me when i get something wrong
00:10:05.279 it's quite possible i will we're going to start with the base piece of the hadoop ecosystem and that's where
00:10:12.560 we're going to store a lot of data and also in a very fault tolerant and linearly scalable manner
00:10:18.560 according to the docs at least so this is hdfs or the hadoop distributed
00:10:24.480 file system so we're going to take for example a file
00:10:29.519 that files across the top anyone remember the computer science way of deciding showing a file is blocks you know all that good stuff
00:10:34.800 so we're going to talk about this file that's nine things wide you know nine blocks um in hadoop when we talk about block
00:10:41.120 level stuff everyone on your mac and your linux machines a block of a file is generally about 4k
00:10:46.320 uh in hadoop and hdfs a block is probably 64 or 128 megs so you need to have
00:10:54.160 larger files generally in here so this file is you know maybe a gig couple hundred meg something like that so the way
00:11:00.880 hdfs works to store a file in a you know distributed and fault tolerant manner is it basically takes
00:11:07.600 uh each block and then it spreads it duplicates it and spreads it across different nodes so you've got this file which is
00:11:14.000 basically a whole bunch of different blocks and it's going to split the blocks up and put them on
00:11:19.200 different nodes so each one of these different data nodes is an actual physical machine so you've
00:11:24.240 got your nine block file every block is stored three times across a whole bunch of different systems so you can have uh some of the
00:11:31.360 machines go down and you'll still have the full amount of your data
00:11:36.800 so the question is how does hdfs and jruby fit together
00:11:42.399 um not really so hey i mean hdfs it's a very core system i
00:11:47.519 mean how many people actually write java to write directly to hdfs probably not as many as you think
00:11:52.959 so you could use say something like fuse or if you wanted to you could use the the j the straight jar talk to it in jruby
00:11:59.760 write a file into hdfs and be done with it on that front so mostly i wanted to talk about hdfs just
00:12:05.120 to give you a baseline of saying okay this is a fundamental infrastructure for the hadoop ecosystem
00:12:12.240 so next we'll talk about how to process lots of data so how many of you know of
00:12:18.079 mapreduce come on raise your hands all right all right all right how many of you use hadoop mapreduce
00:12:24.160 how many of you use something else what do you guys use
00:12:29.360 reacts mapreduce any others yes mapreduce okay
00:12:35.760 all right there's a there's a couple other um i can't remember their names but there's a couple other actually based on hadoop and other types of
00:12:42.320 things that are mapreduce infrastructures i think there was an old ruby one that was called skynet which i don't know if
00:12:48.000 anyone used um but the basic fundamental premise of mapreduce is you have embarrassingly parallel
00:12:55.519 problems so these are problems where you can take the input you can split it into completely autonomous chunks
00:13:03.440 take those chunks run a process on them map process you take your map and you have basically
00:13:09.920 tuples for each one of the map inputs you're going to do a process on it and you're going to get out your output
00:13:17.360 and then the reduce input is basically all those map results grouped up everyone that has a common key all the values for that common key will
00:13:24.000 be put together in a big group and then you'll run the reduce on it that takes all those individual keys and
00:13:29.440 those lists of values for that map and you know produces something else so it's a it's a trivially
00:13:35.040 it's embarrassingly parallel as opposed to something like weather simulation or uh things where every stage of the
00:13:41.760 system has to interact with a whole bunch of other different things so mapreduce is embarrassingly parallel so remember that when you talk to your
00:13:48.160 to your science uh guys and the way this does it is you have uh
00:13:55.440 you and the way this works in hadoop is you have your job tracker and you have a bunch of task trackers
00:14:01.440 which are sitting there generally in parallel with your data nodes we've got our little file down here with
00:14:06.959 all those different pieces you submit a job which includes the mapreduce stuff
00:14:12.320 and some other meta information about the job and then the job tracker says hey we're going to take this and we're
00:14:17.600 going to split it up across all the different data nodes each different data node and task node
00:14:23.600 will take it will process a piece of the file that's local to it so this is where hadoop comes in it says
00:14:30.160 oh you've got this really big file it's spread across all these different nodes uh it can take your your actual map and reduce process
00:14:38.160 take the map file take the map process and give it to each one of these called file block and run it locally so that you're
00:14:43.519 actually moving code to the data versus moving data to the code which is what we typically do in a lot of different
00:14:49.360 things so this is moving code to the data now how does jruby fit into this
00:14:55.360 well it's complicated there's lots of ins and outs lots of what have you you know this this is that and this can
00:15:00.720 make you you know feel how see what it looks like to understand a little bit why this is a bit of a problem we have to look at how
00:15:07.360 in a general sense how job submission and job running works so in the generic sense when you're
00:15:12.800 submitting a job to hadoop for mapreduce purposes you build a job jar file
00:15:18.800 you submit the jar file to the tracker the job tracker gives the jar to each one of the task trackers
00:15:23.839 and then the each one of those task trackers will do the appropriate map and reduce tests by essentially looking up
00:15:29.199 the class and feeding the data through it and basically there's two points
00:15:34.240 where you actually have access to the entire system one of them is up here when you're packaging up your data
00:15:40.160 and the other one here which is basically a run time so these are the two areas where when you're using jruby in
00:15:45.920 relation to hadoop you have the potential for actually doing something with the mapreduce system now there are
00:15:51.440 a couple of things that are already out there one of them is radoop if you google radup right now you won't get
00:15:57.920 this project there's something more there's something a little more recent which is taking the name or dupe and it actually has nothing
00:16:03.519 to do with ruby so and in redup because you're doing
00:16:08.639 you you basically you have the redo command line instead of the hadoop command line so that's actually
00:16:15.680 subverting the the bootstrapping process of submitting the job and then you have the the way that you
00:16:21.279 actually implement your map and reduce tasks is in ruby uh with inherit by inheriting
00:16:26.959 from ruby classes which are actually essentially java extensions so you have these javascript the java
00:16:32.160 code that's input in the gem and then you inherit from those classes and it sort of bootstraps it that way
00:16:38.480 uh unfortunately it's a little old it has been updated in a while so that's one aspect in fact this
00:16:44.959 probably doesn't even have support from some of the more recent mapreduce apis so then we talk
00:16:50.880 about jruby on hadoop this is another basically follows the exact same approach you have joh instead of the
00:16:57.759 hadoop command line instead of implementing inheriting classes you actually just define methods
00:17:04.079 in you know say it was a top level standalone script you define setup define map define reduce
00:17:09.199 that kind of thing and then those get packaged up together and then at runtime because this again
00:17:15.520 uses javascrims essentially a java extension in the gem and that part of the packaging makes sure that jruby and the and the j ruby on hadoop
00:17:22.480 stuff is all shipped and pushed off at the same time you again have the exact same thing of a java
00:17:28.000 extension the problem with this is uh it is also it's been a while since it's been updated
00:17:33.600 uh and if you look through the code it's actually only supports a small subset of possibly your inputs and
00:17:39.440 outputs um so then there's another one which is hadoop papyrus uh it uses papyrus we've got to see a
00:17:46.080 pattern here we've got you know packaging and runtime it's basically your interfaces into where jruby could fit in so we've got
00:17:52.799 papyrus and papyrus is actually built on top of j ruby on hadoop so it implements a sort of a dsl
00:18:01.120 style for doing mapreduce programs and again it also hasn't been updated since uh 2010. so
00:18:08.960 the problem is none of these are actually very useful for us as a programmer they're not they're not great
00:18:14.320 they're it actually you know sitting through and looking through all the source code is the only way i figured out how they actually work and actually how to use them
00:18:20.400 um so there's a couple other approaches where jruby can fit in uh is anyone using cascadini no one all
00:18:28.000 right oh one person using cascading i haven't actually used it but um this looks like a really a pretty decent possibility it's more of
00:18:34.160 a higher level language um but it's got full jruby support the folks at etsy have provided jruby
00:18:41.120 support for cascading it's a way of like plugging workflows together as a whole data processing api
00:18:46.559 um i've only known about cascading for a few weeks so this is just a reference for everyone
00:18:51.679 else and the other way to do hadoop mapreduce processing is with streaming and this is
00:18:58.799 the way hadoop provides a facility for you to say hey my map and reduce processes are actually just going to take data on
00:19:05.039 standard in and emitted on standard out so that's if you you could actually write your mapreduce programs in you
00:19:10.480 know in a shell or any other program that doesn't actually have to be java who's using the folks that are using hadoop how many
00:19:16.320 of you streaming one okay one uses streaming um there is
00:19:21.760 a performance hit on this from what i understand and uh mr flip who's part of info chimps
00:19:26.799 uh he has a project called wukong which helps encapsulate all the streaming you know inputs and outputs and putting things
00:19:32.640 all these other stuff together so these are some other approaches and then we have these three other gems um which are
00:19:39.039 starting to do a little bit on the front but actually i want something a bit different i would like to use the normal hadoop
00:19:45.360 command line you know it's there might as well it just needs a jar and you could submit it and everything works and i'd like to actually inherit
00:19:51.760 from the java classes uh that exists the standard way you would do mapreduce we have jruby
00:19:57.039 you could inherit from mapper write up write your map function inherit from reducer write your reduce function and
00:20:02.720 everything would be great unfortunately it's not written uh i'm gonna take a hand at it if i can
00:20:08.799 uh i'm gonna have to work it looks like uh there may be uh something you have to work with in jruby itself to make this
00:20:14.080 happen so i'm gonna try to help out you know the team and see what's gonna happen and we'll go from there but it there's
00:20:20.159 actually just a little tiny piece of uh of linking jruby and
00:20:25.280 java together uh with in particular case for hadoop and we may be able to just have full on
00:20:31.200 straight up jruby write your application write your map produce jobs in in ruby and be done with it which is
00:20:37.200 really cool um who would like to see that yeah all right all right
00:20:42.240 so keep your eyes out see what happens so that's a little bit of the mapreduce so that's how jruby can fit into the
00:20:48.080 whole mapreduce ecosystem so we've got mapreduce we've got the hdfs
00:20:53.360 uh what are those actual files we're going to store in hdfs so what do you guys use the ones that
00:20:59.280 are doing what are your files that you have on your hdfs systems file formats anyway
00:21:07.919 no okay um so we have a lot of file we have a lot of data to store um how many you've heard of avro
00:21:15.840 actually a i actually like this as as a container file format avro itself it's not a hadoop project
00:21:21.919 but it's an apache project you have a very rich data structure think document it is sort of a json
00:21:27.919 based and it's a it's a binary format that's very compact it also has protocol buffer thrift like
00:21:35.200 you know network ability but the thing that i really like about
00:21:40.240 it is actually has a container file structure so you're able to to write your records
00:21:45.679 into an avro file in a very compact format there's no code generation involved
00:21:51.120 because the schema that is uh that's defined at the header of the file any client any library that needs to
00:21:57.840 open the file can just read the schema and they'll know the entire structure of the data so it's basically record level data with
00:22:03.120 the schema at the head of the file it's mapreduce friendly in terms of
00:22:08.240 different it's got sub blocks within the file and i'll get into that in a minute and it actually supports compression so
00:22:15.200 uh this you can use outside of out outside of hadoop period i mean it's a standalone
00:22:21.360 library for a container file structure for storing for storing data if you ever have anything that's
00:22:27.039 csv you should probably think about using avro instead another thing is it's language neutral
00:22:33.919 so the avro project itself has implementations in java c c plus
00:22:39.360 plus c sharp python ruby pearl i think there might even be a little one
00:22:45.360 but they're all completely language independent ways of uh doing this entire the avro library
00:22:51.520 and the container file structure so if you think about the container file structure again we've got our big
00:22:56.960 nine block file the way avro works is we'll just take a look at these uh first two blocks say they're stored on
00:23:03.840 two different data nodes and this is actually how avro takes
00:23:09.200 advantage of being if in addition to being a good container file structure on its own it can also provide
00:23:14.559 significant advantages when you're using it in hdfs um so if you think of these data nodes
00:23:22.000 uh the subf if you think of one and two is part of the first beginning of the avro file and these dark and light stripes are sub
00:23:29.440 blocks within the file so say we take we have that tax that goes out it's hitting node one it needs to open
00:23:36.159 up the file and it needs to be able to go through and work on the data that's just local but needs to do it a record delimiter
00:23:42.400 uh it doesn't in this case for avro files it's not gonna do a record delimiter it's gonna do it a sub block level delimiter
00:23:47.440 so you're gonna it's gonna process along and it's going to process up to the end of the first file block
00:23:52.799 that's not on the local block so the amount of data that has to move between data nodes in hadoop is reduced
00:23:58.159 to just about you know probably less than just a few k and then task two can pick up there
00:24:03.360 from that point and go further on so this is the way that avro works out well in the hdfs system and works well with
00:24:10.000 the jaw with the task trackers and stuff if you're using something like a tarball or a straight csv file
00:24:17.840 file might not work too bad but especially if you've got a tarball then there's problems with
00:24:24.320 that in terms of splitting the file into different chunks and a lot of people that are doing tarballs in in hadoop they'll actually make sure
00:24:30.880 that their tarball files are already pre-split so every file is already just one block long
00:24:37.679 which kind of holds defeats the entire purpose so in terms of just avro as a whole say i had 5 500 records all of these
00:24:44.960 maybe around 2 500 bytes or something like that if you just look at the straight json it's uh it's 12 mix
00:24:52.000 and it's a 2 meg tarball if you convert it to an afro file it's
00:24:57.840 actually just 3.6 megs and that's with no compression period um when you think about is these are json
00:25:03.919 records so all those keys that are in every single one are duplicated so every single one there's a lot of overhead in just the
00:25:09.279 keys in a json file same way there's a lot of overhead in xml when you've just got the tags so
00:25:14.559 you might look at like 80 of an xml file might just be the tags well in this case it looks like about 60 to 70 percent of
00:25:22.240 the raw data is actually the json structure it's supposed to be a compact format to begin with but you still have
00:25:27.600 these keys that are in every single instance so just by converting to an avro file which puts the schema at the top of the file
00:25:34.240 and then every record is just binary then you you've reduced 70 of the the data volume
00:25:40.880 just by stripping out essentially the keys from your key value pairs and then those individual sub those
00:25:48.159 individual sub blocks in the avro file are individually compressed so you can use snappy compression which
00:25:54.240 is a really nice compression because uh it like lzo are these new styles of
00:25:59.279 compression where it takes less time to read the file off of disk and compressed and decompress it the that time right
00:26:06.960 there is actually shorter than if it was uncompressed and on disk so there's a couple different algorithms lib snappy lco they don't have the best
00:26:13.760 compression ratios maybe only 50 something percent but you can read them off of disk and
00:26:18.960 decompress them faster than you could read off of disk the uncompressed data and it's pretty cool that way so avro
00:26:26.400 and jruby yes we have happiness with jruby specifically you could use the the jar i mean it's
00:26:32.159 just a jar that's it it's just a library you could also use the ruby implementation itself
00:26:37.279 so depending on the features that you need that i didn't talk about you may want to use the java or the ruby
00:26:42.640 if you just want some basic vanilla stuff so you know there's this is a really good spot for jruby and i would actually
00:26:50.000 even just mri or you know any ruby you want i would take a look at avro if you're storing stuff
00:26:55.279 in files on disk and you're accessing them via ruby i would take a look at avro in any case
00:27:01.520 um one other quick one i'm going to talk about is when you're dealing with lots of data distributed across a whole bunch of
00:27:06.799 machines you will need to be able to coordinate that there's a tool called zookeeper which is
00:27:12.080 uh essentially a set of servers that provide coordination services like group
00:27:17.760 membership so a bunch of people show up and say hey i'm available i can do work that kind of stuff distributed locks you've got a process
00:27:24.480 on one machine that needs to wait for another machine's processing to get done and so it can do some sort of locking
00:27:30.960 sequences so sort of if you uh to make things run in sequence sort of a queue
00:27:36.080 uh or if you want to generate numbers or something like that that so that everyone has the same number so it has in increasing numbers uh
00:27:44.000 watches you want to say hey this is sort of like a distributed locks where i'm going to wait and watch for
00:27:49.279 something to happen and when it happens i'm going to get notified well that notification could be between machines in the short
00:27:55.360 and sweet zookeeper is a highly available file system it's basically a hierarchical structure of nodes um
00:28:02.240 call them z nodes uh and each node can contain other nodes or key value pairs
00:28:08.000 and that's pretty much the primitives that it has uh so there's interesting stuff you know when uh say i'm
00:28:14.559 p3 i'm going to register i'm going to you know connect to the zookeeper server server
00:28:19.840 cluster go to basically cd into app one and register myself and put a p3 so anyone else who
00:28:26.000 needs you know p app one services they can go to zookeeper they could essentially cd into that node
00:28:32.559 and then say oh list all of the available services for here and so somebody who's unavailable
00:28:38.000 he could just disappear or somebody another one comes online it can put a note there and appear so there's a different way for doing
00:28:43.840 distributed coordination so zookeeper and jruby outlook is good
00:28:49.120 zookeeper is a wire protocol that's how you're going to communicate with it and twitter has probably the best
00:28:54.320 implementation i've seen of the zookeeper wire protocol so take a look at it if you need distributed coordination
00:29:00.480 a zookeeper itself is just a jar it stands by itself you'll need at least three servers running for uh for your quorum
00:29:06.720 to make sure everything looks good but if you need to do distributed coordination um it's probably a good thing to look at
00:29:14.480 so say you have all this data and we've been talking about it at the block level what if you need essentially record
00:29:21.279 level access so you need low latency access to record level data and in this case uh the hadoop project or the apache
00:29:28.480 project that works on here is bigtable so how many people use bigtable or something like it you know there's there's hbase cassandra
00:29:36.399 is somewhat like this and that might be the only implementations
00:29:41.520 that are sort of around um but hbase is essentially you can think of it as a really big grid with billions
00:29:46.960 of rows and millions of columns and and every cell is optional essentially um
00:29:52.640 so hbase itself builds upon everything that we've already seen here uh it uses hdfs friendly files so
00:29:58.559 they're all you know striped and appropriately so the different chunks of them can exist on different servers and the task managers can
00:30:04.640 can all talk to the appropriate data pieces and it uses a zookeeper to coordinate everything
00:30:10.880 so hbase and jruby this is actually where jruby shines in
00:30:17.039 terms of the hadoop inco system it's at the top because hbase uh
00:30:23.520 basically uses everything else under the sun you can do mapreduce processing uh in hbase uh you uh basically you can
00:30:29.679 access it via a protocol um and actually one of the really interesting
00:30:34.880 things is hp ships with jruby so the ir the hbase shell or if you think of it as your postgres or my sql
00:30:41.360 shell if you're going to say create table all that kind of good stuff select do all these different things for hbase that is actually a customized
00:30:48.240 version of irb which is which is pretty nice hbase itself if you wanted to communicate with it
00:30:54.399 uh from a network perspective uh it supports thrift and h and protocol and also avro rpc
00:31:00.960 uh and there's a couple of different interfaces for it also has the the pure java piece and there's a couple of
00:31:07.600 implementations i actually implemented uh a jruby called a jruby extension to talk
00:31:12.799 to hbase called ashby it seems to work out mostly well and actually that's essentially the
00:31:21.360 the gist of it i'll start with some questions comments anything no
00:31:29.279 too much data all right so this is where we all need to go talk about this and drink quite a bit
00:31:34.559 so yes
00:31:43.440 okay so dr dr nick is trolling i mean how do you know
00:31:52.559 yeah i mean if you want to call it it's data logistics i'd be like i've got data coming in i want to remove
00:31:58.000 data i want to keep it for so long a lot of that has to do with how you store it i mean if you're going to say as data is coming
00:32:04.480 in i'm going to spool it here i'm going to keep it here for an archive this is a location it's live data and
00:32:09.679 then for basically the data logistics you've got to have a process that says hey this is the time and place where data is
00:32:16.399 going to get removed so that the data meets the certain criteria most of the time it's 18 offers because of time
00:32:21.679 maybe you know or maybe say you're storing some sort of financial or medical or something like that you may
00:32:27.200 have a specific time frame maybe it's seven years or eleven years or tax data or something like that there's criteria that says this data can
00:32:33.279 be gotten rid of at a certain point in time so you'll basically have to decide most of the time it's a business decision um
00:32:41.919 yeah i mean you're gonna have to write code to do that to implement those rules yeah these different systems
00:32:48.240 um you'll probably for say you've got uh you know terabytes petabytes of data stored in hadoop
00:32:53.840 you're probably going to have either a file system structure that things are organized by date and you can just start removing
00:33:00.000 stuff or looking at the time stamps of files or maybe the data from the files
00:33:08.559 so i don't think we're quite dealing with the amount of data you were talking about we've got a whole bunch of things that we've been doing some calculations on
00:33:15.440 and we've been resisting uh the requests of one of our project managers to try and speed things up by
00:33:21.760 saving some of the frequently done calculations as like a cash version of the data somewhere because
00:33:28.000 what's a good guideline for figuring out when you might want to consider an option
00:33:33.279 would you or do you ever want to do that so the question is when should you consider essentially
00:33:38.480 caching complicated values uh for a period of time calculate computed values yeah computed time
00:33:44.399 computed values um my rough one is when it doesn't satisfy the business needs
00:33:49.440 you know if you're saying okay i have this data processing that has to take place every day processes 24 hours worth of
00:33:55.600 data and or maybe called a week worth of data and has to be done every hour well all the data that came in for the week minus
00:34:01.600 the last hour you know i will just catch that and then have a rolling window of the data that pages off that doesn't apply
00:34:07.120 to the calculation and you can get rid of it so yeah if it's if it can calculate once and it's then that
00:34:12.399 and that output value is going to be used many many many times why not catch it that's that saves you cpu and it costs
00:34:18.960 you you know bytes of data yes yeah for those of us who are just starting to like
00:34:24.399 be interested in learning more about big data do you have any resources you would recommend for like podcasts sure for
00:34:32.320 for for folks just getting interested in the big data and probably hadoop in particular um if you can there's been a couple of
00:34:38.320 conferences this year called strata from o'reilly i don't know if a lot of their just sort of like an overview of
00:34:43.599 technologies and peoples and businesses that are involved in it those may or may not be available online
00:34:49.119 some of them probably are if you want to talk with hadoop in particular cloudera
00:34:54.399 probably has some podcasts and presentations and they actually have a project i think it's called
00:34:59.440 which is essentially a bootstrap of getting a hadoop cluster running in ec2 with a very minimal number of
00:35:05.680 middle number of steps so the one thing is getting a production system like this
00:35:11.440 put into place can take a while to understand where everything goes how it all fits together there's a lot of moving parts
00:35:17.280 if it was similar that'd be great but it's a complex problem and not necessarily a simple solution so
00:35:23.520 yeah that's your question yeah yeah actually i am friends with
00:35:29.359 folks at dell who are building a package hadoop solution okay you can call up bill and call our hadoop system so you can
00:35:35.599 call up delaware hadoop system yeah uh there are probably there are other customers other people that have you know package deals that you can bring
00:35:42.320 stuff up and
00:35:48.960 so have you had any experience with elastic macros i've had experience with elastic map
00:35:54.880 reach no but i know it exists
00:36:05.040 and actually along those lines uh somebody just put together i think 30 000 node amazon ec2b cluster
00:36:12.880 for a uh unnamed pharma pharma company to do a calculation and it was basically your super computer
00:36:18.400 that cost like seven thousand dollars an hour or something like that it's very cheap for the amount of calculations
00:36:24.320 well um
00:36:35.599 if you just want to use zookeeper stand alone it's actually pretty simple because it is just a jar um you can just download the zookeeper
00:36:42.480 jar comes with some shells uh basically it's a etsy zookeeper to do that cfg you set up the configuration
00:36:48.720 the the how to online purzuki was pretty good and you need at least three otherwise
00:36:53.839 because it's basically i think it's a practice forum so you have to have at least three so they can vote uh and come up with the
00:36:59.359 right solution um the one thing about zookeeper and zookeeper doesn't at least the last time you checked on it
00:37:05.440 it doesn't actually work well in ec2 because zoom keeper has extremely tight requirements
00:37:11.200 on i o latency for writing to this because it's always persistent it always makes sure that everything you
00:37:16.880 do as you keep her is consistent across the entire forum so sometimes if your ec2 volume
00:37:23.040 your eds volumes and stuff like that they don't provide the i o minimum i o requirements so the zookeeper can fail but that may have
00:37:29.839 changed since last time i checked so yeah yes what's the
00:37:49.200 uh from internally using radube or something like that um i can't answer hopefully i can answer
00:37:55.839 the question before the end of the year
00:38:16.839 offline
00:38:30.839 um yeah so the question is is this type of
00:38:37.040 processing mostly used for say offline batch processing or online real-time stuff um
00:38:42.960 depends on your definition of real time
00:39:00.400 for in general i would say uh hadoop in terms of mapreduce jobs are probably
00:39:05.839 not for real-time analysis they're going to be more for offline because just the overhead of spinning up the job and submitting it you know that you're talking seconds in
00:39:12.800 that range and then of course the data you're you're saying streaming to this you need to collect at least enough to
00:39:18.720 fill up a block otherwise it's not efficient um uh hbase you could probably do more
00:39:23.760 because uh all the stuff that it's committing it has you have lower latency access to that
00:39:29.200 data but you also need to know that that data is there uh if you want to do more stream processing of data actually storm which
00:39:35.359 just came out is probably worth looking at there's another one that i can't remember uh another one i can't remember
00:39:41.520 on top of my head but if you need to do if you're if you have a more higher oh flume flume is another one that might be
00:39:46.640 useful uh which is also which is done by cloudera um and those are probably going to be
00:39:51.920 more in terms of if you're going to be screen processing versus more call it call it offline backup where we're offline to
00:39:58.160 maybe minutes hours depending on your date of collection
00:40:04.000 more questions no all right go to the amp block party look
00:40:10.240 out more there
00:40:36.839 do
00:40:51.680 you
Explore all talks recorded at RubyConf 2011
+55