00:00:16.880
hi everybody i'm jeremy uh i'm going to talk to you a little bit about big data in jruby today
00:00:22.160
and this is more of an overview talk we're getting a couple of smaller details on uh on hadoop stuff when we get down
00:00:28.080
in there but initially this is just sort of an overview of what definite what that what big data actually is and then
00:00:34.239
some tools and some procedures and stuff that you can go through to to help you if you have a big data problem so
00:00:40.320
first question i have is uh what is big data who's got a definition come on raffle scale okay that is a
00:00:48.879
great definition nick of course has one uh
00:00:54.800
more than i can put on my time machine backup okay that's a good definition of big data um so yeah there's there's lots of
00:01:01.440
definitions for big data uh where there actually are some official definitions you know all right everybody read this
00:01:08.240
it's kind of dry you know big data applied to large data sets to sizes beyond the ability of commonly used software tools to capture
00:01:14.240
management process data without reasonable elapsed time okay so what nick said yeah it you know it doesn't
00:01:20.720
fit on a time machine backup there's also another one that gartner has of course they get involved in all
00:01:27.200
sorts of stuff like this um because it's big business and enterprise and all that kind of good stuff so increasing volume velocity and
00:01:34.159
variety of data um i personally you know these are yeah yeah they're dry
00:01:40.000
anybody can think of them uh this is my definition one of them one of them or who's had
00:01:45.920
this happen yeah yeah yeah yeah nagios low disk space you know there you go all that
00:01:51.520
kind of good stuff or the ever popular uh that data processing job the one that takes all data run you know
00:01:57.759
it's got to crunch all that data yeah we need to do it over um it's wrong and we need to run it every day
00:02:03.439
on yesterday's data by 80 am so who has that problem okay yes yes and it takes longer than 24
00:02:10.640
hours to run exactly so i asked you guys and i've also i also did the chad fowler approach
00:02:17.040
and many other people's approaches what is your definition of big data and they kind of broke down into three different areas
00:02:25.440
that uh that i can kind of define one of them is that's a lot of data so i personally
00:02:31.440
like cjs which is if active record can't do it it's big data
00:02:36.959
yeah that's i think that's a good one that's a good one um you know uh jeg2 everyone knows james
00:02:42.720
edward gray awesome guy um he says it's big data when i can't use a when i have to use a specialized
00:02:48.879
database so that's also a decent one um in terms of big data you know
00:02:54.879
different people have different volumes that they think are uh are quite a lot does anyone recognize this image
00:03:01.920
yeah yeah yeah tim you know it uh this was actually generated with ruby so uh how many of you know era howard or
00:03:08.000
heard of air howard okay yes yes all those of us in boulder are big fans of era so um years ago when
00:03:15.280
arrow was working at the earth's observation group uh this is basically is the night lights from uh
00:03:22.800
the lights of earth from space at night so these are you know highly populated areas and all that wonderful great stuff
00:03:29.120
um so they have all this satellite data raw satellite data and some of it cleaned a little bit and
00:03:34.480
i don't know all the processes they go through but someone decided they wanted to buy a lot of it and by a lot i mean
00:03:40.159
all of it uh which was 10 petabytes so who has 10 beta bytes of information
00:03:46.879
oh all right all right we have one um in terms in terms of 10 petabytes of information
00:03:53.760
uh data i actually was looking at the backblaze you know they had this great blog post a little while ago about adding more space and doubling the
00:03:59.920
size of their servers and all this kind of great stuff uh they 10 petabytes i think is
00:04:05.200
two-thirds of the total amount of data storage that backblaze currently has in their data center
00:04:10.400
so everyone who backups with backblaze across the entire world two-thirds of it was bought by one
00:04:15.920
company from a volume perspective uh so to get that data to the customer they had five tape robots
00:04:22.479
putting the data on tape for 30 days shipped it ups or fedex or you know
00:04:29.840
courier huh station wagon yes i'm about to use that quote you know which one i'm talking about
00:04:35.280
um and then the and uh it was 75 000 worth of tape to start with just to write it once and
00:04:40.960
then the customer spent 30 days reading it off of tape so who knows the andrew tannenbaum quote i'm about to use
00:04:47.919
okay all right there's two of you never underestimate the bandwidth of a station wagon going down the highway with a box of tapes in the back
00:04:54.160
okay the latency sucks i mean 60-day latency is not good but the
00:04:59.840
bandwidth is tremendous so if you feel like it somebody calculate how long it would take to do 10 petabytes of information
00:05:06.479
over um your 50 megabyte make a bit link at home somebody did run the numbers i
00:05:11.680
haven't done it yet so that's a large volume of data it wasn't headed there fast but that's
00:05:17.440
another thing um another definition that people kind of came up with when they were talking about when i asked the question of what is big data
00:05:23.600
they said it hit a lot of data a good chunk of data heads my way really really fast charlie baker is he in the
00:05:29.360
room yes charlie okay he said nine terabytes of compressed data coming at me every
00:05:34.400
day not too bad more than processing that fits in ram or yeah lots of data headed your way
00:05:42.560
fast for an example how many of you use twitter okay
00:05:48.240
do you realize that your tweet is more than just 140 characters you know the actual volume of data that's in a tweet as a consumer of
00:05:55.120
someone who does tweets is about 2500 bytes which includes like the tags the location if you have it turned on a lot
00:06:00.880
of your author profile all sorts of lovely information so this is a while ago 155 million
00:06:06.160
tweets a day which is 16
00:06:12.160
gigabytes an hour or four megabytes a second
00:06:17.840
or 33.8 megabits who's the networking person in here knows volumes which is the majority of
00:06:24.240
an oc one line so that's a lot of that's a data coming at you very very fast so how many people
00:06:30.160
have an oc one line at least into their data center or wherever they're getting data
00:06:35.680
there we go two yeah not bad so this is all from gnip which is a company that actually
00:06:41.120
uh processes twitter data and you can repurchase it from them and all sorts of good stuff
00:06:46.639
or subscribe to it the other piece that people were talking about we've got large amounts of data coming at you very fast and then what do
00:06:52.720
you need to do with it actually i like nutty com's uh example here he says it's big data when order
00:06:58.479
login is too slow to process it which is kind of you know when you say hey when a really
00:07:04.960
fast algorithm can't do it then you know it's it's it's big data and one that i have to deal with on occasion
00:07:11.360
to add to this is how long do i need to keep this amount of data around so
00:07:16.479
i've got a lot of data coming at me very fast i have to do a lot of work with it and i
00:07:22.160
have to keep it around for a while maybe process it again later so my personal term for
00:07:29.440
big data is any data that makes me feel uncomfortable
00:07:38.479
or as also you can say how fast and how often can you boil the ocean but never fear help is on the way
00:07:47.360
anyone know basics instructions yes okay very popular comic i like it a lot um the first help i'm gonna try to give
00:07:55.599
you is you don't need big data so a lot of people think you need big data
00:08:01.759
but you need to understand why you think you need big data uh daxus who is probably not quite here
00:08:07.680
yet uh has a great blog post he's in the state
00:08:12.800
okay good he's in the state um he has a great blog post on the problem with big data as a rant
00:08:18.160
so there's a good problem there's a good chance that when you think you have a lot of data you actually don't need it
00:08:24.240
so based on your situation it's probably okay to throw a lot of it away for example say we have 155 million
00:08:32.000
things i'm not going to say what these things are but say you have 155 million of them
00:08:37.120
how many of those things do you need to make an analysis of the population as
00:08:42.560
a whole with these criteria one percent error tolerance and 99 confidence
00:08:47.680
anyone want to take a wild guess 10 000 all right well you guys are
00:08:53.680
actually pretty close i was thinking people are going to guest live a lot more you actually need 16 588 of them to make a
00:09:00.880
a a really good uh analysis of the entire population as a whole
00:09:06.000
so if you had 155 million things that were 2500 bytes apiece it's something like 34 megs of data
00:09:12.480
which is really easy to process in a reasonable amount of time so the one thing i'm going to say is
00:09:17.920
about this is if you you need to understand your problem domain there's a very good
00:09:22.959
chance that if you think you have a big data problem you don't you actually just need to sample it and process the sample
00:09:29.760
so now there are cases so initially you don't need big data to
00:09:34.800
start with assume that but after you've decided that or you've made the decision that that assumption
00:09:40.560
doesn't work and you do need to process a large volume of data in a timely manner maybe
00:09:46.399
continually and store it for a long time there are some other tools out there to help you out and this is where we're
00:09:52.160
going to talk a bit about uh the big happy hadoop family so how many people have used any portion
00:09:59.200
of the hadoop systems at all that's not too bad so anyone feel free to tell me when i get something wrong
00:10:05.279
it's quite possible i will we're going to start with the base piece of the hadoop ecosystem and that's where
00:10:12.560
we're going to store a lot of data and also in a very fault tolerant and linearly scalable manner
00:10:18.560
according to the docs at least so this is hdfs or the hadoop distributed
00:10:24.480
file system so we're going to take for example a file
00:10:29.519
that files across the top anyone remember the computer science way of deciding showing a file is blocks you know all that good stuff
00:10:34.800
so we're going to talk about this file that's nine things wide you know nine blocks um in hadoop when we talk about block
00:10:41.120
level stuff everyone on your mac and your linux machines a block of a file is generally about 4k
00:10:46.320
uh in hadoop and hdfs a block is probably 64 or 128 megs so you need to have
00:10:54.160
larger files generally in here so this file is you know maybe a gig couple hundred meg something like that so the way
00:11:00.880
hdfs works to store a file in a you know distributed and fault tolerant manner is it basically takes
00:11:07.600
uh each block and then it spreads it duplicates it and spreads it across different nodes so you've got this file which is
00:11:14.000
basically a whole bunch of different blocks and it's going to split the blocks up and put them on
00:11:19.200
different nodes so each one of these different data nodes is an actual physical machine so you've
00:11:24.240
got your nine block file every block is stored three times across a whole bunch of different systems so you can have uh some of the
00:11:31.360
machines go down and you'll still have the full amount of your data
00:11:36.800
so the question is how does hdfs and jruby fit together
00:11:42.399
um not really so hey i mean hdfs it's a very core system i
00:11:47.519
mean how many people actually write java to write directly to hdfs probably not as many as you think
00:11:52.959
so you could use say something like fuse or if you wanted to you could use the the j the straight jar talk to it in jruby
00:11:59.760
write a file into hdfs and be done with it on that front so mostly i wanted to talk about hdfs just
00:12:05.120
to give you a baseline of saying okay this is a fundamental infrastructure for the hadoop ecosystem
00:12:12.240
so next we'll talk about how to process lots of data so how many of you know of
00:12:18.079
mapreduce come on raise your hands all right all right all right how many of you use hadoop mapreduce
00:12:24.160
how many of you use something else what do you guys use
00:12:29.360
reacts mapreduce any others yes mapreduce okay
00:12:35.760
all right there's a there's a couple other um i can't remember their names but there's a couple other actually based on hadoop and other types of
00:12:42.320
things that are mapreduce infrastructures i think there was an old ruby one that was called skynet which i don't know if
00:12:48.000
anyone used um but the basic fundamental premise of mapreduce is you have embarrassingly parallel
00:12:55.519
problems so these are problems where you can take the input you can split it into completely autonomous chunks
00:13:03.440
take those chunks run a process on them map process you take your map and you have basically
00:13:09.920
tuples for each one of the map inputs you're going to do a process on it and you're going to get out your output
00:13:17.360
and then the reduce input is basically all those map results grouped up everyone that has a common key all the values for that common key will
00:13:24.000
be put together in a big group and then you'll run the reduce on it that takes all those individual keys and
00:13:29.440
those lists of values for that map and you know produces something else so it's a it's a trivially
00:13:35.040
it's embarrassingly parallel as opposed to something like weather simulation or uh things where every stage of the
00:13:41.760
system has to interact with a whole bunch of other different things so mapreduce is embarrassingly parallel so remember that when you talk to your
00:13:48.160
to your science uh guys and the way this does it is you have uh
00:13:55.440
you and the way this works in hadoop is you have your job tracker and you have a bunch of task trackers
00:14:01.440
which are sitting there generally in parallel with your data nodes we've got our little file down here with
00:14:06.959
all those different pieces you submit a job which includes the mapreduce stuff
00:14:12.320
and some other meta information about the job and then the job tracker says hey we're going to take this and we're
00:14:17.600
going to split it up across all the different data nodes each different data node and task node
00:14:23.600
will take it will process a piece of the file that's local to it so this is where hadoop comes in it says
00:14:30.160
oh you've got this really big file it's spread across all these different nodes uh it can take your your actual map and reduce process
00:14:38.160
take the map file take the map process and give it to each one of these called file block and run it locally so that you're
00:14:43.519
actually moving code to the data versus moving data to the code which is what we typically do in a lot of different
00:14:49.360
things so this is moving code to the data now how does jruby fit into this
00:14:55.360
well it's complicated there's lots of ins and outs lots of what have you you know this this is that and this can
00:15:00.720
make you you know feel how see what it looks like to understand a little bit why this is a bit of a problem we have to look at how
00:15:07.360
in a general sense how job submission and job running works so in the generic sense when you're
00:15:12.800
submitting a job to hadoop for mapreduce purposes you build a job jar file
00:15:18.800
you submit the jar file to the tracker the job tracker gives the jar to each one of the task trackers
00:15:23.839
and then the each one of those task trackers will do the appropriate map and reduce tests by essentially looking up
00:15:29.199
the class and feeding the data through it and basically there's two points
00:15:34.240
where you actually have access to the entire system one of them is up here when you're packaging up your data
00:15:40.160
and the other one here which is basically a run time so these are the two areas where when you're using jruby in
00:15:45.920
relation to hadoop you have the potential for actually doing something with the mapreduce system now there are
00:15:51.440
a couple of things that are already out there one of them is radoop if you google radup right now you won't get
00:15:57.920
this project there's something more there's something a little more recent which is taking the name or dupe and it actually has nothing
00:16:03.519
to do with ruby so and in redup because you're doing
00:16:08.639
you you basically you have the redo command line instead of the hadoop command line so that's actually
00:16:15.680
subverting the the bootstrapping process of submitting the job and then you have the the way that you
00:16:21.279
actually implement your map and reduce tasks is in ruby uh with inherit by inheriting
00:16:26.959
from ruby classes which are actually essentially java extensions so you have these javascript the java
00:16:32.160
code that's input in the gem and then you inherit from those classes and it sort of bootstraps it that way
00:16:38.480
uh unfortunately it's a little old it has been updated in a while so that's one aspect in fact this
00:16:44.959
probably doesn't even have support from some of the more recent mapreduce apis so then we talk
00:16:50.880
about jruby on hadoop this is another basically follows the exact same approach you have joh instead of the
00:16:57.759
hadoop command line instead of implementing inheriting classes you actually just define methods
00:17:04.079
in you know say it was a top level standalone script you define setup define map define reduce
00:17:09.199
that kind of thing and then those get packaged up together and then at runtime because this again
00:17:15.520
uses javascrims essentially a java extension in the gem and that part of the packaging makes sure that jruby and the and the j ruby on hadoop
00:17:22.480
stuff is all shipped and pushed off at the same time you again have the exact same thing of a java
00:17:28.000
extension the problem with this is uh it is also it's been a while since it's been updated
00:17:33.600
uh and if you look through the code it's actually only supports a small subset of possibly your inputs and
00:17:39.440
outputs um so then there's another one which is hadoop papyrus uh it uses papyrus we've got to see a
00:17:46.080
pattern here we've got you know packaging and runtime it's basically your interfaces into where jruby could fit in so we've got
00:17:52.799
papyrus and papyrus is actually built on top of j ruby on hadoop so it implements a sort of a dsl
00:18:01.120
style for doing mapreduce programs and again it also hasn't been updated since uh 2010. so
00:18:08.960
the problem is none of these are actually very useful for us as a programmer they're not they're not great
00:18:14.320
they're it actually you know sitting through and looking through all the source code is the only way i figured out how they actually work and actually how to use them
00:18:20.400
um so there's a couple other approaches where jruby can fit in uh is anyone using cascadini no one all
00:18:28.000
right oh one person using cascading i haven't actually used it but um this looks like a really a pretty decent possibility it's more of
00:18:34.160
a higher level language um but it's got full jruby support the folks at etsy have provided jruby
00:18:41.120
support for cascading it's a way of like plugging workflows together as a whole data processing api
00:18:46.559
um i've only known about cascading for a few weeks so this is just a reference for everyone
00:18:51.679
else and the other way to do hadoop mapreduce processing is with streaming and this is
00:18:58.799
the way hadoop provides a facility for you to say hey my map and reduce processes are actually just going to take data on
00:19:05.039
standard in and emitted on standard out so that's if you you could actually write your mapreduce programs in you
00:19:10.480
know in a shell or any other program that doesn't actually have to be java who's using the folks that are using hadoop how many
00:19:16.320
of you streaming one okay one uses streaming um there is
00:19:21.760
a performance hit on this from what i understand and uh mr flip who's part of info chimps
00:19:26.799
uh he has a project called wukong which helps encapsulate all the streaming you know inputs and outputs and putting things
00:19:32.640
all these other stuff together so these are some other approaches and then we have these three other gems um which are
00:19:39.039
starting to do a little bit on the front but actually i want something a bit different i would like to use the normal hadoop
00:19:45.360
command line you know it's there might as well it just needs a jar and you could submit it and everything works and i'd like to actually inherit
00:19:51.760
from the java classes uh that exists the standard way you would do mapreduce we have jruby
00:19:57.039
you could inherit from mapper write up write your map function inherit from reducer write your reduce function and
00:20:02.720
everything would be great unfortunately it's not written uh i'm gonna take a hand at it if i can
00:20:08.799
uh i'm gonna have to work it looks like uh there may be uh something you have to work with in jruby itself to make this
00:20:14.080
happen so i'm gonna try to help out you know the team and see what's gonna happen and we'll go from there but it there's
00:20:20.159
actually just a little tiny piece of uh of linking jruby and
00:20:25.280
java together uh with in particular case for hadoop and we may be able to just have full on
00:20:31.200
straight up jruby write your application write your map produce jobs in in ruby and be done with it which is
00:20:37.200
really cool um who would like to see that yeah all right all right
00:20:42.240
so keep your eyes out see what happens so that's a little bit of the mapreduce so that's how jruby can fit into the
00:20:48.080
whole mapreduce ecosystem so we've got mapreduce we've got the hdfs
00:20:53.360
uh what are those actual files we're going to store in hdfs so what do you guys use the ones that
00:20:59.280
are doing what are your files that you have on your hdfs systems file formats anyway
00:21:07.919
no okay um so we have a lot of file we have a lot of data to store um how many you've heard of avro
00:21:15.840
actually a i actually like this as as a container file format avro itself it's not a hadoop project
00:21:21.919
but it's an apache project you have a very rich data structure think document it is sort of a json
00:21:27.919
based and it's a it's a binary format that's very compact it also has protocol buffer thrift like
00:21:35.200
you know network ability but the thing that i really like about
00:21:40.240
it is actually has a container file structure so you're able to to write your records
00:21:45.679
into an avro file in a very compact format there's no code generation involved
00:21:51.120
because the schema that is uh that's defined at the header of the file any client any library that needs to
00:21:57.840
open the file can just read the schema and they'll know the entire structure of the data so it's basically record level data with
00:22:03.120
the schema at the head of the file it's mapreduce friendly in terms of
00:22:08.240
different it's got sub blocks within the file and i'll get into that in a minute and it actually supports compression so
00:22:15.200
uh this you can use outside of out outside of hadoop period i mean it's a standalone
00:22:21.360
library for a container file structure for storing for storing data if you ever have anything that's
00:22:27.039
csv you should probably think about using avro instead another thing is it's language neutral
00:22:33.919
so the avro project itself has implementations in java c c plus
00:22:39.360
plus c sharp python ruby pearl i think there might even be a little one
00:22:45.360
but they're all completely language independent ways of uh doing this entire the avro library
00:22:51.520
and the container file structure so if you think about the container file structure again we've got our big
00:22:56.960
nine block file the way avro works is we'll just take a look at these uh first two blocks say they're stored on
00:23:03.840
two different data nodes and this is actually how avro takes
00:23:09.200
advantage of being if in addition to being a good container file structure on its own it can also provide
00:23:14.559
significant advantages when you're using it in hdfs um so if you think of these data nodes
00:23:22.000
uh the subf if you think of one and two is part of the first beginning of the avro file and these dark and light stripes are sub
00:23:29.440
blocks within the file so say we take we have that tax that goes out it's hitting node one it needs to open
00:23:36.159
up the file and it needs to be able to go through and work on the data that's just local but needs to do it a record delimiter
00:23:42.400
uh it doesn't in this case for avro files it's not gonna do a record delimiter it's gonna do it a sub block level delimiter
00:23:47.440
so you're gonna it's gonna process along and it's going to process up to the end of the first file block
00:23:52.799
that's not on the local block so the amount of data that has to move between data nodes in hadoop is reduced
00:23:58.159
to just about you know probably less than just a few k and then task two can pick up there
00:24:03.360
from that point and go further on so this is the way that avro works out well in the hdfs system and works well with
00:24:10.000
the jaw with the task trackers and stuff if you're using something like a tarball or a straight csv file
00:24:17.840
file might not work too bad but especially if you've got a tarball then there's problems with
00:24:24.320
that in terms of splitting the file into different chunks and a lot of people that are doing tarballs in in hadoop they'll actually make sure
00:24:30.880
that their tarball files are already pre-split so every file is already just one block long
00:24:37.679
which kind of holds defeats the entire purpose so in terms of just avro as a whole say i had 5 500 records all of these
00:24:44.960
maybe around 2 500 bytes or something like that if you just look at the straight json it's uh it's 12 mix
00:24:52.000
and it's a 2 meg tarball if you convert it to an afro file it's
00:24:57.840
actually just 3.6 megs and that's with no compression period um when you think about is these are json
00:25:03.919
records so all those keys that are in every single one are duplicated so every single one there's a lot of overhead in just the
00:25:09.279
keys in a json file same way there's a lot of overhead in xml when you've just got the tags so
00:25:14.559
you might look at like 80 of an xml file might just be the tags well in this case it looks like about 60 to 70 percent of
00:25:22.240
the raw data is actually the json structure it's supposed to be a compact format to begin with but you still have
00:25:27.600
these keys that are in every single instance so just by converting to an avro file which puts the schema at the top of the file
00:25:34.240
and then every record is just binary then you you've reduced 70 of the the data volume
00:25:40.880
just by stripping out essentially the keys from your key value pairs and then those individual sub those
00:25:48.159
individual sub blocks in the avro file are individually compressed so you can use snappy compression which
00:25:54.240
is a really nice compression because uh it like lzo are these new styles of
00:25:59.279
compression where it takes less time to read the file off of disk and compressed and decompress it the that time right
00:26:06.960
there is actually shorter than if it was uncompressed and on disk so there's a couple different algorithms lib snappy lco they don't have the best
00:26:13.760
compression ratios maybe only 50 something percent but you can read them off of disk and
00:26:18.960
decompress them faster than you could read off of disk the uncompressed data and it's pretty cool that way so avro
00:26:26.400
and jruby yes we have happiness with jruby specifically you could use the the jar i mean it's
00:26:32.159
just a jar that's it it's just a library you could also use the ruby implementation itself
00:26:37.279
so depending on the features that you need that i didn't talk about you may want to use the java or the ruby
00:26:42.640
if you just want some basic vanilla stuff so you know there's this is a really good spot for jruby and i would actually
00:26:50.000
even just mri or you know any ruby you want i would take a look at avro if you're storing stuff
00:26:55.279
in files on disk and you're accessing them via ruby i would take a look at avro in any case
00:27:01.520
um one other quick one i'm going to talk about is when you're dealing with lots of data distributed across a whole bunch of
00:27:06.799
machines you will need to be able to coordinate that there's a tool called zookeeper which is
00:27:12.080
uh essentially a set of servers that provide coordination services like group
00:27:17.760
membership so a bunch of people show up and say hey i'm available i can do work that kind of stuff distributed locks you've got a process
00:27:24.480
on one machine that needs to wait for another machine's processing to get done and so it can do some sort of locking
00:27:30.960
sequences so sort of if you uh to make things run in sequence sort of a queue
00:27:36.080
uh or if you want to generate numbers or something like that that so that everyone has the same number so it has in increasing numbers uh
00:27:44.000
watches you want to say hey this is sort of like a distributed locks where i'm going to wait and watch for
00:27:49.279
something to happen and when it happens i'm going to get notified well that notification could be between machines in the short
00:27:55.360
and sweet zookeeper is a highly available file system it's basically a hierarchical structure of nodes um
00:28:02.240
call them z nodes uh and each node can contain other nodes or key value pairs
00:28:08.000
and that's pretty much the primitives that it has uh so there's interesting stuff you know when uh say i'm
00:28:14.559
p3 i'm going to register i'm going to you know connect to the zookeeper server server
00:28:19.840
cluster go to basically cd into app one and register myself and put a p3 so anyone else who
00:28:26.000
needs you know p app one services they can go to zookeeper they could essentially cd into that node
00:28:32.559
and then say oh list all of the available services for here and so somebody who's unavailable
00:28:38.000
he could just disappear or somebody another one comes online it can put a note there and appear so there's a different way for doing
00:28:43.840
distributed coordination so zookeeper and jruby outlook is good
00:28:49.120
zookeeper is a wire protocol that's how you're going to communicate with it and twitter has probably the best
00:28:54.320
implementation i've seen of the zookeeper wire protocol so take a look at it if you need distributed coordination
00:29:00.480
a zookeeper itself is just a jar it stands by itself you'll need at least three servers running for uh for your quorum
00:29:06.720
to make sure everything looks good but if you need to do distributed coordination um it's probably a good thing to look at
00:29:14.480
so say you have all this data and we've been talking about it at the block level what if you need essentially record
00:29:21.279
level access so you need low latency access to record level data and in this case uh the hadoop project or the apache
00:29:28.480
project that works on here is bigtable so how many people use bigtable or something like it you know there's there's hbase cassandra
00:29:36.399
is somewhat like this and that might be the only implementations
00:29:41.520
that are sort of around um but hbase is essentially you can think of it as a really big grid with billions
00:29:46.960
of rows and millions of columns and and every cell is optional essentially um
00:29:52.640
so hbase itself builds upon everything that we've already seen here uh it uses hdfs friendly files so
00:29:58.559
they're all you know striped and appropriately so the different chunks of them can exist on different servers and the task managers can
00:30:04.640
can all talk to the appropriate data pieces and it uses a zookeeper to coordinate everything
00:30:10.880
so hbase and jruby this is actually where jruby shines in
00:30:17.039
terms of the hadoop inco system it's at the top because hbase uh
00:30:23.520
basically uses everything else under the sun you can do mapreduce processing uh in hbase uh you uh basically you can
00:30:29.679
access it via a protocol um and actually one of the really interesting
00:30:34.880
things is hp ships with jruby so the ir the hbase shell or if you think of it as your postgres or my sql
00:30:41.360
shell if you're going to say create table all that kind of good stuff select do all these different things for hbase that is actually a customized
00:30:48.240
version of irb which is which is pretty nice hbase itself if you wanted to communicate with it
00:30:54.399
uh from a network perspective uh it supports thrift and h and protocol and also avro rpc
00:31:00.960
uh and there's a couple of different interfaces for it also has the the pure java piece and there's a couple of
00:31:07.600
implementations i actually implemented uh a jruby called a jruby extension to talk
00:31:12.799
to hbase called ashby it seems to work out mostly well and actually that's essentially the
00:31:21.360
the gist of it i'll start with some questions comments anything no
00:31:29.279
too much data all right so this is where we all need to go talk about this and drink quite a bit
00:31:34.559
so yes
00:31:43.440
okay so dr dr nick is trolling i mean how do you know
00:31:52.559
yeah i mean if you want to call it it's data logistics i'd be like i've got data coming in i want to remove
00:31:58.000
data i want to keep it for so long a lot of that has to do with how you store it i mean if you're going to say as data is coming
00:32:04.480
in i'm going to spool it here i'm going to keep it here for an archive this is a location it's live data and
00:32:09.679
then for basically the data logistics you've got to have a process that says hey this is the time and place where data is
00:32:16.399
going to get removed so that the data meets the certain criteria most of the time it's 18 offers because of time
00:32:21.679
maybe you know or maybe say you're storing some sort of financial or medical or something like that you may
00:32:27.200
have a specific time frame maybe it's seven years or eleven years or tax data or something like that there's criteria that says this data can
00:32:33.279
be gotten rid of at a certain point in time so you'll basically have to decide most of the time it's a business decision um
00:32:41.919
yeah i mean you're gonna have to write code to do that to implement those rules yeah these different systems
00:32:48.240
um you'll probably for say you've got uh you know terabytes petabytes of data stored in hadoop
00:32:53.840
you're probably going to have either a file system structure that things are organized by date and you can just start removing
00:33:00.000
stuff or looking at the time stamps of files or maybe the data from the files
00:33:08.559
so i don't think we're quite dealing with the amount of data you were talking about we've got a whole bunch of things that we've been doing some calculations on
00:33:15.440
and we've been resisting uh the requests of one of our project managers to try and speed things up by
00:33:21.760
saving some of the frequently done calculations as like a cash version of the data somewhere because
00:33:28.000
what's a good guideline for figuring out when you might want to consider an option
00:33:33.279
would you or do you ever want to do that so the question is when should you consider essentially
00:33:38.480
caching complicated values uh for a period of time calculate computed values yeah computed time
00:33:44.399
computed values um my rough one is when it doesn't satisfy the business needs
00:33:49.440
you know if you're saying okay i have this data processing that has to take place every day processes 24 hours worth of
00:33:55.600
data and or maybe called a week worth of data and has to be done every hour well all the data that came in for the week minus
00:34:01.600
the last hour you know i will just catch that and then have a rolling window of the data that pages off that doesn't apply
00:34:07.120
to the calculation and you can get rid of it so yeah if it's if it can calculate once and it's then that
00:34:12.399
and that output value is going to be used many many many times why not catch it that's that saves you cpu and it costs
00:34:18.960
you you know bytes of data yes yeah for those of us who are just starting to like
00:34:24.399
be interested in learning more about big data do you have any resources you would recommend for like podcasts sure for
00:34:32.320
for for folks just getting interested in the big data and probably hadoop in particular um if you can there's been a couple of
00:34:38.320
conferences this year called strata from o'reilly i don't know if a lot of their just sort of like an overview of
00:34:43.599
technologies and peoples and businesses that are involved in it those may or may not be available online
00:34:49.119
some of them probably are if you want to talk with hadoop in particular cloudera
00:34:54.399
probably has some podcasts and presentations and they actually have a project i think it's called
00:34:59.440
which is essentially a bootstrap of getting a hadoop cluster running in ec2 with a very minimal number of
00:35:05.680
middle number of steps so the one thing is getting a production system like this
00:35:11.440
put into place can take a while to understand where everything goes how it all fits together there's a lot of moving parts
00:35:17.280
if it was similar that'd be great but it's a complex problem and not necessarily a simple solution so
00:35:23.520
yeah that's your question yeah yeah actually i am friends with
00:35:29.359
folks at dell who are building a package hadoop solution okay you can call up bill and call our hadoop system so you can
00:35:35.599
call up delaware hadoop system yeah uh there are probably there are other customers other people that have you know package deals that you can bring
00:35:42.320
stuff up and
00:35:48.960
so have you had any experience with elastic macros i've had experience with elastic map
00:35:54.880
reach no but i know it exists
00:36:05.040
and actually along those lines uh somebody just put together i think 30 000 node amazon ec2b cluster
00:36:12.880
for a uh unnamed pharma pharma company to do a calculation and it was basically your super computer
00:36:18.400
that cost like seven thousand dollars an hour or something like that it's very cheap for the amount of calculations
00:36:24.320
well um
00:36:35.599
if you just want to use zookeeper stand alone it's actually pretty simple because it is just a jar um you can just download the zookeeper
00:36:42.480
jar comes with some shells uh basically it's a etsy zookeeper to do that cfg you set up the configuration
00:36:48.720
the the how to online purzuki was pretty good and you need at least three otherwise
00:36:53.839
because it's basically i think it's a practice forum so you have to have at least three so they can vote uh and come up with the
00:36:59.359
right solution um the one thing about zookeeper and zookeeper doesn't at least the last time you checked on it
00:37:05.440
it doesn't actually work well in ec2 because zoom keeper has extremely tight requirements
00:37:11.200
on i o latency for writing to this because it's always persistent it always makes sure that everything you
00:37:16.880
do as you keep her is consistent across the entire forum so sometimes if your ec2 volume
00:37:23.040
your eds volumes and stuff like that they don't provide the i o minimum i o requirements so the zookeeper can fail but that may have
00:37:29.839
changed since last time i checked so yeah yes what's the
00:37:49.200
uh from internally using radube or something like that um i can't answer hopefully i can answer
00:37:55.839
the question before the end of the year
00:38:16.839
offline
00:38:30.839
um yeah so the question is is this type of
00:38:37.040
processing mostly used for say offline batch processing or online real-time stuff um
00:38:42.960
depends on your definition of real time
00:39:00.400
for in general i would say uh hadoop in terms of mapreduce jobs are probably
00:39:05.839
not for real-time analysis they're going to be more for offline because just the overhead of spinning up the job and submitting it you know that you're talking seconds in
00:39:12.800
that range and then of course the data you're you're saying streaming to this you need to collect at least enough to
00:39:18.720
fill up a block otherwise it's not efficient um uh hbase you could probably do more
00:39:23.760
because uh all the stuff that it's committing it has you have lower latency access to that
00:39:29.200
data but you also need to know that that data is there uh if you want to do more stream processing of data actually storm which
00:39:35.359
just came out is probably worth looking at there's another one that i can't remember uh another one i can't remember
00:39:41.520
on top of my head but if you need to do if you're if you have a more higher oh flume flume is another one that might be
00:39:46.640
useful uh which is also which is done by cloudera um and those are probably going to be
00:39:51.920
more in terms of if you're going to be screen processing versus more call it call it offline backup where we're offline to
00:39:58.160
maybe minutes hours depending on your date of collection
00:40:04.000
more questions no all right go to the amp block party look
00:40:10.240
out more there
00:40:36.839
do
00:40:51.680
you