00:00:18.240
coming so before I get started I want to just uh ask a quick question which is how many people in the audience right
00:00:24.320
now either already use big data or want to learn how to use big data so they can put it on their resume so they can get
00:00:30.320
three times more recruiter emails exactly Perfect all right that's
00:00:35.399
pretty much what I'm going to talk about today how to optimize your recruiter emails uh building data driven products
00:00:41.360
using Ruby so you're probably wondering who is this guy why should I listen to him
00:00:47.039
right he looks like he's about 12 years old his voice hasn't even broken yet so I studied uh computer science and
00:00:53.239
bioinformatics at UCSD before I eventually uh dropped out so I could kind of join the startup scene in San
00:00:58.320
Francisco and right now I'm currently a data scientist at share through which basically means I'm an engineer who also
00:01:04.360
happens to be good at math and since I live in Silicon Valley they decided to call me a data scientist the company I work for share
00:01:11.439
through is a a native video advertising platform what that means is we consume
00:01:17.520
large amounts of data about users all over the web so that we can customize ad
00:01:22.560
experiences to hopefully make ads as a whole suck less right which is good because it means I basically get paid to
00:01:29.479
use data data to improve my business's bottom line so really you want to listen to me because somebody pays me to do
00:01:35.520
this so I must kind of know what I'm talking about at least so my goal of this talk is to help
00:01:42.040
you answer the following four questions what is a data driven product what does the development cycle look like for a
00:01:47.560
data driven product where does Ruby fit in this new world of data science and how can Ruby be improved to stay
00:01:54.040
relevant in the age of big data but I just want to let you know I'm not going to talk about whether you should use use
00:02:00.360
like support Vector machines or some new type of regression or whether you want to use principal component analysis this
00:02:06.360
is really about talking about how Ruby fits in and how we build datadriven products as a
00:02:12.080
whole so before I get started I want to give you a couple warnings Ruby is not your only option
00:02:19.440
right the world of big data right now is pretty much a Minefield you have so many things to choose from right we have Hive
00:02:25.319
Pig there's R Scala cascading python Java right all these tools in this
00:02:30.680
ecosystem right now and really you're not going to use just one of them so in my day-to-day job I actually use a
00:02:36.800
combination of Ruby python Java R and even a little bit of Scola just to do what I do on a day-to-day basis right
00:02:43.360
the key is it's all about picking the right tool for the right job but since this is a ruby conference I'm really
00:02:48.640
going to talk about the ways I see Ruby fitting in and the places in the data driven product cycle where Ruby really
00:02:55.280
is a good fit so now we've got those warnings out of the the way we'll first start with
00:03:01.280
what is a data driven product right a data driven product is really anything
00:03:06.480
that uses data to improve the bottom line of your business right so it could be a standalone product right where the
00:03:12.000
whole company just does data some examples of these are like boundary mix panel right Google to a certain extent
00:03:18.959
but I think it's more interesting to look at the ways you can incorporate like particular data driven products
00:03:24.319
into a larger company offering right examples might be like ad targeting right uh product recommendations if
00:03:30.439
you're amazon.com right which books you bought uh information aggregation and filtering so if you go on a News website
00:03:36.920
which articles you're likely to want to watch right we can see some examples of these all over the web at the moment
00:03:42.280
right we have GitHub the classic example right this kind of seems simple and
00:03:48.040
trivial when you really think about it what GitHub is doing is they're giving you data and they're letting you use data to understand how you interact with
00:03:54.680
their product and in understanding how you interact with your product they know you're going to become like more
00:03:59.920
attached to the platform as a whole right so using this simple graph I can do things like know when my engineering
00:04:05.680
team is most productive right and if I know when my engineering team is most productive I can make sure I don't schedule meetings during those times at
00:04:13.040
the same time I can check to make sure I'm not burning my Engineers out making sure like they're not committing at 1:00
00:04:18.440
in the morning every single Saturday because nobody really wants that right the other classic example is
00:04:24.960
linkedin's people you may know right so social networks realized a long time ago that a user's interaction with their
00:04:30.840
network was highly coupled to how dense their social network was on that given Network so in order to help improve that
00:04:37.800
what LinkedIn did is they kind of pioneered this notion of people you may know right so we use algorithms and we
00:04:43.360
use data based on how you interact with a network in order to Target people they think you're likely to want to get to
00:04:51.240
know in doing this they're able to help improve engagement and overall improve the value of their product right because
00:04:57.039
LinkedIn makes money off advertising if users are on their site interacting with their friends they get more ad
00:05:02.080
Impressions which means ultimately they make more money and the class example is
00:05:07.479
advertising right like Google AdWords is probably one of the ultimate data driven products right they use data from your
00:05:13.800
searching history they use the actual search term itself all to Target advertising to you right and the cool
00:05:19.600
thing is by using this data they're able to create value for Google as a whole which lets them deliver the awesome uh
00:05:27.000
search results that we're so used to on a daily basis right using this data product they're able to make money for
00:05:33.560
Google but it doesn't have to be a product in the traditional sense another great thing you can do with data is
00:05:39.680
improve marketing for your company as a whole right this is a great example Facebook released this graphic back in
00:05:45.240
December of 2010 and they got tons of pickup right all the tech blogs covered it was on Tech crunch Mashable and like
00:05:50.960
a bunch of people use it as their screen saver right just a simple product like this you wouldn't normally think of it as a product but when you kind of take
00:05:57.440
that step back what it's doing is it's giving Facebook a way to reach users right and if users see how connected
00:06:03.919
Facebook is as a whole they're more likely to want to join the network and most importantly Brands now realize how connected Facebook is because if a brand
00:06:10.680
manager sees this they're going to say like hey maybe I should spend some ad reeven like some of my ad Budget on
00:06:17.160
Facebook so that's great you've told me exactly what a data driven product is filled it with lots of buzzwords but I'm
00:06:22.599
an engineer so I want to know how do I actually go about building something so I think building a data
00:06:28.280
driven product kind of comes down to this cycle which has four major steps right you start off with asking the
00:06:33.840
right question there's collecting and cleaning your data then you move on to building the predictive model and
00:06:39.560
finally you get to like publishing your results phase right the important thing to notice about this is it's not a
00:06:44.680
linear flow right it's not waterfall building data driven products is just like building every other product that
00:06:49.880
we're so used to building right like rails web apps we all do the agile thing right it's very much the same with data
00:06:55.879
but it's very hard to get out of that traditional like waterfall straight down model right like most people in a
00:07:01.000
traditional kind of Enterprise background doing data stuff using traditional like business intelligence tools ter data that kind of thing you
00:07:07.599
get in this kind of waterfall mode right you just like give me a question I'm going to answer it same technology all the time but I think it's important to
00:07:13.800
realize that to build good data products you need this cycle right because you might finally get to the published results phase like you print out this
00:07:20.080
graph and then the business says okay that's cool but we've kind of pivoted since then so your graph really has no
00:07:25.520
value now so you need to go back and do it again right same thing happens like you've built the Ultimate model tons of
00:07:31.199
work and then somebody in business development signs a new deal with a third party and all of a sudden you need
00:07:36.680
to integrate that new third Pary data source because hey the business needs it we need to put it on our website we need
00:07:41.800
you to put it in your model all of a sudden you're going to have to go from that whole building a model phase back to cleaning and collecting data because
00:07:47.919
until you've done that you can't truly build the model so now that we have kind of the four main phases outlined we're
00:07:54.360
going to like take a step and like take a step in and look at how we do each one
00:08:00.800
so the first phase is all about asking the right question right this seems really simple seems kind of trivial
00:08:06.120
you're probably wondering why I'm even talking about this right this is a Tech conference I said I was going to talk about Ruby but this is actually one of
00:08:11.400
the hardest phases right if you don't ask the right question no matter how awesome your Tech stack is no matter
00:08:17.039
what technology you pick it won't matter right if you're not answering a question that really helps the business you're
00:08:22.960
not delivering value and the whole point of like engineering and product is delivering value to your business right
00:08:28.520
which means you really need to focus on asking the right question and conveniently the only thing you need to
00:08:33.599
do that is English right you don't need Ruby you don't need python you don't need Java you don't need Hadoop to help
00:08:38.719
ask the right question what you need to do is you need to go out and you need to talk to the business right you need business context you need to know like
00:08:45.200
what makes a business run like what are Partnerships looking like what does the market look like as a whole right all these things are traditionally the
00:08:51.200
hardest things for engineers to do right we actually have to go out and talk to people which is quite challenging I don't personally like doing that that
00:08:56.800
much we have to do them so that we can kind of get that that first phase done so we can use Ruby to kind of help with
00:09:03.360
that do some exploratory stats but really at the end of the day that whole first phase you don't need any programming technology you just need to
00:09:09.000
talk to people so for a personal example right I kind of want to guide this whole process
00:09:15.519
through what I do on a day-to-day basis so the first example that I really have is the marketing department in the
00:09:21.680
business comes to me and they say okay Ryan I want a data dump of percent of users on publisher X that I've also seen
00:09:27.480
on publisher y right so I want to see how many users I've seen on Forbes that I later saw on the all or Business
00:09:34.720
Insider right I could do that that's simple that's like very easy question but the problem is what value does that
00:09:40.920
really give them right a data dump is very simple so you kind of you take that step back which is the next box down
00:09:46.519
which is the thing they're really trying to ask the question they really want to know is what is the value of a user on an ad Network right because if we can
00:09:53.399
determine the value of a user on an ad Network we can better predict our Revenue as a whole and we can better
00:09:59.680
gauge how much to charge an Advertiser for each impression right but that's almost too big of a question right like
00:10:04.760
now I found this huge theme what do I actually do with that I can't really answer that so you maybe
00:10:10.079
take that one step down right which is in this case what is the supply of a user of a given type right so given a
00:10:16.839
user been seen on publisher a what what's the supply of users who will also be seen on publisher B right and most
00:10:23.800
importantly can we predict that given we've seen a user on one publisher that we'll see a user on another publisher
00:10:30.399
and that's kind of the question we chose to work on so once we have that question formatted we get down to the real code writing
00:10:36.480
phase which is phase two which is data collection and
00:10:41.959
cleaning so this is not very glamorous but you will spend 90% of your time doing data collection and cleaning right
00:10:47.560
no matter how what anyone tells you right like in computer science class they're always like focus on math focus
00:10:53.279
on stats that's all great you're going to spend 90% of your time cleaning data data in the real world is very very
00:11:00.200
messy and the thing that really makes a difference between a good data product a good engineer a good data scientist and
00:11:06.279
someone who isn't very good is your ability to deal with and clean data so for example you would start off
00:11:13.320
with something like this right this is what my logs look like that I do most of my analysis on right you can kind of
00:11:19.800
read them but you can already see like there's some missing values we have the intentionally blank htdp refer that's
00:11:25.880
awesome and the whole thing is just kind of messy right and what I needed do is I need to take this massive data I need to
00:11:32.160
output something that looks like this because I need this two column CSV so I
00:11:37.839
can input it into my graph algorithms to actually determine what percentage of a user is going to cross over to build the
00:11:43.279
product that the business really needs which is a predictive model for how many users are going to be seeing across our
00:11:48.720
Network right but for you guys how do you get your data right where does data come
00:11:54.519
from right now in the Social Web most data comes from these four sources right
00:12:01.079
we have server logs right your front end boxes rails boxes engine X boxes they're all producing tons and tons of logs you
00:12:07.880
have that you have third party apis right everyone loves to collect Twitter data right now everyone loves to collect
00:12:13.040
Facebook data you have web scraping right maybe they don't have an API or it's just a page you want to get some
00:12:19.040
information from so you're going to go out there and you're going to scrape that data and finally we have direct user input right you have like a
00:12:25.160
questionnaire you have a survey something that where the user's directly giving you data the important thing to know about these
00:12:31.399
four sources that's very different from what we're used to is they all require programming skills right like none of this data is just conveniently handed to
00:12:39.120
you right like everything requires you to go out and write code to get it right and it could be even worse what happens
00:12:44.839
if someone gives you a PDF right they're like hey I gave you data you're like no you gave me a PDF I can't do anything with that but ultimately you're going to
00:12:51.199
have to pull data from all these sources in order to even start building a real product that helps your
00:12:56.720
business and this is really where Ruby comes in right Ruby has tons of tools to
00:13:02.600
make it possible for you to collect and clean data right this is just a sample of the tools I use on a day-to-day basis
00:13:07.880
right we have NOCO giri if you want to parse XML parse HTML right if you're doing any sort of scraping you're going
00:13:13.199
to spend tons of time using Noco giri and even if you're dealing with old apis that still use XML you're going to spend
00:13:18.440
tons of time using noiri we have Savon which everyone loves right it's a soap client if you're
00:13:24.079
having to use an API that was built in the '90s and they haven't updated it to rest and you have to use soap save on will make it a lot lot easier right we
00:13:30.480
have rest client which makes it really easy to make htdp requests because you're going to be making a lot of htdp requests if you want to cross reference
00:13:36.800
data sources right pretty much every API right now is HTP based especially if it's modern using rest so that will come
00:13:43.120
in very handy right we have PDF reader because you will most likely have somebody come to you and want you to do
00:13:49.639
some analysis on a PDF like hey I have tons of data in a PDF I need you to extract it right Ruby makes that
00:13:56.759
possible and then finally we have Sinatra right like Sinatra is a great way if you just want to like quickly set
00:14:02.040
up a survey right like I could write a survey that I can put up on Amazon Mechanical T to like ask somebody their
00:14:07.079
opinion on the election in probably 10 minutes right and then we also have like
00:14:12.320
Twitter Twitter is a classic example and that's where we're going to look at our first piece of code so like maybe a simple thing you
00:14:19.680
want to do is you want to say what's the word frequency from Hurricane Sandy right and the awesome thing about Ruby
00:14:24.920
is we can pretty much do that in like 11 lines of code given that we've already like configured our tweet stream client
00:14:31.720
right and if you just take a second to read that
00:14:39.120
code what you can see is that it's really very basic right Ruby takes this
00:14:44.399
hard process of cleaning and collecting data and makes it much easier to work with right so here we're already able to
00:14:50.480
do like basically the two hardest things which is one go out and collect the data right we're able to track all the
00:14:56.399
keywords that have the hashtag Sandy in them but it just doesn't just do that we're also getting the cleaning phase
00:15:01.480
done right because no algorithm that's going to do interesting things on natural language is going to take like
00:15:07.399
really sentences right most of the models fundamentally want words or engrs or some sort of representation and we
00:15:13.720
can already get that right rubby makes it so easy we're able to take that entire sentence split it and then remove
00:15:19.519
blank strings like just like that right which means now we can simply output a comma separated list of words which
00:15:25.959
would be awesome to input into the next phase in our algorithm
00:15:31.000
and when we talk about collecting data we can not talk about rails right rails is so easy if you need to collect direct
00:15:37.319
user input right like if you want to use Mechanical Turk right for sentiment analysis anything like that rails makes
00:15:42.680
that so easy there's an awesome open source project from Twitter called Clockwork Raven right which has gotten a
00:15:48.680
lot of press it's basically a way to submit jobs to Amazon Mechanical Turk get feedback and then refine which users
00:15:55.279
you think do the best job right so that's all written in rails and those a funny quote the other day from one of their data scientists he said I was
00:16:01.720
trained as a classical scientist but I spend most of my time writing rails because what it comes down to is he
00:16:07.160
spends most of his time writing the Frameworks he needs to collect the data so that he can even do the complex
00:16:12.199
analysis that he spent all that time in school learning how to do like that's great but you said you're
00:16:18.040
going to talk about big data and my data is Big Data right everyone right now wants to say their data is bigger than everyone else's and it's kind of yeah we
00:16:26.120
have to get used to like this buzzword and realizing that just because your data is Big Data doesn't mean you can't
00:16:31.199
use Ruby right everyone says well Ruby can't scale it turns out some smart guys created this thing called Hadoop which
00:16:38.199
means you can make Ruby scale right so Hadoop is Java I don't really like writing Java personally so if I can get
00:16:44.519
away with it I'm going to not write Java and for a lot of tasks you don't really need Java in order to write Hadoop
00:16:52.240
jobs so if you remember those log lines from before when there's two of them it's pretty easy to see which pieces of
00:16:57.759
data are missing right we can see that it's pretty easy blank hdp refer there's a couple missing values but in reality
00:17:04.280
in production if you're dealing with big data your logs basically look like that right it's an UND discernable mass of
00:17:09.520
text there's no way you can go through and manually inspect every single log line to see like hey somebody introduced
00:17:14.600
a bug in the client and all of a sudden my data is not being collected right and that's kind of where
00:17:19.760
Hado comes in and that's where we can really use the power of Ruby because Ruby makes it very easy to do simple
00:17:25.520
tasks across humongous clusters of nodes so here's an example from those log
00:17:31.480
lines I happen to know that at a given point there was a bug introducing our client and lines were being passed back
00:17:37.720
to me without user IDs and without user IDs there's not really much interesting stuff you can do if you're an ad company
00:17:43.320
because you need to track users but in basically three lines of real Ruby plus a header we can basically take 10
00:17:51.520
billion log lines and just output the log lines that have missing user IDs right and if that might only be 10,000
00:17:58.440
log lines right so what I can do is I can use Ruby to take this massive data that's basically UND discern and distill
00:18:04.799
it down into something small right once I have those 10,000 log lines I can probably download download those onto my
00:18:10.559
local machine see like what time did the log lines start appearing right what time was the bug introduced what time
00:18:16.280
was the bug solved right one of the best things you can try and do one of the things Ruby is so good at is taking big
00:18:22.000
data and making it small right because you really want to make your data much smaller to make it much easier to work
00:18:27.480
with so when it comes to Ruby and Hadoop there's kind of three good options at
00:18:33.080
this point there might be more but these are the kind of Three Mature options that we've experimented with and used
00:18:38.600
right first off you have vanilla Hadoop streaming that's a script I just showed you basically Hadoop will take every
00:18:44.880
line that it sees as input and just output it to standard out and then your Ruby script can just read it in and do whatever you want with it right this
00:18:51.280
gives you ultimate flexibility and power but tons of boilerplate code to write so if your data is serialized you have to
00:18:57.360
deal with all the deserialization right trivial things like I want to do group by an account right that becomes hard
00:19:04.240
because all of a sudden you have to write your own Ruby code to do a group buy an account and it's a distributed account so it's very hard conveniently
00:19:11.320
the guys at infochimps wrote this cool Library called Wukong which is an abstraction on top of the vanilla Hadoop
00:19:17.640
streaming which makes it much easier to work with like uh Ruby on Hadoop if you
00:19:23.120
just want to use the streaming and regular like regular Ruby and what they do is they give you this kind of
00:19:29.080
abstraction that makes it very easy to perform classic tupal operations right something like a group buy a count a
00:19:35.320
distinct all those kind of things become very easy but streaming is not necessarily the most efficient way to
00:19:42.320
use Big Data it's definitely not the fastest way you can run a Hadoop job
00:19:47.480
really what you want is you want to drop down to Java right you want to drop down to the native libraries and there's cool Java libraries pretty heavily adopted
00:19:54.159
everyone really uses it these days called cascading cascading is all Java and there's tons of verbosity it makes
00:19:59.799
it very easy to do the same kind of tupal operations right Group by count those kind of things but the problem is
00:20:05.679
is Java so I end up writing like a 100 lines of Constructors like abstract factory factory Group by like
00:20:13.400
awesome so the guys that Etsy wrote this cool rapper uh cascading J Ruby right
00:20:20.440
and What it lets you do is it lets you write J Ruby scripts that end up translating down
00:20:26.760
essentially to jvm bik code right so really the power of Hado the power of
00:20:31.960
big data is J Ruby right J Ruby is an awesome project it gives you the full power of the jvm and it pretty much lets
00:20:38.200
you do anything with Hadoop you want to do in Ruby syntax right sure you'll take a performance penalty but most of us
00:20:44.200
aren't running on like Facebook and Google scale right we don't have like pedabytes and zetes we have like maybe
00:20:50.640
terabytes right jru is fine for that you can use J Ruby to write Pig udfs right
00:20:55.919
maybe everyone already likes Pig maybe your company already has a Hive cluster you can use J Ruby to write Hive udfs
00:21:01.480
right like custom Hive functions say I want to do something like geoc code and IP address right that becomes a lot
00:21:07.280
easier when you can write that using existing Ruby libraries and much like uh
00:21:12.559
much like the way Square rolls everything up before deploying it using their uh their framework you can kind of
00:21:18.559
do the same thing with Hadoop right you just bundle up your J Ruby in a jar you just create a big Uber jar and you just
00:21:23.640
ship it and Hadoop is fine to run your J Ruby and we can really see the power of
00:21:29.039
that with this example so I'll let you read it and see if you can get the gist for what's going
00:21:43.240
on so yeah this script basically is the classic word count example right so to
00:21:48.760
give you a little bit of context the classic word count example if you use the raw Hadoop API I think ends up being
00:21:54.640
a couple hundred lines of code if you use just vanilla cascading and you really compress it it'll probably look a
00:22:00.480
little bit shorter than this but it's a lot harder to understand right the cool thing here is this is all the Ruby code we're so used to doing right Ruby gives
00:22:06.960
us blocks we can pass them around it gives you Anonymous functions things that Java doesn't have things that make it a pain in the ass to write cascading
00:22:14.360
jobs right I have to create a class just for a custom function this script basically lets us take however much data
00:22:20.200
we want right split it into words and count them right something we all know
00:22:25.840
how to do on the command line J Ruby gives you the ability to harness that
00:22:31.039
across a cluster of you know a billion nodes if you
00:22:36.400
haven't so Ruby as a whole is a powerful tool for data collection and cleaning right we all know how to use data to we
00:22:43.600
all know how to use Ruby to clean data in Unix right really the great thing about it is you can use those exact same
00:22:49.760
tools you're used to using every single day to build like data driven products right just because you have big data doesn't mean you can't do the stuff
00:22:55.720
you're so used to doing right you write a script you can run it on Hadoop you want to use J Ruby you can harness the
00:23:00.919
full power of Hadoop right which just makes Ruby so powerful when you combine that with the ability to easily collect
00:23:06.120
data using things like rails it just it's such a natural fit for this hardest part of building a data driven product
00:23:12.679
because you will spend 90% of your time doing it so if you can use Ruby for the thing you spend 90% of your time doing
00:23:18.080
you already have a win really right so once it's all cleaned you kind of move on to the next phase right
00:23:25.240
that's the statistical modeling and the prediction that's kind of the glamorous pH right if you go to interview for a
00:23:30.640
job this is what they're going to ask you all the questions about right like what kind of distribution does my data have all that kind of stuff so it's
00:23:37.640
definitely the Glamorous part and it turns out really what you're trying to say in my personal example right I said
00:23:45.279
I need to be able to predict a user's likelihood of being on multiple Publishers right and what that really
00:23:50.799
means is I need a function that takes as input a user ID and a publisher X and I need to Output me the probability that
00:23:57.960
the user will be on publisher why right so it's pretty basic right that's a pretty simple function but it's going to
00:24:03.960
require me to do some stats right like that's where the kind of data stuff really comes in right I'm gonna have to do some
00:24:10.360
stats but Ruby sucks at statistical Computing right like if you go on stack Overflow and you post a like question
00:24:16.000
about stats and Ruby somebody's gonna say don't be an idiot use R but it turns out who cares right like
00:24:24.120
most of your time is going to be spent cleaning and collecting your data anyway so the fact that Ruby might not be the best statistical programming language
00:24:30.799
doesn't really matter the other thing you can always say is you have you actually tried running RM production it sucks right
00:24:38.440
like to take code and actually build a real product that your business is going to make money on you need to be able to monitor it you need to be able to deploy
00:24:44.679
it easily you need to know when an exception occurs all those things that Ruby already gives you you don't get when you try and
00:24:50.600
run languages such as R you can do it but it's definitely not the best way and
00:24:55.640
since we're all rubious we want to know how we can use Ruby to kind of do the statistical modeling so it turns out
00:25:02.240
while people may say Ruby sucks at Stats there's actually a pretty good selection
00:25:07.320
of libraries that let you do most of the stats you'll need to do to build the kind of products we're talking about
00:25:12.399
right stat sample is an awesome Library it basically goes through and implements most of the statistics functions you'd
00:25:18.360
ever really need right sh it will maybe lack some obscure genetics algorithm but at that point you should probably be
00:25:25.120
using another language anyway we also have a s Ruby right which is a big push kind of an attempt to make Ruby
00:25:31.600
equivalent to CPI and What it lets you do is that same stuff right it provides you with Matrix libraries right all
00:25:37.600
these things that we need to do stats are there in Ruby right like we can use them and then most importantly we have
00:25:43.679
lib SPM right so support Vector machines are the mo one of the most popular ways to do classification right now and
00:25:50.120
there's a pretty battle hardened production ready C implementation of svm called lib svm and there's a ruby
00:25:56.760
binding for it right so if you go online and you Google Ruby lib SVN you're going to find a good blog post by uh I think
00:26:03.960
Elia gregoric on how to use it and you'll just find tons of documentation right there's no reason we can't use these libraries because most of them are
00:26:10.000
in C there's Ruby rappers on top of most of these hardcore libraries like lib svm
00:26:15.880
and if you really need to do some obscure stats we have libraries like RS Ruby and ren Ruby right that basically
00:26:21.720
lets you send Ruby objects over to R so if you need to do some obscure
00:26:26.760
complicated regression right like least angles regression that is in Ruby yet but you already have your data in Ruby
00:26:31.880
like from active records say right you run a big query you have all these objects you can just pass those over to r r can do the hard part the little bit
00:26:38.880
of statistic s crunching you need and then give it back to you right and again
00:26:43.919
the other tool we really have is J Ruby right I can't say enough
00:26:49.960
like J Ruby really gives you the power to build data products in Ruby right you can harness the full power of the jvm
00:26:55.840
and when it comes to data stuff there's a jvm library for for pretty much everything right like there's already a
00:27:00.960
Java library for hard Matrix stuff sparse matrices pretty much every type of machine learning algorithm you want to use every kernel density method you
00:27:07.559
want to use they're already implemented in Java and the great thing about J Ruby is it lets us write this nice wrapper on
00:27:14.080
the top of that stuff that we already in a language we already all know right we already know how to use Ruby and since
00:27:19.559
the hard part has usually already been written there's no reason we can't put Ruby on top of it right there's no
00:27:24.799
reason we can't make it so that data products are approach to all of our
00:27:30.159
Engineers so really although everyone will make fun of you for it there's no reason you can't do statistical modeling
00:27:35.960
in Ruby and once you have that you get to the final stage phase four which is
00:27:42.240
publishing results right this is where you actually make money if you don't get here you haven't really made the business any money which means you have
00:27:47.919
an academic research project not a product and this is really again where I
00:27:54.480
think Ruby is almost the perfect fit right so when it comes to public results if you really want to talk about a
00:27:59.919
product right you're going to end up building a web UI or a mobile app right
00:28:05.080
so here's an example that's uh share throughs analytics dashboard right so that whole thing is all ruby because
00:28:11.840
once you've done all that number crunching all that modeling you end up with actually really small data right and rails is awesome at taking something
00:28:18.679
from my SQL from postgress and making it so that you can put it to the client right and once it's on the client you
00:28:24.799
have D3 right you have high charts all those things in JavaScript that we're all so used to using and we already know
00:28:30.960
how to use we just harness those right like once you have the data distilled once you've cleaned it you've modeled it
00:28:36.880
and you're ready to present it really it's just building a web app or a mobile app right here we have Yelp right
00:28:44.679
kind of the same thing they've done tons of behind the scenes coding so that they can give you good recommendations fill
00:28:49.840
throughout fraud all of those things are already there right we already have that
00:28:55.080
and the reason we have that is rails right so just because we're talking about building data products and we're talking about Big Data doesn't mean we
00:29:02.440
can't use rails right when it comes to publishing your results often times you're going to end up building dashboards all those kind of things
00:29:09.039
rails is great at
00:29:15.360
them so to kind of close my personal example right I said I need to basically
00:29:21.000
predict user overlap so I'm not 100% done with that whole process right I've
00:29:27.200
been having to iterate on those last two phases I had a model people didn't really like like how they had to interact with it so we've gone back but
00:29:34.039
we get something cool like this right this is a little bit of eye candy I thought i' put on the slide for you and what this basically shows is a dense
00:29:41.760
network of publisher overlap right so like dense Publishers in the middle essentially have tons of crossover
00:29:47.679
because every single white line represents a user who you've seen on multiple Publishers so kind of the nice thing
00:29:55.519
thing I'm kind of trying to get across is even though you're building a data product there's no reason we can't use
00:30:01.120
Ruby however I have to note unfortunately I did not generate that graph with Ruby I was a little bit dirty
00:30:07.919
and I used Python and that's because even though I love Ruby and I love the stuff you can
00:30:14.679
do with it to build data products it's not all roses yet right there's definitely things I think we as a
00:30:20.240
community need to improve and come work on to make it so that we get more people talking about doing data things in Ruby
00:30:28.600
and kind of here are the four things I think we really need to work on the first one is kind of a graphing Library I know there's been attempts right like
00:30:34.120
there's Ruby viz attempting to Port Proto is but no there's no really Ruby
00:30:39.320
equivalent to the things people are used to in Python and R right in Python you have matap plot Li in R you have ggplot
00:30:44.440
too right if Ruby could just have this centralized adopted graphing Library it
00:30:49.760
would go a long way because you spend a lot of time doing graphs when you're doing kind of exploratory
00:30:55.799
analysis and kind of the second thing is a unified Matrix and Vector library
00:31:01.240
right machine learning if any of you guys do it you pretty much know what it comes down to most of the time is just Matrix math right like you're trying to
00:31:07.760
do some Matrix transformation so that you can get like a matrix that you can
00:31:12.960
work with or you trying to like fill in missing gaps in a matrix right all that kind of stuff it would be great if rby
00:31:19.559
had kind of a more unified centralized Matrix library right the guys at scuby are working on it I think there's
00:31:25.159
a couple of other libraries out there if you look on GitHub there there's no kind of centralized knowledge which is really what we need
00:31:32.639
right like it's not that Ruby can't do it it's really just that we need to like centralized discussion around it right
00:31:37.760
just like we have that centralized web framework which made it so that everyone started doing web stuff in rails and
00:31:43.279
kind of rails to a certain extent helped us beat python for web stuff we kind of
00:31:48.360
need that for Ruby and really it comes down to those last two things which is we just need more publishing right like
00:31:53.720
we need people talking about it more we spend a lot of time in the Ruby Community talking about like tdd and O and all those things are great but it
00:32:00.039
would be awesome if we started to get a little bit more publishing around kind of Ruby and machine learning and how we can really use Ruby and that kind of
00:32:06.679
ties into last one which is academic buyin right so like most of the time if you guys are going out to try and hire people to like build data products or do
00:32:13.279
stats or be analysts they're mostly academics right and academics tend to use Python and R simply because there's
00:32:20.440
not a lot of academic buyin at least in California universities for Ruby if we could just kind of get the academics to
00:32:26.760
accept Ruby see we can use it to do kind of computer sciency things because that's what computer science programs are all about it would go a long way
00:32:33.519
because people would come out into the workforce already knowing that they could use
00:32:39.120
Ruby so even though it has its WS I definitely think Ruby Plus data equals
00:32:44.480
agile data products right it gives you the ability to iterate really quickly you can harness the power of Hadoop you
00:32:49.639
can collect data really easily and then you can present the data that you've built using Ruby using rails using
00:32:56.279
Sinatra and finally finally my obligatory wear hiring slide so if you want to work with
00:33:02.960
me on data things uh feel free to email me or visit that link thank you