Summarized using AI

Machine Learning for Fun and Profit

Chris Nelson • November 01, 2012 • Denver, Colorado • Talk

The talk "Machine Learning for Fun and Profit" presented by Chris Nelson at RubyConf 2012 explores the integration and application of machine learning techniques within Ruby programming, particularly focusing on decision trees. Nelson explains the fundamental principles of machine learning, emphasizing its role in analyzing data to predict outcomes based on established rules.

Key Points Discussed:
- Introduction to Machine Learning: Nelson provides a broad overview of machine learning as a subset of artificial intelligence trained to learn from example data to make predictions.
- Decision Trees Explained: He dives deeper into decision trees and the ID3 algorithm, which helps create these trees based on various decision points derived from the data sets. Nelson emphasizes that decision trees facilitate logical decision-making processes and explains how they function through an intuitive representation akin to a flowchart.
- Case Study: Drawing on a project for recommending home improvements, he explains how rules expressed in Cucumber tables were transformed into decision tree structures. This project illustrated the power of machine learning in providing automated recommendations based on specific homeowner details.
- Entropy and Information Gain: The talk delves into the concepts of entropy and information gain as measures used in the ID3 algorithm to determine the best attribute for splitting the data at each node in the decision tree. This mathematical foundation helps in selecting the optimal path for classification.
- Practical Implementations: Nelson discusses practical implementations of the ID3 algorithm in Ruby, highlighting libraries like AI for R and Decision Tree, which assist in building decision trees while correcting for biases that can arise with continuous data attributes.
- Simplification through Decision Tables: Toward the end of the discussion, he summarizes how, in certain cases where rules are already known, simpler implementations such as decision tables may be more appropriate than complex decision tree algorithms. He also underscores the significance of unit tests in ensuring code accuracy when implementing these algorithms.

Conclusions: Nelson concludes the session by reinforcing key takeaways about understanding algorithm strengths and weaknesses. He encourages developers to choose the simplest workable solution for machine learning tasks and highlights the importance of solid testing practices to validate implementation accuracy. His discussion not only enlightens attendees about machine learning concepts but also inspires the application of these techniques in real Ruby programming scenarios, showing that even automated testing can be made more efficient through machine learning.

Machine Learning for Fun and Profit
Chris Nelson • Denver, Colorado • Talk

Date: November 01, 2012
Published: March 19, 2013
Announced: unknown

TDD is a great way to test code, but have you ever wondered if there are ways to leverage the awesome power of computers and help us write better tests? Research in the field of formal verification has shown promising results with tools that analyze programs for logic errors and can even figure out what input values caused those failures. However, until now, none of that research was ever used with Ruby. This talk discusses RubyCorrect, a research project that attempts to apply verification techniques like "extended static checking" and "symbolic execution" to the world of Ruby programs. We look at how these techniques work and how they could potentially improve the kinds of program faults we can detect. Machines that write our tests? So crazy that it just might work!

RubyConf 2012

00:00:15.320 okay think I'll go ahead and get started Um first of all a program note Uh it
00:00:21.039 turns out originally this talk was called machine learning for fun and profit and there are like six other
00:00:26.160 talks that are X for fun and profit So the title for this talk has officially changed for machine learning for insert
00:00:33.680 cliche of your choice here So I'm Chris Nelson I'm with uh Gaslight
00:00:41.520 We do uh web and mobile app development We're also doing a training class in Backbone and CoffeeScript in San
00:00:46.559 Francisco in early December So if you're into that please check us out Uh but what I'm going to be talking about today
00:00:54.000 is machine learning Um let's see if I can focus or narrow that Yeah that
00:01:00.239 center is a lot better Uh so machine learning is a very broad topic and
00:01:06.560 briefly um the way it's defined at least on Wikipedia is a branch of art artificial intelligence that has to do
00:01:13.280 with um using a set of examples or a set of data and trying to learn what the
00:01:19.280 rules are and predict outcomes based on it And uh there's a whole bunch of
00:01:24.560 different algorithms to do that And um this talk is really more of a depth talk
00:01:30.080 than a breadth talk So um I'm not going to survey all the different machine
00:01:35.840 learning algorithms that are out there There's there's quite a few I'm really going to drill in pretty specifically on
00:01:42.400 decision trees and even more specifically on a particular algorithm to implement decision trees and go into
00:01:49.759 how it works in detail and hopefully give you guys a comfort level to know when's an appropriate place to use
00:01:56.240 decision trees uh where they do really well where they fall down and how the
00:02:01.840 algorithm actually works So hopefully that's of interest to you guys and I'll also give you some resources at the end
00:02:08.000 so that like the other algorithms that I don't cover if you want to go learn those well there's some resources that I
00:02:13.599 can point you at Uh but that's roughly the plan Um so just uh fair warning
00:02:21.200 there'll be some math Um I myself am not a particularly super mathy guy I
00:02:26.800 actually had to relearn some math that I'd forgotten to get ready for this talk And that's a good thing for me and I'll
00:02:34.480 try to go through it pretty in detail and and I don't think it's uh you know terribly difficult at all but um uh so
00:02:43.440 how this talk came about is uh from an actual project that I was on for a state
00:02:49.200 government and the project was all about recommending uh home improvements to
00:02:55.120 homeowners and uh figuring out what incentive programs they might qualify for
00:03:00.959 Uh so it had a lot of different rules about okay I should replace the heating
00:03:06.959 system if you know I have an old heating system and this type of heating and this
00:03:12.400 particular efficiency rating blah blah blah then I might qualify for an incentive program So a lot of uh rules
00:03:20.560 and uh fortunately when we came to the project they were already expressed to us as cucumber tables So that was a
00:03:27.120 pretty nice situation to be in Actually we already had customer written cuces to start off One of the uh rare situations
00:03:34.000 that I've had the uh good fortune to be in So uh we started in on the project
00:03:40.799 and uh we had uh rules This is like one of the very simpler rules that we had
00:03:46.159 where uh it's talking about whether I should recommend that the homeowner replace the pool pump And as you can see
00:03:52.799 there um if I don't have a pool obviously that upgrade is not recommended as you might guess Uh if I
00:03:59.120 do have a pool and it has an efficient pump then I don't need to recommend it
00:04:04.400 If I have a pool and it doesn't have an efficient pump well then I should recommend the upgrade Uh so fairly
00:04:10.400 simple and as you might guess it turns into fairly simple code Uh you know I
00:04:15.439 look at the property and if it has a pool and it has an efficient pool pump Oh that's backwards I'm sorry if I have
00:04:21.120 a pool and it doesn't have an efficient pool pump is the way that should actually read I'm sorry about that Um so
00:04:28.080 this then is a slightly more interesting example Uh this is to do with replacing
00:04:34.560 the lighting system in the property And really this is kind of based on two different things Uh whether I have well
00:04:42.080 what what type of lighting system I currently have in there And then if I don't know how long ago was their
00:04:48.479 lighting system replaced in the property and that turns into slightly
00:04:54.320 more interesting code and that you know I look at the lightning tip type first and if it's one of those types that I
00:05:00.320 know hey I should recommend an upgrade then I recommend the upgrade If it's not then I fall back and look at uh if the
00:05:06.880 type is don't know I actually look at how long ago it was installed to decide
00:05:12.240 So again not terribly difficult code to write and I had the cucumber tables to actually make sure that this code was
00:05:18.560 right and that it actually satisfied the rules But I got done with a couple of
00:05:24.080 these and actually my uh guy that I did this project with pointed out that that number is actually wrong I had two down
00:05:30.400 and I had something like 300 to go Uh so a lot of rules to write with unit tests
00:05:36.560 and of course a very aggressive uh schedule to meet So uh a little bit of
00:05:43.039 uh sadness ensued but fortunately I had an ace in my pocket and that is I in
00:05:48.800 fact prayer program with the wizard So um I said to the wizard wizard what
00:05:54.240 think you upon our dilemma And uh the wizard stroked his beard for a minute
00:06:01.039 banged his staff loudly on the ground and saidnone shall pass And then he said
00:06:06.560 to me and also you might want to look at decision trees So when a wizard tells you to do
00:06:11.919 something of course you do it And uh that led me to learn more about decision trees uh which I had either not learned
00:06:19.280 about in CS school or completely forgotten about So I had to dig in and learn and that's how this talk comes
00:06:25.120 about Uh so what are decision trees uh in brief let's go back to that table And
00:06:31.680 basically a decision tree is kind of a treebased representation of what I would
00:06:36.720 look at the the the decisions I would make as I went through and uh kind of
00:06:42.240 implemented this table So if we just look at this table it's pretty obvious what the decision tree for it is going
00:06:48.960 to look like It's going to look something like this And that's off the edge of the screen That's so sad But uh
00:06:55.840 not too bad I can tell you what it's about but basically I look at the type first and if it's one of those types I
00:07:01.680 know the outcome for already I'm done If it's don't know then I have to fall through and look at the last replaced
00:07:08.560 and then make the decision based on that So that's just a treebased representation of the table we were
00:07:14.800 looking at just a second ago Um but the interesting thing to point out I mean if I look at this as a human it's like
00:07:22.240 really obvious how I should build that tree where I should root that tree and what I should look at first But as a
00:07:29.120 computer I don't really know that And there's actually more than one way I could express the same table as a
00:07:35.440 decision tree I could also express it like this where I look at the last replaced first and then I look at the
00:07:41.440 type It's just that there's a whole lot more decisions to make and it's just kind of silly You know I would never
00:07:46.720 actually choose to implement the logic like this as a human Uh but the important thing to point out is there
00:07:53.039 actually are multiple ways to build a decision tree for the same set of examples of rules Uh so of course um if
00:08:02.240 I really want an algorithm to be able to build those trees for me uh it needs to figure out some way uh to figure out
00:08:09.759 which attribute to look at first And uh it turns out there is an
00:08:15.680 algorithm that's uh good at doing that Um and it is called ID3 It's one of the
00:08:21.599 more popular algorithms for decision trees Uh it stands for to look this uperative dichotomizer
00:08:28.400 I think that S could also be a Z but uh if you're a collector of weird words uh
00:08:33.519 dichotomize is kind of a cool world cool word It means to uh split or to classify
00:08:40.320 into two parts Um it's up there with Skepsis in my book as far as weird words
00:08:45.839 to collect um it was written by a go guy named Ross Quinnland and what it uses to
00:08:52.480 figure out which attribute to uh put at the root of the tree and which attribute
00:08:57.920 to put at the you know recursively do look at next as you go down and build that tree Uh it's all about a measure
00:09:05.920 called entropy And uh the way that um I'm defining entropy for the purpose of
00:09:12.560 this talk is really a measure of how much variability I have in a given set
00:09:20.080 Uh so um there's actually a formula that it uses to measure this and we're going
00:09:26.240 to go through and look at this And so this looks a little scary but it's actually fairly simple And we'll go
00:09:32.080 through and break it down and look at some examples And maybe you're you know maybe it's not scary at all to you but
00:09:37.440 for me I you know I had to look at this for a little bit and understand what was going on Uh but basically what this
00:09:42.640 means is if I take the entropy of that of the outcomes in that table
00:09:50.880 uh how to calculate that is I loop over the different possible outcomes and in
00:09:58.160 that set there's only recommended and not recommended Those are the two possible outcomes And then for each one
00:10:05.120 I look at the frequency of that outcome times the log base 2 of the
00:10:10.959 frequency of that outcome So it's actually fairly straightforward Excuse me I got to get
00:10:17.600 some water so my voice doesn't give out
00:10:25.680 So if we take our table again and we can calculate the entropy for the result of
00:10:31.920 that whole table there
00:10:37.880 Um what we do is we take a look at the frequency of each outcome So if we take
00:10:44.000 a look at recommended we'd show that recommended actually occurs 5/8s of the
00:10:50.000 time So we take like 5/8 times log base 2 of 5/8 and then we add that to the
00:10:55.760 frequency that not recommended occurs which is 38 and then we time that by log
00:11:00.880 base 2 of 38 So we come up with a number and that number is about 0.95 blah blah
00:11:06.959 blah blah blah So that's the entropy of the whole result Um so entropy by itself
00:11:14.480 is is interesting but it doesn't actually tell me which attribute to pick
00:11:19.519 But I wanted to show you entropy so that when I showed you gain it would make sense Uh and gain basically is a measure
00:11:28.399 of the effect on entropy when I choose a given attribute to split up the table
00:11:36.279 by So um gain is obviously based on
00:11:41.760 entrop entropy excuse me and the gain for choosing a given attribute So the
00:11:48.560 gain on that table for a given attribute is the original entropy of the outcome
00:11:55.800 0.95 minus uh the sum uh over the possible values for that
00:12:03.240 attribute times uh and the sum of the possible values I'm sorry that's not
00:12:08.800 quite right So I loop over the possible values for that attribute And then the calculation that I do is the frequency
00:12:16.399 that attribute occurs times the entropy of the results
00:12:22.480 for that attribute And if that doesn't make sense that yet that's that's totally cool
00:12:29.040 because we're going to go through this in a little bit more detail and show exactly what that means But that's what the actual uh calculation involved is So
00:12:37.120 let's actually take a look at that in a little bit more detail so it makes more sense Um so if we take a look at the um
00:12:44.560 gain for type we basically go through each
00:12:50.399 possible value for type and calculate the entropy of the outcome So um yeah we
00:12:57.360 can read that Okay So in this case it's it's pretty simple Um we take a look at
00:13:02.639 pinbased fluorescent as the first value for that attribute and we see that uh
00:13:07.760 for both of those results it comes up with not recommended So I only have a
00:13:12.959 single value there And if I calculate that it comes up with you know the frequency that not recommended occurs is
00:13:19.920 all the time which is one times log base 2 of one which is zero and then on the
00:13:26.079 other and then it doesn't occur at all in recommended So that's 0 time log base
00:13:31.680 2 of 0 which I think is like infinity or something but it doesn't matter because I'm multiplying it by zero So it's a lot
00:13:37.040 of math to tell me something that's intuitively completely obvious which is that the entropy for something that's
00:13:43.040 all the same is zero There isn't any entropy It's all the same So a lot of
00:13:48.399 math to tell you what's intuitively obvious but since computers don't have intuition we need math instead
00:13:54.959 So uh I do the same thing in the case of screw in screw in incandescent It's just
00:14:00.320 recommended in instead of not recommended that the entropy is still zero The same for screw in CFL as you
00:14:07.040 can see there And the only entropy I have to work with is over on don't know
00:14:12.240 And for don't know I have recommended appearing half the time and not
00:14:18.240 recommended appearing half the time And if I calculate that out the entropy of
00:14:23.680 that is one So then if I go back and plug that into my original equation for
00:14:29.920 gain that we talked about earlier I basically take the proportion of each of the possible values for lighting type
00:14:37.199 And for the first three it's zero So one quarter of zero is still zero And then for the last one I have oops I'm sorry
00:14:44.480 It should have been one quarter of one there And then it actually comes out to be 0.25 So the gain for lighting type is
00:14:53.800 0.95 minus 0.25 and the total there is 7
00:14:58.959 So a higher number for gain is better That means I've reduced the entropy by
00:15:04.399 this amount Make sense so far except for my typo over
00:15:10.680 there All right So again it's probably we we already know this from looking at
00:15:16.000 it but now we have a way for the computer to figure out that it should start the tree at lighting type And if
00:15:23.360 we just you know to verify this the gain for installed on uh it doesn't reduce
00:15:29.760 the entropy very much And the gain for installed on is actually 05 if you're
00:15:36.120 interested So it's obvious that lightning type's a lot better So um all this is really neat and cool
00:15:43.040 but how do we make practical use of all this goodness uh so there are a few
00:15:48.320 different implementations of this ID3 algorithm in Ruby And the first one we're going to look at is AI for R AI
00:15:56.079 for R is actually a gem that implements a lot of different um artificial
00:16:01.600 intelligence intelligence algorithms uh there's some implementations of neural networks genetic algorithms uh and
00:16:09.360 classifiers which is what we're talking about here And it has an implementation of ID3 And one of the cool things about
00:16:15.759 this gem is it actually will output the code based on the tree that you based on
00:16:22.079 the u the set of examples that you pass in So we'll see what that looks like And
00:16:28.079 I'll bump over to my editor now At least I'll try to first I have to
00:16:33.920 break out of that mode Bump over to my editor now
00:16:41.560 And let's see I'll make that sorry make that smaller I don't really need to look
00:16:47.600 at the tree So can everybody read that okay So
00:16:55.959 uh yes I'm on the right example So um in order to uh use AI forr what we're going
00:17:03.680 to do here is we're going to build a data set first and then
00:17:10.600 um data set actually has a convenient method to load a data set from CSV So
00:17:17.839 basically what I have is I have a CSV that's just exactly what was in that table we were looking at earlier And so
00:17:25.039 I just feed that in here and it builds an ID3 classifier for me And then I have
00:17:32.080 a method that takes that decision tree and just evaluates a given uh property
00:17:37.679 based on it And then I just define a set of examples over here so that I can take
00:17:43.120 a look at what that looks like in IRB So let's go do that
00:17:51.400 Now if I get a top
00:18:02.600 here and then I take a look at what I call that replace lighting
00:18:08.559 fixtures rule evaluate and look at my example data and
00:18:15.840 if I use like the pinbased fluorescent example Sure enough it tells me that's not
00:18:21.320 recommended If I look at one of the other types of lighting like screw in
00:18:27.080 CFL tells me that it is recommended If I look at
00:18:33.960 um I don't know old it'll tell me it is recommended So it's it's doing the right
00:18:40.480 thing Um but as well as just being able to evaluate an example uh I thought one
00:18:46.559 of the cool features of this gem is it actually will output that code So let's look at what that looks
00:18:54.200 like And I need to actually get it that decision tree in there And then I can call get
00:19:02.280 rules And it actually will output code for me in Ruby where it's comparing the
00:19:07.520 lighting type That LF is a little cut off there But
00:19:14.200 um so if I actually have an object that implements methods for all those columns
00:19:23.039 that responds to lighting type I could literally take that get rules and do an instance eval on a property object and
00:19:30.160 it would like it would literally be writing my code for me So when I first ran across this and started using this I
00:19:36.480 was uh you know I was pretty blown away Um but uh it turns out there's more to
00:19:42.480 the story So we'll continue on Uh any questions so far
00:19:50.360 cool So that's AI for R in a nutshell So um yeah we already did
00:19:58.039 that So that's really awesome Uh however there's a rather large caveat with using
00:20:05.760 the ID3 algorithms that currently exist uh and in order to uh understand that
00:20:12.640 let's take a different permutation of our example from a minute ago and let's imagine that we take that last replaced
00:20:20.000 and instead of where we conveniently have uh more than 10 years ago and less
00:20:25.120 than 10 years ago now we actually have uh numeric values for the year it was
00:20:30.720 last replaced on Well let's think about
00:20:35.919 what it's going to come up with now for the gain of last
00:20:41.720 replaced And to to think about that take a look at what it's going to do for gain
00:20:46.880 It's going to take each possible value and look at what the entropy is for the
00:20:52.159 for one of those values And you can probably guess what that's going to do
00:20:58.240 Um but what we're talking about here is something called the entropy bias And
00:21:03.520 the entropy bias refers to the fact that it's going to be biased uh by default
00:21:10.640 toward attributes that have a large set of possible values or um in the in this
00:21:16.799 case really an infinite set of values Uh and how this plays out is the entropy
00:21:21.840 for installed at uh for any of those values 1990 in this case for any of those values the entropy is going to be
00:21:28.080 zero There isn't any So what it's going to do is it's going to add up all that entropy and end up uh calculating the
00:21:35.039 gain as the total entropy for the set minus zero And that's going to be always
00:21:40.320 the maximum possible gain So it's always going to split on that attribute first
00:21:45.760 even though that's not really predictive of anything That's not really predictive of an outcome So it kind of falls over
00:21:52.000 at that point and is not useful Um which you know made me a little sad but I got
00:21:57.840 over it And how I got over it is I realized uh well I didn't realize I read and did
00:22:03.840 some more research and it turns out there is actually a solution to this problem Uh and the problem basically the
00:22:10.559 entropy bias is really about how do we handle continuous attributes or attributes that are over a continuous
00:22:16.559 range rather than a set of discrete values And what we need to do is we need to
00:22:21.880 discretize those values And it turns out there's a way to do that which um is a
00:22:26.960 little tedious but not particularly complicated Uh and what we need to do uh
00:22:32.559 to be able to handle a continuous value like installed year uh we just sort all
00:22:38.720 those values and then uh build a list of the halfway points between each of those
00:22:45.880 values and then measure the gain from splitting on that point of each of those
00:22:53.840 And basically we look at the gain from each possible split point and the gain
00:22:59.120 with the highest split point wins And all of a sudden now we have a split point and we can just discretetize every
00:23:06.320 row into is it less than the split point or more than the split point So now we like are back in business as far as like
00:23:13.039 the rest of the algorithm goes So um kind of cool Uh unfortunately that AI
00:23:19.600 forr ID3 implementation does not implement this feature Uh this actually
00:23:25.840 this algorithm actually comes from the same guy who did ID3 He actually did a later algorithm called
00:23:32.919 C4.5 Uh I don't know why it's called that Maybe C is for continuous but I'm really just making that up Um but one of
00:23:40.559 the cool things that it does have in that algorithm along with some other improvements is what we talked about being able to handle continuous
00:23:46.799 attributes So that's pretty cool Um and it turns out there is an implementation
00:23:52.080 to this in Ruby by a guy Ilia Gregoric who um he's spoken at some Ruby
00:23:59.039 conferences super smart guy Um if you have a chance to talk to him you should Uh I don't think he's here or at least I
00:24:05.440 didn't see his name giving any talks Um but anyway uh he did a gem called
00:24:10.559 decision tree uh obviously enough conveniently enough and uh it has some nice features One of the features is
00:24:17.039 that it can actually output the tree that it builds graphically So that's pretty cool Uh and it also more
00:24:23.679 importantly maybe it deals with continuous attributes correctly And just sort of as like a side little fun thing
00:24:31.360 it happens to add an entropy method to the array class So as I was prepping for
00:24:37.039 this talk and like doing these entropy calculations I could just like build an array and do well you know array entropy
00:24:42.880 and it would return a value So um I don't know if it's all that practical unless you're giving a talk on entropy
00:24:48.159 but it's still pretty cool Um so let's see that in
00:24:58.919 action and we'll jump out of here
00:25:04.559 typing is
00:25:12.760 hard Oh I should probably start with the code for decision tree example first so
00:25:18.480 it makes sense Um so this is using the decision tree gem Uh it's a little bit
00:25:24.720 more work to set up initially in that I have to take that um two-dimensional
00:25:30.559 array I get back from parsing my CSV myself and I have to pop off the labels
00:25:37.039 and I have to drop the last result in there So the labels in this case are lighting type install that result Well
00:25:44.720 really only the labels for attributes are the ones I want So I drop result off the end and then I just have lighting
00:25:50.880 type and install that in that labels array Uh and then the other thing I need
00:25:55.919 to do is munchge my data a little bit They're all strings as coming back from
00:26:01.200 uh parsing the CSV So I convert the ones that are all numbers into actual numbers
00:26:06.960 U it's a little cut off but there's like a two over there at the end So just a little bit of data munching not much And
00:26:14.000 then at that point I'm ready to feed into the decision tree ID3 guy uh the
00:26:20.039 labels the data itself a default value and a config
00:26:26.760 object And that config object is all about telling it what columns to treat
00:26:34.159 as discrete columns and what columns to treat or attributes rather what attributes are discrete and what
00:26:39.919 attributes are continuous If you've ever looked at this gem uh
00:26:46.080 that config thing is like brand new Um that's in like point4 just bumped that
00:26:51.120 gem version Um but so I can tell it that the lighting type's discreet Blast replace is
00:26:57.000 continuous And uh then I set up some example data and I'm in
00:27:02.360 business So now let's see that work I have a
00:27:09.159 tree and tree has a predict method and I again can look at my
00:27:16.799 example data run pinbased fluorescent and it's not
00:27:24.279 recommended screw in CFL is recommended and we can see it's working
00:27:32.480 just as it did before uh so I also Oh told you that there's a
00:27:37.520 feature there Oh yeah So um if we look at array here you can
00:27:43.919 actually get the entropy Oh if I can type So that's kind
00:27:49.520 of cool And then the last cool feature is really that uh
00:27:59.960 tree.graph I dump that to Ruby comp
00:28:06.240 It dumps out a PNG file for me and among other things that'll let me see exactly
00:28:12.480 what it did as far as building that tree So it's split on lighting type just the way it should and then it figured out
00:28:19.440 that based on the set of data that I have the split point is
00:28:24.840 2003.5 So originally I had 10 years ago So it's like really close And if I gave
00:28:30.880 it more examples it would get closer and and eventually exactly get to 10 years ago So um pretty
00:28:44.279 cool Uh yeah I think that's all I really had to say about that implementation decision
00:28:53.080 tree Any questions on that all right
00:28:58.399 Uh so the last section really is about
00:29:03.480 um whether I have rules or whether I have examples and
00:29:08.760 um to show it So decision trees in general they're really good at taking
00:29:14.559 lots of examples and referring inferring rules from them But uh in our case as we
00:29:21.919 got farther along in this project what we realized is we actually had the rules
00:29:28.919 themselves and we didn't need something uh really as complicated as as the
00:29:34.799 decision tree algorithms we were looking at And to show you what I mean let's look at one of the later rules that we
00:29:40.320 had to implement Oh and that's a little bit cut off but I think it's enough there that I can show what's going on So
00:29:48.240 what we really had for some of these later rules is uh kind of a more simplified just table of um if it's
00:29:55.679 electric resistance um none of those other values none of those other
00:30:01.120 attributes matter If the type is electric resistance the outcome is not
00:30:07.320 recommended Um the same for some of those examples later on Basically
00:30:12.640 anytime I have a space in that table what that means is that attribute for that row doesn't
00:30:19.159 matter So um out of the box and there might be a
00:30:24.880 way I've talked about this a little bit with other people there might be a way to adapt the decision tree algorithms to
00:30:30.159 be able to do deal with these blank values for attributes in this table but out of the box it doesn't know how to
00:30:35.919 deal with this Uh but what I realized is the decision tree algorithms are really
00:30:41.279 all about trying to figure out what decisions to do first And in a situation
00:30:47.360 like this you can actually use a really simple brute force appro approach where
00:30:53.279 basically you just go down you start with your example and you just step down this table and do compares at each row
00:31:01.039 and the first time you find a match that's the outcome and you're done So um I didn't actually need decision
00:31:09.279 tree quite although you know learning about them and figuring out where to apply them was very cool and I enjoyed
00:31:14.720 it Um I actually had a simpler problem So um I I looked around and there might
00:31:20.399 have been an implementation that did what I wanted already but it was just so simple that I ended up writing a gem to
00:31:26.080 do this myself And I called this decision table because it's not really a decision tree algorithm per se uh and
00:31:33.200 it's really for where you have rules expressed in a tabular format rather than a bunch of examples that you need
00:31:39.519 to learn the rules from And the key point is you already know what order to do the comparisons That's what those
00:31:46.080 algorithms are really figuring out is what order to do the the comparisons And if you already know you can just use a
00:31:53.120 simple brute force approach like I did So we can uh look at this as well
00:32:04.640 We'll bop over to uh the decision table example and decision table again can
00:32:13.200 operate from uh well really a two-dimensional array that I'm parsing from
00:32:18.440 CSV and I have a simplified space heating CSV that is basically exactly
00:32:24.960 that table that I looked at with the space heating and I feed that to my decision
00:32:31.200 table rule set And then at that point I'm ready to run my example data through
00:32:36.880 it
00:32:54.080 And uh I have an evaluate evaluate method just like I did uh for the first
00:33:00.399 state for our example I guess excuse
00:33:07.640 me Oh yeah sorry I have a different set of examples in this case Um so I think I
00:33:14.000 have a gas furnace efficient and in that case if it's already an efficient gas
00:33:19.080 furnace it won't recommend I replace it If I have an inefficient gas furnace it
00:33:29.000 will If I don't know but it was replaced a long time ago it'll recommend it So uh
00:33:34.640 it's working Oops What happened here sorry about that U so it's working
00:33:42.960 just as I expect Um and it's just a a simpler brute force kind of approach for
00:33:49.679 when I need that
00:33:56.240 So to kind of summarize some of the things I learned through all this and uh
00:34:01.360 and the first thing is that if you have a situation like this where you have rules that are expressed like this
00:34:08.000 to you from the customer in a tabular format Um don't you know write all the
00:34:13.760 code yourself You don't need to do that Uh there's several approaches that'll work either the um if you have examples
00:34:20.639 and you need to infer the rules um something like ID3 or the the decision tree gem is a great way to go Uh if what
00:34:28.320 you really have is just a table of the rules that uh you can use something like decision table and turn that into a DSL
00:34:34.879 essentially Um but what's really really helpful is in our situation we wrote a
00:34:40.960 few of them by hand and then we had the test cases so that when we're when we were ready to replace it with the uh the
00:34:48.440 decision tree algorithm or the decision table uh we were able to verify that it
00:34:53.839 was still working as expected So in this case the unit tests were actually more
00:34:59.200 valuable than the implementation code because they allowed us to replace the implementation code with something
00:35:04.400 better So uh really strong uh case for unit tests Uh the other thing is you
00:35:10.800 really need to understand the algorithms and what they're good for and when they uh fall down Um initially when I started
00:35:17.440 looking at decision trees uh they just sort of looked like a wonderful magic black box that would write my code for
00:35:22.960 me And as I dug in some more it turned out the answer was maybe kind of Um but
00:35:29.119 it's really important to understand how these things work rather than just plugging them in And of course the last
00:35:34.320 thing is the simplest thing that can possibly work is always a good choice So um some resources to look at
00:35:42.480 uh igvida.com is Ilia's website There's a bunch of good blog posts on different
00:35:47.839 other uh AI algorithms to look
00:35:53.400 at and uh he goes into detail on things like basian classification
00:35:59.640 and singular vector distribution I'm I think I'm getting those words wrong but
00:36:04.720 lots of different algorithms that are worth looking at Um there's also a really good uh discussion of ID3 with an
00:36:13.119 implementation in Python And uh last but not least um I actually have all my
00:36:19.520 slides available on GitHub This whole thing is done in uh Reveal.js Uh so um I don't know how much
00:36:28.560 time I have left but looks like yeah I do have a few minutes for questions if there are any
00:36:37.359 So uh using the decision trees can uh can handle noise in the training data
00:36:44.800 like if you have a few incorrect uh results in there will it kind
00:36:51.720 of deal with that if you have enough data or will that cause
00:36:57.680 you know what I'd have to I'd have to give that a try and find out I don't know the answer to that off the top of my
00:37:02.839 head Oh I'm sorry to repeat the question Would it deal with noise in the data uh
00:37:09.119 would it figure out that if you had some values that some rows that were incorrect would it be able to figure
00:37:14.160 that out i think if the val
00:37:22.000 I'm sorry I didn't hear that I think if the values vary and you could tell it you know what was passed away at the end
00:37:28.320 it would learn I think it would I think it would have a better chance if it was a continuous attribute than it was if it
00:37:34.320 was a discrete attribute of of continuing on and working through that situation I don't know what it would do
00:37:40.320 if it was a discrete attribute that it had like a you know an outcome that was wrong
00:37:50.320 I don't know the answer to that I'm sorry
00:37:56.440 Yeah The question was on a pruning pruned or unpruned decision trees and I do not know the answer to
00:38:06.680 that All right Well there no more questions Thanks a lot
Explore all talks recorded at RubyConf 2012
+46