00:00:15.320
okay think I'll go ahead and get started Um first of all a program note Uh it
00:00:21.039
turns out originally this talk was called machine learning for fun and profit and there are like six other
00:00:26.160
talks that are X for fun and profit So the title for this talk has officially changed for machine learning for insert
00:00:33.680
cliche of your choice here So I'm Chris Nelson I'm with uh Gaslight
00:00:41.520
We do uh web and mobile app development We're also doing a training class in Backbone and CoffeeScript in San
00:00:46.559
Francisco in early December So if you're into that please check us out Uh but what I'm going to be talking about today
00:00:54.000
is machine learning Um let's see if I can focus or narrow that Yeah that
00:01:00.239
center is a lot better Uh so machine learning is a very broad topic and
00:01:06.560
briefly um the way it's defined at least on Wikipedia is a branch of art artificial intelligence that has to do
00:01:13.280
with um using a set of examples or a set of data and trying to learn what the
00:01:19.280
rules are and predict outcomes based on it And uh there's a whole bunch of
00:01:24.560
different algorithms to do that And um this talk is really more of a depth talk
00:01:30.080
than a breadth talk So um I'm not going to survey all the different machine
00:01:35.840
learning algorithms that are out there There's there's quite a few I'm really going to drill in pretty specifically on
00:01:42.400
decision trees and even more specifically on a particular algorithm to implement decision trees and go into
00:01:49.759
how it works in detail and hopefully give you guys a comfort level to know when's an appropriate place to use
00:01:56.240
decision trees uh where they do really well where they fall down and how the
00:02:01.840
algorithm actually works So hopefully that's of interest to you guys and I'll also give you some resources at the end
00:02:08.000
so that like the other algorithms that I don't cover if you want to go learn those well there's some resources that I
00:02:13.599
can point you at Uh but that's roughly the plan Um so just uh fair warning
00:02:21.200
there'll be some math Um I myself am not a particularly super mathy guy I
00:02:26.800
actually had to relearn some math that I'd forgotten to get ready for this talk And that's a good thing for me and I'll
00:02:34.480
try to go through it pretty in detail and and I don't think it's uh you know terribly difficult at all but um uh so
00:02:43.440
how this talk came about is uh from an actual project that I was on for a state
00:02:49.200
government and the project was all about recommending uh home improvements to
00:02:55.120
homeowners and uh figuring out what incentive programs they might qualify for
00:03:00.959
Uh so it had a lot of different rules about okay I should replace the heating
00:03:06.959
system if you know I have an old heating system and this type of heating and this
00:03:12.400
particular efficiency rating blah blah blah then I might qualify for an incentive program So a lot of uh rules
00:03:20.560
and uh fortunately when we came to the project they were already expressed to us as cucumber tables So that was a
00:03:27.120
pretty nice situation to be in Actually we already had customer written cuces to start off One of the uh rare situations
00:03:34.000
that I've had the uh good fortune to be in So uh we started in on the project
00:03:40.799
and uh we had uh rules This is like one of the very simpler rules that we had
00:03:46.159
where uh it's talking about whether I should recommend that the homeowner replace the pool pump And as you can see
00:03:52.799
there um if I don't have a pool obviously that upgrade is not recommended as you might guess Uh if I
00:03:59.120
do have a pool and it has an efficient pump then I don't need to recommend it
00:04:04.400
If I have a pool and it doesn't have an efficient pump well then I should recommend the upgrade Uh so fairly
00:04:10.400
simple and as you might guess it turns into fairly simple code Uh you know I
00:04:15.439
look at the property and if it has a pool and it has an efficient pool pump Oh that's backwards I'm sorry if I have
00:04:21.120
a pool and it doesn't have an efficient pool pump is the way that should actually read I'm sorry about that Um so
00:04:28.080
this then is a slightly more interesting example Uh this is to do with replacing
00:04:34.560
the lighting system in the property And really this is kind of based on two different things Uh whether I have well
00:04:42.080
what what type of lighting system I currently have in there And then if I don't know how long ago was their
00:04:48.479
lighting system replaced in the property and that turns into slightly
00:04:54.320
more interesting code and that you know I look at the lightning tip type first and if it's one of those types that I
00:05:00.320
know hey I should recommend an upgrade then I recommend the upgrade If it's not then I fall back and look at uh if the
00:05:06.880
type is don't know I actually look at how long ago it was installed to decide
00:05:12.240
So again not terribly difficult code to write and I had the cucumber tables to actually make sure that this code was
00:05:18.560
right and that it actually satisfied the rules But I got done with a couple of
00:05:24.080
these and actually my uh guy that I did this project with pointed out that that number is actually wrong I had two down
00:05:30.400
and I had something like 300 to go Uh so a lot of rules to write with unit tests
00:05:36.560
and of course a very aggressive uh schedule to meet So uh a little bit of
00:05:43.039
uh sadness ensued but fortunately I had an ace in my pocket and that is I in
00:05:48.800
fact prayer program with the wizard So um I said to the wizard wizard what
00:05:54.240
think you upon our dilemma And uh the wizard stroked his beard for a minute
00:06:01.039
banged his staff loudly on the ground and saidnone shall pass And then he said
00:06:06.560
to me and also you might want to look at decision trees So when a wizard tells you to do
00:06:11.919
something of course you do it And uh that led me to learn more about decision trees uh which I had either not learned
00:06:19.280
about in CS school or completely forgotten about So I had to dig in and learn and that's how this talk comes
00:06:25.120
about Uh so what are decision trees uh in brief let's go back to that table And
00:06:31.680
basically a decision tree is kind of a treebased representation of what I would
00:06:36.720
look at the the the decisions I would make as I went through and uh kind of
00:06:42.240
implemented this table So if we just look at this table it's pretty obvious what the decision tree for it is going
00:06:48.960
to look like It's going to look something like this And that's off the edge of the screen That's so sad But uh
00:06:55.840
not too bad I can tell you what it's about but basically I look at the type first and if it's one of those types I
00:07:01.680
know the outcome for already I'm done If it's don't know then I have to fall through and look at the last replaced
00:07:08.560
and then make the decision based on that So that's just a treebased representation of the table we were
00:07:14.800
looking at just a second ago Um but the interesting thing to point out I mean if I look at this as a human it's like
00:07:22.240
really obvious how I should build that tree where I should root that tree and what I should look at first But as a
00:07:29.120
computer I don't really know that And there's actually more than one way I could express the same table as a
00:07:35.440
decision tree I could also express it like this where I look at the last replaced first and then I look at the
00:07:41.440
type It's just that there's a whole lot more decisions to make and it's just kind of silly You know I would never
00:07:46.720
actually choose to implement the logic like this as a human Uh but the important thing to point out is there
00:07:53.039
actually are multiple ways to build a decision tree for the same set of examples of rules Uh so of course um if
00:08:02.240
I really want an algorithm to be able to build those trees for me uh it needs to figure out some way uh to figure out
00:08:09.759
which attribute to look at first And uh it turns out there is an
00:08:15.680
algorithm that's uh good at doing that Um and it is called ID3 It's one of the
00:08:21.599
more popular algorithms for decision trees Uh it stands for to look this uperative dichotomizer
00:08:28.400
I think that S could also be a Z but uh if you're a collector of weird words uh
00:08:33.519
dichotomize is kind of a cool world cool word It means to uh split or to classify
00:08:40.320
into two parts Um it's up there with Skepsis in my book as far as weird words
00:08:45.839
to collect um it was written by a go guy named Ross Quinnland and what it uses to
00:08:52.480
figure out which attribute to uh put at the root of the tree and which attribute
00:08:57.920
to put at the you know recursively do look at next as you go down and build that tree Uh it's all about a measure
00:09:05.920
called entropy And uh the way that um I'm defining entropy for the purpose of
00:09:12.560
this talk is really a measure of how much variability I have in a given set
00:09:20.080
Uh so um there's actually a formula that it uses to measure this and we're going
00:09:26.240
to go through and look at this And so this looks a little scary but it's actually fairly simple And we'll go
00:09:32.080
through and break it down and look at some examples And maybe you're you know maybe it's not scary at all to you but
00:09:37.440
for me I you know I had to look at this for a little bit and understand what was going on Uh but basically what this
00:09:42.640
means is if I take the entropy of that of the outcomes in that table
00:09:50.880
uh how to calculate that is I loop over the different possible outcomes and in
00:09:58.160
that set there's only recommended and not recommended Those are the two possible outcomes And then for each one
00:10:05.120
I look at the frequency of that outcome times the log base 2 of the
00:10:10.959
frequency of that outcome So it's actually fairly straightforward Excuse me I got to get
00:10:17.600
some water so my voice doesn't give out
00:10:25.680
So if we take our table again and we can calculate the entropy for the result of
00:10:31.920
that whole table there
00:10:37.880
Um what we do is we take a look at the frequency of each outcome So if we take
00:10:44.000
a look at recommended we'd show that recommended actually occurs 5/8s of the
00:10:50.000
time So we take like 5/8 times log base 2 of 5/8 and then we add that to the
00:10:55.760
frequency that not recommended occurs which is 38 and then we time that by log
00:11:00.880
base 2 of 38 So we come up with a number and that number is about 0.95 blah blah
00:11:06.959
blah blah blah So that's the entropy of the whole result Um so entropy by itself
00:11:14.480
is is interesting but it doesn't actually tell me which attribute to pick
00:11:19.519
But I wanted to show you entropy so that when I showed you gain it would make sense Uh and gain basically is a measure
00:11:28.399
of the effect on entropy when I choose a given attribute to split up the table
00:11:36.279
by So um gain is obviously based on
00:11:41.760
entrop entropy excuse me and the gain for choosing a given attribute So the
00:11:48.560
gain on that table for a given attribute is the original entropy of the outcome
00:11:55.800
0.95 minus uh the sum uh over the possible values for that
00:12:03.240
attribute times uh and the sum of the possible values I'm sorry that's not
00:12:08.800
quite right So I loop over the possible values for that attribute And then the calculation that I do is the frequency
00:12:16.399
that attribute occurs times the entropy of the results
00:12:22.480
for that attribute And if that doesn't make sense that yet that's that's totally cool
00:12:29.040
because we're going to go through this in a little bit more detail and show exactly what that means But that's what the actual uh calculation involved is So
00:12:37.120
let's actually take a look at that in a little bit more detail so it makes more sense Um so if we take a look at the um
00:12:44.560
gain for type we basically go through each
00:12:50.399
possible value for type and calculate the entropy of the outcome So um yeah we
00:12:57.360
can read that Okay So in this case it's it's pretty simple Um we take a look at
00:13:02.639
pinbased fluorescent as the first value for that attribute and we see that uh
00:13:07.760
for both of those results it comes up with not recommended So I only have a
00:13:12.959
single value there And if I calculate that it comes up with you know the frequency that not recommended occurs is
00:13:19.920
all the time which is one times log base 2 of one which is zero and then on the
00:13:26.079
other and then it doesn't occur at all in recommended So that's 0 time log base
00:13:31.680
2 of 0 which I think is like infinity or something but it doesn't matter because I'm multiplying it by zero So it's a lot
00:13:37.040
of math to tell me something that's intuitively completely obvious which is that the entropy for something that's
00:13:43.040
all the same is zero There isn't any entropy It's all the same So a lot of
00:13:48.399
math to tell you what's intuitively obvious but since computers don't have intuition we need math instead
00:13:54.959
So uh I do the same thing in the case of screw in screw in incandescent It's just
00:14:00.320
recommended in instead of not recommended that the entropy is still zero The same for screw in CFL as you
00:14:07.040
can see there And the only entropy I have to work with is over on don't know
00:14:12.240
And for don't know I have recommended appearing half the time and not
00:14:18.240
recommended appearing half the time And if I calculate that out the entropy of
00:14:23.680
that is one So then if I go back and plug that into my original equation for
00:14:29.920
gain that we talked about earlier I basically take the proportion of each of the possible values for lighting type
00:14:37.199
And for the first three it's zero So one quarter of zero is still zero And then for the last one I have oops I'm sorry
00:14:44.480
It should have been one quarter of one there And then it actually comes out to be 0.25 So the gain for lighting type is
00:14:53.800
0.95 minus 0.25 and the total there is 7
00:14:58.959
So a higher number for gain is better That means I've reduced the entropy by
00:15:04.399
this amount Make sense so far except for my typo over
00:15:10.680
there All right So again it's probably we we already know this from looking at
00:15:16.000
it but now we have a way for the computer to figure out that it should start the tree at lighting type And if
00:15:23.360
we just you know to verify this the gain for installed on uh it doesn't reduce
00:15:29.760
the entropy very much And the gain for installed on is actually 05 if you're
00:15:36.120
interested So it's obvious that lightning type's a lot better So um all this is really neat and cool
00:15:43.040
but how do we make practical use of all this goodness uh so there are a few
00:15:48.320
different implementations of this ID3 algorithm in Ruby And the first one we're going to look at is AI for R AI
00:15:56.079
for R is actually a gem that implements a lot of different um artificial
00:16:01.600
intelligence intelligence algorithms uh there's some implementations of neural networks genetic algorithms uh and
00:16:09.360
classifiers which is what we're talking about here And it has an implementation of ID3 And one of the cool things about
00:16:15.759
this gem is it actually will output the code based on the tree that you based on
00:16:22.079
the u the set of examples that you pass in So we'll see what that looks like And
00:16:28.079
I'll bump over to my editor now At least I'll try to first I have to
00:16:33.920
break out of that mode Bump over to my editor now
00:16:41.560
And let's see I'll make that sorry make that smaller I don't really need to look
00:16:47.600
at the tree So can everybody read that okay So
00:16:55.959
uh yes I'm on the right example So um in order to uh use AI forr what we're going
00:17:03.680
to do here is we're going to build a data set first and then
00:17:10.600
um data set actually has a convenient method to load a data set from CSV So
00:17:17.839
basically what I have is I have a CSV that's just exactly what was in that table we were looking at earlier And so
00:17:25.039
I just feed that in here and it builds an ID3 classifier for me And then I have
00:17:32.080
a method that takes that decision tree and just evaluates a given uh property
00:17:37.679
based on it And then I just define a set of examples over here so that I can take
00:17:43.120
a look at what that looks like in IRB So let's go do that
00:17:51.400
Now if I get a top
00:18:02.600
here and then I take a look at what I call that replace lighting
00:18:08.559
fixtures rule evaluate and look at my example data and
00:18:15.840
if I use like the pinbased fluorescent example Sure enough it tells me that's not
00:18:21.320
recommended If I look at one of the other types of lighting like screw in
00:18:27.080
CFL tells me that it is recommended If I look at
00:18:33.960
um I don't know old it'll tell me it is recommended So it's it's doing the right
00:18:40.480
thing Um but as well as just being able to evaluate an example uh I thought one
00:18:46.559
of the cool features of this gem is it actually will output that code So let's look at what that looks
00:18:54.200
like And I need to actually get it that decision tree in there And then I can call get
00:19:02.280
rules And it actually will output code for me in Ruby where it's comparing the
00:19:07.520
lighting type That LF is a little cut off there But
00:19:14.200
um so if I actually have an object that implements methods for all those columns
00:19:23.039
that responds to lighting type I could literally take that get rules and do an instance eval on a property object and
00:19:30.160
it would like it would literally be writing my code for me So when I first ran across this and started using this I
00:19:36.480
was uh you know I was pretty blown away Um but uh it turns out there's more to
00:19:42.480
the story So we'll continue on Uh any questions so far
00:19:50.360
cool So that's AI for R in a nutshell So um yeah we already did
00:19:58.039
that So that's really awesome Uh however there's a rather large caveat with using
00:20:05.760
the ID3 algorithms that currently exist uh and in order to uh understand that
00:20:12.640
let's take a different permutation of our example from a minute ago and let's imagine that we take that last replaced
00:20:20.000
and instead of where we conveniently have uh more than 10 years ago and less
00:20:25.120
than 10 years ago now we actually have uh numeric values for the year it was
00:20:30.720
last replaced on Well let's think about
00:20:35.919
what it's going to come up with now for the gain of last
00:20:41.720
replaced And to to think about that take a look at what it's going to do for gain
00:20:46.880
It's going to take each possible value and look at what the entropy is for the
00:20:52.159
for one of those values And you can probably guess what that's going to do
00:20:58.240
Um but what we're talking about here is something called the entropy bias And
00:21:03.520
the entropy bias refers to the fact that it's going to be biased uh by default
00:21:10.640
toward attributes that have a large set of possible values or um in the in this
00:21:16.799
case really an infinite set of values Uh and how this plays out is the entropy
00:21:21.840
for installed at uh for any of those values 1990 in this case for any of those values the entropy is going to be
00:21:28.080
zero There isn't any So what it's going to do is it's going to add up all that entropy and end up uh calculating the
00:21:35.039
gain as the total entropy for the set minus zero And that's going to be always
00:21:40.320
the maximum possible gain So it's always going to split on that attribute first
00:21:45.760
even though that's not really predictive of anything That's not really predictive of an outcome So it kind of falls over
00:21:52.000
at that point and is not useful Um which you know made me a little sad but I got
00:21:57.840
over it And how I got over it is I realized uh well I didn't realize I read and did
00:22:03.840
some more research and it turns out there is actually a solution to this problem Uh and the problem basically the
00:22:10.559
entropy bias is really about how do we handle continuous attributes or attributes that are over a continuous
00:22:16.559
range rather than a set of discrete values And what we need to do is we need to
00:22:21.880
discretize those values And it turns out there's a way to do that which um is a
00:22:26.960
little tedious but not particularly complicated Uh and what we need to do uh
00:22:32.559
to be able to handle a continuous value like installed year uh we just sort all
00:22:38.720
those values and then uh build a list of the halfway points between each of those
00:22:45.880
values and then measure the gain from splitting on that point of each of those
00:22:53.840
And basically we look at the gain from each possible split point and the gain
00:22:59.120
with the highest split point wins And all of a sudden now we have a split point and we can just discretetize every
00:23:06.320
row into is it less than the split point or more than the split point So now we like are back in business as far as like
00:23:13.039
the rest of the algorithm goes So um kind of cool Uh unfortunately that AI
00:23:19.600
forr ID3 implementation does not implement this feature Uh this actually
00:23:25.840
this algorithm actually comes from the same guy who did ID3 He actually did a later algorithm called
00:23:32.919
C4.5 Uh I don't know why it's called that Maybe C is for continuous but I'm really just making that up Um but one of
00:23:40.559
the cool things that it does have in that algorithm along with some other improvements is what we talked about being able to handle continuous
00:23:46.799
attributes So that's pretty cool Um and it turns out there is an implementation
00:23:52.080
to this in Ruby by a guy Ilia Gregoric who um he's spoken at some Ruby
00:23:59.039
conferences super smart guy Um if you have a chance to talk to him you should Uh I don't think he's here or at least I
00:24:05.440
didn't see his name giving any talks Um but anyway uh he did a gem called
00:24:10.559
decision tree uh obviously enough conveniently enough and uh it has some nice features One of the features is
00:24:17.039
that it can actually output the tree that it builds graphically So that's pretty cool Uh and it also more
00:24:23.679
importantly maybe it deals with continuous attributes correctly And just sort of as like a side little fun thing
00:24:31.360
it happens to add an entropy method to the array class So as I was prepping for
00:24:37.039
this talk and like doing these entropy calculations I could just like build an array and do well you know array entropy
00:24:42.880
and it would return a value So um I don't know if it's all that practical unless you're giving a talk on entropy
00:24:48.159
but it's still pretty cool Um so let's see that in
00:24:58.919
action and we'll jump out of here
00:25:04.559
typing is
00:25:12.760
hard Oh I should probably start with the code for decision tree example first so
00:25:18.480
it makes sense Um so this is using the decision tree gem Uh it's a little bit
00:25:24.720
more work to set up initially in that I have to take that um two-dimensional
00:25:30.559
array I get back from parsing my CSV myself and I have to pop off the labels
00:25:37.039
and I have to drop the last result in there So the labels in this case are lighting type install that result Well
00:25:44.720
really only the labels for attributes are the ones I want So I drop result off the end and then I just have lighting
00:25:50.880
type and install that in that labels array Uh and then the other thing I need
00:25:55.919
to do is munchge my data a little bit They're all strings as coming back from
00:26:01.200
uh parsing the CSV So I convert the ones that are all numbers into actual numbers
00:26:06.960
U it's a little cut off but there's like a two over there at the end So just a little bit of data munching not much And
00:26:14.000
then at that point I'm ready to feed into the decision tree ID3 guy uh the
00:26:20.039
labels the data itself a default value and a config
00:26:26.760
object And that config object is all about telling it what columns to treat
00:26:34.159
as discrete columns and what columns to treat or attributes rather what attributes are discrete and what
00:26:39.919
attributes are continuous If you've ever looked at this gem uh
00:26:46.080
that config thing is like brand new Um that's in like point4 just bumped that
00:26:51.120
gem version Um but so I can tell it that the lighting type's discreet Blast replace is
00:26:57.000
continuous And uh then I set up some example data and I'm in
00:27:02.360
business So now let's see that work I have a
00:27:09.159
tree and tree has a predict method and I again can look at my
00:27:16.799
example data run pinbased fluorescent and it's not
00:27:24.279
recommended screw in CFL is recommended and we can see it's working
00:27:32.480
just as it did before uh so I also Oh told you that there's a
00:27:37.520
feature there Oh yeah So um if we look at array here you can
00:27:43.919
actually get the entropy Oh if I can type So that's kind
00:27:49.520
of cool And then the last cool feature is really that uh
00:27:59.960
tree.graph I dump that to Ruby comp
00:28:06.240
It dumps out a PNG file for me and among other things that'll let me see exactly
00:28:12.480
what it did as far as building that tree So it's split on lighting type just the way it should and then it figured out
00:28:19.440
that based on the set of data that I have the split point is
00:28:24.840
2003.5 So originally I had 10 years ago So it's like really close And if I gave
00:28:30.880
it more examples it would get closer and and eventually exactly get to 10 years ago So um pretty
00:28:44.279
cool Uh yeah I think that's all I really had to say about that implementation decision
00:28:53.080
tree Any questions on that all right
00:28:58.399
Uh so the last section really is about
00:29:03.480
um whether I have rules or whether I have examples and
00:29:08.760
um to show it So decision trees in general they're really good at taking
00:29:14.559
lots of examples and referring inferring rules from them But uh in our case as we
00:29:21.919
got farther along in this project what we realized is we actually had the rules
00:29:28.919
themselves and we didn't need something uh really as complicated as as the
00:29:34.799
decision tree algorithms we were looking at And to show you what I mean let's look at one of the later rules that we
00:29:40.320
had to implement Oh and that's a little bit cut off but I think it's enough there that I can show what's going on So
00:29:48.240
what we really had for some of these later rules is uh kind of a more simplified just table of um if it's
00:29:55.679
electric resistance um none of those other values none of those other
00:30:01.120
attributes matter If the type is electric resistance the outcome is not
00:30:07.320
recommended Um the same for some of those examples later on Basically
00:30:12.640
anytime I have a space in that table what that means is that attribute for that row doesn't
00:30:19.159
matter So um out of the box and there might be a
00:30:24.880
way I've talked about this a little bit with other people there might be a way to adapt the decision tree algorithms to
00:30:30.159
be able to do deal with these blank values for attributes in this table but out of the box it doesn't know how to
00:30:35.919
deal with this Uh but what I realized is the decision tree algorithms are really
00:30:41.279
all about trying to figure out what decisions to do first And in a situation
00:30:47.360
like this you can actually use a really simple brute force appro approach where
00:30:53.279
basically you just go down you start with your example and you just step down this table and do compares at each row
00:31:01.039
and the first time you find a match that's the outcome and you're done So um I didn't actually need decision
00:31:09.279
tree quite although you know learning about them and figuring out where to apply them was very cool and I enjoyed
00:31:14.720
it Um I actually had a simpler problem So um I I looked around and there might
00:31:20.399
have been an implementation that did what I wanted already but it was just so simple that I ended up writing a gem to
00:31:26.080
do this myself And I called this decision table because it's not really a decision tree algorithm per se uh and
00:31:33.200
it's really for where you have rules expressed in a tabular format rather than a bunch of examples that you need
00:31:39.519
to learn the rules from And the key point is you already know what order to do the comparisons That's what those
00:31:46.080
algorithms are really figuring out is what order to do the the comparisons And if you already know you can just use a
00:31:53.120
simple brute force approach like I did So we can uh look at this as well
00:32:04.640
We'll bop over to uh the decision table example and decision table again can
00:32:13.200
operate from uh well really a two-dimensional array that I'm parsing from
00:32:18.440
CSV and I have a simplified space heating CSV that is basically exactly
00:32:24.960
that table that I looked at with the space heating and I feed that to my decision
00:32:31.200
table rule set And then at that point I'm ready to run my example data through
00:32:36.880
it
00:32:54.080
And uh I have an evaluate evaluate method just like I did uh for the first
00:33:00.399
state for our example I guess excuse
00:33:07.640
me Oh yeah sorry I have a different set of examples in this case Um so I think I
00:33:14.000
have a gas furnace efficient and in that case if it's already an efficient gas
00:33:19.080
furnace it won't recommend I replace it If I have an inefficient gas furnace it
00:33:29.000
will If I don't know but it was replaced a long time ago it'll recommend it So uh
00:33:34.640
it's working Oops What happened here sorry about that U so it's working
00:33:42.960
just as I expect Um and it's just a a simpler brute force kind of approach for
00:33:49.679
when I need that
00:33:56.240
So to kind of summarize some of the things I learned through all this and uh
00:34:01.360
and the first thing is that if you have a situation like this where you have rules that are expressed like this
00:34:08.000
to you from the customer in a tabular format Um don't you know write all the
00:34:13.760
code yourself You don't need to do that Uh there's several approaches that'll work either the um if you have examples
00:34:20.639
and you need to infer the rules um something like ID3 or the the decision tree gem is a great way to go Uh if what
00:34:28.320
you really have is just a table of the rules that uh you can use something like decision table and turn that into a DSL
00:34:34.879
essentially Um but what's really really helpful is in our situation we wrote a
00:34:40.960
few of them by hand and then we had the test cases so that when we're when we were ready to replace it with the uh the
00:34:48.440
decision tree algorithm or the decision table uh we were able to verify that it
00:34:53.839
was still working as expected So in this case the unit tests were actually more
00:34:59.200
valuable than the implementation code because they allowed us to replace the implementation code with something
00:35:04.400
better So uh really strong uh case for unit tests Uh the other thing is you
00:35:10.800
really need to understand the algorithms and what they're good for and when they uh fall down Um initially when I started
00:35:17.440
looking at decision trees uh they just sort of looked like a wonderful magic black box that would write my code for
00:35:22.960
me And as I dug in some more it turned out the answer was maybe kind of Um but
00:35:29.119
it's really important to understand how these things work rather than just plugging them in And of course the last
00:35:34.320
thing is the simplest thing that can possibly work is always a good choice So um some resources to look at
00:35:42.480
uh igvida.com is Ilia's website There's a bunch of good blog posts on different
00:35:47.839
other uh AI algorithms to look
00:35:53.400
at and uh he goes into detail on things like basian classification
00:35:59.640
and singular vector distribution I'm I think I'm getting those words wrong but
00:36:04.720
lots of different algorithms that are worth looking at Um there's also a really good uh discussion of ID3 with an
00:36:13.119
implementation in Python And uh last but not least um I actually have all my
00:36:19.520
slides available on GitHub This whole thing is done in uh Reveal.js Uh so um I don't know how much
00:36:28.560
time I have left but looks like yeah I do have a few minutes for questions if there are any
00:36:37.359
So uh using the decision trees can uh can handle noise in the training data
00:36:44.800
like if you have a few incorrect uh results in there will it kind
00:36:51.720
of deal with that if you have enough data or will that cause
00:36:57.680
you know what I'd have to I'd have to give that a try and find out I don't know the answer to that off the top of my
00:37:02.839
head Oh I'm sorry to repeat the question Would it deal with noise in the data uh
00:37:09.119
would it figure out that if you had some values that some rows that were incorrect would it be able to figure
00:37:14.160
that out i think if the val
00:37:22.000
I'm sorry I didn't hear that I think if the values vary and you could tell it you know what was passed away at the end
00:37:28.320
it would learn I think it would I think it would have a better chance if it was a continuous attribute than it was if it
00:37:34.320
was a discrete attribute of of continuing on and working through that situation I don't know what it would do
00:37:40.320
if it was a discrete attribute that it had like a you know an outcome that was wrong
00:37:50.320
I don't know the answer to that I'm sorry
00:37:56.440
Yeah The question was on a pruning pruned or unpruned decision trees and I do not know the answer to
00:38:06.680
that All right Well there no more questions Thanks a lot