Summarized using AI

Simulating the World with Ruby

Bryan Liles • November 01, 2012 • Denver, Colorado • Talk

In the talk "Simulating the World with Ruby," Bryan Liles explores the intriguing concept of modeling complex systems using the Ruby programming language. He addresses the unconventional choice of Ruby for simulations, typically dominated by languages like R or Mathematica due to their scientific capabilities. Bryan argues that Ruby's expressiveness makes it a suitable candidate for building simulations, as it allows developers to rapidly prototype and iterate on their models without cumbersome compile-link cycles.

Key Points Discussed:

  • Why Use Ruby for Simulations:

    • Ruby is expressive and allows for quick iterations in the coding process.
    • It has a concise syntax that enables modeling complex systems effectively.
  • Building Blocks of Simulations:

    • Models are created by taking inputs and producing outputs, often represented as deterministic or stochastic models.
    • Deterministic Models: Always produce the same output for the same inputs (e.g., calculating the hypotenuse using the Pythagorean theorem).
    • Stochastic Models: Include randomness, giving different outcomes each time they are run (e.g., modeling the spread of cooties).
  • Performance Differences Between Environments:

    • Liles discusses the performance of MRI (Matz's Ruby Interpreter) versus JRuby, emphasizing JRuby's ability to harness multi-core processing and faster execution times due to the JVM and its garbage collection mechanisms.
  • Statistical Understanding in Simulations:

    • The importance of understanding common statistics concepts, such as mean, max, min, and standard deviation, in analyzing simulation results.
    • He mentions the availability of Ruby gems that can perform statistical calculations, although he notes that Ruby lacks robust scientific libraries compared to Python.
  • Challenges in Data Management:

    • The talk stresses the limitations of using Active Record for large datasets in simulations and suggests alternatives for handling big data efficiently.
  • Memory Management:

    • The presenter highlights potential memory leaks in simulations when using Ruby, showcasing how careful memory management can drastically improve performance.
  • Visualization and Real-World Applications:

    • Bryan indicates the necessity of data visualization for understanding simulation data, recommending tools that could help illustrate results beyond simple console outputs.
  • Final Thoughts:

    • Liles concludes that while Ruby is not inherently designed for scientific computation, its expressive power can be harnessed to create effective simulations. He encourages the Ruby community to develop more robust libraries for statistical analysis and machine learning.

Conclusion:

Bryan's talk presents Ruby as a viable choice for building simulations, emphasizing quick iterations, expressive syntax, and the importance of thorough statistical comprehension and performance optimization techniques. The future implications of effectively applying Ruby in scientific domains look promising, provided the community continues to enhance its libraries and capabilities.

Simulating the World with Ruby
Bryan Liles • Denver, Colorado • Talk

Date: November 01, 2012
Published: March 19, 2013
Announced: unknown

You want to model a world. That world has millions of people, who also interact with each other. How would you even start tackling this model in Ruby? I'd like to demonstrate one solution. In this talk, we'll explore this problem from inception. See how the process can evolve from the simple first model to a much more complicated interactive tool. This talk will cover topics such as getting Ruby to do more than one thing at once, sampling discrete probability distributions in constant time, and the perils of garbage collection and a few other suprises.

RubyConf 2012

00:00:00.359 foreign
00:00:14.900 my name is Brian Lyles I'm from Baltimore and I'm here representing Thunderbolt Labs but I'm not talking
00:00:22.439 about Thunderbird Labs today we're going to talk about Ruby simulators and um
00:00:28.199 to get started with this this is a talk about writing simulators
00:00:34.320 in Ruby now you might say to yourself who would want to write a simulator in Ruby I mean it sounds pretty
00:00:40.860 Preposterous Ruby is not known as a great scientific or mathematical
00:00:46.379 language a lot of top Minds who are actually creating this kind of software don't use Ruby or really they don't even
00:00:52.559 use a lot of general purpose programming languages anyways they use things like Mathematica or crazy things like R but
00:00:59.280 you know what we're crazy people so we're going to write a simulator in Ruby or at least talk about writing
00:01:04.440 simulators in Ruby so why Ruby well the first thing this is rubyconf I'm sure everyone in here loves Ruby I love it
00:01:11.880 it's actually one of my favorite languages probably it's like one two with my favorite languages so I like to
00:01:18.840 code in Ruby because you know Ruby is very expressive I have not found anything ever that I've tried to code
00:01:26.700 that I could not actually just sit down and Hammer out in Ruby I'm actually I've done a lot of java and I've done a lot
00:01:32.759 of other languages and sometimes I scratch my head trying to figure out exactly how how would I actually codify
00:01:39.420 this idiom in code another thing I like about Ruby is there's no run compile loop I mean you
00:01:46.020 write the code and you run it and if it breaks you fix it and you run it again
00:01:51.079 there's not a lot of setup you don't there's no Linker there's nothing like that and you don't need I mean if you're
00:01:56.579 using MRI and you just have Ruby on your machine or using jruby you don't need much else to get Ruby up and running
00:02:03.060 so the next thing about Ruby is everything is an object and I really just enjoy this
00:02:10.380 fact that I'm going to model the world in oo and I'm just going to apply my oo Hammer everywhere I can because uh-oh is
00:02:17.520 the best way of doing things and I'm just kidding here so let's get into past the introductions
00:02:24.060 and talking about the building blocks of simulations so when you build a when you're building
00:02:29.520 a simulator um what we're actually doing is taking these Concepts called models and we're
00:02:35.819 giving them inputs and they're going to spit out outputs and we're going to use multiple models to actually
00:02:41.239 model or actually build um an effect and we're going to actually reason on the effect that was built so
00:02:48.120 before we talk about that there's some vocabulary words I hope you guys brought pencils and internet so you can actually
00:02:54.060 look up these words the first word is deterministic and I thought the best way to show what like a deterministic model
00:03:01.080 would be by actually writing some Ruby code that has that actually isn't a real model
00:03:06.840 um we've all written code like this what this is is a model of the world and
00:03:12.599 and it has a method answer to life and what makes this deterministic is that no matter how many times you run the answer
00:03:18.480 to life method on world model you're always going to get 42. very simple example here and continuing on models so
00:03:26.819 models can have inputs like I said so in this case we have a triangle class and we want to solve for the hypotenuse and
00:03:32.940 I hope everybody in here knows the Pythagorean theorem so you know hopefully you can check my math
00:03:38.580 so a squared plus b squared yes equals c squared and you notice I'm I'm giving
00:03:44.519 two inputs um the length of a the length of B and I'm using the square root of after you added them up and you will
00:03:51.540 always once again get the same answer so once again this is deterministic
00:03:57.659 so here's an another example um and to talk about so one of the
00:04:03.599 things that I do with with modeling is um we are actually building models of
00:04:09.420 infections and things like that but infections are boring and gross but you
00:04:14.519 know what um I bet when you were small cooties was fun and and it's funny and
00:04:20.160 I'll tell a little story here so I gave this I gave another version of this talk in Belgium where you know everybody speaks Flemish and not English and I put
00:04:28.199 a I put a slide up there it said cooties on it and they said who and actually people were Googling while
00:04:33.720 I was talking figure out what cooties were you know I'm glad I'm black in the United States where you guys actually know what cooties are so um
00:04:41.160 so once again we're talking about deterministic models models that there's no Randomness in these models you give
00:04:46.440 them input if you give them the same input you will always get the same output and and we're actually talking more about our simulation now so there's
00:04:53.400 a cost for cooties so you know what if this side of the room um actually gets cooties you know it's
00:04:59.639 going to cost like ten dollars to get rid of give everybody up your cootie shots so these are things that we um
00:05:05.520 want to model so um so our models won't be deterministic there's another word
00:05:11.400 another vocabulary word and I hope I spelled this right I'm sure I did it's stochastic and stochastic means that these are
00:05:17.280 models that have a little bit of Randomness in them and I only have one slide on this because
00:05:22.979 um I think I can illustrate this pretty succinctly here so what we have to do is
00:05:29.220 that so if this guy right here in the front row and this other guy here in the front row the guy on the left with the
00:05:34.860 gray sweatshirt on here has cooties what is the percentage chance this guy
00:05:41.400 here that's two seats away from him is going to get cooties I'm sure it's pretty high but according to my model
00:05:46.860 here the if um we actually have Iran so every time you run this there's actually a chance where you won't get it and
00:05:52.259 there's a chance that you will get it notice the chances one tenth of one percent so it's not very high but he's
00:05:57.360 been looking at him the whole entire time so I'm sure there'll be lots of chances for transmission
00:06:04.680 so um now we are experts in models and and I want to say this very very simply
00:06:10.440 um models we just modeled the world what we are doing is just coding what we see and what we know and mathematicians will
00:06:17.039 actually have large amounts of differential equations and they use Mathematica and it takes from what I hear it takes minutes and minutes and
00:06:23.160 minutes to run this but they don't have to be that they don't have to be that hard and like I said Ruby's expressive
00:06:29.039 everyone in this room even if you didn't really understand Ruby per se or you're a ruby like a like
00:06:35.460 a a neophyte you understand what's kind of what's going on here and like I said the expressiveness is the win
00:06:42.479 so let's talk about the Ruby that I like um MRI is it's great
00:06:48.000 um with 193 and the latest releases 193 you have fast Run cycles for your test you have a lot of gems out there but the
00:06:55.740 problem with MRI is I just don't understand it's garbage collector um
00:07:01.440 and I just can't get my head around threading and a couple other things so let's talk about jruby so what do I like
00:07:07.740 about jruby it's fast and you know I have to give kudos to the jruby team over there um one seven the release
00:07:14.759 it is way faster I mean startup is kind of slow still but you guys know this but when it gets up and humming it cooks and
00:07:22.199 here's some proof because you know we all like to use micro benchmarks to prove um all of our cases here
00:07:28.620 so um actually this slide is borrowed from um actually a slide later in the
00:07:34.080 talk where I actually showed the code of big arrays so what's really what's going on here is um you'll notice that I've
00:07:39.180 actually run it twice and the three so the arrow and the three is actually my prompt the three means I'm using MRI on
00:07:44.460 193 and the little Diamond means I'm using jruby I'm using RBM to switch between rubies and the first one and the
00:07:51.960 first one you'll see that the first line of the run is actually populating I think it's a million arrays and then
00:07:58.919 querying it randomly a million times and the second time is properly populating a million hashes and querying it randomly
00:08:05.819 a million times so you notice up top it's 42 and 141 but notice at the bottom
00:08:10.979 it's 41 and 73. um I you know benchmarks Michael benchmarks do lie but come on that's
00:08:17.819 twice as fast almost so I mean that's a big deal and that's why we are actually pursuing jruby for
00:08:24.300 this exercise and I know everybody likes pictures I kind of will explain what this picture means and and later on in the talk but
00:08:31.379 look at the slope of this line This slope of the line is um actually an old version of our simulator and you'll
00:08:37.020 notice and it's actually because our simulator has iterations so these are actually tracking the time of the
00:08:42.060 iterations and notice the slope goes kind of up and there's some outliers so I drew a little um trend line so you can
00:08:47.580 actually see the slope and lime and you notice towards the end I was actually getting into Ruby's garbage collection
00:08:52.740 so times were getting so that's why times are skewed off the line so same thing with jruby first thing you'll
00:09:00.660 notice is that the slope is much slower one thing you'll notice at the on the absolute left part of the graph what
00:09:07.440 that actually is does anyone know what that is take a guess of what that is on the why there's a lot of um dots on the
00:09:14.339 left side and they go up why they're not on the trend line does anyone know what that might be
00:09:19.920 what do you say that is legit warming up so notice after the jit warms up it Cooks I mean it
00:09:26.820 really does move quickly and you know having a nice Jet and I'll and I'll be
00:09:32.100 frank with you I have not tried this in ribinius um no no there's there's real there's
00:09:37.140 technical reasons why I haven't tried this in ravenous but I think even with ribinius with a real working jet I mean
00:09:43.440 we are getting some real performance gains and I'll tell you this code and this code was actually the same exact
00:09:49.620 code just one was running with jruby and one was run with MRI so another thing I like about
00:09:55.019 um jruby is the jvm the jvm is a lot of smart guys over a lot of years writing a
00:10:02.100 lot of neat code I don't understand it I don't understand all the ins and outs hotspot I don't understand all the ins and outs of garbage collection I do know
00:10:08.820 that um it uses all the memory you have um
00:10:14.160 so um right here I have one of these newfangled MacBook Pro retinas and I got it with the 16 gigs of memory I actually
00:10:20.640 can I've never in my life said I'm going to run a process on my Mac that will use all the memory on my box I just wrote
00:10:27.420 One so and another thing is um it uses all the cores and not to say that rubinius
00:10:33.839 and MRI I'm not going to talk about rabbinis anymore because I'm not picking on ribinius I'm not and I'm not picking an MRI but MRI can kind of use all the
00:10:41.040 cores but um good old um Global interpreter lot kind of limits you to one chord so let's
00:10:48.540 let's dig into that so whenever you have um things to execute on MRI it kind of
00:10:54.180 looks like this so you got so each one of those orange blocks is a new instruction so what
00:11:00.540 happens when you run thread new well not quite what you would hope so what happens is it does actually allow for
00:11:07.680 parallel execution it's just on one core and who here actually has a one core machine that you code on
00:11:14.339 right so it's just a waste of money so with jruby um same thing orange things are the
00:11:20.339 blocks to be or execution and you run thread new and hey look you potentially could be run on multiple cores I mean we
00:11:27.779 don't know this because the operating system is smarter than this but the potential is there but you know I don't want to be I don't
00:11:33.779 want to poop all over MRI so um I want to actually give a solution so if you want to run on multiple cores on who
00:11:40.560 knows how to do multiple cores who knows how to break the global interpreter lock and
00:11:45.899 MRI and a c extension well
00:11:51.540 actually no it's it's actually not that hard um so you just have to write a little bit of c
00:11:57.480 um and what and what this function does so this is C this RB thread blocking region the first argument the second
00:12:02.760 argument are the first argument is the name is the actual method and the second one are the what you're going to pass
00:12:08.880 into it so whatever you run and use all the cores that core that code will actually run outside of the global
00:12:14.640 interpreter lock so I mean we I mean there are ruby gems that use this what seeks and use this but you know it's not
00:12:21.060 readily accessible you know we get this for free for easy without having to write C extensions in jruby
00:12:27.060 so enough about jvm so who here statistics who like statistics who know
00:12:32.700 statistics so everybody who has their hand up probably knows more than I do um but you know what I can still share
00:12:39.180 so we have a bunch of numbers and why do we use statistics we use statistics because we want to actually reason about
00:12:44.240 output and data so we have this we have this list of 10 numbers and they actually they are random so and I
00:12:50.700 graphed them using numbers on the Mac and it looks kind of like this and if you look at this you have no idea
00:12:55.920 exactly how these how this data correlates to each other so um the simple things like the simple simple
00:13:01.740 tenets of Statistics are let's look at the mean the mean is the middle value it's not the median not the middle value
00:13:07.980 but the value that would be in the middle of all of them so the meanness Dot 42. and I shouldn't say dot 42 so
00:13:14.579 say 42 hundredths or 4300s and then we look at the max and then we also we look
00:13:19.980 at the men and the most important thing is we look at the standard deviation because what we're curious about is how
00:13:25.200 much so if you're running a so if you're running a stochastic simulation and your numbers are all over a place maybe your
00:13:31.860 model is not tuned correctly so we always look at the standard deviation to make sure that the data that's coming
00:13:36.959 out at least the numbers are similar so maybe there's an accepted amount of error and inside and I'm actually surprised
00:13:44.160 that Ruby doesn't include this but there's a nice gym called descriptive um statistics that you can install and
00:13:50.399 what it allowed you to do and I actually don't don't do it this there's there's actually two ways to do this you can
00:13:55.740 actually uninstall this gym and then you can actually require descriptive statistics
00:14:01.019 and what it does is it actually extends core extensions and I know we don't like that so um what you can do is actually
00:14:07.980 require descriptive statistics safe and you can actually say so I have an array
00:14:13.200 a I can actually go a extend descriptive statistics and I actually get those methods like standard deviation min max
00:14:20.100 averages and all the things that array does not already include
00:14:25.260 so another neat thing about statistics are distributions and so I was writing some Ruby
00:14:32.100 and first of all this is not Ruby so one thing you're going to learn when you're writing statistics or writing
00:14:38.160 simulations is that Ruby just does not give you everything you need actually this is does anybody know what language
00:14:44.160 this is yeah it is our and you know what you wouldn't normally see it like this you
00:14:49.260 would probably see it like this this would actually given it away really quickly this is our what this does is it
00:14:55.680 generates something that looks like this um what is this does anyone know what
00:15:01.740 this is and there's a normal distribution so we'll use normal distribution so
00:15:07.019 normal distribution um I guess the canonical example is the um your professor in college you know someone
00:15:13.199 had to get an A you know the class was hard and someone had to get an A so what he will do is he would actually he would
00:15:18.839 actually readership you everybody's grades on this bell curve so most people are getting C's and only the top few are
00:15:24.540 getting A's no matter how bad their grades were so what we do is we use distributions to actually model our
00:15:30.420 numbers so they are something that we can expect and another thing we do with my and we're talking about standard deviation
00:15:36.360 earlier so actually you can actually model um I can actually with r draw the standard deviation so I wanted it to be
00:15:43.320 um 0.5 and so it's actually one so I wanted to actually see so if I was actually
00:15:48.660 um examining this this graph for um see what output was I would actually just
00:15:54.360 only look in the gray block and the cool thing about this is that this code right here you can't do this with Ruby right
00:15:59.459 now um there's there's a project out there called protoviz which was actually I
00:16:04.560 don't know if it's still going on um I think so you think they encourage everyone no there's a project called
00:16:10.380 protoviz so Ruby people you know we get projects and we name them rubyves and it can actually generate graphs like this
00:16:16.620 but the problem is the people who are doing Proto um viz actually retired that project and created something great
00:16:22.019 called b3.js but we'll talk about that later so um here's another thing so here's
00:16:28.560 more here's actually another way to generate a distribution or generate a graph in um in R and right this this
00:16:36.180 right here is a beta distribution and beta distributions take um two values the two is actually the alpha so if you
00:16:43.440 look on Wikipedia and you look up beta distribution it's going to take input two inputs the two is Alpha the five is
00:16:48.839 Beta And depending on the on the um those two numbers it actually does draw a different graph or it actually does
00:16:56.220 draw different distribution so notice this one actually is more to the left and I don't know all the fancy technical
00:17:01.740 words for that so I don't want to confuse anybody so moving on there's also other types of
00:17:08.819 distributions actually there's a whole list of distributions and this one here is the wible and I just I only put this
00:17:13.980 in the slides because I like how Weibel sounds I just want to say why but all day long so Weibel with the shape one
00:17:20.880 actually generates the graph looks like this so how is this useful actually you know it um someone told me like a few
00:17:27.900 minutes ago what this is useful for and I already forgot so just know that you can do this
00:17:33.780 so um going back to Ruby because we are talking we are at Ruby comp we are talking about doing simulations in Ruby
00:17:41.220 um there's actually a gem out there called distribution and you can gem install distribution and this is how you
00:17:46.500 would use it so remember that graph that I had where it was the beta distribution it kind of went up on the left side and
00:17:51.780 then came and it slipped back down to the right um we can actually generate that distribution in Ruby and it's actually
00:17:57.419 really simple code I just put in I just put in 0.2 and notice I have a 2 and the five there and it actually generates
00:18:03.660 that number 2.73 so what I'm saying is that whenever X on the graph whenever X
00:18:09.539 is 0.2 the value is going to be 2.37 and as you notice I drew a little arrow
00:18:14.880 there to actually show you that so how would we generate a graph like this from Ruby
00:18:19.980 so let's see more code so you're going to notice a little um a little thing about this talk is that I put a lot of
00:18:26.460 code in it and if you I just I just like looking at pretty color code so this is a lot of code in this talk so actually
00:18:33.120 what I'm doing here in in this right here is I'm actually generating an array that that includes all the values of the
00:18:39.000 distribution and I'm actually sampling it 1000 times so because we're using Ruby we have to
00:18:44.940 actually write we have to have to generate our graphs in Ruby and using um state-of-the-art Ruby technology I get a
00:18:50.400 graph that looks like this so remember the pretty um canoe are pretty R graph that you know went up and
00:18:57.660 down um yeah I'm just not getting this actually what this is this is um spark written by I think by Zach Holman at
00:19:04.140 GitHub and in this right here actually was is actually on the console
00:19:09.360 so it's actually ASCII and I just colored it so so this is just the um this is state of the art right here so
00:19:15.240 don't tell anybody this stuff this is I mean this is new stuff right here so once again another distribution and
00:19:22.740 actually right here what we're so what we have right here is actually
00:19:28.020 there's there's a slight type of um so when we have distributions um there's there's the there's the um
00:19:34.740 the PDF which is the distribution that I showed you before but there's also called something called the CDF which is the um
00:19:41.580 the cumulative distribution and what it what a good example of that would be so
00:19:46.620 when a woman a woman goes 40 weeks for having a baby so actually somebody could create um create an actual probability
00:19:52.440 distribution for when a when a lady is going to have a baby what the percentage is but you notice that the graph so if
00:19:58.200 we use our graph from before um notice that right here notice that it goes up and down what a cumulative
00:20:04.860 distribution does it says that you can never really go down so actually as you
00:20:10.020 near your due date the graph will actually go up and once again um using Ruby state-of-the-art technology um I
00:20:17.160 created a graph to show you that any questions about that graph
00:20:23.820 so I mentioned spark every earlier and if you're on a Mac and you have Homebrew
00:20:29.160 you can Brew install spark um it's actually it's neat you can actually just pass it a list of numbers
00:20:35.280 and it'll just create a graph for you so and here's what I was talking about
00:20:40.620 pregnancies before and little sample code here
00:20:46.080 so another thing I want to talk about is sampling distribution so what in a lot
00:20:51.840 of cases what our stochastic models we're going to actually want to sample a distribution we're going to just want to say I want some kind of random variable
00:20:58.380 out of so my distribution describes some kind of Randomness and I want to actually just pull a variable out there
00:21:04.140 and what I did is when I wrote this gym called Vos and what it does is instead of so imagine you're rolling a die and
00:21:10.679 the die is not loaded so what's the percentage so you roll a die and there's a percentage so you have you have six
00:21:16.559 things that you have six outcomes um so what this does is similar but the um the die is loaded not everything is
00:21:23.039 the same so what my Bose Alias method will allow you to do is allow you to sample a distribution in in constant
00:21:29.400 time and coming in to find out so when you write web code for years and years and years you tend to not think about things like constant times like I got
00:21:36.299 caching who cares whenever you do things like this there's no caching so what this thing right here does is it
00:21:41.760 actually samples it actually just samples its distribution 100 times but it uses the Bose Alias method and notice
00:21:47.100 that it's just a little simple DSL on actually rolling a die that is based on that distribution
00:21:53.400 so let's talk about big data I mean because you know this is actually how I got my talk accepted here I think I put
00:21:58.440 big data in the talk thing so but you know what I really thought that I could
00:22:03.480 come bigger than that so let's talk about huge data and this is actually the new thing that's going to come and um so
00:22:09.840 let me talk about our simulation and you'll notice that I haven't really talked about our simulation because before when I was explaining this I
00:22:14.880 actually just jumped right in and threw a lot of code of people and I'm like I didn't tell people all the building blocks so they could actually understand
00:22:20.820 how awesome I am for writing these kind of simulations so um let's say our simulation has 100 people and the
00:22:27.600 simulation goes for 360 or 3650 iterations which will be for our for
00:22:34.020 example 100 or 10 years so if we do a little bit of multiplication we notice that we have 365 000 actions
00:22:41.220 so because we are big data and we're using active record this is what I thought we would do we would just create
00:22:46.919 an active record class called Observation and every time we every time we created one of those um what's that
00:22:53.880 number um 365 000 actions we'll just create a new observation so you know um that
00:23:00.240 didn't work that well so what the problem is is what if we actually have three billion actions and
00:23:06.600 this is what will happen if we run 100 000 people over a hundred years we can't actually run
00:23:12.840 um active record create three billion times I mean we just can't do it I mean it's slow
00:23:19.440 enough doing it one time so um one of the issues here is now either think so we have all this data
00:23:25.380 and actually um if I actually turn on all the logging out of the system it actually generates
00:23:31.140 over two gigabytes of data in two seconds and just think about that I mean this is something running on this box
00:23:36.480 this is not even it's not even big metal it's and it actually makes my SSD kind of wine
00:23:42.179 so it's actually goes from it actually does 400 megabits for like two seconds and you say this is kind of crazy so we
00:23:48.240 really can't use active record so you know um since this is Ruby cost um I figured we would just do the easy
00:23:53.520 thing maybe we'll dump it to and we'll dump it the couch and before I started I had one problem where I was
00:23:59.039 trying to create a simulation if I dumped it the or dumped at the couch now I had like five problems trying to figure out does my data
00:24:05.159 actually even there so um so I you know I'm I'm outside of
00:24:10.919 the box thinker so I said I'll think outside the box so anyone familiar with these two databases volt DB and Druid
00:24:17.640 um so these are like these new um newfangled olap all memory databases that have they're like really awesome
00:24:24.240 but the problem is is that um you should look at these and only put these in my slides because I actually want to show I
00:24:29.700 I don't want to hear my sequel postgres blah blah all the time people need to look at these new these new type of
00:24:35.280 databases that are actually can do real-time data and they can actually do real-time transactions on real-time data
00:24:41.220 I'm talking about if you're an ad serving provider and you're doing let's say you're doing 100 000 Impressions per second these databases can handle it
00:24:49.500 so um so I said you know what um I'm going to be a Luddite here
00:24:54.659 I'm just going to insert it into postgres like this so actually here's a little um there's just a little rails
00:24:59.940 thing um you can never never you can't create a thousand active record objects
00:25:05.059 easily so what you what I always do is I actually go right down to the um the connection and I execute this myself and
00:25:11.340 I just build these I just build this up myself so the next problem with um simulations
00:25:17.280 is you got to worry about memory management and um once again rails made me soft um once upon a time I was
00:25:23.280 actually a c in a similar program primer and we actually had to think about memory management but with Ruby you're like nah screw it I'm just gonna I'm
00:25:30.000 just gonna do things like this I'm going to create a billion people and then I'm going to put a billion people in an
00:25:35.460 array because you know what's the worst that can happen so um so I have this listener here and
00:25:42.120 this actually is just the example this listener is an observer so you actually see that um the the new person is
00:25:47.940 actually a callback so when the simulation runs this actually there's actually callbacks so every time a new person is created or born the p a person
00:25:54.779 gets attendance to the people array seems seems perfectly fine to me so this is what happens
00:26:00.720 um the simulator actually runs and then you pass the listener and then you do um
00:26:05.820 you do you actually run the simulator and then you can actually look in the listener and you can expect the people so what happens if we do this again so
00:26:13.440 let's say because I can't run the simulator one time I actually run a simulator maybe 10 to 15 times so I can actually get a good amount of sample
00:26:19.500 data so if I do this again guess what happens and actually these next few slides are going to show you something
00:26:25.440 of one of another reasons why you should use jruby over MRI and I'm not here just advocating it but these next few slides
00:26:31.860 are awesome so anybody know what this app is and I'm sorry it's really fuzzy because it was a small image what is it
00:26:37.740 it's visual VM and what this is actually showing and if you can just look at the look at the on the left side the y-axis
00:26:44.580 and this is actually running on this mat of how much memory this thing uses um that was actually this is actually
00:26:50.580 only two runs of a simulation I was actually trying to do 10 runs so notice it actually caps off at the first run it
00:26:56.159 hit seven gigs or seven seven gigabytes of memory on the Heap and then garbage collection came through
00:27:02.460 and it cleaned up a whole bunch of stuff but the second time it went up to almost 11 or actually went up to 11 gigabytes
00:27:08.220 and after that um so what happens when jruby runs out of memory is it does something really really cool so if you
00:27:13.980 have four cores in your box JB will be like you know what you're out of memory now I'm gonna you so um
00:27:20.760 what it does is um since the garbage collector I believe the garbage collector runs in another thread and it says well you know what that garbage
00:27:26.159 collector is too slow I'm gonna run something in another thread all of a sudden your machine is screaming and all the CPUs are pegged and you can't
00:27:31.620 control see the application anymore just have to wait so um let's let's look at this so This
00:27:36.960 is actually the other side this actually is the same image I actually just I couldn't fit it all because it's wide but you notice that um if you look on
00:27:44.039 the bottom there's a lot of GC on the bottom and what's happening there is um the people actually what's happening is
00:27:50.820 it's actually a memory leak and I'm surprised no one called that out so what's happening is I'm actually populating an array
00:27:57.419 and then creating another object but I'm never releasing anything in that array so jruby's like I'll keep it
00:28:04.740 so what I did is a simple is actually a very simple thing is that after We Run The Listener after we actually do a
00:28:10.080 simulator run we just call reset and we set we set people to an empty array and we do the same thing and we get
00:28:16.980 something more sane here so notice same code only change was that reset line and
00:28:22.500 notice that what happens is so whenever it runs it just uses less memory and it actually gets rid of all the people that
00:28:27.659 it never uses why persist things that we aren't going to be using it's only used for calculations so just a little just a
00:28:33.600 little reminder this is a reminder for Ruby code we can leak memory like crazy
00:28:38.640 in Ruby code rails proves it every single day so once again the other side of that
00:28:45.000 graph so um I've been talking for a while and I haven't talked about building a simulator yet
00:28:54.419 oh because I because I passed Dash yes because by default it's um 500 mags yeah
00:29:01.320 that doesn't work yeah it actually if I want to troll myself it yes
00:29:12.000 right you know what and it's not and it's not a fault of the language it's the fault
00:29:18.120 of me the programmer I'm leaking memory that's oh it's okay okay I'm holding
00:29:23.220 memory you know what I'm going to retract what I just said from what Ryan said I am not leaking memory I'm
00:29:28.380 actually holding on stuff that I don't need which actually makes a lot of sense
00:29:33.960 so um now on to building a simulator so um let's talk about this simulator so in my simulator I have a group of people
00:29:39.779 and I have eight people here and what the simulator does is it actually runs over a period of time so what happens is
00:29:46.320 this girl gets with this guy this girl and then this guy gets this girl this girl gets with this guy this guy gets
00:29:52.020 with this guy because that's how my simulator rolls and we actually um and we actually try to figure out
00:29:57.299 what happens and how cooties are being spread so just to show you a little example of why we're actually doing this
00:30:03.539 I prepared a short video
00:30:11.159 but I didn't want him to think I didn't trust him I didn't know I could catched on the playground we were in love so I didn't
00:30:18.600 think about it all I did is trade Lunchables every year two million kids are infected
00:30:25.980 with cooties
00:30:33.720 and two other kids Cody I just wanted to play Ted I never
00:30:39.899 thought I'd be it and the numbers are growing
00:30:46.080 blame myself
00:30:54.260 and even though a vaccine is available
00:31:03.360 children never you may have cooties
00:31:14.000 speak to your kids about cooties Cody speaks to them first
00:31:19.700 what do I do now oh
00:31:29.159 you're getting a little example of my passion for actually solving or actually being able to tackle this epidemic of
00:31:35.760 cooties so back to our simulation so our simulation is actually a big loop you
00:31:40.980 could actually just think about it everything every day we just increment the day and we just run it again
00:31:46.260 so what do we do in each day so there's a there's a group of people that are actually alive in our simulation what we
00:31:53.220 do is we look at each person and we determine hey people who is ready to actually transmit cooties every day
00:32:00.000 or actually who's ready to pair up and transmit cooties so what we do is we find people who are compatible and this is what the simulation does and then we
00:32:06.960 um we group people together and then what we do is we do some complex calculations to see if cooties are
00:32:12.000 actually shared and that's a technical diagram right there so and that's actually all the simulator
00:32:17.700 does and you know what I do have code I'm just wait until later on this afternoon
00:32:23.700 um we're gonna we're actually going to I have an unembargoed version of the simulator that I think I can share with
00:32:29.640 you guys I will put it up on GitHub so you can actually see what a simulator in Ruby looks like
00:32:34.860 um so moving on so before you can write this like I was talking about earlier um
00:32:40.140 Ruby just does not like you putting a thousand or no actually not a thousand a million items in an array it just says
00:32:46.740 you just shouldn't do that and actually you know what we really should not um be putting a million items in an array it's just there's
00:32:53.520 um our computer science classes our data structures classes told us that there's just much better ways of storing things
00:32:59.159 and the same thing with the hash so earlier I was talking about that um Benchmark that that I showed the output
00:33:05.340 for this is actually the code to make it simple once again I just populate
00:33:11.159 um an array put a million items in it and then sample and then actually um query it randomly a million times and
00:33:17.039 the same thing with the hash and to recap we notice that jruby is faster
00:33:22.679 than MRI so um but you know what that doesn't really mean something it doesn't really
00:33:27.960 it's like benchmarks micro benchmarks are bad doesn't really mean anything in the grand scheme of things so you notice
00:33:33.120 that um this is my actually notice I got three up there this is actually a run of and probably the version of the the
00:33:39.120 simulator that I'm going to share um this is actually with a thousand people over 100 years
00:33:44.240 36 500 days and you notice it ran in 129 seconds but if you notice
00:33:50.519 um jruby ran at about the same time so just because your micro benchmarks are faster
00:33:55.559 does not mean that it will actually double the speed of your app there's just other things going on
00:34:01.380 so um next thing I want to do is talk about algorithms in Ruby Ruby is just missing
00:34:07.860 a whole bunch of neat algorithms we don't have a real Heap in Ruby in the
00:34:12.899 standard lib we don't have priority queues and actually those are things that we could actually use and let me
00:34:18.240 show you a demonstration of that so every day we have an event or when we have we actually keep a track of events
00:34:24.659 and every and if we put events into my um into this
00:34:30.960 into this array and we pop it off you know that that's that's kind of cool but that's not describing what we want to do
00:34:36.540 actually what we're really doing is having a priority queue so really what I want to do is I want to say add this new
00:34:42.720 event at priority 10 and then the next thing I have to do I don't have to actually go through my array to figure out what I'm doing I actually have to do
00:34:49.080 is say hey event pop off the next item and it will pop off the thing in this case with the lowest priority which will
00:34:55.500 be 10. so that's cool and all so I like The Benchmark and I Benchmark all the time
00:35:00.839 so here's a cool thing um I actually have an array and then I have an array where I'm inserting at a position and
00:35:07.380 then I have the implementation of this priority queue that comes out of that algorithms gym you know it's nice for
00:35:14.400 the nice DSL and the nice ability to do this but you notice look how much slower it actually is because this
00:35:20.940 implementation of the Heat and the priority queue is actually coded in Ruby and this was actually a ruby summer of code project I don't remember who did it
00:35:27.540 but I mean it's a great effort but look we don't in Ruby we just don't have very
00:35:33.060 fast data structures um on the tangent um python which is another language that some of us love to
00:35:38.700 hate has numpy they have Panda they have so many nice things Ruby um python
00:35:44.160 because a lot of scientific communities use it actually puts an effort on making real fast data structures
00:35:50.760 so moving on so once you have a simulator another thing you're going to think
00:35:57.060 about is so you're going to build a simulator and even meet um Supreme coder up here on the stage right now when I
00:36:02.700 build simulators the first time I run it they always have there are they're actually wrong so I don't maybe I don't
00:36:08.700 have the right amount of Randomness you know so like I said a model is in and out in a little box here and um what we
00:36:15.960 need to do is we need to be able to train our simulator like train the values but inside of our our simulator to actually
00:36:22.500 turn return the right values because we already observed this or we just know that these to be the right empirical
00:36:28.200 values so um here's a graph here and what this is is
00:36:34.260 um here's what I because I've observed this this is what I expect my data to look like so infections per 1000 people
00:36:40.260 over time should look like the graph the slope of the graph like this so what we would do and and
00:36:46.339 unfortunately I really wish I could share this code with you guys but I'm just going to talk about it um but one
00:36:52.020 of the things and this is one of the value ads that Thunderbolt is working on um we're actually working on a machine
00:36:57.180 learning project in Ruby um I don't hear a lot of people doing machine learning like actual machine learning in Ruby so
00:37:02.700 what you're doing is actually building ways to train so we can say okay
00:37:07.859 I input this I expected this but I actually got this now what I'm going to do is actually create a large Bayesian
00:37:14.099 Network and do things something similar to like what spam assassin does or what Google does to your spam and we're actually going to Traverse through the
00:37:19.619 network to see is this the right value is this the right value up if this is this value we'll just return this number
00:37:25.500 and so what we're actually doing is the computer is actually learning that whenever it has error it actually looks
00:37:31.260 at a standard deviation of the error and actually uses that to rationalize what the next value possibly could be hey
00:37:36.900 machine learning and Ruby it's not fast but it is machine learning in Ruby and you know if we need to make it faster
00:37:43.320 because we're using jruby we'll just write it we'll just write it in Java and moving on so the last part about a
00:37:50.520 simulators you got to have them visualizations um these are graphs I showed earlier these graphs are this graph and this
00:37:56.700 graph where it was created with this kind of r um but we also um use canoe plot you
00:38:03.480 should learn canoe plot create a graph like this all you would need to do is create a Digraph that looks like this and you just run and you would just run
00:38:10.920 um canoe plot on that and like I was also talking about earlier we definitely
00:38:15.960 are using v3.js I'm not ready to show this code but this lets you know that if you're doing visualizations now and I
00:38:22.140 don't care what platform you are and you're using something where you can use web really have a nice look at d3.js
00:38:28.440 so um coming towards the end of this talk um we learned a lot of lessons
00:38:33.540 um Ruby lets you iterate very quickly but the problem is Ruby is not very
00:38:38.820 quick so what do you do you write the slow Parts in C plus plus and Java which is also the win of having jruby
00:38:46.280 take advantage of Jr being the jvm and only polyglots I mean if you're just
00:38:52.500 going to be Ruby Ruby Ruby Ruby Ruby only you're not going to be able to do this you're really going to have to
00:38:58.380 learn more than one language to do this correctly another thing that I want to say is that tdd is hard I don't know any
00:39:04.440 people here have tried a tdd an implementation of an algorithm that you found on Wikipedia whenever they're
00:39:10.140 using sigmas and alphas and all that stuff that's hard just write it and then
00:39:15.240 write the test second but just test is there don't don't beat yourself over trying to be a good
00:39:21.540 developer just because you are trying to follow some kind of standard whenever all you're really trying to do is
00:39:27.300 Implement a proof that's already working you just want to make just write the test second don't even I beat myself over over this constantly
00:39:35.940 um another thing is that uh rubylex stats science libraries for Ruby to be
00:39:41.220 taken seriously in this kind of space um we just need this so I mean we need the numpy we need matplotlib that python
00:39:47.520 has I mean I I I'm not moving this project to python I'm really dedicated to doing this on Ruby so we will
00:39:54.240 actually we we will try to build what we can and of course we will open source what we can but we have to acknowledge
00:40:00.599 that Ruby is just not great in this space even as a journal purpose language so um that's it I plan on going 40
00:40:07.320 minutes 45 minutes and it's been 44 minutes so thank you guys for not doing it all during the talk so that's it
00:40:21.599 so any questions if they're hard I will not answer them I promise you right here so you're saying that we
00:40:27.839 don't have a lot of these libraries and the problem is that we then have to implement them in either C plus plus or
00:40:33.420 job or whatever so as we have different runtime we're going to end up duplicating a lot of work like the
00:40:39.420 advantage of just using Ruby is that generally the VMS are pretty compatible so so how about this if there was a
00:40:45.300 great C extension for algorithms and some statistical stuff that I was doing I probably would not have looked at
00:40:50.760 jruby and the reason I'm using jruby is because actually the real simulators written in Java and because it's just
00:40:57.060 it's just that much faster now and I can at least take advantage of all the numerical stuff that's on the jvm so
00:41:03.180 that's so we can look at the libraries
00:41:09.440 that you have being so much slower is because the code is slow not because
00:41:15.000 no um what'd you say and what Ryan said is it's it's not that Ruby sucks inherently what he's saying
00:41:21.119 is that the reason that it's slows because it's probably the code that's in the algorithms gym it's just not very good and you know something it's hard
00:41:27.900 for whenever you code for money and like you're not coding for fun to actually sit and replace things whenever whenever
00:41:33.780 you're actually under a deadline so that that's kind of hard so Jesse so um there's been a lot of movement on side
00:41:39.960 Ruby lately um
00:41:48.359 Foundation um so they're working at basically doing the this kind of stuff like giving us
00:41:55.380 these tools so out of cyber B distribution gym comes out of cybery there's a statistics gym that comes out
00:42:01.680 of Sai Ruby and what I was talking about Ruby Fizz is part of the side Ruby what I'm saying is that
00:42:07.440 um I'm glad that and I do know that it's speeding up again it's just that it's not there yet so it's not a complaint
00:42:13.619 it's just saying this is what it is right now we will try something different so
00:42:25.700 so in a stochastic model there's there's a couple ways you can do this so he's asking me how do I test a model that has
00:42:31.800 Randomness in it um there's two easy ways the first easy way is whenever you create the model you
00:42:37.260 whenever you initialize whenever you initialize the object and at the second create an options hash and in there put
00:42:44.760 a key called Rand and passing your own random number generator that can actually only return five so you know
00:42:51.119 what the randomness is always going to be the second way you can do it is you can run it a whole bunch of times and
00:42:56.700 then sample it to see that it's actually within and like if you're using many tests like there's the within is that
00:43:02.220 what it is within you can actually use within and actually determine if it's close enough and actually that's how I use it in the
00:43:08.579 Bose Alias method I just I just actually run it 10 000 times and just sample and make sure that it's close enough
00:43:17.180 that sets the random scene to 42. so I always know that I've got the same
00:43:22.800 sequence of numbers coming out that's a that's actually another good way as well sometimes has problems
00:43:29.099 across versions and another thing I reason why I always straight from that because sometimes you'll do that and
00:43:35.819 then you'll call random somewhere else and you don't realize what the next value is and now you're off and that's
00:43:40.859 the only reason why I've been bitten by it before because I wasn't careful
00:43:46.020 right so any other questions all right well thank you guys you guys
00:43:51.240 been great
00:44:24.480 thank you
Explore all talks recorded at RubyConf 2012
+46