Summarized using AI

There and Back Again

Joshua Ballanco • November 01, 2012 • Denver, Colorado • Talk

In his talk "There and Back Again," Joshua Ballanco narrates his journey of creating a delegation library, which unexpectedly led him to discover and address a bug in Ruby's Garbage Collector. He illustrates the process of benchmarking various delegation strategies in Ruby, emphasizing the importance of measuring performance to make informed design choices.

Key Points:

  • Introduction to Benchmarking: Ballanco sets the stage by encouraging programmers to benchmark algorithms to understand performance implications.
  • Open Source Contribution: He shares tips on how to effectively contribute to open source projects, highlighting the significance of finding and reporting bugs. He describes the "Four Rs" of bug reporting: reproduce, report, reduce, and regress, emphasizing that contributors should actively engage with the software.
  • The Adventure Begins: While benchmarking his delegation library, he experienced a crash in Ruby, which excited him rather than disappointing him, leading him to dig deeper into debugging.
  • Debugging Techniques: He discusses essential tools like GDB (GNU Debugger) and introduces concepts such as memory allocation and garbage collection in Ruby. He highlights various debugging strategies that involve compression of the bug-reporting process to isolate the issues effectively.
  • Finding the Bug: Through a methodical process, he refines his test case down to a minimal code sample that replicates the crash. Using GDB, he uncovers errors related to memory handling and investigates the correlation between allocations and deallocations.
  • Using Advanced Tools: Ballanco transitions to advanced aspects of debugging using tools like Gallo and other environment variables that can assist in managing memory allocation effectively.
  • Conclusion: The talk concludes with a resolution of the bug in Ruby’s Garbage Collector through the application of proper error-checking in the code that manages memory allocation. He reflects on the adventure of debugging and invites further discussion from the audience.

Takeaways:

  • Benchmarking is crucial for making data-driven decisions about performance in programming.
  • Active participation in open source projects can lead to significant improvements in software reliability.
  • The debugging process can be an exploration, requiring patience and systematic investigation to resolve issues effectively. Balanco encourages developers to engage with the Ruby community and contribute to its evolution.

Through sharing this journey, the speaker promotes not just the technical skills needed to debug software, but the broader community and problem-solving approach that defines programming within open source.

There and Back Again
Joshua Ballanco • Denver, Colorado • Talk

Date: November 01, 2012
Published: March 19, 2013
Announced: unknown

Come hear a tale of how one programmer wanted to write a delegation library. Knowing that the wise elders have always stated "it's not true until you measure it", the programmer decided to benchmark the alternatives. But this was no ordinary benchmark. Down paths winding and mysterious, the programmer wielded his trusty "gdb" and became proficient with the weapon they call "libgmalloc". Finally, this sinister benchmark led the programmer straight into the dark heart of the Ruby Garbage Collector! But fear not...this tale has a happy ending.

RubyConf 2012

00:00:15.400 all right did everybody uh make it to Jim wri's great keynote this morning all right good because uh your
00:00:22.160 mind needs to be kind of limber for uh the adventure we're about to set off
00:00:28.720 on okay so uh I titled my talk there and back again or how I set out to Benchmark
00:00:35.200 an algorithm and ended up fixing Ruby because that's basically exactly what I did and I would like to share the story
00:00:42.480 of that adventure with you okay so first of all my name is
00:00:47.840 Joshua I live and work from home in Ankara Turkey which is actually a very
00:00:53.600 lovely place and I work for a company that used to be called ELC but we're actually in
00:01:00.640 the process uh right in the middle of the process of rebranding is Burnside digital so uh if you go home from Ruby
00:01:07.960 conf and want to find out more information Burnside digital is uh the place you want to go
00:01:13.759 looking all right so like any good story this one comes with a warning this is an advanced
00:01:20.119 presentation while I do not expect you to be an expert on the inner workings of Ruby I do hope that you have at least
00:01:27.079 some familiarity with pointers memory allocation memory Pages if you're prone to motion sickness
00:01:34.200 or become nauseous at the S of heximal memory addresses this talk may not be for you and um if you want to take a
00:01:41.200 moment now to check out one of the others I would not be offended um I realized when I proposed this talk that
00:01:47.079 it was a little ambitious uh but I figured what the heck maybe uh maybe somebody else out there has as much fun
00:01:53.439 doing this kind of stuff as I do okay I I do want to make sure that
00:02:00.399 everybody gets a little something out of this talk though so I want to before we jump into the actual adventure story uh
00:02:06.159 do a little bit of talking about um contributing to open source uh Ruby or
00:02:11.840 any open source project really I'm I'm going to be talking in the context of Ruby but this applies uh pretty
00:02:17.519 ubiquitously so I've been hanging out on the Ruby cor mailing list for a while and there's a
00:02:24.160 question that pops up occasionally you get some uh gung-ho programmer discovers Ruby or you know gets really excited
00:02:31.400 wants to contribute how can I contribute to Ruby they say right let me let me at those bugs well one thing that I found
00:02:39.800 is the best way that you can contribute to Ruby or any open source project for that matter use
00:02:46.440 it don't go looking for the bugs use Ruby and the bugs will find
00:02:53.480 you and that's actually a a sort of a more efficient way of going about contributing to a project because after
00:03:00.480 all we write software not to write software we write software to run software so the more you run it the
00:03:06.120 better it is going to be now let's say you've been using Ruby
00:03:12.080 and you happen to stumble across a bug what next well I want to introduce you
00:03:17.440 to the four RS of bug reporting these are things that you should do if you want to be a good open-source
00:03:24.280 citizen so you want to reproduce report reduce and regress
00:03:31.879 okay let's look at those a little closer so this is an often overlooked uh step
00:03:38.599 in in reporting bugs but the first thing that you need to do if you think you found a bug is make sure you found a
00:03:46.120 bug a bug that happens once is not a bug that you can really go about
00:03:51.760 fixing so the first thing you want to do is run the code again and see if it
00:03:58.400 crashes or if whatever the bug happens to be reproduces can you make it occur every time now not every bug is 100%
00:04:06.439 reproducible once you get into things like threading and memory allocation maybe it's only 60% 50% it should be
00:04:13.360 more than zero though if you can't reproduce it you're not going to be able to fix it in the process of reproducing it
00:04:20.680 there are a number of things that you need to be aware of Ruby can be uh influenced by a number of different
00:04:26.280 configuration files an irbr RC a gem RC very ious other RC's the environment
00:04:32.240 that you're running it in the particular Shell Shell configurations environment variables all of these things can have
00:04:37.960 subtle effects on the way that Ruby is running so make sure when you're trying to reproduce this that you pay attention
00:04:44.880 especially uh lately a lot of people have been using rvm and rbn and these
00:04:50.800 also do a lot to the environment that may alter it in such a way that uh it
00:04:56.160 affects whether or not the bug is going to occur the best thing that you can do if you're trying to reproduce a bug is
00:05:02.280 ask a friend get one of your buddies on chat or whatever and say hey I ran this
00:05:07.720 code I think there's a bug can you run it do you see the bug if two people see the bug it's
00:05:13.960 probably a bug okay so once you're sure you have a bug The Next Step report it
00:05:20.240 again something that is uh quite frequently overlooked you know you have newbie programmers come into something
00:05:27.720 like Ruby or a library that out there and they're using it and they say oh that looks like a bug but I don't know
00:05:34.960 somebody else must have found it right I don't need to report it no if you find
00:05:40.800 the bug report the bug duplicate bug reports are not a problem
00:05:46.840 actually H some people might complain about them but honestly duplicate bug reports are a good thing they're a good
00:05:51.919 thing for two reasons one if a particular bug is reported and again and
00:05:58.120 again then that means means that people are obviously encountering this bug this is an important bug this is how open
00:06:04.000 source projects prioritize their fixes the bugs that get reported more often are fixed
00:06:10.440 sooner the other reason duplicate bug reports are good is because not everybody is going to hit the bug the same way and one of the most challenging
00:06:18.080 things about fixing any bug is narrowing down the specific set of criteria that are needed to trigger the bug so if you
00:06:26.080 have multiple bug reports you can look for the commonalities between them to help zero in on what the actual bug
00:06:33.199 is okay so Ruby has a bug tracker bugs. Ruby d.org when you're reporting it make sure
00:06:40.759 you include all of the steps it's really good if you can actually just attach a little script and say run this script
00:06:47.479 you'll see the bug that's great make sure you make a note of what Ruby version you're using it should go
00:06:54.400 without saying that not not every Ruby version has the same set of bugs so if you found bug in say 193 it may already
00:07:02.599 be fixed in 2.0 so make sure that you report the version and if the bug that you're
00:07:08.560 reporting happens to be a crash make sure you attach the crash log
00:07:13.759 now if you're on OS 10 oh and by the way I should mention that this talk presumes you are on OS 10 a number of the tools
00:07:20.240 that we're will'll encounter later are on OS 10 there are equivalance on other operating systems but just so that
00:07:26.879 caveat is out there um so on OS 10 there's this app console. apppp uh if
00:07:32.720 you don't find your crash log in your terminal you can go to console app and there's this little section it's kind of
00:07:38.879 hard to read but uh it says user diagnostic reports and there you will see all the various programs that have
00:07:44.360 crashed and their crash logs and you should see one that says Ruby uncore and then it'll give a date so you just look
00:07:50.039 for the one that corresponds to the date that the crash actually happened and you get the full output of everything that happened there
00:07:57.479 okay next step now those first two steps those are kind of the bare minimum for
00:08:03.199 reporting a bug right make sure it can be reproduced and then actually go and report
00:08:08.720 it now if you want to be a really good open source citizen the best thing you can do is to reduce the
00:08:15.639 bug so if you find if you find a bug running a rails app right and you go and
00:08:21.199 you say well here's a 100K line rails app and I was running it and I did these 30 things and then I found a bug oh okay
00:08:29.840 that's a bug but it's not going to be horribly easy to reproduce right so try
00:08:35.200 as much as you can to eliminate the need to include any extraneous libraries try and zero down on just the code that's
00:08:42.399 crashing simple scripts are the best thing that you can have five or fewer lines I know that sounds really really
00:08:48.880 small but seriously if you can get it down to five lines of Ruby or less that's the ideal case and aside from
00:08:55.920 making it easier for whoever's going to try and fix this bug to understand we'll see a little bit later that some of the
00:09:01.839 tools that you're going to use to try and fix the bug depend on the bug being
00:09:07.240 a small test case and finally if you're feeling
00:09:13.320 really gung-ho and you really want to you know be a really really good open source citizen try and regress it so
00:09:20.560 Ruby has multiple different versions multiple different patch levels of those versions uh you can try all of those and
00:09:27.279 see does it occur on 187 does it occur on 193 does it occur on 20 what patch
00:09:32.600 level of 1 n if you're really ambitious get bisect is an amazing tool that you
00:09:38.839 can actually just set to run automatically and find the exact commit that caused your bug that's really
00:09:44.560 helpful doesn't always work but if it if it does um and finally if you really are
00:09:50.800 ambitious you can also try different build settings try different optimization levels um but that's that's
00:09:56.720 kind of almost uh going a little too far at this point okay I said four there's actually five
00:10:04.640 if you want to be a really good open source citizen try and fix
00:10:10.200 it and so that brings us to our story so a long time ago about six
00:10:17.640 months maybe a little longer in a far away land called turkey uh I was working
00:10:23.480 on a delegation library now yes I know this isn't technically what delegation is we've
00:10:29.279 you've all been lectured on you know the delegation pattern is not what you're doing in Ruby n all right when I say delegation what I mean
00:10:36.279 is I have an object that I'm going to pass to the initializer to a second object and I want methods on that first
00:10:41.880 object to be called when I attempt to call them on the second object right
00:10:47.000 there's a number of ways that you can achieve this in Ruby so there's what I was calling pre-
00:10:53.240 delegation where as soon as I initialize the object I'm going to take all of the methods on the second object and hoist
00:10:59.920 them into the object that I'm creating using defined Singleton
00:11:05.240 method then there's what I was calling post delegation where I don't Hoist the
00:11:10.399 methods until the first time they're called so I rely on method missing the first time you call a method and then
00:11:15.920 after that I'm going to Hoist the method into my object and from then on it'll be called as if it was a method on this
00:11:24.440 object then there's on demand where it's basically just going through method missing every single
00:11:30.839 time and then there's this complicated Beast that um I was actually interested in a delegation library that would allow
00:11:36.680 me to do quick calls to the delegate but also swap out the delegate at runtime so
00:11:42.519 I came up with this crazy scheme where I undf methods when I decide that I want to reset the delegate and all the rest of that right so uh that
00:11:50.560 was uh and then the next step bench market right how do I know which of
00:11:56.360 these strategies to choose well run through a couple of different scenarios find out what the runtime looks like
00:12:03.000 that'll help me make an informed evidence-based decision so that's what I did um now I
00:12:11.639 ran through those kind of quickly I uh put up a couple of uh blog posts on our company blog describing the backstory in
00:12:17.120 a little bit more detail why you want to Benchmark and how I came up with those various
00:12:22.720 tests but as I was running my delegation Benchmark when I got to one of the later
00:12:30.240 benchmarks that was putting a lot of stress on the system all of a sudden bang
00:12:36.920 crash Ruby crashed I found a bug now I don't know about you guys
00:12:44.480 maybe you get annoyed when you find a bug I get excited I think yes all right
00:12:52.760 so okay so going through those four RS right before I reported it even um I I
00:12:58.639 was able to produce it that was not a problem every time I ran that Benchmark it crashed uh before I reported it I
00:13:04.360 wanted to reduce it because this is actually I know you can't read this uh you're not supposed to this is the entire Benchmark that I was running
00:13:10.560 right and and that's a lot of code uh so the first thing I did is uh those blocks at the bottom or all the different
00:13:15.680 benchmarking runs that I was doing it was only the last one uh that was finally causing the crash so I said okay
00:13:22.680 what if I take out all of the Benchmark runs before that right will that last Benchmark run on its own still trigger
00:13:28.480 the crash and yep it did so then I said okay well it's crashing on the first of
00:13:35.519 the methods that I'm benchmarking in that last Benchmark method so let's let's eliminate all the other delegation
00:13:41.399 strategies that I was trying let's just go with this pre- delegation one and see if just running that in that last
00:13:47.160 Benchmark causes a crash and yep it certainly does okay so at this point I don't need
00:13:52.639 the Benchmark library or I suspected I didn't need the Benchmark Library anymore I could just Hoist the code out
00:13:57.720 and run the uh the method that was actually being benchmarked on its own see if it still crashed and yep it
00:14:04.480 did now when I was doing this Benchmark in this inner loop here right so I've got the 10,000 times do this
00:14:10.560 initialization and then 10,000 times call a method right I was actually calling three different methods in The
00:14:15.800 Benchmark that was triggering the crash and I wanted to know do I actually need to call all three of those what if I
00:14:20.880 just call the first one so the first method I called in The Benchmark was the length method so let's see if I do this
00:14:27.720 is it still going to crash nope that didn't crash and this is kind
00:14:33.920 of the art of creating a reduced test case you sort of vector towards the minimum amount of code that will trigger
00:14:40.720 the crash and as you do you'll probably end up eliminating the crash at some point so then you got to backtrack and
00:14:46.399 say okay what can I do differently right so the second method that I called of those three was reverse and I said what
00:14:54.160 if I call just reverse right do I actually need to call all three or is this one enough and yep that one did
00:15:00.519 cause a crash so the next thing I wanted to know is do I actually need to hoist all of
00:15:07.360 the methods from my delegate onto the object that I'm created or is it enough for me to Simply Hoist the reverse
00:15:13.279 method the one that I'm calling right and so I eliminated that iteration through all of the methods and just did
00:15:20.320 that single defined Singleton method with reverse and yep that still crashed so at that point um I was pretty
00:15:28.279 much at the minimum amount of code that I I could have that would still trigger the crash the last thing to do is to
00:15:33.720 play with the number of times of iteration and so you'll notice that I only needed to run the outside Loop a
00:15:39.240 thousand times the inside loop I still needed to run 10,000 times in order to trigger the crash and so now I have my
00:15:46.319 minimally reduced test case that this is the minimum am amount of code that I
00:15:52.040 need to trigger the crash at this point I went and I filed the bug and then I
00:15:57.800 went to sleep because it was probably about 2: a.m. on a Friday night but then I woke up on Saturday and
00:16:03.480 I couldn't stop thinking about it I said I want to fix
00:16:09.040 this all right so before we set off on this adventure
00:16:15.160 we need to pack our bags if you're going to debug Ruby you need Ruby so going at
00:16:20.680 the source code conveniently it's mirrored to GitHub so it's actually really easy for you to get now you don't
00:16:26.519 have to worry with subversion checkout commands and go ahead and build it uh over the years building Ruby has
00:16:33.720 become incredibly easy in fact I do believe you don't even need that first configure step anymore you can just
00:16:39.440 simply type make and it'll run configure for you before you actually uh make um
00:16:45.240 if you want to play around with build settings configure d-el has uh a lot of information about different build
00:16:50.600 settings uh we'll see some of them a little bit later all right so make sure that you've got the Ruby source code and
00:16:56.920 you can build the Ruby source code then study the
00:17:02.240 map again I don't expect you all to know uh Ruby internals but there are a couple of things that are useful to know if
00:17:08.600 you're going to head off and start debugging Ruby um I I do suggest that um
00:17:14.600 I'm forgetting the name of the book but uh the book that uh was discussed in the session yesterday uh Ruby internals uh
00:17:22.559 Ruby under microscope thank you yes uh that sounds like it it would be an excellent resource if this is the kind
00:17:27.760 of thing that you want to get into but at a bare minimum things that you should be familiar with uh inside of the Ruby
00:17:32.960 Source directory you'll find this file include Ruby ruby. this is the public cap API but it's also a lot of
00:17:38.919 convenient methods that you can use when you're attempting to debug things when you look at Ruby Source
00:17:45.720 you're going to see a couple of uh values that are repeated over and over again one is ID and an ID is essentially
00:17:52.600 just a symbol uh then value which is essentially everything else Ruby is is
00:17:59.039 dynamically typed and C is not so Ruby sort of fakes out C by calling
00:18:04.840 everything a value uh then there are a couple of methods good methods to know RB id2 name
00:18:10.960 this actually will take an ID and it will dump it out as a C string it's important that you have a C string
00:18:16.679 because we're going to be in a second uh using GDB and GDB only knows about C
00:18:22.480 strings it can't print a ruby string but it can print a C string so uh RB id2 name turns an ID into a C string so you
00:18:29.039 can print it out in GDB uh likewise RB string value seaster uh is essentially
00:18:34.799 like two string um but it's going to as the name apply implies uh output a C
00:18:40.360 string so that you can view that in GDB and finally as we're going to be looking at some uh back traces VM call
00:18:47.640 method is an important one to pay attention to because this is how methods get called in
00:18:53.400 Ruby okay and finally make sure you bring your Gardener because what an
00:18:59.000 adventure without your Gardener uh and in this case that's mini Ruby so if you build Ruby and pay
00:19:07.280 attention to the build log you'll notice the first thing that happens when you build Ruby is that you actually build mini Ruby mini Ruby is a sort of a
00:19:14.760 bootstrap uh although it's not exactly what you would consider a traditional bootstrap it's actually a very capable
00:19:20.520 Ruby executable just without a lot of the extensions and uh standard Library parts so it's smaller uh it still has
00:19:28.120 the lecture part parser interpreter VM runtime all the core library or um core classes uh so if so so long as your bug
00:19:36.760 is in one of those core Library classes you actually don't need to run Ruby to debug you can run mini Ruby and this is
00:19:43.640 convenient because mini Ruby builds a heck of a lot quicker than Ruby does and if you're going to be you know putting in printf statements and all the rest of
00:19:50.159 that it's nice to be able to rebuild quickly okay so we're ready let's set
00:19:57.760 off the first weapon that you need to get proficient with if you're going to be debugging anything in Sealand is
00:20:05.240 GDB and the way that this works is you simply give GDB the executable to run
00:20:10.720 which in this case is going to be mini Ruby what what happen is GDB will start
00:20:16.760 up and then it'll stop and it'll wait and that's so that you can do any sort
00:20:21.919 of uh breakpoint setting that you need to do before you you uh begin running the the actual program uh enironment
00:20:28.960 variables we we'll see all of this in a second then the arguments that you would normally pass to mini Ruby right so uh
00:20:36.760 if you're going to have a script that exhibits the bug normally you would do
00:20:42.200 mini Ruby and then pass the name of the script on the command line right well because we're inside of GDB instead of
00:20:47.799 passing it on the command line you actually pass it as an argument to the Run command in GDB and that'll pass it
00:20:53.320 through as an argument to mini Ruby and then in this case because we had a crash we wait for for it to
00:20:59.559 crash so here's what that looks like okay so we're starting up GDB on Mini
00:21:04.960 Ruby it's going to give you some introductory information copyright notices you know this is gdbs compiled
00:21:10.799 by blah blah blah none of that really matters and then for a a first go let's
00:21:18.000 just run our program now I I have this tradition I like to name my bug scripts bug. RB so I don't have to remember the
00:21:24.400 name I just always do a run bug. RB and so that's now going to tell us
00:21:30.000 it's going to start the mini Ruby program executing with the argument bug. RB it's loading all of our libraries
00:21:36.159 shared symbols and there is our crash so we crash in tiny free list
00:21:42.840 remove pointer what the heck is that and how the heck did we get there all right well the back Trace
00:21:50.000 command or BT for short will show us the stack Trace at the location of the
00:21:55.480 crash and in this case we're this is actually much longer so I I like I said I tried to fit as much as I could onto
00:22:02.360 these slides without making it crazy so the the output is actually way longer uh if you do this on your own um but if we
00:22:10.000 look backwards through our stack frames a little bit back back back eventually we'll see there is a VM call
00:22:16.360 method and if you've looked at the call Signature for VM call method you're going to see that uh one of the
00:22:22.120 arguments that it takes is uh Rec V which is the the target of the call and
00:22:27.360 so uh G wouldn't it be nice if we could figure out what uh what object we were actually calling a method on right this
00:22:33.200 is just you know trying to figure out which direction we're going right what what happened at the point of the crash
00:22:38.600 so we use our RB string value sear with the memory address of the receiver
00:22:46.279 and crash we got a crash inside of our
00:22:53.799 crash So This Time ruby is telling us that we can't object allocation during
00:22:59.559 garbage collection what well go back to the back Trace that we had before and
00:23:06.120 you'll notice between the method that actually triggered the crash and the VM call method at some point we were
00:23:12.520 triggering a GC lazy Sweep All right so this crash is
00:23:17.600 actually happening inside of the GC routines and if you study some of Ruby's internals you'll know that once you're
00:23:23.919 inside of the GC routines there's not a lot that you can get away with right so you could get around this you
00:23:32.320 know the the RB string value Seer method actually goes in and and looks at the structures and uh you know figures out
00:23:40.400 uh what the name is based on values stored in in different structures and pointers and you could trace that down
00:23:45.640 on your own but uh you know and I did but uh it didn't really actually tell me
00:23:51.919 anything all right so let's go back to the message that we got in our terminal the first time that we saw the crash
00:24:00.320 it said incorrect check some for freed object object was probably modified after being freed set a break point in
00:24:06.679 Malo error break to debug now this is something that os10 does uh Malo error break is a method
00:24:14.159 that doesn't do anything it simply exists as a method for you to set a break point on if you're attempting to
00:24:20.480 debug something that is a error in malok right so so let's go back to GDB start
00:24:27.320 it up again this time before we run the program let's set the break point now you're notice GDB does this weird thing
00:24:33.600 it tells us you're breaking on some method I don't know what this is right it's because the libraries haven't been loaded yet so it says uh should I should
00:24:40.679 I wait and see if this method shows up and you say yeah okay go ahead and then you run the program right and then it
00:24:46.200 says oh yeah there's maloc error break okay so the break point is set and then we're going to run run run and crash in
00:24:54.120 the exact same place so setting that uh breakpoint didn't exactly help
00:25:03.679 anything okay enough with GDB time to turn to some more advanced
00:25:09.520 Weaponry so if you go into the os10 man page for malok you'll find that there
00:25:15.840 are a whole collection of really interesting environment variables that you can set that modify the way that
00:25:21.120 malok works one of those is called Malo stack logging no compact all right it's fun to say what
00:25:28.960 happens when you set Malo stack logging no compact is that every time there is an allocation or a free of memory right so
00:25:37.480 every time memory is allocated or freed the system is going to look at the stack
00:25:43.200 at the point that free or allocation happened and record it to disk and then
00:25:48.880 there's a tool that you can use Malo history you can go in and you can retrieve those stack traces right and
00:25:55.440 this is the history for that memory address for the entire run time of the program now first thing we're going to
00:26:01.159 need to do before we uh set off in GDB again and actually probably could have done this a little bit sooner is go back
00:26:07.120 to that configure make step and make sure that we are making with our optimizations set to zero and our
00:26:13.799 debugging symbols included so that we can actually see all the memory addresses because if you go back and
00:26:19.360 look at those uh back traces before you'll see a number of them it said inlin or you know value not available
00:26:24.559 because of optimizations and stuff like that so recompiling at optimization level Z will allow us to see all of the
00:26:30.480 memory addresses that we need to see okay so start up GDB again before
00:26:37.360 the program runs set the environment variable Malo stack logging no compact equals true right and in GDB it's
00:26:43.360 confusingly set environment instead of Set n because whatever um
00:26:49.520 run and you'll notice as soon as we start running we start getting some logging messages Malik is telling me
00:26:54.679 okay I turned on your stack logging I'm not going to compact the stack logs but your dis is going to fill up really
00:27:01.559 fast and this is why it's so important to have a reduced test case because as soon as we get into these debugging
00:27:07.600 tools they stress the system pretty significantly okay so we let it run and
00:27:14.799 pay attention to the process ID that it reports to us for what we're running because we're going to need that when we
00:27:20.760 go to use Malo history and there's our crash again same crash as always okay so now when we look at the
00:27:27.600 back Trace we going to go and and these methods appear because we're optimization level zero now right before
00:27:33.760 the Ruby x free and vmx free were all in line so we didn't see them so we can actually see the pointer that Ruby is
00:27:40.159 attempting to free when this crashes and so now we can take that pointer we can take the process ID from the screen
00:27:45.679 before we can go Malo history process ID memory address and Bam we get this whole
00:27:52.080 and I mean by the way this is like 20 30 pages of this I just showed the very end of it right every time this memory
00:27:58.480 address was either allocated or freed anywhere in the Ruby running process it
00:28:03.679 records the stack right and and it reports the stack with pipes so you can see that this is the last time that it
00:28:09.760 was allocated in freed you see the last time that it was allocated and freed it was a start main Ruby run node all the
00:28:14.799 rest of that that that and actually if you look the free one actually matches the back Trace that we got for our crash
00:28:20.640 because this is our crash right we're crashing as we're freeing this block now
00:28:26.200 what you can do is you can go back um so at some point there was a problem with an allocation or a free right the the
00:28:32.080 bug that told us that somebody was touching our memory inappropriately right probably after it was freed so we
00:28:38.840 can look through this log and we can look for things like for example
00:28:44.279 um you can make sure that every allocation in free is matched up right so there are different things that Ruby
00:28:50.080 allocates in freeze there are procs and blocks and all the rest of that right so uh you can see here the last allocation
00:28:56.120 was a proc new that alloc it and then a proc free that freed it right so this matches up and actually if you go back
00:29:02.080 through the history um it matches up the whole way back H that didn't actually tell us much
00:29:08.720 of anything it's kind of cool to look at but we need something else we need
00:29:16.120 gallo and you can read the man page for G Malo I suggest you do it's actually kind of entertaining but I think the
00:29:21.880 first line summary is enough to give you an idea says guard malok is an
00:29:27.120 aggressive debugging malok Library let's see how it looks okay so
00:29:33.640 this is your memory you've got pages and normally what Malik is going to do is it's going to give you a little bit of
00:29:39.080 memory from a page and another page and then it's going to go to the next page and so it's allocating your memory and
00:29:44.320 when you free it's going to Mark something as freed and then it's going to allocate some more and then maybe you come along and you want another uh a
00:29:51.159 pointer and it's got its free list of uh pointers that have been freed so it recycles that memory and gives it back
00:29:56.399 to you right this is how Malo normally works this is how your system runs every day Galo does something interesting when
00:30:05.399 Galo allocates memory it allocates it on the edge of a page boundary and then it
00:30:10.960 marks the next page as unusable so what this does is for example if you were working in C and you
00:30:18.519 allocated enough space to have say an array of 10 inss and then you went and attempted to access the 11th member of
00:30:25.480 the array right that's a bug but but normally c will not complain it'll just
00:30:30.760 give you whatever is at that memory address right C doesn't care so what GMA does is to catch those
00:30:37.600 kinds of bugs it ensures that you cannot access the memory after the end of the
00:30:43.600 memory that you've allocated right so the next allocation is going to happen on the next page boundary and then the
00:30:50.399 next page after that is going to be marked off do you see how quickly we're using up memory again this is why it's so
00:30:57.399 important to have a reduced test case G Malo really stresses the system and on top of that the other thing that g malok
00:31:03.960 does that's particularly important for our case is that when you free memory it
00:31:09.080 doesn't return it to the system it marks that page as unusable as well so if you attempt to use memory after you freed it
00:31:15.840 you will crash so the whole point of G malok and the man page explain this explains this
00:31:21.760 right is your program Should Crash It Should Crash immediately where the bug
00:31:27.519 occurs right the reason that we are having such a hard time finding this bug is because
00:31:32.639 somebody allocated memory somebody freed memory somebody touched the memory inappropriately then somebody else allocated the memory somebody else freed
00:31:38.320 the memory and that's where the crash happened right the crash happened all the way over here but the bug was somewhere in the
00:31:44.919 past so G should help us catch that so here's what we do we start up GDB again
00:31:51.720 we have to set our uh Malo stack logging no compact we want to know what happens on all of these allocations we said uh
00:31:58.399 this dial insert libraries environment variable which on OS 10 tells the dynamic loader use this Library not the
00:32:03.760 one you normally use and we tell it to use Li G Malo and then we run and again we're getting all these
00:32:10.519 nice wonderful errors about or warnings about how how much our program is going to suck up all the resources of our
00:32:16.559 machine and now when we crash it's a completely different
00:32:21.720 crash this is something this looks like it could be the location of our bug
00:32:27.919 we're setting a mark on this me which
00:32:33.039 stands for method entry on cfp which is the not current frame pointer Matt
00:32:39.480 corrected me yesterday but I forget it was anyway so we can print out what is
00:32:45.440 cfme and there's the memory address okay so we have a memory address that's been
00:32:50.679 inappropriately touched we have our process ID which we recorded before so now we go in we do Malo history we give
00:32:57.440 the process ID we give the memory location and it tells us the last time that it was freed right so somebody
00:33:03.960 freed this memory before they should have because somebody else is attempting to mark it the GC Mark phase is
00:33:09.039 attempting to mark this memory is still in use but it's already been freed who
00:33:14.120 was it turns out it's that guy free method entry eye now if you go and you look at
00:33:20.960 the source for free method entry ey turns out it's one line should have been three lines because it didn't check to
00:33:29.320 see whether or not the method entry that it was freeing was already marked so all you have to do is add that
00:33:36.720 guard so that we don't free method entries that have already been marked by the GC and the bug goes
00:33:43.600 away all right and that's it got a little bit of time for questions so
00:33:56.480 thanks yes
00:34:05.320 um how did I well because I actually spent a good 48 hours straight looking at almost every method in the stack and
00:34:12.760 uh trying to figure out who's marking the method entry and when are they marking it and when are they freeing it and tracing through all of the methods
00:34:18.800 so no I mean it's a fun adventure um if I told the whole story we'd be here till Nightfall so uh um yeah I I kind of
00:34:27.040 shortcut that that but took a little bit of our what's that did you get blame it to find out who uh yeah the same person that fixed
00:34:34.440 it is the one who broke it so yeah if we were sistic enough to want
00:34:42.119 to reproduce this is the blog post and everything you mentioned before list the Ruby version all that um well not uh
00:34:50.480 that blog post isn't up yet I'll actually be posting my slides yeah I forgot to mention I'll post my slides so
00:34:55.560 that uh because I know the text is kind of small so you can uh go through it at a more leisurely Pace um turns out as
00:35:02.280 far as Ruby version you can actually run the most recent 193 because it didn't get back ported to the most recent patch
00:35:08.800 level so it's only fixed in Ruby 2.0
00:35:16.640 whoops any other questions all right well thank you for
Explore all talks recorded at RubyConf 2012
+46