Summarized using AI

Laser: Static Analysis for Ruby, in Ruby

Michael Edgar • September 29, 2011 • New Orleans, Louisiana • Talk

In this talk, Michael Edgar presents Laser, a tool designed for static analysis of Ruby code, utilizing Ruby itself for implementation. The focus lies on addressing the challenges presented by Ruby's dynamic nature, enabling programmers to perform effective static analysis and error detection.

Key Points Discussed:

  • Static Analysis Definition: Laser aims to find properties of Ruby programs without execution, focusing on optimization and error discovery.
  • Challenges with Ruby: Ruby's dynamic grammar and lack of type annotations complicate static analysis. However, Edgar illustrates that meaningful insights can be drawn from Ruby code with proper techniques.
  • Examples of Analysis: Several analytical questions are tackled, such as identifying code that never runs, unused variables, method effectiveness, potential exceptions raised, and method behaviors related to blocks.
  • Metaprogramming and Flow Analysis: These two aspects are emphasized as major challenges. Metaprogramming complicates the analysis, while flow analysis is vital for accurate results.
  • Flow-Sensitive Analysis: Laser’s approach includes creating a control flow graph (CFG) and using static single assignment form to track variable assignment and usage effectively.
  • Error Detection: Laser is able to identify a variety of errors and potential risks such as unused methods and variables, misused blocks, and incorrect return types.
  • Performance Concerns: While Laser demonstrates valuable static analysis capabilities, performance remains a concern, especially when running extensive codebases due to its interpretive nature.
  • Future Improvements: Edgar outlines challenges ahead, such as enhancing the tool for broader Ruby applications like Rails, and optimizing the control flow graph for better performance.

Conclusion and Takeaways:

Laser represents a significant advancement in Ruby static analysis, supporting programmers by providing deeper insights and bug detection without compromising Ruby's expressiveness. Moving forward, it aims to enhance Ruby's viability in modern software development through improved static analysis integrations.

Laser: Static Analysis for Ruby, in Ruby
Michael Edgar • New Orleans, Louisiana • Talk

Date: September 29, 2011
Published: December 13, 2011
Announced: unknown

What truly makes Ruby special as a language is its focus on expressivity, flexibility, and dynamism. Yet these same properties - and their widespread use in the community - make even straightforward application code difficult to analyze statically in a meaningful way. Laser seeks to change that. As a general-purpose analyzer using traditional compiler techniques, it statically discovers properties of real-world Ruby programs that no existing tool can. This talk is a whirlwind tour of what Laser can do, how it does it, and what it means for a typical Ruby programmer (who doesn't want to litter his or her code with type annotations). Among the questions it attempts to answer: * What code never runs? * What variables are unused? * What code runs, but has no meaningful effect? * What exceptions might a method raise? * Are blocks required, optional, or ignored by a method? * What variables are constant? * What methods get generated (or removed) by loops at load-time? * What types are being used, and where? * What gets added to a class by calling a method like acts_as_list? Most importantly, Laser uses this information to find bugs and tell you about them, in addition to warning you about potential mistakes. It has a clear integration path with YARD and Redcar, as well as a possible future in optimization. On a broader scale, Laser exposes and builds upon the underlying strength and regularity of Ruby and modern Ruby techniques, without restricting Ruby's natural expressivity through static typing.

RubyConf 2011

00:00:17.000 good afternoon everyone uh thank you all for coming uh my name is uh Mike Edgar and today I'm going to be talking about
00:00:22.400 laser uh which is static analysis for Ruby in Ruby um so a little bit about me
00:00:28.160 um y just told you my name um the website is carbon. uh I am not Canadian I just like
00:00:35.040 the name um and Laser which is going to be talking about today was my undergraduate thesis uh Dartmouth
00:00:40.879 College and nowadays I just started as a Google engineer uh so what is laser laser is a
00:00:48.280 uh attempt at uh building real static analysis uh linting for Ruby and it has
00:00:53.760 two goals which are pretty actually quite distinct which is to find errors in your code or potential errors and
00:01:00.199 just general style issues uh a key part of that is that because Ruby has such a complex grammar you need a pretty
00:01:05.840 Advanced tool to find style issues so what's static analysis really about static analysis at its core is
00:01:12.880 finding properties of a program any kind of uh non-trivial properties without running it this has a lot of uses uh
00:01:18.920 biggest one the most uh important one today has been optimization um static analysis is why uh um all of our
00:01:26.840 mainstream programming languages have gotten so fast uh it's very useful for error Discovery this is where it's really becoming uh where it's been more
00:01:33.240 explored these days uh the clang team has an awesome static analyzer for CN C++ code and just for General
00:01:40.360 statistical analysis finding out things about large corpuses of uh code in a given language and static analysis is
00:01:47.479 for the most part quite hard um these are a few of the problems that we might try to solve using static analysis um uh
00:01:54.880 finding Loops finding what constants are in a program doing type inference which has been a you know some focus in Ruby
00:02:00.880 but not too much exceptions um you see this in Java actually it's part of the spec you need to do uh exception
00:02:06.920 inference um dead code elimination per inference these are all examples of static analysis and uh for the most part
00:02:13.560 they're all undecidable uh with you know without any Aid to the compiler um which
00:02:19.800 means that we are doomed to fail in terms of actually uh doing a perfect job
00:02:25.120 of all these things now a lot of you probably wondering why are we talking about static analysis and in Ruby it's
00:02:30.599 an incredibly Dynamic language anything can change at runtime um and of course
00:02:36.040 we have no types anywhere so how do we even know anything but as I said we already aren't going to figure
00:02:41.519 everything out all these problems are already undecidable so the real question and the question that I've been you know
00:02:48.120 focusing on for the last year is what can we do reliably what can we figure out and the answer is actually quite a
00:02:54.640 lot um my first motivating example uh the one that I first came up with about a year ago when I decided to start this
00:03:00.159 project was I wanted to write a tool that given a ruby method or Ruby program
00:03:05.239 find out for a given method whether it requires a block um when you call it
00:03:11.280 whether it will use one optionally well whether it will ignore one completely and when I actually formalized this I
00:03:16.959 found another uh group of methods that actually make no sense which I call block foolish ones those are the ones
00:03:22.879 that will try to yield if you don't give it a block and will not call a block if you give it one um those are an error
00:03:31.360 um so just to make it clear what I mean here's a block required method it doesn't always called The Block um but
00:03:38.680 you cannot call it safely as in you cannot call it without generating exception unless you provide a block um
00:03:45.439 a block ignored method just something that adds I mean yeah something that adds to numbers something that definitely does not use a block no
00:03:51.480 matter what block optional we know and love these These are especially common in our file operations all they're all
00:03:57.120 over the standard Library all of them numerable and 1.9 they're all black optional uh and this is just a quick
00:04:02.840 example um and this is a black foolish one which is just the inverse and this
00:04:08.920 seems incredibly foolish but that's why I call it that but you'll notice the only difference is that we screwed up
00:04:14.599 the uh the order uh of the guard on on a block optional
00:04:20.359 method so these uh are all very hard problems
00:04:27.160 um checking if a method uh determining if a method is block required is equivalent to the halting halting
00:04:32.199 problem and all the other ones are even harder um but the truth they are they
00:04:39.360 are but the truth is and before I go any further Microsoft's done a lot of great work on the halting problem um
00:04:47.759 and and they're all highly tractable um and I knew this from the start because
00:04:53.240 when we as Ruby programmers look at our me look at our methods we all know instantly whether uh which category
00:04:59.039 these methods follow into we can tell it doesn't take a lot of difficult work and that's because we use for one we use
00:05:05.880 Ruby syntax we use very common idioms and it's it's just very simple to
00:05:11.720 actually figure these things out for the mosta for uh for most cases now because
00:05:16.800 these are undecidable and unrecognizable you can always write a method that it can't figure out and that's okay
00:05:24.039 because um how are we going to do this yeah uh no one actually writes yield if
00:05:30.479 uh the halting problem is true for a given thing that just you don't write that code if you do then laser is not
00:05:36.120 the tool for you um and I'm sorry but the truth is real code is tractable for a lot of these problems and that's why
00:05:41.800 we write these tools that's why we're interested so when I started I found that two major challenges uh face Ruby
00:05:49.240 when it comes to this the first thing is metaprogramming metaprogramming is what makes Ruby so fun to use in my opinion
00:05:55.319 it's what makes it you know fascinating it lets you write incredible code but it also makes it very hard to uh for a tool
00:06:01.360 to look at a bit of code and see what it's doing for example a lot of uh active support and whatnot uses uh
00:06:07.440 string class eval just to create methods you know uh Alias method chain for example I don't know if the current
00:06:13.440 implementation uses string eval or defined method but either way it could work and a tool needs to be prepared for
00:06:18.720 both of these if you aren't then you're not going to know what methods are there and you're already as soon as you see a call to one of those methods you're
00:06:24.759 going to have no idea what's going on so that is something that you have to solve
00:06:30.160 and then the other thing is that we need flow analysis and this is something that hasn't really been done uh in a strong
00:06:35.840 General way for Ruby yet um most of our tools just look at the abstract syntax tree you can do flow analysis on a
00:06:41.479 syntax tree it's just a lot more cumbersome and a lot less fun um and as we saw in the in the block usages
00:06:47.960 example you need flow analysis you need to be able to tell which branch is taken in a given context um if you can't then
00:06:54.199 you're done from the beginning so for metaprogramming um I uh I struggled with
00:07:00.400 this for quite a bit I tried to find a way that didn't involve just running the code but the truth is that's pretty much what you got to do um and you may ask is
00:07:08.800 that still static analysis well it is static analysis unless you're as long as your program does something other than
00:07:15.360 create classes and methods um if as soon as you hit IO then um yeah as soon as
00:07:22.560 you hit IO then all of a sudden you know I we're not going to keep running the code we're going to switch to more traditional techniques but uh I note on
00:07:30.120 the slide here is this enough I'm not sure I'm not sure if this is actually enough for a full for example laser I'm
00:07:36.000 going say right now doesn't work on Rails uh probably going to be a while before it does um and I don't and so I
00:07:42.520 don't know if this approach yet is going to work just right for that but the truth is when you think about your programs how many methods do you add to
00:07:49.240 your program after a given point in time um I call that point in time the actual load time of your program up until then
00:07:55.919 you're still constructing classes through actual runtime code you're still adding method but after a certain point of time you
00:08:01.720 aren't creating new methods um method missing and the idiom of creating a method dynamically on that is an
00:08:06.879 interesting challenge one I haven't uh focused on yet I don't think that's actually too tough of a technical problem but in real world code we don't
00:08:13.960 add lots of methods in classes at runtime um and this is uh again for
00:08:19.240 applications testing Frameworks those are going to do some interesting things um but I'm focusing primarily on
00:08:25.520 applications uh so yeah um so the for the the first prong of lasers approach
00:08:32.039 is just run all the code you see in the program until you hit IO and while you do that capture all the global State
00:08:37.919 changes um and it does that using a pretty interesting technique we'll see in a sec but and then there's flow
00:08:44.000 analysis um you can do static analysis that is flow and sensitive um when I started doing research on this there
00:08:49.480 actually was uh an implementation of the type inference algorithm I use which we'll talk about in a sec which was flow
00:08:55.760 insensitive and that had some limitations rubies DW is uh for the most part flow insensitive um and that is
00:09:03.040 part of the reason why you know some people have expressed frustration with it is that it is limited in that sense
00:09:08.720 and flow sensitivity for the most part makes pretty much all analyses better it's just an extra signal in when you are trying to figure something out so
00:09:16.560 flow analysis really is Laser's major contribution um to Ruby and to the uh to the literature on this so far and we do
00:09:23.399 that by trying to build a control flow graph for Ruby coat we then put it in what's called Static single assignment
00:09:28.680 form which uh all the compiler Buffs in the room will know about um I did cons uh consider continuation passing style
00:09:35.360 but I had no idea what that would look like with Ruby so I went with static single assignment form and after we
00:09:40.920 build these graphs we all just do all the analyses on these graphs not everything laser does is on the graph
00:09:46.839 some of it is done before that really trivial stuff like seeing if you Shadow a variable with a block but the interesting stuff happens on the
00:09:53.519 graphs so what's the control flow graph it's just a representation of a program it's a graph of things called basic
00:09:59.680 blocks a basic block is a sequence of instructions which when run always starts at the beginning and always
00:10:05.959 terminates with the last instruction you can't jump into the middle of a basic block and you can't leave early this is a simple control flow
00:10:12.760 graph uh this uh yeah it for some abstract language whatever and as you
00:10:18.720 can see it starts on B1 goes into B2 this is just a simple Loop uh and at the end it exits you can't leave after the
00:10:26.440 assignment to comparison and that's just a fundamental part of basic box box um these are fundamental to how most
00:10:32.200 optimizing compilers work what instructions do we use um now
00:10:37.560 this is something that takes a lot of thought um the rinus people will tell you they put a lot of work into deciding
00:10:42.800 what bik code they use well we put a lot of work into I put uh a lot of thought
00:10:47.959 into uh picking the instruction set to use for laser um and your choices there are going to frame all of your analyses
00:10:55.040 and how much work you have to do to support those analyses lasers instruction set is is uh extremely
00:11:00.320 simple and sometimes that's a bit annoying but here's what it is um in those basic blocks uh which is what we
00:11:06.880 end up turning every single Ruby program into is just these basic blocks pointing to each other you can assign a local
00:11:12.440 variable you can call a method with Dynamic dispatch call Super raise return
00:11:18.880 jump or branch and uh while I don't support all of the Ruby language so far I've managed to P pigeon hole everything
00:11:25.279 into just those instructions um your uh
00:11:30.959 don't do throw yet uh that's uh one of the limitations that I'll be talking about later but yes catch and throw is
00:11:36.279 something I'm still trying to figure out but I do suspect it'll end up turning into a raisik uh situation so first
00:11:43.240 question is probably well there's a whole lot of features that don't seem to make sense in that context uh most obvious one is instance variables um
00:11:49.839 well I turn those into calls to instance variable and set while ignoring overrides um and there's two reasons for
00:11:56.120 that one I didn't want to add another instruction uh just because I didn't want to have to do more work later two
00:12:02.959 uh it forced me to also simultaneously solve the problem of instance variable set with a constant string and um for
00:12:11.120 example calls to that are very commonly generated through metaprogramming and sometimes they're just used to break
00:12:16.519 encapsulation so I wanted to be able to solve that and I figured by just doing this for the short term it would force
00:12:22.079 me to actually solve that problem I do the same thing for class variables as well keywords like Alias or defa or
00:12:28.760 class those are all method calls uh class turns into a lot of method calls uh and branches and whatnot blocks uh
00:12:36.120 create a a constant proc value and stick it in a temporary
00:12:41.279 variable and this doesn't really always work um there are some things that there aren't actually existing method calls
00:12:46.959 for in Ruby so I use a special class called Laser Magic um if see rubby folks
00:12:53.560 will know the Frozen VM object that's pretty much my frozen VM so for example when you set or uh get a global variable
00:13:00.560 it turns into a method called to Laser Magic um and I I mean maybe someday I'll
00:13:06.160 change this for now it's actually worked out so real difficulties um Ruby control
00:13:12.800 flow is complex and unpredictable exceptions come from pretty much every method call at least until we do further
00:13:17.839 analysis and figure out what things can raise what methods so when we first build this control flow graph every method call is going to possibly raise
00:13:23.920 an exception uh and also currently it does not model all control flows yet which is a big problem
00:13:29.680 um and this just results in very ugly graphs so here is the graph you can sort of see what's going on there for that
00:13:36.079 simple while loop um I've got it hooked up to graph viz to automatically spit these out in PNG which is pretty cool
00:13:42.839 but yeah uh and all those little red ones that are solid those are possible exceptions the dotted ones are ones that
00:13:48.680 has determined to not exist and this is just a two nested classes um so luckily uh I I'm the only
00:13:57.759 one who has to look at these while I'm Deb laser uh you don't have to worry about it but the size of them is a problem for Speed and efficiency all
00:14:04.000 right so static single assignment is a special form of a CFG uh every in this every variable is assigned only once
00:14:10.480 local variables and the compiler made Temporaries this requires rewriting the control flow graph renaming variables um
00:14:16.519 in Ruby it involves changing the structures a little bit in the situation of unassigned variables you also need to
00:14:22.360 add a new instruction type but you don't need to know about
00:14:27.639 that um just don't uh so yeah the key point is that it distinguishes variables with the same name that are assigned on
00:14:34.079 different code paths for every variable it calculates it also simultaneously calculates where it's used with respect
00:14:39.639 to flow which gives us the flow sensitivity which makes us powerful and in Ruby while I'm doing this it just so
00:14:45.839 happens that's the perfect time to uh pretty much the only time when you easily figure out when a v uh local
00:14:52.160 variable can be assigned um uh conditionally on One path so that's how
00:14:58.839 I actually surface that information all right so now we're ready to start analyzing stuff but first we
00:15:04.680 need to talk a little bit about how rubby how laser represents the Ruby object model um we pretty I pretty much
00:15:10.120 reimplemented the whole thing in laser there's a laser object a class module proc method um there's big overlap here
00:15:16.279 with rubinius and where I think uh there could be a lot of good work uh either me stealing everything they did or vice
00:15:21.759 versa we'll see how that works out um but an example of this is laser method there's one instance of these for every
00:15:28.120 method in the analyze program it implements the basic meth method functionality and it also has meth uh
00:15:33.920 has methods on the method for answering analytical questions yield type all you do is you type uh whatever method. yield
00:15:41.160 type hit enter and it'll tell you is it a block optional foolish whatever um return type for types for example and
00:15:48.639 what's interesting is because of this we can subclass laser method if we want to add more add more information now by
00:15:54.519 default uh a laser method just uses the built-in uh techniques that I'm going to going to illustrate to figure out the
00:16:00.440 answers to these questions but if you want and I do this for the send method in particular and allocate um I actually
00:16:08.440 subclass laser method and say this is actually a s method and this is really what happens when you call it that
00:16:13.959 should only be necessary for special things that are really low level in Ruby but we'll see all right so let's get some answers
00:16:20.560 the first stumbling block is dynamic dispatch with Dynamic types and yeah so we I decided everything's going to be a
00:16:26.880 method calling Ruby um now I actually have have to uh own up to that and figure out how to handle all these
00:16:32.480 different method calls so I use What's called the cartisian product algorithm this was uh uh the first paper on this I
00:16:39.279 think was in 91 or 92 for small talk and self a guy named adeson hope I'm saying that right and the real question is how
00:16:46.000 do I know what foo's type is here in the case of bar. run XYZ obviously that
00:16:51.040 depends on Bar's type we all know that but it also depends on X Y and Z's type
00:16:56.360 and by type I mean the concrete class here um I don't have any more advanced type system excuse me in laser besides
00:17:03.039 that so here's a quick simple example here we want to find out what uh the
00:17:08.880 bottommost X's type is uh in this horrible little line of code um and you
00:17:14.240 can see here it does depend on both the receiver of go and the argument um if it
00:17:20.039 goes to the double class it'll return an integer if it goes to the string class it'll return a string or if you put a
00:17:26.480 string into double. go it'll get a a but that doesn't happen I lied um so
00:17:32.160 yeah and the answer is to try all of the types the cartisian product of all the argument types um so that means every
00:17:39.160 possible combination of Bar's class X's class y's class all need to be tried and then you Union the result of these all
00:17:44.799 together now if one of these raises its type is empty so that actually handles the that um sort of handles that case uh
00:17:52.880 silently um and you just end up with all the possible actual return types this is really nice because for
00:17:59.440 one it's conservative um it is actually going to get every possible type it may
00:18:04.640 get some extras uh that aren't in the real program but it will get all the correct ones it only requires one pass
00:18:10.360 through the program an important part of this is that uh if we go back to this example every time we discover a new
00:18:16.120 type for X we have to go and recalculate the new parts of the cartisian product that's how you get the one pass part of
00:18:21.919 it uh it does require a finite number of types uh this is not uh I I tried to add
00:18:28.520 a tupal types to the language to laser and uh I was able to add some tupal types to it but um some instances
00:18:36.600 resulted in infinite types which kind of ruins everything um all right so lasers
00:18:43.000 implementation um for every single one of these combinations of the cartisian product it just duplicates the entire
00:18:48.159 control flow graph um the SSA uh the fact that it's SSA means that we're uh
00:18:53.760 flow sensitive um and this may be one of the first times someone's combined a uh
00:18:59.919 a flow sensitive cartisian product algorithm um come up with one I don't
00:19:05.280 really know um but yeah so you end up with this very nice way of uh expressing
00:19:10.400 the question laser method what is your return type for types what it does is it just goes and says well what's the
00:19:16.559 control flow graph for those types give me that's return type uh and then it checks it against expectations and
00:19:22.480 returns it now some interesting stuff happens in the control flow graph but um
00:19:27.760 it's the questions that we ask our classes and our methods they all go through the control flow graphs which lets us isolate all these things to very
00:19:35.080 relatively simple uh old algorithms um I don't use any algorithms in this that
00:19:40.360 are not at least 20 years old uh static single assignment comes from the late 80s out of IBM um control flow graphs go
00:19:47.000 back to like the 50s or 60s um so yeah if all of our analyses use cfgs um and
00:19:53.159 we need to parameterize the cfgs by the argument types and all of our uh all of
00:19:58.840 our analyses can be parameterized by by the argument types for example if we want to ask if something um yields uh we
00:20:06.400 can actually check if that depends on say the uh type of some argument that's a bad example that probably doesn't
00:20:11.520 actually happen too much but yeah that actually gives us more accuracy whenever we want to do an analysis it just makes
00:20:17.280 it a lot slower um only two analyses aren't currently parameterized by that right now and that's yield type because
00:20:22.720 that doesn't really make any sense and uh I forget the other all right so let's find some bugs I already showed you some
00:20:28.240 error disc actually back here when I said check return type against the expectations well well all that does is
00:20:36.440 uh it goes and says all right what is the expectation of uh what do I expect this method to return now you may be
00:20:42.760 thinking well that doesn't make any sense well in my opinion it does because if something's named 2 s I expect it to
00:20:48.080 return a string or a subass of that if it's to R I expect an array or subass of
00:20:53.480 that and if it's bang you really better return a Boolean um and and so these are
00:21:00.240 and so right here we just see it checks against that and adds an error of uh a warning level error it isn't a uh an um
00:21:08.600 a fatal err but yeah there's about 10 different levels of Precedence of errors in laser and so yeah it just tells you
00:21:14.240 hey that Pro that may be a mistake that you did there and it it took 16 lines of
00:21:19.279 code to add this feature to laser um and that's not much of an anomaly the hard part is building the control flow graph
00:21:25.480 uh it is built in pretty much one file which is over 2,500 lines long um and I
00:21:31.440 really got to work on that but that that's really the hard part uh after that you're writing algorithms to work with the general data structure with
00:21:37.960 very simple instructions so the analyses end up quite compact another one is unused methods this one took two days to
00:21:45.159 add reliably and I real and the Insight was just we're already calculating dispatch every time we see a method call
00:21:51.600 we're using this cartisian product algorithm to say okay what methods could this call what are the receiver types
00:21:57.039 and how about we just mark we're doing that why don't we just Mark every method we see uh that's really quite that's
00:22:02.440 pretty cheap compared to the cost of doing all this other stuff and at the end of The Run of laser we'll just see what isn't
00:22:07.880 used um and this pretty you can see right here that involved adding that one line method. Ben used I have it
00:22:13.799 highlighted it's kind of hard to see um and then this to the runner uh of
00:22:19.840 uh like the actual entry point to laser at the very end it just says you know all unused methods unused methods
00:22:25.279 iterates over every class and module every and then then over every method checks if it's had that Ben used flag
00:22:31.720 set on it and then returns it and then it creates a warning um and and because
00:22:38.120 we already model send and public sends uh dispatch um we already account for that that comes out to around 15 to 20
00:22:44.919 lines of code just to add this one feature um unused variables um now
00:22:50.720 there's some easy cases uh that are variables that are assigned but never read um and Ruby already does this it's
00:22:56.039 in the parser which is a little you know a bit of it's a place to do it um well I
00:23:02.400 no no anyway uh but it can't handle even
00:23:08.200 trivially dead code if you write if true or if you know uh something that only returns a string um it can't really uh
00:23:15.600 or yeah it can't tell which branches aren't taken um and also it can't handle Vari variables that are only used uh to
00:23:23.039 compute than unused variables um so for lasers approach we just go through every variable in the control flow graph um
00:23:30.640 find all the transitive uses and something's considered unused if uh if
00:23:35.760 it's not returned uh if not if none of its transitive uses are returned or uh
00:23:41.400 used in an impure method call so if you um add a bunch of uh numbers together and I can prove that they're numbers but
00:23:47.600 you never return any of those numbers then all of them are considered unused this also transparently works for method
00:23:53.159 arguments because those are just local variables uh as far as the control flow graph is concerned by lowering the
00:23:59.559 repres representation so far a lot of things just fall into place multiple assignments for example those are just
00:24:04.840 desugared into single assignments and branches and whatnot so yeah this turns into a dead
00:24:11.080 simple work list algorithm um and it's O of V which is the the number of
00:24:16.320 variables in the SSA control flow graph it's usually linear with the with the code size I haven't uh checked this with
00:24:23.200 my implementation um but for the most part in other languages it's shown to be uh Tri typically linear and yeah this
00:24:31.000 entire algorithm was just 55 lines of code on top of what was already there and we get flow sensitivity and dead
00:24:36.679 code does not affect uh how it sees this and so here's here's just a quick example I cheated and just ripped this
00:24:43.360 from the re the the rpcs for laser but yeah you see this here and it calculates
00:24:49.559 a constant which is just some number in like base five uh and checks it and Che
00:24:54.760 and pro and Laser here figures out one of those branches because it has some basic Purity analysis it can tell that
00:25:01.279 you know turning a number into a string is a pure operation and uh it trims off
00:25:06.720 the lines at seven 8 and 10 I'm sorry it
00:25:12.520 Tri it prunes the El Branch so it knows those lines are dead and you see that the second and third spec check that it
00:25:19.240 also sees that the variable D on line four it calculates it it even assigns it to a variable J later but then J does
00:25:26.240 nothing so it manages to kill all those very cool um I should probably add
00:25:32.399 something so it checks if x is unused because it isn't but I don't have a spec for that all right so block usage this
00:25:38.080 was the whole motiva this is what I really started out to do and obviously it's going to be harder than what I've already shown you um unused variables
00:25:44.000 are cool but I really didn't have a thesis if I could find unused variables in Ruby I mean I did I just would like a
00:25:49.600 better one um so why do we care well for one we can find errors um if you give a
00:25:54.960 block to a method that's never going to use it so you you probably are a bit confused about what you think that method is going to do it's not going to
00:26:01.279 kill you it's not going to you know raise an exception but you probably messed up um that's a silent error that
00:26:06.840 laser can find the other case is when you don't provide a method that very well could use it you may not even notice the first time because like that
00:26:13.240 silly Rand example I gave before it may not raise the first time you do it it may be that only at you know after
00:26:19.679 you've deployed your application oh you hit the edge case where it needs a block so that that that's a big thing finding
00:26:26.960 errors is a big thing and that's what lasers focused on you know and block foolish methods you already you already screwed that up um another idea is that
00:26:34.279 we could autogenerate documentation you know if we get really if I get really good at this then maybe I could hook this up to yard and uh have it
00:26:41.000 automatically spit out um whether it yields or not that'd be pretty cool and
00:26:46.039 maybe someday we can help the the um the compiler writers and the audience that
00:26:51.159 um work so tirelessly to help us all out so as Ruby programmers how do we
00:26:56.840 tell if a method needs its block during its execution when we look at a when we look at a method how do we tell that
00:27:02.240 well if it calls yield um either not in a block given uh type conditional and
00:27:07.960 without catching a local jump error or one of its super types yeah that's when
00:27:13.000 I when I first discussed this idea someone said oh what about catching local jump air well I had to account for that um you do um yeah exactly yeah I
00:27:22.399 mean there's again there's things that it can't do that's proven um but uh the other thing you could introduce the
00:27:27.840 explicit it you know Amper sand block variable in the method signature and then call it you could use procon colon
00:27:33.679 new to get get a hold of it you could then or you could not even call it right
00:27:38.960 there but you could delegate it to uh a method that does call it um and as long
00:27:44.279 as none of these are guarded by block given defined yield any user Define oh
00:27:50.440 yes uh yeah you mean like Alias method and the Alias
00:27:55.640 keyword oh yeah yeah yeah yeah absolutely um yeah and user defined aliases to those methods um but yeah or
00:28:04.120 you could do you can define an alias to block given um and there's also did
00:28:09.240 anyone here know that iterator question mark is a uh is an alias to block given
00:28:14.360 yeah that that's also one uh I did not know that um but it is
00:28:20.240 so um yeah so uh or someone could grab a hold of one of those and do is that not
00:28:25.840 n so there's a whole lot of cases that you could do to construct one of these and they all exist in even in the Ruby
00:28:31.519 core standard library though I don't think uh it um some of those do but the main ones do so we need constant
00:28:37.919 propagation and we have constant propagation um we have the ability to prune those branches as I showed you
00:28:44.000 earlier during the undefined uh the uh not sorry the unused variable case so
00:28:50.080 what we do is uh there's a little bit of uh uh support in the control flow graph Builder to make this easier to analyze
00:28:56.440 but for the most part we just copy the parameterized uh CFG twice there's no reason I mean there are reasons to
00:29:02.559 parameterize this but it's uh on for the most part it's not worth it and it would save a lot of memory and time um on one
00:29:08.880 we assume there's no block we pass a fake in Block argument to the constant propagation system um and on the other
00:29:16.320 we assume we tell it there is a block we tell it always returned true for Block given and all those things pass in a
00:29:21.799 dummy laser proc and then we track the aliases looking for attempted calls through by aliases here I mean Alias is
00:29:29.080 through the method so if you capture proc. new into X and then set y to X or
00:29:34.720 call a method with it it's a very conservative analysis but it tries to make sure it can see if you've possibly
00:29:41.000 called that block um if we see delegation if you call a method with that as the block we recursively analyze
00:29:48.279 uh at that point we then ask that method hey I'm passed you my block are you going to use it because I need to
00:29:54.120 know and this whole thing was 79 lines of code um there's uh 19 lines of a
00:30:00.320 conservative Alias analysis I mentioned it here because I don't use it for anything else yet um but and it handles
00:30:06.080 almost any uh real Ruby code you throw at it um again there's always cases that it won't be able to handle including
00:30:11.799 real world cases I'm sure um but this was one I mean yeah this is turning unrecognizable that's harder than the
00:30:17.559 halting problem um Ruby's not as hard to analyze as we've been led to believe um
00:30:23.519 and a big part of this is the syntax and the idioms for example block given uh everyone uses block given um or everyone
00:30:31.320 and uh that that's just one example yield by being an actual keyword makes it a lot easier for the control flow
00:30:37.720 graph Builder to to create separate code paths for that to make it easier for later
00:30:43.519 analyses um so yeah now simulation we talked about this at the at the start I said that we were just going to run all
00:30:49.399 the top level code um cfgs are representations of a program they're clearly executable so when laser sees
00:30:56.159 top level code it runs it uh and it gives up after it runs for either a long time either because you wrote an infinite Loop before while you're
00:31:03.039 defining classes or um or just that you wrote really long code and Laser doesn't want to waste your time um or when it
00:31:10.399 hits IO okay we turning you off um when it gives up it switches to all the
00:31:15.679 analyses that I've already told you even for the top level code so this is literally interpreting
00:31:21.760 Ruby in Ruby and you're probably thinking that must be really really really slow um but you're wrong cuz it's
00:31:27.720 way way slower uh I've got a I've got a bug actually in
00:31:33.320 the GitHub uh repository for someone who just like wrote a top level Loop to like
00:31:38.440 100,000 and it ran for like 20 minutes and they just like gave up um and that that is a problem I'll talk about it in
00:31:44.600 a sec but uh but yeah it's really slow uh luckily it only uses it again defining classes not uh does not take
00:31:52.159 that long that graph looked big but it's actually not that many instructions and this gives us some benefits while we're
00:31:58.120 doing this it tracks which edges of the CFG are actually taken whenever something's simulated uh and during
00:32:03.279 constant propagation it checks to see what things are clearly never run this uh makes analysis later faster and it
00:32:09.639 could uh possibly optimize startup time someday um there's something called the theory of uh partial evaluation which
00:32:15.519 you may want to look up um it's way too headyy to talk about now but it's all about this idea of partially running a
00:32:22.320 program and figuring out what things we can uh ahead of time ignore and a large part of the overhead of loading 300
00:32:30.000 classes into a ruby program there's tons of checks in there that can be completely all lighted so that's another
00:32:35.519 possible path for this there's not enough time to keep talking about different analyses there's some other cool things I hit most of the good stuff
00:32:41.919 um but it's time to talk for a bit about the challenges that laser faces uh the biggest challenge is that
00:32:48.080 uh there's a bust factor of one and I have a job now and I don't write Ruby um so it's maybe a little less than one um
00:32:57.320 and there's also a bit of technical debt that makes this a bit harder for other people to jump in um I mean I'm proud
00:33:03.919 that I got an academic project actually uh the source out there that doesn't usually happen um but that said there's
00:33:10.159 uh there's good corners and there's bad Corners um and you know I mentioned the control photograph Builder that's 2500
00:33:15.519 lines and it actually needs to grow um so that whole thing needs to be
00:33:20.760 modularized and made better and some of the internal apis I showed you some code Snippets before and I'm not terribly
00:33:26.120 proud of how they look because they could actually be cleaned up quite a bit to be more clear the arguments could be better and that also makes it harder for
00:33:32.559 people to jump in it's hard to analyze a control flow graph when there's so many different things that you can do with it and it isn't really
00:33:39.159 clear also laser understands Ruby it does not know c um I've been developing it using uh actually 1
00:33:46.760 193 uh and that's mainly because I've fixed about a dozen Ripper bugs that are actually now going to be in 193 making
00:33:53.159 it good enough to actually yeah
00:33:58.440 um I actually yeah uh laser does use Ripper it's good enough to actually you know power a
00:34:05.960 um sort of Ruby implementation I think it I think it's finally good enough that it could actually power Ruby implementation um and so right now it
00:34:13.679 only works with Ruby 1.9.3 my continuous build server just can't actually run laser yet um until I get 193 on it um
00:34:22.079 but yeah thing is lots of MRI is in see um it's laser does awesome when it sees just Ruby calls but eventually you're
00:34:28.000 going to get to C calls if you're using uh um yarv or MRI and the current solution is uh to have stubs for all of
00:34:35.760 these uh and not all because I don't have that much time but for lots of these methods with annotations and yeah
00:34:42.280 there are this is just a few lines from string. RB in my in the laser repository
00:34:49.200 and there's there there's a couple thousand lines of those and that's just not going to work uh the um and the
00:34:55.240 obvious solution is to use rubinus libraries I really really like this and I think this is really where it's got to go um I haven't done it yet there's uh
00:35:02.839 they need a little bit of extra support to analyze well because rubinus does use well a lot of the type checks that
00:35:08.960 happen in uh normal C Ruby um they expose in real Ruby and those are going
00:35:14.040 to need some extra help from laser to really to understand as accurately as it needs to to analyze these appropriately
00:35:20.599 there's no technical barriers to this though uh it really is you know a week or two of work and we could I I could
00:35:27.040 drop in you know 95% of rubinus libraries and um drastically improve
00:35:34.160 both the state I mean get rid of all these and I mean these aren't even completely correct
00:35:43.760 yes Diamond Back Ruby I believe they're using o camel but uh yeah uh I'll talk
00:35:53.119 about yeah well I did I I the thing is I mean that was for static I mean that's a good that's a very good idea um they
00:36:00.319 that was I didn't even think of that talk to me after I like that all right but yeah so uh there's also some good
00:36:06.400 news which is that uh a lot of the things that laser could do um I mean for example it already finds guaranteed
00:36:12.240 rises in your methods uh for example if you try to call a method that doesn't exist and it can actually prove that um
00:36:17.640 incorrect area type errors stuff like that um and I figured I'd put only one meme in this entire thing but it doesn't
00:36:24.560 tell you uh and some of that is uh for one like some method calls are supposed to always raise if you called the method
00:36:30.880 raise I'm not going to yell at you for always raising in that method because you called raise or you may write a
00:36:37.160 rapper function that does some formatting for a string then calls raise well I shouldn't yell at you for calling that um I haven't figured out exactly
00:36:43.800 what that hero stick is supposed to be um and it's not going to be just as good as did you call Rays um aity errors
00:36:51.400 should be reported uh I don't actually think that would be that much work I just haven't gotten around to it but
00:36:56.520 that is something that does work um the other thing is laser is pretty slow um we'll never get rubyist to Value
00:37:03.560 static analysis if it's slow um the compiler writers aren't aren't going to like it if they have to add if um none
00:37:10.880 of their users want to use the static analysis that can make things faster because it turns their build uh to a
00:37:16.160 couple hours um it's like it's it's a lot like testing people don't run their tests if they take half an hour to run
00:37:23.599 uh laser isn't that bad I mean uh I haven't tested I haven't had it it isn't
00:37:29.040 good enough to test on enormous code bases yet um but I suspect it is too slow um and a good a good amount of the
00:37:36.160 low hanging fruit has been picked all the algorithms used are um at most o of N squared uh there's only one that's
00:37:43.520 over squared um and that one can't really be made too much better um so it's not they aren't really algorithmic
00:37:49.319 problems I'm using algorithms that much smarter people came up with 20 years ago um so there are a few things the CFG
00:37:55.640 encoding well because I use so few instru CS that could be made a lot more compact that could improve things a lot
00:38:01.280 um it would also mean analyses could be made smarter and avoid doing extra work um I only have I I say laser is Ruby in
00:38:08.280 Ruby there are two parts of it right now that are not in Ruby um one is Ripper um
00:38:13.760 and that may actually change soon I've been working on a new parsing thing but we'll see uh and the other is that I
00:38:19.839 replaced the basic block class I talked about basic blocks earlier that got Rewritten in C++ because it just made
00:38:24.920 things so much faster it can it is still still fallback in Ruby but yeah um the
00:38:30.680 other thing is that a lot of this can be cached on disk the whole standard library for example say I drop in all of rubin's libraries well all of that stuff
00:38:38.240 can be analyzed once and just saved on disk um serialize I try the main problem
00:38:43.640 right now is I use too many uh hashes with default procs and those can't be serialized um so yeah the one m and and
00:38:52.520 so everything I've talked about so far they aren't really technical issues um the main thing right now is is that the
00:38:57.920 control flow graph is not conservative now conservative we hear that mostly uh about um you know Mr garbage collector
00:39:04.359 and being conservative is a bad thing there we all you know rubinius has the precise garbage collector well that's because garbage collection is not
00:39:10.400 undecidable um and building a accurate control flow a precise control flow graph is um unless you have a very bad
00:39:17.920 language um a non-t complete one um so yeah there's three categories of solutions precise ones which is rubinius
00:39:24.760 uh garbage collector and there's conservative things like the typical optimizing compiler CFG and the rfgc and
00:39:31.319 then there's wrong things which is what we have um it's true there's there's no
00:39:36.920 wiggle room there are things that laser will be just wrong about um and that sucks uh and the main problem is and
00:39:45.720 what it comes down to is that not every possible control flow flow through a ruby program is represented by edges in the graph um the easy solution is to add
00:39:52.880 an edge from every block to every basic block but that makes it useless and slower so we're not going to go with that
00:39:58.280 uh and the real solution is just to actually figure out just uh how to actually model these uh these
00:40:05.760 problems um as minimally as possible currently addresses catch and throw this was brought up earlier um right now
00:40:13.000 laser does model exceptions uh quite I'm pretty sure we don't do I don't do retry
00:40:18.400 but uh I haven't done that yet but it does do all the other uh uh other stuff with exceptions really well it even
00:40:23.680 handles weird cases like next and through loops and all that stuff um um and ensure but catch and throw adds
00:40:29.560 another level to the equation because any method can uh also catch or throw
00:40:34.599 and I don't know if that's going to be too hard or not but fibers and continuations not too worried about that
00:40:40.119 just yet but those are control flow constructs that have to be considered um prop cases uh bind yeah uh bindings make
00:40:48.400 things horrible um because you yeah I mean you could pass a uh uh just Callum
00:40:54.359 method with a block and that method could change a local variable well a control flow graap needs to be able to see when it local variables change or
00:40:59.720 things are going to get messed up that may not be solvable um we'll see luckily
00:41:05.079 most people don't do such a horrible thing um and the big part of laser is that I need you guys to not do the worst
00:41:10.960 things in the world um the I mean uh um
00:41:16.119 none of the compiler writers are happy because they are laughing because they have to support the worst things in the world um when you uh call a method with
00:41:25.160 with a block and they have to be ready for the fact that you're going to change all the local variables in that scope um
00:41:31.040 and that inhibits enormous optimizations that could be made um so I mean this control flow graph is
00:41:37.560 is isn't conservative yet that's not the worst thing in the world because I'm not a compiler I'm not going to generate the wrong program for you uh
00:41:45.520 but I so yeah I mean we want static I want static analysis to succeed that's
00:41:51.880 uh the more spous warnings you get or the lack of warnings that could be given that depends on accuracy that needs to
00:41:57.760 be improved the other thing is the compiler writers do want these do want these analyses I'll put words in their mouth and say that you know a lot of
00:42:03.920 these things and there's and the things I've talked about mostly so far they're not that big a deal for compiler writers but there's a lot more that could be
00:42:09.800 done Escape analyses Alias analyses um and we all want those things to be made
00:42:15.920 that much faster so yeah lasers had a lot of successes it's the first for into deep static analysis for Ruby it does a
00:42:22.280 lot more than I've even talked about today I didn't even talk about the style stuff uh it can do some pretty cool style things and it'll write your code
00:42:27.720 to fix them too um and I've just uh I'm working on a new par a new parsing
00:42:33.960 library that exposes the full par tree including white spaces comments all that stuff so that it can be even better at
00:42:40.359 that and so that it can uh know exact column numbers and give Clank style error reporting that's another big thing
00:42:46.960 um it can it can it models exception frequency and type inference could be a little bit better at that with respect
00:42:52.200 to rescue Clauses but um it does successfully model that it handle it can see when you have aity airs all that
00:42:58.280 stuff too um does Conant propagation and Branch printing it may not help programs today but that's okay so laser is a year
00:43:05.599 old uh but extremely young I truly believe it is vital to Ruby's growth I think this is a huge opportunity and I
00:43:11.680 think this is really what ruby needs uh in many ways to help stay relevant uh in a world where people are starting to
00:43:17.559 look for other tools um I think that this can really make things uh make Ruby a much more attractive to have this sort
00:43:24.559 of support uh in a year or two I could could see see laser actually supporting rails um it would need an extra plugin
00:43:31.640 uh not going to lie on that one but it could it could absolutely do that um how much time we
00:43:38.400 got my phone oh yeah I turned off my phone because it people keep kept tweeting me 254 how much time that oh
00:43:46.960 some guy tweeted while he was here that's okay thanks all right so here's just a quick
00:43:52.200 thing here let's take a look at uh V uh temp. RB um got two classes here and I'm this
00:43:59.319 is this shows off just a couple of features uh not too many but
00:44:05.240 um but yeah so I mean I I don't know if people are laughing at the horrible code or people
00:44:11.200 tweeting but um but yeah so yeah this is something that uh I I mean I wanted to find an easy way to I could have used an
00:44:17.800 if or something but this um yeah it uses gets. 2i to pick something out of there
00:44:23.040 and so recv is going to be either a foo a bar hello or nil because you could
00:44:28.839 return 10 or something like that um and then it calls do stuff with three so if
00:44:34.240 it goes to Foo it'll return three values if it goes to um bar return three values
00:44:41.640 we know it's not going to return two because we're good where we programmers and we know that three is an integer so there's nothing wrong with this code so
00:44:48.720 let's run and uh most of this time it actually is taking right now uh in my defense is
00:44:54.559 it loading the standard library and analyzing it uh because as I said I'm not cashing that yet uh but it should
00:45:00.319 only take about five more seconds 10 seconds oh there we go okay so there a couple things uh yeah um there's a blank
00:45:06.920 line at the end there's a little bit of dead code I showed you yeah you saw the dead code um oh string can be wrapped in
00:45:13.760 double my bad um I didn't even do that on purpose uh so now let's make this
00:45:21.400 well let's remove one h no we'll make an extra one so yeah now we're updating our
00:45:26.480 bar AP uh to return four things and uh the line below that's consuming that API has not
00:45:32.920 been updated it isn't prepared to have a fourth value and it's only unpacking that tupal into three things um so let's
00:45:40.400 see if laser how's it doing when you say F do youan aut fixes yes yes those the the
00:45:48.359 style thing I mean yeah it only even tries to fix style things but yeah there
00:45:53.760 we go um see that second to last line there uh r h s value being discarded um
00:45:59.119 that's it seeing that um we have the potential here to have uh four things on
00:46:05.640 the right hand side and only unpacking into three things you can see that it's analyzing all of the possible types and
00:46:11.160 let's see if it hopefully works when I do this and if it only return two which means one of those things is going to be
00:46:17.079 nil two good things one thing is that it'll see that it's nil and if you accidentally try to call something on it
00:46:22.720 then it'll notice the exception which as I said it won't tell you about um and it'll also let you know that that left
00:46:29.040 hand side is never assigned defaults to know very cool yes your com
00:46:34.960 about what uh I mean yeah I mean it's pretty it's
00:46:46.520 pretty it's extremely minimal but I like it so uh no but the good thing is that
00:46:52.680 there's already I'm sorry there you go I like that uh no but most importantly is that
00:46:59.920 laser does allow both uh tool wide settings for which things you do want to be warned about um and it also allows uh
00:47:08.480 it doesn't have file wide settings yet but it does have line settings you can add a comment to a line and say ignore
00:47:13.680 in this case uh let's see it didn't print the string that you
00:47:19.880 actually specify but I'm probably over time anyway
00:47:29.280 um because as I said it doesn't tell you what uh remember the the scumbag steep thing yeah uh that's part of how it
00:47:35.880 needs to it needs to be better at uh the other thing is that I mean because the
00:47:41.119 type analysis is cons type analysis is conservative it's going to pick up on things that aren't even in the code and
00:47:48.079 that's part of an example where it may actually have picked up a type that never will happen so I mean that yeah um
00:47:55.880 it doesn't even tell you about guaranteed exceptions it finds yet the possible exceptions that's going to be
00:48:01.359 more interesting to figure out where that goes but yeah all right so I I'm I think that's pretty much it uh any
00:48:07.280 questions
00:48:19.760 yes okay cool
00:48:34.760 cool schemes had a lot of great work on it the one one of the most fascinating projects is a a project called Stalin um
00:48:41.800 which is called that because it brutally optimizes uh and it has very slow compilation times but it generates
00:48:47.599 incredibly fast scheme but anyway yeah next anyone I thought I saw a couple other hands
00:48:53.240 yes uh re oh yeah yeah I mean part of the reason I got interested in this
00:48:58.319 because I was interested in code quality tools and yeah there's things like reek there's Rudy there's all those different
00:49:03.680 tools for uh for that sort of thing um I think that I mean for for things like
00:49:09.119 finding coat smells and stuff like that I think that laser could be a big benefit to that a big part of what it does I mean this finds you know if you
00:49:15.040 do assignment in an if it'll let you know about that um I think there's a a
00:49:20.440 lot of um what's the business word Synergy there we go a lot of synergy for a lot
00:49:26.440 of projects out there with what I've been doing so any else all right cool
00:49:31.839 thank you
Explore all talks recorded at RubyConf 2011
+55