Summarized using AI

Your App is not a Black Box

Josh Kalderimis • November 01, 2012 • Denver, Colorado • Talk

In the talk titled "Your App is not a Black Box," Josh Kalderimis from the Travis CI team emphasizes the importance of understanding what happens inside an application rather than viewing it as a mysterious entity. Kalderimis discusses how Travis CI utilizes distributed continuous integration to manage thousands of tests daily and highlights the tools and strategies they developed to achieve visibility and transparency in their services.

Key Points:

  • Awareness over Ignorance: The main theme is that developers should not treat their applications as black boxes; knowing what goes on inside enables optimization and improved performance.
  • Tools and Services: He discusses various logging and monitoring tools used in Travis, including NewRelic, Librato, Graphite, and custom-built internal tools.
  • Logging Best Practices: Kalderimis stresses the need for concise logging that is readable and usable, advocating for single-line logs that convey essential information without verbosity.
  • Error Tracking: Importance of using exception notification services like Sentry and Haystack to track application errors effectively rather than manually sifting through logs.
  • Metric Collection: The talk delves into metrics collection methodologies using ActiveSupport Notifications, stressing that metrics are to an application what unit tests are to code.
  • Aggregating and Visualizing Data: Discusses leveraging services to aggregate and visualize metrics data, illustrating how Travis CI utilizes tools like Papertrail and metrics aggregation to monitor application performance.
  • Building Internal Tools: Kalderimis encourages the development of internal tools for better team management and service monitoring, showcasing a tool they created for monitoring events and service hooks in the Travis CI system.

Concluding Takeaways:

  • Applications need to be monitored and optimized continuously through proper logging, error tracking, and metrics.
  • Continuous integration tools like Travis CI can provide invaluable insights when utilized with a strategy that emphasizes visibility.
  • Utilizing existing services and building internal tools has a substantial positive impact on managing application performance and reliability.

Your App is not a Black Box
Josh Kalderimis • Denver, Colorado • Talk

Date: November 01, 2012
Published: March 19, 2013
Announced: unknown

This talk is part story, part code, and part mustache.

Travis CI is a distributed continuous integration system running over 7,000 tests daily. For us to get a true insight into what is going on behind the scenes we have had to come a long way by integrating and building both tools and libraries so that Travis and its many parts are not just a black box of information.

Reading logs and using NewRelic is not new, but far from enough, especially when it comes to apps which are composed of many smaller apps, like Travis. How do you track and visualize requests being processed by multiple services? How do you silence verbose logs while not losing the core of the message? And how do you aggregate, visualize and share metrics?

A lot of how we track, manage, and visualize what is going on in Travis has been created as a set of internal tools and by using a range of awesome services, but core to a lot of this is ActiveSupport Notifications and Travis Instrumentation.

This session will give insight to how Travis is composed and connected, as well as shedding light on how simple it can be to gain more visibility into even a complex, distributed system like Travis, as well as your applications too.

RubyConf 2012

00:00:15.599 hello hello i guess it's time for us to start it's
00:00:21.840 uh the 4 30 mark a couple of people coming in hi guys
00:00:27.039 awesome uh welcome my name is josh
00:00:32.320 or if you want to be really formal josh calderemas you can find me on github or the twitters or you can just call me mr
00:00:38.960 calderemas and find a theater uh this is me with anthony he's a sexy guy
00:00:44.879 uh i'm from the travis team but before we get started you know it's the afternoon you've just had some
00:00:50.800 donuts and coffee could you all just do me the favor of standing for a second you have everyone everyone in the back
00:00:56.879 with a laptop yep yup everyone in that section yeah perfect okay
00:01:02.559 so i want you to stay standing if you are logging anything from one of your apps
00:01:08.720 so it could be rails it could be an agent if you're using logging
00:01:14.320 i'd expect this there's one person that sat down that's a shame um
00:01:20.400 i want you to stay standing if you're using new relic in any shape or form
00:01:26.400 okay i want you to stay standing if you're using custom metrics for example librato or
00:01:34.479 graphite statsd no
00:01:40.560 sorry okay now i want you to stay standing
00:01:45.840 if you're using custom logging custom new relic custom metrics and a dashboard
00:01:51.040 tying them all together boom
00:01:56.479 boom goes the dynamite thank you guys okay and the people that were standing up you can just feel good
00:02:02.479 and maybe i'll give you a sticker man just feel good yeah okay now you can sit
00:02:07.520 okay so two small things before i get my my proper talk started uh i want you to meet the travis team
00:02:14.400 uh this is technically our official outfit we wear a onesie you join the travis company
00:02:20.239 and we give you a onesie
00:02:27.040 i have a feeling my microphone's gonna fall down there we go this is perfect for coding and you can
00:02:33.760 buy them online for a good price of 120 we'll even put a travis logo on the back
00:02:39.680 we used to wear shite shirts and awful moustaches don't tell sven i said that
00:02:45.120 beautiful moustaches and we used to present together in them but
00:02:51.200 now this is a side note uh is there any one from engine yard here
00:02:56.800 okay well thanks to the beauty of engine yard they're actually sponsoring piotr sanaki
00:03:02.480 drogues on twitter he did a lot of the rails engines work and was working on sprout core so he is now working full
00:03:08.319 time on travis open source so i i just want to give a huge thanks to them
00:03:13.760 because they're helping push travis and the open source testing part of it ford
00:03:18.959 so thank you engineered for being in other talks uh
00:03:26.239 oh it does all the way up and there's a summer version
00:03:34.159 so if you know anything about travis travis is a crowd-funded company we started
00:03:40.319 roughly in january putting up a crowdfunding page we raised about 125
00:03:45.599 000 usd and from that it was from about 780 individual donations from people in
00:03:52.239 the audience here and from at home 29 companies also joined in to help us
00:03:58.000 get going you know everyone from engineer to heroku to bendy works
00:04:04.480 to soundcloud and this is only a fraction of them so these are the people that have helped make travis possible
00:04:11.360 so you know we we want to give a lot of love and thanks to anyone here who has donated anyone who's donator who's
00:04:17.280 outside of the room as well but a big thing for us and we haven't actually released it
00:04:22.560 officially but thanks to the beautiful people at sticker meal they have or
00:04:28.160 they've in the past created a whole lot of stickers for us so if you donated you would have got these beautiful stickers
00:04:34.720 in the post but thanks to david who works at sticker mule and his pretty dog
00:04:40.560 we now have an official mascot we have a travis and we have a whole lot of travis
00:04:46.880 stickers so if you want to travel stick later come and get me and i've got one for you
00:04:52.960 but back to the talk so as i said my name is mr calderemis i come from this very far out place
00:04:58.479 called new zealand we're known for hobbits and orcs
00:05:04.240 cable cars beautiful views i'm from wellington to be exact beautiful nightlife amazing
00:05:10.800 coffee and the most fantastic amusement rides ever and i've also been on one of those it's
00:05:17.680 one of those small planes where you can see out the front of the cockpit and as you're coming in you're like oh great i
00:05:23.199 can see the runway i can see a hill i can see a runway and then and then even while it's landing all you can see is a
00:05:29.280 hill up to the last second uh the last four years i've been living in this beautiful place called amsterdam
00:05:36.160 where it's known for its huge amount of bikes it's amazing architecture
00:05:41.280 it's night life it's recreational activities
00:05:47.440 but now i'm technically a traveling nomad i've got no official home i'm in
00:05:52.560 between san francisco new zealand san francisco berlin i travel with a ukulele
00:05:58.160 i've got a nice little coffee kit and today i'm here to talk to you about fun things like logging and metrics and
00:06:06.240 monitoring it's part theory it's part code it's part internal tools and it's part
00:06:12.240 moustache but you'll notice that the talk title in the
00:06:18.160 in the conference planner was about the black box and and i had a huge issue trying to find a
00:06:25.440 good talk title having a good talk title is what people look for and i was at the american bookstore in
00:06:32.960 the netherlands it was actually just before i was going to sao paulo and i was like you know josh you don't know
00:06:38.319 anything about sao paulo and you're going to sao paulo and this was the day before going to sao paulo so i had to
00:06:44.319 run here to get myself a brazil book so i actually knew what was going on and i came across this book and it's
00:06:52.080 it's not how good you are it's how good you want to be and it had these really nice little
00:06:57.120 lessons in there for example fail fail again falben all of these things that we're taught as programmers and when
00:07:02.880 even starting a business like the person who doesn't make mistakes is unlikely to make anything
00:07:08.720 or if you can't solve a problem it's because you're playing by the rules you know nice little lessons so i thought to
00:07:14.800 myself since i'm doing a talk about metrics and what's going on inside your app really
00:07:21.199 it's not how good your app is it's how good you want it to be what i want you to take away is that
00:07:27.440 your app when you deploy it wherever it is it's not just a black box you need to
00:07:32.639 know what's going on inside of it knowing what goes on inside of it gives you the power to optimize and to make
00:07:39.759 things better it's not about the war we're doing a million you know web requests a second
00:07:46.319 that is a very very uh circle jerk kind of metric there we want
00:07:52.400 something we need to know really what's going on inside it and we've got a whole bunch of tools and libraries available
00:07:57.680 to us everything from exception notifier to librato we should be using these tools and even
00:08:04.240 better you should be building internal tools internal tools will allow you to
00:08:10.879 mold the data and allow you to really abstract what you need and on top of that services are your friends don't
00:08:17.280 build things when things are available for you to use it doesn't matter if it costs 100 bucks a month these services
00:08:23.280 are worth their weight in but anyhow i come from this this nice
00:08:28.560 little crowdfunded company called travis and we run your tests for you you might have seen a little status image and you
00:08:36.080 know across github we run those tests it's free it's open source it's full of ferries and unicorns we've got another
00:08:43.200 piece of infrastructure and we are for builds what rubygems is for libraries
00:08:48.959 we have over 25 000 open source projects on us we do 10 000 test runs daily
00:08:56.399 over the last year and a half we've done 707 000 build requests on github
00:09:02.160 totaling 1.9 million total test runs and we support 11 different languages
00:09:08.800 everything from ruby to php to python to go i think there's eight different versions of ruby three different
00:09:14.560 versions of php and python we've got it if we don't have it tell us and we'll
00:09:20.480 add it and if you see this it's the sign that it's on travis
00:09:26.160 it was started for fun the whole idea was for it to be instant live and modern and
00:09:31.440 hackable and as any rails app it started kind of like this it's just a simple rails app
00:09:37.920 that would run on heroku it would use rescue and it would talk to a dedicated box we'd use redis and pusher and a nice
00:09:45.200 pretty database and it's you know it's about getting things out there this
00:09:50.640 was simple and this did the job it did the job when you had five projects but not when you've got 25 000.
00:09:56.800 so it kind of grew to this this is a simplistic view of our architecture
00:10:02.640 we have travis listener which listens from github q stuff to rabbit that then goes to travis hub which will
00:10:08.959 process the requests send it to the database and buy a pusher the user will of course talk to travis
00:10:14.959 fire traverse ci which is the web part and then all of the jobs go to travis worker which uses virtualbox
00:10:21.680 and i will just state that i want to stab virtualbox in the eye but
00:10:26.880 we've got 10 of these boxes that we manage and this works this is how we have scaled
00:10:34.079 as a side note we're splitting travis hub into three bits so now when you look
00:10:39.120 at this it's not four different bits it's actually you know seven different bits and on top of that
00:10:44.560 we've got other bits behind the scenes we use everything from redis to rabbit
00:10:49.600 to virtualbox jruby and mri we're using an assortment of everything to make this
00:10:56.000 job possible and it's kind of built around a service orientated architecture
00:11:02.399 but we kind of call it the mini soa version because we are sharing services
00:11:08.560 and when we say services i'm saying we share reddises and databases between these little apps you
00:11:15.279 generally shouldn't be doing this but anyhow we'll get to this right now we have a total of 10
00:11:21.920 deployable apps and on top of that we have four different environments
00:11:27.279 we have org staging and production and we've got a whole separate setup for pro
00:11:32.560 just to keep everything a little bit more secure and also because org is sponsored we don't want
00:11:38.399 that to leach over the pro so we've got four different environments totally 40 different deployments
00:11:45.279 i mean that's a lot so really measuring and monitoring is a challenge but even more so it's more
00:11:51.440 important to get it right we need to know what's going on inside of travis so we can optimize and make it better for
00:11:58.240 you the user also i used to play canoe polo i'd like us to get to know each other
00:12:05.680 so let's talk about logging so logging is important everyone here is
00:12:11.519 doing it and they might not be knowing it but you're doing logging it's useful but
00:12:16.880 this is the standard rails logging we've got this is what seven eight different lines
00:12:24.000 just for your production log this is your default in production what we really want is something simple
00:12:30.399 we want something concise and it should be easily readable it should be one line
00:12:36.959 if you look at this the rails log standard rails log we can actually break it down we've got the request type the
00:12:43.839 url the format the action status total duration view and database
00:12:49.200 those are all there but there are over eight different lines why not something like this
00:12:55.680 this is all we need and even better it's on one line so this is what we use across our web
00:13:02.079 applications in production it keeps it nice and simple it's keyword we can search for it and it means that we don't
00:13:07.279 have super verbose logging going on and you can find this available it's from matthias one of the travis guys
00:13:14.399 it's under roid rage log rage and you can just add it to your rails app and you should have
00:13:20.240 better logging just for production but what about non-web apps that's not going to work for a non-web app it's
00:13:26.320 very rails specific so when you're creating a ruby agent or
00:13:31.760 ruby tool you want to think about how you use logging logging has everything
00:13:36.800 from debug info and error you know simply put you want to debug
00:13:42.639 lots you want to info a bit and you want to error those exceptions it's not a huge it's not a huge thing to grasp
00:13:50.240 really you want to be logging all the things now if we think about travis law travis
00:13:56.240 hub just for a second here this is one of our components we're doing about
00:14:01.680 100 requests per minute using rabbitmq and doing processing via
00:14:07.519 threads and subscriptions it has things like build finished notifications and it's doing background
00:14:13.839 processing and threads and this is what we have you'll see that on the well down here
00:14:21.120 these are all related log messages and why they are is that you can see that the build idea is the same
00:14:26.959 so you need to think about context when you're logging an easy way to group them if it be tagging if it be putting an
00:14:33.680 identifier or a uuid what you want is any log message that happens in your processing needs to be easily
00:14:41.440 grouped and attributed to a previous log message if that's applicable
00:14:46.720 and what you can see is it's a very simple kind of context we've got the type
00:14:53.279 debugger info the class the method and any extra information and the model and its id
00:15:00.639 and for an error it's just even simpler error class and error message
00:15:05.760 it allows us to then go you know just detail that and to grab and we will get a live streaming of
00:15:11.680 anything to do with that build id but on travis work it's a little bit different so that was the part that's
00:15:18.000 actually processing the jobs which is managing the virtual boxes that's got five vms per worker which is effectively
00:15:25.199 five jobs per worker at any one time what the goal is so every time you look
00:15:30.639 at logging you should have a specific goal in mind and for our goal for logging we wanted to easily be able to
00:15:37.120 see what was going on and we needed it to be able to do it live because we don't want to disturb the application we just want to tail
00:15:44.320 so in here the context is the worker and you'll see on the left hand side
00:15:50.880 on the left hand side it says ruby 2 and then it's got each different class and then what's going on so we can even see
00:15:57.600 in a nice clear human readable way about to run about to perform change directory
00:16:02.880 export etc we know what commands it's running as well we can see it's doing a bundle install on the bottom
00:16:09.199 being able to see that makes it much easier to just tap in and watch the build and make sure nothing's exploding
00:16:17.920 it's a much more verbose form of info compared to the rails app which is usually doing it on a singular request
00:16:24.880 but it's a different goal so don't you don't want to treat every app the same
00:16:31.360 and as i said like everything's grouped appropriately making it easier to read and then all we need to do is do a nice
00:16:37.680 little tail grab for ruby 2 and we can see everything going on with that worker
00:16:42.720 and even better we can improve it further we can consolidate all our logging
00:16:48.079 for example you could async it to a single server you could use something like graylog 2
00:16:53.519 which is built on ruby and java uses elasticsearch self-hosted so you need to look after that
00:16:59.199 or you can use something like paper trail hosted and just pure awesome and for us now we've got all of our web
00:17:06.079 stuff or our hub stuff going on in paper trail nice and easy to see we've also got search at the bottom and then you
00:17:12.319 can go back in the time frame nice and easily it's got excellent stuff like archiving notifications and web hooks when it
00:17:19.199 detects a certain stream so now you don't have to build an alert in yourself you can get to have it detect certain
00:17:25.199 errors and then maybe notify you on campfire and of course we love peter trail
00:17:31.600 oh i also used to play nick bottom in a midsummer's night dream
00:17:37.919 thank you in high school so exceptions boom
00:17:44.559 so exceptions you never want to check logs for errors
00:17:49.679 that is no fun and we're beyond that when we're doing exception logging we want to
00:17:56.559 use an app a service or a library for this if you're not doing this already there's a million available and just to
00:18:02.559 clarify really quickly we've got something as basic as exception notifier a very old rails plugin would send you
00:18:08.240 an email each time you could do something like century which is self install or hosted and this
00:18:15.679 is built on python if i believe right and it's very nice interface and it's just
00:18:21.760 simple you've also got urbit which is a self install and written in ruby you can host
00:18:27.840 that on heroku as well for us we use haystack so haystack is built by the guys at
00:18:34.080 github sadly it's not open source but this is what you get so this is an example of an internal
00:18:41.679 tool that they've seen the need for and the difference that you'll see here is that there's a nice pretty graph and
00:18:48.640 visualizations and context to be able to see what's going on with your errors so you've even got the most common errors
00:18:54.240 and then the latest exceptions along with a nice little metric of how many exceptions there are
00:19:00.240 and of course you get the normal breakdown of what's going wrong so you can easily add extra bits to here and
00:19:06.640 you can have the back trace a lot of people see errors as a to-do list
00:19:13.360 me personally i don't i don't see errors as a to-do list i see errors as a natural part of your app errors are
00:19:19.520 going to happen it's you know i'm not saying ignore them but don't fret over them they're not the end of
00:19:25.360 the world the best error exception notifier is your customer your user
00:19:31.200 that's that's how i find it personally if an error goes wrong and travis we know something's bad when someone tweets
00:19:36.400 saying hey sync isn't working respond what monitor twitter and email
00:19:42.799 like support emails twitter oh this is my little sister so
00:19:50.480 that was very unexpected even for me um i i should actually start just littering
00:19:56.559 them everywhere and i won't even know where i'll just get someone else to do it next time so exception monitoring is really
00:20:02.559 important but don't see it as the first thing that you need to do every morning it is a good thing to get rid of them
00:20:08.400 though so do watch the big numbers uh and my little sister
00:20:13.679 so let's talk about metrics who here uses new relic again just a
00:20:18.720 show of hands awesome who here is using new relic and not
00:20:23.919 for a web application one person
00:20:29.280 maybe two over there was it was two sold okay so this is what you'd normally
00:20:35.919 see in new relic you know pretty graphs a nice little aptx score throughput
00:20:41.360 awesome it's really super powerful but there's something that you might not
00:20:47.200 notice here this is not for a web application new relic is bendable around your agent
00:20:54.240 apps anything that's doing background processing it might be background jobs might be a whole separate app just
00:20:59.679 during amqp subscriptions this here is all of our amqp subscriptions and we can see which ones in particular are very
00:21:06.320 slow or in this case the most time consuming and for us to do that we just built a
00:21:11.760 little wrapper it's a little bit of handy meta programming but we attach we tell a new relic to attach its
00:21:19.679 its proxy and controller notification so we can then get a trace for that method and then what we get out of that
00:21:27.039 is the beauty of new relic you get all of the database breakups all of the
00:21:32.480 web external breakups as well and then you also get to dive into each of those requests and see what's going
00:21:39.280 on it's really powerful if it i will say though it's not incredibly intuitive to
00:21:45.360 add because they call it controller notification and people here will probably think of a web controller as
00:21:51.200 well a web controller that's what i associate controller notification with but for us
00:21:56.799 controller notification was used for an amqp subscription which was a different way of hooking into it nonetheless it's
00:22:04.480 really powerful so you can get working with it straight away because if you're using new rail sorry heroku you get the
00:22:11.360 standard for free you can just shove your app on roku and then you get all of this pretty stuff
00:22:17.760 so as i said not just for new relic it's for all apps i love popcorn
00:22:23.440 we're really getting to know each other here is great so let's talk about metrics
00:22:29.360 so metrics are awesome okay hands down awesome
00:22:34.400 i would like to introduce you to matthias our chief metrics officer and his pioneer and i when i was preparing
00:22:41.679 this talk i've given it a few times now and i was like matthias how do i explain to the audience how awesome metrics are
00:22:48.640 and he said very abstractly metrics are to a running app as unit tests are to
00:22:54.159 your code it's like that's beautiful that even i have trouble understanding that tell me again
00:23:01.039 well it represents a certain part of your app by measuring how long it takes and how often it occurs at a given point
00:23:07.039 of time it's as simple as that i've even attributed him to 2012
00:23:13.200 just in case you ever reused this effectively metrics is attached to a
00:23:18.320 method call it's about how long it took but it doesn't have to be attached to a
00:23:23.440 method call but it's attached to something that's happening in your code an event and for us we use it on the method calls
00:23:30.000 a lot so for example there are four parts to metrics if i if i start there you've got collecting
00:23:37.039 sending aggregating and visualizing those are the four main parts of metrics
00:23:43.760 and if we talk about collecting for a second i want to introduce you to active support notifications
00:23:50.960 if you don't already know it you're using it every single day in your rails app it's littered everywhere it's even
00:23:56.720 how logging works there's no logger dot print or logger dot info that happens via active support notifications
00:24:06.480 the greatest thing is it's a very simple publish and subscriber or observable
00:24:11.840 kind of pattern you publish events you instrument some code or you sub and then you subscribe to
00:24:18.080 those events or instrumentations you can subscribe to one event type or
00:24:23.600 multiple using regex it's really simple you publish one-time events
00:24:29.679 or you can yield a block and time it or you can just publish an event using instrument and send some extra info
00:24:37.120 i won't go fully into active support notifications but i would highly recommend that you buy
00:24:43.440 this book from joe's avaleem it is absolutely hands down amazing will give
00:24:48.640 you a very thorough insight into what's happening inside of rails and also that particular topic of notifications
00:24:55.520 so i'm just going to refer to active support notifications as asn for now but let's look at how we could use it
00:25:02.400 that's all we need to do we can just start littering our code with publishing and campfire and urls
00:25:09.039 and we can start passing information about this was called we can go instrument
00:25:14.880 okay so they're pretty much the same or we can instrument and it will do a timing as well
00:25:21.200 nice and simple we use it to publish three events in travis now this is all over our code we
00:25:28.880 published a received completed and failed
00:25:33.919 you'll kind of you'll see a pattern similar to this you'll send a message and we say oh
00:25:40.080 someone fired off a uh campfire message or notification we do
00:25:46.400 some processing and then we publish that it completed successfully or if there is an error we publish that failed and we
00:25:52.799 re-raise it's nothing particularly special but because this is a whole lot of craft
00:25:58.080 code ah completed sorry josh there we go
00:26:04.080 because of the craft we use a little bit of meta programming and now you just go instrument and send message
00:26:09.600 and all we're doing is emitting a whole lot of events these events are completely metrics agnostic so like step
00:26:16.799 one is that you can even add this to your library or gem but you can add this all over your code knowing that it's not
00:26:22.240 going to take up extra cpu it's not going to make your app slower it's going to just start emitting events and then
00:26:27.360 you can start thinking about how do you want to use those events it's very general purpose
00:26:33.120 and it's all about the subscriber this is the beauty oh i love german food
00:26:41.120 so let's think about sending and aggregating for a second statsd
00:26:47.120 that's no js using udp is one of the methods that we can use we can start aggregating and sending if or sending
00:26:52.880 and have it aggregate it uses simple udp messages like this where you'd say this
00:26:58.080 is one counter increment by one please or we can do a timing like 320
00:27:03.760 milliseconds and you can even do this from your console you can set up stat steve and
00:27:09.679 then just do this using netcat nice and simple you've got a nice little ruby gem for it
00:27:16.240 you can just set up the host in the port increment do some timings boom done
00:27:21.279 and it's only 155 lines of code which is a very small gem it's got some tests in there
00:27:27.600 but this is what we use we use a combination of metrics d which is by eric from paper trail
00:27:35.279 and it's based on ruby and event machine it's coupled together with the metrics
00:27:41.360 gem now the metrics gem is very similar again you're not doing anything mind blowing here we set up counters and we
00:27:47.440 get to increment and we can do some timings and then we flush that every x amount of
00:27:53.600 seconds so we flush it every 60 seconds and then what metrics the gem does is it
00:27:58.799 part aggregates it aggregates within that process and then it sends it and it now what it
00:28:05.039 does is it sends it to a reporter so this is quite cool because now you can send to different types of reporters
00:28:12.159 and in this case we can send it to our logs and we can actually see what's happening in our logs for each of these
00:28:18.399 metrics so let's let's subscribe to something here
00:28:23.600 what we're going to do is we're going to add a started app so we know how long it took we'll continue to use this instrument
00:28:30.480 we're going to also do the started and finished app on the completed and then we subscribe
00:28:37.520 we just subscribe to anything using that namespace travis at the top and then if it's if there is a finished
00:28:43.760 stat then we send the time it took and if not we just mark it that this
00:28:49.120 method was called so it's nice and simple this is all you need to subscribe your
00:28:55.120 entire you know code base you litter it with events you do a nice little subscriber
00:29:01.279 and now what we do is we start the logger and this is what we get out we get everything from now you'll see
00:29:07.440 that at the top it's active record and we've got a timer we've got how many times it happened the one minute rate 15
00:29:14.559 minute rate means max standard deviations everything this is just for one process
00:29:20.240 so now what you can do is if we think about it this is what we've
00:29:25.279 just set up so we use paper trail this is travis in particular travis uses active support
00:29:30.480 notifications we use metrics and then it goes to paper trail so all of our metrics are going to paper trail
00:29:36.960 it's a very quick way to get started and it means it's very visual as well so it gives you a little bit of debug time
00:29:44.320 but it's the thing that comes after it that's really important because what we're doing now is we're emitting events
00:29:49.919 we're aggregating we're sending awesome now we need to start visualizing we need to start making use of these effects
00:29:57.039 so there's graphite and graphite is a self-hosted product so
00:30:02.159 you're going to need to run this yourself and it gives you very thorough graphs it's not the prettiest interface but
00:30:08.880 sometimes it isn't about the interface even though it actually is that's why i have a mac and
00:30:15.039 then there's librato uh librato we are in love with we get pretty graphs like
00:30:20.240 this that allows us to do absolutely everything and create our own dashboards this here is actually
00:30:26.080 a set of our errors we have everything from http errors on the top left to amqp
00:30:32.080 messages we've got log entries and also notifications and if we look at here this gives us a
00:30:38.720 breakdown of every type of notification we're sending the top green bar the big peaks are actually pushing notifications
00:30:45.279 so you can see that we're sending a crap load and the cool thing is that you can embed
00:30:50.320 these graphs into your campfire including now live graphs so you can actually share a graph with
00:30:56.159 your team and say let's monitor it live together and start commenting this is actually one of the graphs that
00:31:02.240 we blogged about a while ago what is what you'll see here is that we deployed some code and then boom we had a lot of
00:31:09.679 errors and then we fixed the bug quickly and everything peaked off you know we still had some high peaks and then we fixed
00:31:16.799 another error and they just all went away so as you can see you could use something
00:31:23.200 like you know urbit or get sentry or haystack but you can also use
00:31:30.000 librato and your own metrics and monitoring as well even better is to sayo
00:31:37.279 it gives you a nice handy dashboard that links into your librato or graphite metrics now actually the original fork
00:31:44.159 of it or the original version of taseo which is on uh obscurity at the bottom there the
00:31:50.240 github that only works with graphite at the moment but if you use mathias or roid rages fork this can also
00:31:58.159 work with librato so you're getting a nice little dashboard and it's also an easy dashboard that you can note your
00:32:04.399 graphs you'll see at the red one amqp ready you can see that we've got 17 000 18 000 messages this is another error
00:32:12.320 and we can see nice and quick and simply that is a big warning sign we need to do something about it
00:32:19.360 so if we put everything together in our holy circle of metrics
00:32:24.480 it goes into paper trail we need a web hook to pull it out of paper trail and then it goes to librato but you know we
00:32:30.960 can do a lot better here so we've got a lot of aggregating and sending it's a little bit silly
00:32:37.679 so we can use the metrics d reporter along with metrics d and now we get something like this
00:32:45.519 okay we already cut one piece out of it or we can go even one more step
00:32:51.440 and we can use the labrado metrics reporter and then this is all you need
00:32:57.200 librato will do the aggregating the sending and the visualizing for you
00:33:02.799 i used a scuba diver crayfish fun fact so the next part is like you're using
00:33:09.279 all these tools you're doing logging properly you're doing exception tracking you've got metrics
00:33:16.000 but the third part to all of this is the internal tools so i want to show you travis admin it
00:33:22.720 was a little app that we've built it's open source as well you know all of travis is open source
00:33:28.399 and this is a app built on sinatra along with bootstrap by the beautiful
00:33:33.840 constantine hazard i call him constantine house i would urge you all to do that because i don't
00:33:40.080 think he likes it that much i love you constantine in case you're watching um
00:33:46.159 so you this is part of the app it's composed of three four different bits this one here allows us to view the
00:33:51.679 service hooks set up for each of the repos what we do is you would put in a user on the or the user and a repo on
00:33:58.640 the right hand side and it checks the database for their oauth token and then we can check the hooks set up for that user we can
00:34:05.279 confirm that the hook is set up correctly we can also disable pull requests if we need to or maybe we can
00:34:11.200 enable some extra things to look out for so you'll see pull requests just down here
00:34:17.440 we've also got the gh web console gh is a gem created by constantine which
00:34:22.960 is a nice way of interacting with the github api it's different from octocat this one uses the headers a lot to
00:34:30.000 traverse and it's a bit more low level it's very low level but what we can do is we can put the user and the
00:34:37.760 the command that we're going for like user repos and we get a nice formatted
00:34:43.040 json list along with the headers uh on the right hand side so we can see
00:34:48.800 what our rate limit is and what was the scope of that request so it's a nice way of seeing what's going on with our
00:34:55.359 github requests because we do a lot of them but the crown of all of this is the
00:35:00.960 event monitor so as i showed before this was our infrastructure it's it's not
00:35:06.480 even part of it it's just half of our infrastructure and the dear old nicolas cage joined our
00:35:13.680 team and we tasked him to create a way to unify all of our notifications
00:35:19.520 so if we have all of this information going on how can we kind of see the path that a request takes comes in by the
00:35:26.240 listener it goes to hub it goes to the worker and then it comes back to hub how do we visualize that
00:35:32.240 so he built an app for us built on ember js and using redis pub sub
00:35:37.680 and we've got something like this now you might not see it very easily but i'll turn on the live version in a
00:35:43.920 second but what we have on the left hand side is an event list all of the events that happen and the master event so in
00:35:50.000 this case you'll see that a job finished and then we see all of the events that happened
00:35:56.880 after that that might be a push a notification was sent an irc notification was sent database was
00:36:03.520 updated so we can see all of the sub events and then we can see the payload attached to each of these so now you can
00:36:09.440 unify and actually see live what's going on this just goes to redis it expires
00:36:14.560 quickly and we can watch it live and then we can pause it if need be it uses travis no uh instrumentation a
00:36:22.160 custom thing that we built for doing all of these hooks along with active support notifications but
00:36:27.839 what i'll do
00:36:41.359 so this should be travis happening right now
00:36:49.520 and wow it's very quiet i know there we go and you can see that it's updating
00:36:55.119 accounts we can kind of see the payloads that are going on we can see the [ __ ] ruby driver just
00:37:00.400 started and all the stuff that started with that so it allows us to see not what's going on in a logging kind of
00:37:07.200 way but what's going on in a rich json logging kind of way we can kind of see
00:37:12.720 all of the events that are happening after each other now right now if you wanted to go use
00:37:17.920 this it's completely open source
00:37:24.000 if you want to use this it it's completely open source but i will admit that there's a fair bit of
00:37:30.000 code that is kind of bound to how we saw the problem i'd love to have other people try this out and get involved and
00:37:36.400 see how we could take this to the next step it's a lot of the code is located in
00:37:42.400 travis support which is one of our repos on github it's completely open source
00:37:48.960 um i'd like to just finish off very quickly to say that
00:37:54.720 the whole point of this talk is to not just deploy your app to heroku or to
00:38:00.079 capistra or using capistrano to a dedicated server you need to know what's going on inside your app
00:38:05.760 knowing what's going on inside your app will allow you to find those errors to make things faster and to find out what features your users are using in fact
00:38:13.119 the one of the best ways you can use metrics is start attaching it to features start seeing the paths that
00:38:19.760 people use and what they're doing and how often something is used if you just spent two weeks working on a feature
00:38:25.280 that no one uses is that a good return on investment don't treat your app like a black box
00:38:31.760 don't just deploy it and forget about it and just look at those nice little circle jerk stats actually look into
00:38:37.280 what's going on the knowledge will help you improve don't look at those big war numbers
00:38:43.920 use services all of these services are available and they're amazing build internal tools don't build them
00:38:50.960 straight away but over time figure out what is it that you need or your team needs to make it
00:38:56.720 easier for you to administer the product you don't need to build it in your current normal application build it as a
00:39:02.000 side app that's why we have travis admin that stuff does not belong in travis itself it's just another tool that would read
00:39:08.480 the database and all the external services services are your friends i think i've said services of your friends enough
00:39:15.520 you can find us on irc and you know on github and on the twitters
00:39:21.760 if you'd like a pretty sticker come talk to me you are awesome
00:39:27.119 thank you
00:39:36.400 and any questions
00:39:43.760 this one there's two i don't think we have microphones so you any just yell it out
00:39:49.680 travis pro the council bill uh email support travelci.com
00:39:55.599 we've got a blog post with our pricing and all of the plans um we are freely giving out the invites
00:40:02.240 we're just controlling that slowly while while we get those pages up and running um
00:40:07.839 yeah email us we'll get you on there
00:40:22.079 in terms of
00:40:30.480 i guess the class declaration of like saying instrument this method because it's our way of pulling it out of the
00:40:36.319 code we don't have to have it sitting with the method underneath you can actually eval that into the classes in a
00:40:41.680 different area later on and then you can group them together but we find that it's the least intrusive
00:40:47.599 way we're not actually worried about instrumenting too much in fact the more you instrument the better it's about how
00:40:54.000 you kind of use that statistics like the main things that you should probably look at like start easy
00:40:59.520 is how long do maybe some of my notifiers take if i'm posting to twitter
00:41:04.640 or facebook how long do they take how many of them error out maybe you can look at features as well like it's a
00:41:10.400 great way to find out what your users are doing within your system if you have an amazing time tracking app and no one
00:41:17.440 ever uses the schedule function then maybe you need to think about how you rework that it can help you with
00:41:23.839 your a b testing metrics is very abstract it's not just about code but it's about everything that can kind of
00:41:28.880 go on within your app but we generally feel that every time we add
00:41:33.920 a feature now we make sure that we add a metric so we can track what's going on with it as
00:41:46.839 well you can't because you only have http for heroku we can't run metrics d
00:41:52.480 on there or statsd we use at the moment the web hook with paper trail but we'll soon be changing
00:41:59.839 over to using the librato notifier reporter so now it will just be the app
00:42:05.200 will use the metrics gem and then we'll go straight to librato librato's building and better alerting it's kind
00:42:10.880 of the one thing i didn't cover in this talk because it's a little bit hard to throw in everything including the kitchen sink
00:42:16.960 is the the holy thing that mixes everything you know connects it all up is alerting
00:42:22.960 so the final stage in all this is that when you've got all these metrics you've got all this uh the measurements and the
00:42:29.119 monitoring you need to be alerted when something hits peak now teseo takes it to a certain degree you're getting nice
00:42:35.680 colorings out of it but what you really want is campfire or pager duty or you
00:42:41.040 know someone to come out and get you from the restaurant or you know you need a way to find out when something's going
00:42:46.240 wrong it's the first step this step takes you to knowing what's going on inside the app then you need to kind of
00:42:52.640 add the alerting to notify you when something's going a bit astray
00:43:02.240 no other questions oh well thank you very much if um you've got anything else just come talk to me after thank you
00:43:42.319 you
Explore all talks recorded at RubyConf 2012
+46