Your App is not a Black Box

RubyConf 2012

Bookmark

Play on YouTube

Edit

#continuous-integration

Your App is not a Black Box

Josh Kalderimis • November 01, 2012 • Denver, CO • Talk

In the talk titled "Your App is not a Black Box," Josh Kalderimis from the Travis CI team emphasizes the importance of understanding what happens inside an application rather than viewing it as a mysterious entity. Kalderimis discusses how Travis CI utilizes distributed continuous integration to manage thousands of tests daily and highlights the tools and strategies they developed to achieve visibility and transparency in their services.

Key Points:

Awareness over Ignorance: The main theme is that developers should not treat their applications as black boxes; knowing what goes on inside enables optimization and improved performance.
Tools and Services: He discusses various logging and monitoring tools used in Travis, including NewRelic, Librato, Graphite, and custom-built internal tools.
Logging Best Practices: Kalderimis stresses the need for concise logging that is readable and usable, advocating for single-line logs that convey essential information without verbosity.
Error Tracking: Importance of using exception notification services like Sentry and Haystack to track application errors effectively rather than manually sifting through logs.
Metric Collection: The talk delves into metrics collection methodologies using ActiveSupport Notifications, stressing that metrics are to an application what unit tests are to code.
Aggregating and Visualizing Data: Discusses leveraging services to aggregate and visualize metrics data, illustrating how Travis CI utilizes tools like Papertrail and metrics aggregation to monitor application performance.
Building Internal Tools: Kalderimis encourages the development of internal tools for better team management and service monitoring, showcasing a tool they created for monitoring events and service hooks in the Travis CI system.

Concluding Takeaways:

Applications need to be monitored and optimized continuously through proper logging, error tracking, and metrics.
Continuous integration tools like Travis CI can provide invaluable insights when utilized with a strategy that emphasizes visibility.
Utilizing existing services and building internal tools has a substantial positive impact on managing application performance and reliability.

Your App is not a Black Box
Josh Kalderimis • Denver, CO • Talk

Date: November 01, 2012
Published: Tue, 19 Mar 2013 00:00:00 +0000
Announced: unknown

This talk is part story, part code, and part mustache.

Travis CI is a distributed continuous integration system running over 7,000 tests daily. For us to get a true insight into what is going on behind the scenes we have had to come a long way by integrating and building both tools and libraries so that Travis and its many parts are not just a black box of information.

Reading logs and using NewRelic is not new, but far from enough, especially when it comes to apps which are composed of many smaller apps, like Travis. How do you track and visualize requests being processed by multiple services? How do you silence verbose logs while not losing the core of the message? And how do you aggregate, visualize and share metrics?

A lot of how we track, manage, and visualize what is going on in Travis has been created as a set of internal tools and by using a range of awesome services, but core to a lot of this is ActiveSupport Notifications and Travis Instrumentation.

This session will give insight to how Travis is composed and connected, as well as shedding light on how simple it can be to gain more visibility into even a complex, distributed system like Travis, as well as your applications too.

RubyConf 2012

00:00:15.599 hello hello i guess it's time for us to start it's

00:00:21.840 uh the 4 30 mark a couple of people coming in hi guys

00:00:27.039 awesome uh welcome my name is josh

00:00:32.320 or if you want to be really formal josh calderemas you can find me on github or the twitters or you can just call me mr

00:00:38.960 calderemas and find a theater uh this is me with anthony he's a sexy guy

00:00:44.879 uh i'm from the travis team but before we get started you know it's the afternoon you've just had some

00:00:50.800 donuts and coffee could you all just do me the favor of standing for a second you have everyone everyone in the back

00:00:56.879 with a laptop yep yup everyone in that section yeah perfect okay

00:01:02.559 so i want you to stay standing if you are logging anything from one of your apps

00:01:08.720 so it could be rails it could be an agent if you're using logging

00:01:14.320 i'd expect this there's one person that sat down that's a shame um

00:01:20.400 i want you to stay standing if you're using new relic in any shape or form

00:01:26.400 okay i want you to stay standing if you're using custom metrics for example librato or

00:01:34.479 graphite statsd no

00:01:40.560 sorry okay now i want you to stay standing

00:01:45.840 if you're using custom logging custom new relic custom metrics and a dashboard

00:01:51.040 tying them all together boom

00:01:56.479 boom goes the dynamite thank you guys okay and the people that were standing up you can just feel good

00:02:02.479 and maybe i'll give you a sticker man just feel good yeah okay now you can sit

00:02:07.520 okay so two small things before i get my my proper talk started uh i want you to meet the travis team

00:02:14.400 uh this is technically our official outfit we wear a onesie you join the travis company

00:02:20.239 and we give you a onesie

00:02:27.040 i have a feeling my microphone's gonna fall down there we go this is perfect for coding and you can

00:02:33.760 buy them online for a good price of 120 we'll even put a travis logo on the back

00:02:39.680 we used to wear shite shirts and awful moustaches don't tell sven i said that

00:02:45.120 beautiful moustaches and we used to present together in them but

00:02:51.200 now this is a side note uh is there any one from engine yard here

00:02:56.800 okay well thanks to the beauty of engine yard they're actually sponsoring piotr sanaki

00:03:02.480 drogues on twitter he did a lot of the rails engines work and was working on sprout core so he is now working full

00:03:08.319 time on travis open source so i i just want to give a huge thanks to them

00:03:13.760 because they're helping push travis and the open source testing part of it ford

00:03:18.959 so thank you engineered for being in other talks uh

00:03:26.239 oh it does all the way up and there's a summer version

00:03:34.159 so if you know anything about travis travis is a crowd-funded company we started

00:03:40.319 roughly in january putting up a crowdfunding page we raised about 125

00:03:45.599 000 usd and from that it was from about 780 individual donations from people in

00:03:52.239 the audience here and from at home 29 companies also joined in to help us

00:03:58.000 get going you know everyone from engineer to heroku to bendy works

00:04:04.480 to soundcloud and this is only a fraction of them so these are the people that have helped make travis possible

00:04:11.360 so you know we we want to give a lot of love and thanks to anyone here who has donated anyone who's donator who's

00:04:17.280 outside of the room as well but a big thing for us and we haven't actually released it

00:04:22.560 officially but thanks to the beautiful people at sticker meal they have or

00:04:28.160 they've in the past created a whole lot of stickers for us so if you donated you would have got these beautiful stickers

00:04:34.720 in the post but thanks to david who works at sticker mule and his pretty dog

00:04:40.560 we now have an official mascot we have a travis and we have a whole lot of travis

00:04:46.880 stickers so if you want to travel stick later come and get me and i've got one for you

00:04:52.960 but back to the talk so as i said my name is mr calderemis i come from this very far out place

00:04:58.479 called new zealand we're known for hobbits and orcs

00:05:04.240 cable cars beautiful views i'm from wellington to be exact beautiful nightlife amazing

00:05:10.800 coffee and the most fantastic amusement rides ever and i've also been on one of those it's

00:05:17.680 one of those small planes where you can see out the front of the cockpit and as you're coming in you're like oh great i

00:05:23.199 can see the runway i can see a hill i can see a runway and then and then even while it's landing all you can see is a

00:05:29.280 hill up to the last second uh the last four years i've been living in this beautiful place called amsterdam

00:05:36.160 where it's known for its huge amount of bikes it's amazing architecture

00:05:41.280 it's night life it's recreational activities

00:05:47.440 but now i'm technically a traveling nomad i've got no official home i'm in

00:05:52.560 between san francisco new zealand san francisco berlin i travel with a ukulele

00:05:58.160 i've got a nice little coffee kit and today i'm here to talk to you about fun things like logging and metrics and

00:06:06.240 monitoring it's part theory it's part code it's part internal tools and it's part

00:06:12.240 moustache but you'll notice that the talk title in the

00:06:18.160 in the conference planner was about the black box and and i had a huge issue trying to find a

00:06:25.440 good talk title having a good talk title is what people look for and i was at the american bookstore in

00:06:32.960 the netherlands it was actually just before i was going to sao paulo and i was like you know josh you don't know

00:06:38.319 anything about sao paulo and you're going to sao paulo and this was the day before going to sao paulo so i had to

00:06:44.319 run here to get myself a brazil book so i actually knew what was going on and i came across this book and it's

00:06:52.080 it's not how good you are it's how good you want to be and it had these really nice little

00:06:57.120 lessons in there for example fail fail again falben all of these things that we're taught as programmers and when

00:07:02.880 even starting a business like the person who doesn't make mistakes is unlikely to make anything

00:07:08.720 or if you can't solve a problem it's because you're playing by the rules you know nice little lessons so i thought to

00:07:14.800 myself since i'm doing a talk about metrics and what's going on inside your app really

00:07:21.199 it's not how good your app is it's how good you want it to be what i want you to take away is that

00:07:27.440 your app when you deploy it wherever it is it's not just a black box you need to

00:07:32.639 know what's going on inside of it knowing what goes on inside of it gives you the power to optimize and to make

00:07:39.759 things better it's not about the war we're doing a million you know web requests a second

00:07:46.319 that is a very very uh circle jerk kind of metric there we want

00:07:52.400 something we need to know really what's going on inside it and we've got a whole bunch of tools and libraries available

00:07:57.680 to us everything from exception notifier to librato we should be using these tools and even

00:08:04.240 better you should be building internal tools internal tools will allow you to

00:08:10.879 mold the data and allow you to really abstract what you need and on top of that services are your friends don't

00:08:17.280 build things when things are available for you to use it doesn't matter if it costs 100 bucks a month these services

00:08:23.280 are worth their weight in but anyhow i come from this this nice

00:08:28.560 little crowdfunded company called travis and we run your tests for you you might have seen a little status image and you

00:08:36.080 know across github we run those tests it's free it's open source it's full of ferries and unicorns we've got another

00:08:43.200 piece of infrastructure and we are for builds what rubygems is for libraries

00:08:48.959 we have over 25 000 open source projects on us we do 10 000 test runs daily

00:08:56.399 over the last year and a half we've done 707 000 build requests on github

00:09:02.160 totaling 1.9 million total test runs and we support 11 different languages

00:09:08.800 everything from ruby to php to python to go i think there's eight different versions of ruby three different

00:09:14.560 versions of php and python we've got it if we don't have it tell us and we'll

00:09:20.480 add it and if you see this it's the sign that it's on travis

00:09:26.160 it was started for fun the whole idea was for it to be instant live and modern and

00:09:31.440 hackable and as any rails app it started kind of like this it's just a simple rails app

00:09:37.920 that would run on heroku it would use rescue and it would talk to a dedicated box we'd use redis and pusher and a nice

00:09:45.200 pretty database and it's you know it's about getting things out there this

00:09:50.640 was simple and this did the job it did the job when you had five projects but not when you've got 25 000.

00:09:56.800 so it kind of grew to this this is a simplistic view of our architecture

00:10:02.640 we have travis listener which listens from github q stuff to rabbit that then goes to travis hub which will

00:10:08.959 process the requests send it to the database and buy a pusher the user will of course talk to travis

00:10:14.959 fire traverse ci which is the web part and then all of the jobs go to travis worker which uses virtualbox

00:10:21.680 and i will just state that i want to stab virtualbox in the eye but

00:10:26.880 we've got 10 of these boxes that we manage and this works this is how we have scaled

00:10:34.079 as a side note we're splitting travis hub into three bits so now when you look

00:10:39.120 at this it's not four different bits it's actually you know seven different bits and on top of that

00:10:44.560 we've got other bits behind the scenes we use everything from redis to rabbit

00:10:49.600 to virtualbox jruby and mri we're using an assortment of everything to make this

00:10:56.000 job possible and it's kind of built around a service orientated architecture

00:11:02.399 but we kind of call it the mini soa version because we are sharing services

00:11:08.560 and when we say services i'm saying we share reddises and databases between these little apps you

00:11:15.279 generally shouldn't be doing this but anyhow we'll get to this right now we have a total of 10

00:11:21.920 deployable apps and on top of that we have four different environments

00:11:27.279 we have org staging and production and we've got a whole separate setup for pro

00:11:32.560 just to keep everything a little bit more secure and also because org is sponsored we don't want

00:11:38.399 that to leach over the pro so we've got four different environments totally 40 different deployments

00:11:45.279 i mean that's a lot so really measuring and monitoring is a challenge but even more so it's more

00:11:51.440 important to get it right we need to know what's going on inside of travis so we can optimize and make it better for

00:11:58.240 you the user also i used to play canoe polo i'd like us to get to know each other

00:12:05.680 so let's talk about logging so logging is important everyone here is

00:12:11.519 doing it and they might not be knowing it but you're doing logging it's useful but

00:12:16.880 this is the standard rails logging we've got this is what seven eight different lines

00:12:24.000 just for your production log this is your default in production what we really want is something simple

00:12:30.399 we want something concise and it should be easily readable it should be one line

00:12:36.959 if you look at this the rails log standard rails log we can actually break it down we've got the request type the

00:12:43.839 url the format the action status total duration view and database

00:12:49.200 those are all there but there are over eight different lines why not something like this

00:12:55.680 this is all we need and even better it's on one line so this is what we use across our web

00:13:02.079 applications in production it keeps it nice and simple it's keyword we can search for it and it means that we don't

00:13:07.279 have super verbose logging going on and you can find this available it's from matthias one of the travis guys

00:13:14.399 it's under roid rage log rage and you can just add it to your rails app and you should have

00:13:20.240 better logging just for production but what about non-web apps that's not going to work for a non-web app it's

00:13:26.320 very rails specific so when you're creating a ruby agent or

00:13:31.760 ruby tool you want to think about how you use logging logging has everything

00:13:36.800 from debug info and error you know simply put you want to debug

00:13:42.639 lots you want to info a bit and you want to error those exceptions it's not a huge it's not a huge thing to grasp

00:13:50.240 really you want to be logging all the things now if we think about travis law travis

00:13:56.240 hub just for a second here this is one of our components we're doing about

00:14:01.680 100 requests per minute using rabbitmq and doing processing via

00:14:07.519 threads and subscriptions it has things like build finished notifications and it's doing background

00:14:13.839 processing and threads and this is what we have you'll see that on the well down here

00:14:21.120 these are all related log messages and why they are is that you can see that the build idea is the same

00:14:26.959 so you need to think about context when you're logging an easy way to group them if it be tagging if it be putting an

00:14:33.680 identifier or a uuid what you want is any log message that happens in your processing needs to be easily

00:14:41.440 grouped and attributed to a previous log message if that's applicable

00:14:46.720 and what you can see is it's a very simple kind of context we've got the type

00:14:53.279 debugger info the class the method and any extra information and the model and its id

00:15:00.639 and for an error it's just even simpler error class and error message

00:15:05.760 it allows us to then go you know just detail that and to grab and we will get a live streaming of

00:15:11.680 anything to do with that build id but on travis work it's a little bit different so that was the part that's

00:15:18.000 actually processing the jobs which is managing the virtual boxes that's got five vms per worker which is effectively

00:15:25.199 five jobs per worker at any one time what the goal is so every time you look

00:15:30.639 at logging you should have a specific goal in mind and for our goal for logging we wanted to easily be able to

00:15:37.120 see what was going on and we needed it to be able to do it live because we don't want to disturb the application we just want to tail

00:15:44.320 so in here the context is the worker and you'll see on the left hand side

00:15:50.880 on the left hand side it says ruby 2 and then it's got each different class and then what's going on so we can even see

00:15:57.600 in a nice clear human readable way about to run about to perform change directory

00:16:02.880 export etc we know what commands it's running as well we can see it's doing a bundle install on the bottom

00:16:09.199 being able to see that makes it much easier to just tap in and watch the build and make sure nothing's exploding

00:16:17.920 it's a much more verbose form of info compared to the rails app which is usually doing it on a singular request

00:16:24.880 but it's a different goal so don't you don't want to treat every app the same

00:16:31.360 and as i said like everything's grouped appropriately making it easier to read and then all we need to do is do a nice

00:16:37.680 little tail grab for ruby 2 and we can see everything going on with that worker

00:16:42.720 and even better we can improve it further we can consolidate all our logging

00:16:48.079 for example you could async it to a single server you could use something like graylog 2

00:16:53.519 which is built on ruby and java uses elasticsearch self-hosted so you need to look after that

00:16:59.199 or you can use something like paper trail hosted and just pure awesome and for us now we've got all of our web

00:17:06.079 stuff or our hub stuff going on in paper trail nice and easy to see we've also got search at the bottom and then you

00:17:12.319 can go back in the time frame nice and easily it's got excellent stuff like archiving notifications and web hooks when it

00:17:19.199 detects a certain stream so now you don't have to build an alert in yourself you can get to have it detect certain

00:17:25.199 errors and then maybe notify you on campfire and of course we love peter trail

00:17:31.600 oh i also used to play nick bottom in a midsummer's night dream

00:17:37.919 thank you in high school so exceptions boom

00:17:44.559 so exceptions you never want to check logs for errors

00:17:49.679 that is no fun and we're beyond that when we're doing exception logging we want to

00:17:56.559 use an app a service or a library for this if you're not doing this already there's a million available and just to

00:18:02.559 clarify really quickly we've got something as basic as exception notifier a very old rails plugin would send you

00:18:08.240 an email each time you could do something like century which is self install or hosted and this

00:18:15.679 is built on python if i believe right and it's very nice interface and it's just

00:18:21.760 simple you've also got urbit which is a self install and written in ruby you can host

00:18:27.840 that on heroku as well for us we use haystack so haystack is built by the guys at

00:18:34.080 github sadly it's not open source but this is what you get so this is an example of an internal

00:18:41.679 tool that they've seen the need for and the difference that you'll see here is that there's a nice pretty graph and

00:18:48.640 visualizations and context to be able to see what's going on with your errors so you've even got the most common errors

00:18:54.240 and then the latest exceptions along with a nice little metric of how many exceptions there are

00:19:00.240 and of course you get the normal breakdown of what's going wrong so you can easily add extra bits to here and

00:19:06.640 you can have the back trace a lot of people see errors as a to-do list

00:19:13.360 me personally i don't i don't see errors as a to-do list i see errors as a natural part of your app errors are

00:19:19.520 going to happen it's you know i'm not saying ignore them but don't fret over them they're not the end of

00:19:25.360 the world the best error exception notifier is your customer your user

00:19:31.200 that's that's how i find it personally if an error goes wrong and travis we know something's bad when someone tweets

00:19:36.400 saying hey sync isn't working respond what monitor twitter and email

00:19:42.799 like support emails twitter oh this is my little sister so

00:19:50.480 that was very unexpected even for me um i i should actually start just littering

00:19:56.559 them everywhere and i won't even know where i'll just get someone else to do it next time so exception monitoring is really

00:20:02.559 important but don't see it as the first thing that you need to do every morning it is a good thing to get rid of them

00:20:08.400 though so do watch the big numbers uh and my little sister

00:20:13.679 so let's talk about metrics who here uses new relic again just a

00:20:18.720 show of hands awesome who here is using new relic and not

00:20:23.919 for a web application one person

00:20:29.280 maybe two over there was it was two sold okay so this is what you'd normally

00:20:35.919 see in new relic you know pretty graphs a nice little aptx score throughput

00:20:41.360 awesome it's really super powerful but there's something that you might not

00:20:47.200 notice here this is not for a web application new relic is bendable around your agent

00:20:54.240 apps anything that's doing background processing it might be background jobs might be a whole separate app just

00:20:59.679 during amqp subscriptions this here is all of our amqp subscriptions and we can see which ones in particular are very

00:21:06.320 slow or in this case the most time consuming and for us to do that we just built a

00:21:11.760 little wrapper it's a little bit of handy meta programming but we attach we tell a new relic to attach its

00:21:19.679 its proxy and controller notification so we can then get a trace for that method and then what we get out of that

00:21:27.039 is the beauty of new relic you get all of the database breakups all of the

00:21:32.480 web external breakups as well and then you also get to dive into each of those requests and see what's going

00:21:39.280 on it's really powerful if it i will say though it's not incredibly intuitive to

00:21:45.360 add because they call it controller notification and people here will probably think of a web controller as

00:21:51.200 well a web controller that's what i associate controller notification with but for us

00:21:56.799 controller notification was used for an amqp subscription which was a different way of hooking into it nonetheless it's

00:22:04.480 really powerful so you can get working with it straight away because if you're using new rail sorry heroku you get the

00:22:11.360 standard for free you can just shove your app on roku and then you get all of this pretty stuff

00:22:17.760 so as i said not just for new relic it's for all apps i love popcorn

00:22:23.440 we're really getting to know each other here is great so let's talk about metrics

00:22:29.360 so metrics are awesome okay hands down awesome

00:22:34.400 i would like to introduce you to matthias our chief metrics officer and his pioneer and i when i was preparing

00:22:41.679 this talk i've given it a few times now and i was like matthias how do i explain to the audience how awesome metrics are

00:22:48.640 and he said very abstractly metrics are to a running app as unit tests are to

00:22:54.159 your code it's like that's beautiful that even i have trouble understanding that tell me again

00:23:01.039 well it represents a certain part of your app by measuring how long it takes and how often it occurs at a given point

00:23:07.039 of time it's as simple as that i've even attributed him to 2012

00:23:13.200 just in case you ever reused this effectively metrics is attached to a

00:23:18.320 method call it's about how long it took but it doesn't have to be attached to a

00:23:23.440 method call but it's attached to something that's happening in your code an event and for us we use it on the method calls

00:23:30.000 a lot so for example there are four parts to metrics if i if i start there you've got collecting

00:23:37.039 sending aggregating and visualizing those are the four main parts of metrics

00:23:43.760 and if we talk about collecting for a second i want to introduce you to active support notifications

00:23:50.960 if you don't already know it you're using it every single day in your rails app it's littered everywhere it's even

00:23:56.720 how logging works there's no logger dot print or logger dot info that happens via active support notifications

00:24:06.480 the greatest thing is it's a very simple publish and subscriber or observable

00:24:11.840 kind of pattern you publish events you instrument some code or you sub and then you subscribe to

00:24:18.080 those events or instrumentations you can subscribe to one event type or

00:24:23.600 multiple using regex it's really simple you publish one-time events

00:24:29.679 or you can yield a block and time it or you can just publish an event using instrument and send some extra info

00:24:37.120 i won't go fully into active support notifications but i would highly recommend that you buy

00:24:43.440 this book from joe's avaleem it is absolutely hands down amazing will give

00:24:48.640 you a very thorough insight into what's happening inside of rails and also that particular topic of notifications

00:24:55.520 so i'm just going to refer to active support notifications as asn for now but let's look at how we could use it

00:25:02.400 that's all we need to do we can just start littering our code with publishing and campfire and urls

00:25:09.039 and we can start passing information about this was called we can go instrument

00:25:14.880 okay so they're pretty much the same or we can instrument and it will do a timing as well

00:25:21.200 nice and simple we use it to publish three events in travis now this is all over our code we

00:25:28.880 published a received completed and failed

00:25:33.919 you'll kind of you'll see a pattern similar to this you'll send a message and we say oh

00:25:40.080 someone fired off a uh campfire message or notification we do

00:25:46.400 some processing and then we publish that it completed successfully or if there is an error we publish that failed and we

00:25:52.799 re-raise it's nothing particularly special but because this is a whole lot of craft

00:25:58.080 code ah completed sorry josh there we go

00:26:04.080 because of the craft we use a little bit of meta programming and now you just go instrument and send message

00:26:09.600 and all we're doing is emitting a whole lot of events these events are completely metrics agnostic so like step

00:26:16.799 one is that you can even add this to your library or gem but you can add this all over your code knowing that it's not

00:26:22.240 going to take up extra cpu it's not going to make your app slower it's going to just start emitting events and then

00:26:27.360 you can start thinking about how do you want to use those events it's very general purpose

00:26:33.120 and it's all about the subscriber this is the beauty oh i love german food

00:26:41.120 so let's think about sending and aggregating for a second statsd

00:26:47.120 that's no js using udp is one of the methods that we can use we can start aggregating and sending if or sending

00:26:52.880 and have it aggregate it uses simple udp messages like this where you'd say this

00:26:58.080 is one counter increment by one please or we can do a timing like 320

00:27:03.760 milliseconds and you can even do this from your console you can set up stat steve and

00:27:09.679 then just do this using netcat nice and simple you've got a nice little ruby gem for it

00:27:16.240 you can just set up the host in the port increment do some timings boom done

00:27:21.279 and it's only 155 lines of code which is a very small gem it's got some tests in there

00:27:27.600 but this is what we use we use a combination of metrics d which is by eric from paper trail

00:27:35.279 and it's based on ruby and event machine it's coupled together with the metrics

00:27:41.360 gem now the metrics gem is very similar again you're not doing anything mind blowing here we set up counters and we

00:27:47.440 get to increment and we can do some timings and then we flush that every x amount of

00:27:53.600 seconds so we flush it every 60 seconds and then what metrics the gem does is it

00:27:58.799 part aggregates it aggregates within that process and then it sends it and it now what it

00:28:05.039 does is it sends it to a reporter so this is quite cool because now you can send to different types of reporters

00:28:12.159 and in this case we can send it to our logs and we can actually see what's happening in our logs for each of these

00:28:18.399 metrics so let's let's subscribe to something here

00:28:23.600 what we're going to do is we're going to add a started app so we know how long it took we'll continue to use this instrument

00:28:30.480 we're going to also do the started and finished app on the completed and then we subscribe

00:28:37.520 we just subscribe to anything using that namespace travis at the top and then if it's if there is a finished

00:28:43.760 stat then we send the time it took and if not we just mark it that this

00:28:49.120 method was called so it's nice and simple this is all you need to subscribe your

00:28:55.120 entire you know code base you litter it with events you do a nice little subscriber

00:29:01.279 and now what we do is we start the logger and this is what we get out we get everything from now you'll see

00:29:07.440 that at the top it's active record and we've got a timer we've got how many times it happened the one minute rate 15

00:29:14.559 minute rate means max standard deviations everything this is just for one process

00:29:20.240 so now what you can do is if we think about it this is what we've

00:29:25.279 just set up so we use paper trail this is travis in particular travis uses active support

00:29:30.480 notifications we use metrics and then it goes to paper trail so all of our metrics are going to paper trail

00:29:36.960 it's a very quick way to get started and it means it's very visual as well so it gives you a little bit of debug time

00:29:44.320 but it's the thing that comes after it that's really important because what we're doing now is we're emitting events

00:29:49.919 we're aggregating we're sending awesome now we need to start visualizing we need to start making use of these effects

00:29:57.039 so there's graphite and graphite is a self-hosted product so

00:30:02.159 you're going to need to run this yourself and it gives you very thorough graphs it's not the prettiest interface but

00:30:08.880 sometimes it isn't about the interface even though it actually is that's why i have a mac and

00:30:15.039 then there's librato uh librato we are in love with we get pretty graphs like

00:30:20.240 this that allows us to do absolutely everything and create our own dashboards this here is actually

00:30:26.080 a set of our errors we have everything from http errors on the top left to amqp

00:30:32.080 messages we've got log entries and also notifications and if we look at here this gives us a

00:30:38.720 breakdown of every type of notification we're sending the top green bar the big peaks are actually pushing notifications

00:30:45.279 so you can see that we're sending a crap load and the cool thing is that you can embed

00:30:50.320 these graphs into your campfire including now live graphs so you can actually share a graph with

00:30:56.159 your team and say let's monitor it live together and start commenting this is actually one of the graphs that

00:31:02.240 we blogged about a while ago what is what you'll see here is that we deployed some code and then boom we had a lot of

00:31:09.679 errors and then we fixed the bug quickly and everything peaked off you know we still had some high peaks and then we fixed

00:31:16.799 another error and they just all went away so as you can see you could use something

00:31:23.200 like you know urbit or get sentry or haystack but you can also use

00:31:30.000 librato and your own metrics and monitoring as well even better is to sayo

00:31:37.279 it gives you a nice handy dashboard that links into your librato or graphite metrics now actually the original fork

00:31:44.159 of it or the original version of taseo which is on uh obscurity at the bottom there the

00:31:50.240 github that only works with graphite at the moment but if you use mathias or roid rages fork this can also

00:31:58.159 work with librato so you're getting a nice little dashboard and it's also an easy dashboard that you can note your

00:32:04.399 graphs you'll see at the red one amqp ready you can see that we've got 17 000 18 000 messages this is another error

00:32:12.320 and we can see nice and quick and simply that is a big warning sign we need to do something about it

00:32:19.360 so if we put everything together in our holy circle of metrics

00:32:24.480 it goes into paper trail we need a web hook to pull it out of paper trail and then it goes to librato but you know we

00:32:30.960 can do a lot better here so we've got a lot of aggregating and sending it's a little bit silly

00:32:37.679 so we can use the metrics d reporter along with metrics d and now we get something like this

00:32:45.519 okay we already cut one piece out of it or we can go even one more step

00:32:51.440 and we can use the labrado metrics reporter and then this is all you need

00:32:57.200 librato will do the aggregating the sending and the visualizing for you

00:33:02.799 i used a scuba diver crayfish fun fact so the next part is like you're using

00:33:09.279 all these tools you're doing logging properly you're doing exception tracking you've got metrics

00:33:16.000 but the third part to all of this is the internal tools so i want to show you travis admin it

00:33:22.720 was a little app that we've built it's open source as well you know all of travis is open source

00:33:28.399 and this is a app built on sinatra along with bootstrap by the beautiful

00:33:33.840 constantine hazard i call him constantine house i would urge you all to do that because i don't

00:33:40.080 think he likes it that much i love you constantine in case you're watching um

00:33:46.159 so you this is part of the app it's composed of three four different bits this one here allows us to view the

00:33:51.679 service hooks set up for each of the repos what we do is you would put in a user on the or the user and a repo on

00:33:58.640 the right hand side and it checks the database for their oauth token and then we can check the hooks set up for that user we can

00:34:05.279 confirm that the hook is set up correctly we can also disable pull requests if we need to or maybe we can

00:34:11.200 enable some extra things to look out for so you'll see pull requests just down here

00:34:17.440 we've also got the gh web console gh is a gem created by constantine which

00:34:22.960 is a nice way of interacting with the github api it's different from octocat this one uses the headers a lot to

00:34:30.000 traverse and it's a bit more low level it's very low level but what we can do is we can put the user and the

00:34:37.760 the command that we're going for like user repos and we get a nice formatted

00:34:43.040 json list along with the headers uh on the right hand side so we can see

00:34:48.800 what our rate limit is and what was the scope of that request so it's a nice way of seeing what's going on with our

00:34:55.359 github requests because we do a lot of them but the crown of all of this is the

00:35:00.960 event monitor so as i showed before this was our infrastructure it's it's not

00:35:06.480 even part of it it's just half of our infrastructure and the dear old nicolas cage joined our

00:35:13.680 team and we tasked him to create a way to unify all of our notifications

00:35:19.520 so if we have all of this information going on how can we kind of see the path that a request takes comes in by the

00:35:26.240 listener it goes to hub it goes to the worker and then it comes back to hub how do we visualize that

00:35:32.240 so he built an app for us built on ember js and using redis pub sub

00:35:37.680 and we've got something like this now you might not see it very easily but i'll turn on the live version in a

00:35:43.920 second but what we have on the left hand side is an event list all of the events that happen and the master event so in

00:35:50.000 this case you'll see that a job finished and then we see all of the events that happened

00:35:56.880 after that that might be a push a notification was sent an irc notification was sent database was

00:36:03.520 updated so we can see all of the sub events and then we can see the payload attached to each of these so now you can

00:36:09.440 unify and actually see live what's going on this just goes to redis it expires

00:36:14.560 quickly and we can watch it live and then we can pause it if need be it uses travis no uh instrumentation a

00:36:22.160 custom thing that we built for doing all of these hooks along with active support notifications but

00:36:27.839 what i'll do

00:36:41.359 so this should be travis happening right now

00:36:49.520 and wow it's very quiet i know there we go and you can see that it's updating

00:36:55.119 accounts we can kind of see the payloads that are going on we can see the [ __ ] ruby driver just

00:37:00.400 started and all the stuff that started with that so it allows us to see not what's going on in a logging kind of

00:37:07.200 way but what's going on in a rich json logging kind of way we can kind of see

00:37:12.720 all of the events that are happening after each other now right now if you wanted to go use

00:37:17.920 this it's completely open source

00:37:24.000 if you want to use this it it's completely open source but i will admit that there's a fair bit of

00:37:30.000 code that is kind of bound to how we saw the problem i'd love to have other people try this out and get involved and

00:37:36.400 see how we could take this to the next step it's a lot of the code is located in

00:37:42.400 travis support which is one of our repos on github it's completely open source

00:37:48.960 um i'd like to just finish off very quickly to say that

00:37:54.720 the whole point of this talk is to not just deploy your app to heroku or to

00:38:00.079 capistra or using capistrano to a dedicated server you need to know what's going on inside your app

00:38:05.760 knowing what's going on inside your app will allow you to find those errors to make things faster and to find out what features your users are using in fact

00:38:13.119 the one of the best ways you can use metrics is start attaching it to features start seeing the paths that

00:38:19.760 people use and what they're doing and how often something is used if you just spent two weeks working on a feature

00:38:25.280 that no one uses is that a good return on investment don't treat your app like a black box

00:38:31.760 don't just deploy it and forget about it and just look at those nice little circle jerk stats actually look into

00:38:37.280 what's going on the knowledge will help you improve don't look at those big war numbers

00:38:43.920 use services all of these services are available and they're amazing build internal tools don't build them

00:38:50.960 straight away but over time figure out what is it that you need or your team needs to make it

00:38:56.720 easier for you to administer the product you don't need to build it in your current normal application build it as a

00:39:02.000 side app that's why we have travis admin that stuff does not belong in travis itself it's just another tool that would read

00:39:08.480 the database and all the external services services are your friends i think i've said services of your friends enough

00:39:15.520 you can find us on irc and you know on github and on the twitters

00:39:21.760 if you'd like a pretty sticker come talk to me you are awesome

00:39:27.119 thank you

00:39:36.400 and any questions

00:39:43.760 this one there's two i don't think we have microphones so you any just yell it out

00:39:49.680 travis pro the council bill uh email support travelci.com

00:39:55.599 we've got a blog post with our pricing and all of the plans um we are freely giving out the invites

00:40:02.240 we're just controlling that slowly while while we get those pages up and running um

00:40:07.839 yeah email us we'll get you on there

00:40:22.079 in terms of

00:40:30.480 i guess the class declaration of like saying instrument this method because it's our way of pulling it out of the

00:40:36.319 code we don't have to have it sitting with the method underneath you can actually eval that into the classes in a

00:40:41.680 different area later on and then you can group them together but we find that it's the least intrusive

00:40:47.599 way we're not actually worried about instrumenting too much in fact the more you instrument the better it's about how

00:40:54.000 you kind of use that statistics like the main things that you should probably look at like start easy

00:40:59.520 is how long do maybe some of my notifiers take if i'm posting to twitter

00:41:04.640 or facebook how long do they take how many of them error out maybe you can look at features as well like it's a

00:41:10.400 great way to find out what your users are doing within your system if you have an amazing time tracking app and no one

00:41:17.440 ever uses the schedule function then maybe you need to think about how you rework that it can help you with

00:41:23.839 your a b testing metrics is very abstract it's not just about code but it's about everything that can kind of

00:41:28.880 go on within your app but we generally feel that every time we add

00:41:33.920 a feature now we make sure that we add a metric so we can track what's going on with it as

00:41:46.839 well you can't because you only have http for heroku we can't run metrics d

00:41:52.480 on there or statsd we use at the moment the web hook with paper trail but we'll soon be changing

00:41:59.839 over to using the librato notifier reporter so now it will just be the app

00:42:05.200 will use the metrics gem and then we'll go straight to librato librato's building and better alerting it's kind

00:42:10.880 of the one thing i didn't cover in this talk because it's a little bit hard to throw in everything including the kitchen sink

00:42:16.960 is the the holy thing that mixes everything you know connects it all up is alerting

00:42:22.960 so the final stage in all this is that when you've got all these metrics you've got all this uh the measurements and the

00:42:29.119 monitoring you need to be alerted when something hits peak now teseo takes it to a certain degree you're getting nice

00:42:35.680 colorings out of it but what you really want is campfire or pager duty or you

00:42:41.040 know someone to come out and get you from the restaurant or you know you need a way to find out when something's going

00:42:46.240 wrong it's the first step this step takes you to knowing what's going on inside the app then you need to kind of

00:42:52.640 add the alerting to notify you when something's going a bit astray

00:43:02.240 no other questions oh well thank you very much if um you've got anything else just come talk to me after thank you

00:43:42.319 you

Josh Kalderimis

@joshk

Explore all talks recorded at RubyConf 2012

+45

RubyConf 2012