00:00:15.599
hello hello i guess it's time for us to start it's
00:00:21.840
uh the 4 30 mark a couple of people coming in hi guys
00:00:27.039
awesome uh welcome my name is josh
00:00:32.320
or if you want to be really formal josh calderemas you can find me on github or the twitters or you can just call me mr
00:00:38.960
calderemas and find a theater uh this is me with anthony he's a sexy guy
00:00:44.879
uh i'm from the travis team but before we get started you know it's the afternoon you've just had some
00:00:50.800
donuts and coffee could you all just do me the favor of standing for a second you have everyone everyone in the back
00:00:56.879
with a laptop yep yup everyone in that section yeah perfect okay
00:01:02.559
so i want you to stay standing if you are logging anything from one of your apps
00:01:08.720
so it could be rails it could be an agent if you're using logging
00:01:14.320
i'd expect this there's one person that sat down that's a shame um
00:01:20.400
i want you to stay standing if you're using new relic in any shape or form
00:01:26.400
okay i want you to stay standing if you're using custom metrics for example librato or
00:01:34.479
graphite statsd no
00:01:40.560
sorry okay now i want you to stay standing
00:01:45.840
if you're using custom logging custom new relic custom metrics and a dashboard
00:01:51.040
tying them all together boom
00:01:56.479
boom goes the dynamite thank you guys okay and the people that were standing up you can just feel good
00:02:02.479
and maybe i'll give you a sticker man just feel good yeah okay now you can sit
00:02:07.520
okay so two small things before i get my my proper talk started uh i want you to meet the travis team
00:02:14.400
uh this is technically our official outfit we wear a onesie you join the travis company
00:02:20.239
and we give you a onesie
00:02:27.040
i have a feeling my microphone's gonna fall down there we go this is perfect for coding and you can
00:02:33.760
buy them online for a good price of 120 we'll even put a travis logo on the back
00:02:39.680
we used to wear shite shirts and awful moustaches don't tell sven i said that
00:02:45.120
beautiful moustaches and we used to present together in them but
00:02:51.200
now this is a side note uh is there any one from engine yard here
00:02:56.800
okay well thanks to the beauty of engine yard they're actually sponsoring piotr sanaki
00:03:02.480
drogues on twitter he did a lot of the rails engines work and was working on sprout core so he is now working full
00:03:08.319
time on travis open source so i i just want to give a huge thanks to them
00:03:13.760
because they're helping push travis and the open source testing part of it ford
00:03:18.959
so thank you engineered for being in other talks uh
00:03:26.239
oh it does all the way up and there's a summer version
00:03:34.159
so if you know anything about travis travis is a crowd-funded company we started
00:03:40.319
roughly in january putting up a crowdfunding page we raised about 125
00:03:45.599
000 usd and from that it was from about 780 individual donations from people in
00:03:52.239
the audience here and from at home 29 companies also joined in to help us
00:03:58.000
get going you know everyone from engineer to heroku to bendy works
00:04:04.480
to soundcloud and this is only a fraction of them so these are the people that have helped make travis possible
00:04:11.360
so you know we we want to give a lot of love and thanks to anyone here who has donated anyone who's donator who's
00:04:17.280
outside of the room as well but a big thing for us and we haven't actually released it
00:04:22.560
officially but thanks to the beautiful people at sticker meal they have or
00:04:28.160
they've in the past created a whole lot of stickers for us so if you donated you would have got these beautiful stickers
00:04:34.720
in the post but thanks to david who works at sticker mule and his pretty dog
00:04:40.560
we now have an official mascot we have a travis and we have a whole lot of travis
00:04:46.880
stickers so if you want to travel stick later come and get me and i've got one for you
00:04:52.960
but back to the talk so as i said my name is mr calderemis i come from this very far out place
00:04:58.479
called new zealand we're known for hobbits and orcs
00:05:04.240
cable cars beautiful views i'm from wellington to be exact beautiful nightlife amazing
00:05:10.800
coffee and the most fantastic amusement rides ever and i've also been on one of those it's
00:05:17.680
one of those small planes where you can see out the front of the cockpit and as you're coming in you're like oh great i
00:05:23.199
can see the runway i can see a hill i can see a runway and then and then even while it's landing all you can see is a
00:05:29.280
hill up to the last second uh the last four years i've been living in this beautiful place called amsterdam
00:05:36.160
where it's known for its huge amount of bikes it's amazing architecture
00:05:41.280
it's night life it's recreational activities
00:05:47.440
but now i'm technically a traveling nomad i've got no official home i'm in
00:05:52.560
between san francisco new zealand san francisco berlin i travel with a ukulele
00:05:58.160
i've got a nice little coffee kit and today i'm here to talk to you about fun things like logging and metrics and
00:06:06.240
monitoring it's part theory it's part code it's part internal tools and it's part
00:06:12.240
moustache but you'll notice that the talk title in the
00:06:18.160
in the conference planner was about the black box and and i had a huge issue trying to find a
00:06:25.440
good talk title having a good talk title is what people look for and i was at the american bookstore in
00:06:32.960
the netherlands it was actually just before i was going to sao paulo and i was like you know josh you don't know
00:06:38.319
anything about sao paulo and you're going to sao paulo and this was the day before going to sao paulo so i had to
00:06:44.319
run here to get myself a brazil book so i actually knew what was going on and i came across this book and it's
00:06:52.080
it's not how good you are it's how good you want to be and it had these really nice little
00:06:57.120
lessons in there for example fail fail again falben all of these things that we're taught as programmers and when
00:07:02.880
even starting a business like the person who doesn't make mistakes is unlikely to make anything
00:07:08.720
or if you can't solve a problem it's because you're playing by the rules you know nice little lessons so i thought to
00:07:14.800
myself since i'm doing a talk about metrics and what's going on inside your app really
00:07:21.199
it's not how good your app is it's how good you want it to be what i want you to take away is that
00:07:27.440
your app when you deploy it wherever it is it's not just a black box you need to
00:07:32.639
know what's going on inside of it knowing what goes on inside of it gives you the power to optimize and to make
00:07:39.759
things better it's not about the war we're doing a million you know web requests a second
00:07:46.319
that is a very very uh circle jerk kind of metric there we want
00:07:52.400
something we need to know really what's going on inside it and we've got a whole bunch of tools and libraries available
00:07:57.680
to us everything from exception notifier to librato we should be using these tools and even
00:08:04.240
better you should be building internal tools internal tools will allow you to
00:08:10.879
mold the data and allow you to really abstract what you need and on top of that services are your friends don't
00:08:17.280
build things when things are available for you to use it doesn't matter if it costs 100 bucks a month these services
00:08:23.280
are worth their weight in but anyhow i come from this this nice
00:08:28.560
little crowdfunded company called travis and we run your tests for you you might have seen a little status image and you
00:08:36.080
know across github we run those tests it's free it's open source it's full of ferries and unicorns we've got another
00:08:43.200
piece of infrastructure and we are for builds what rubygems is for libraries
00:08:48.959
we have over 25 000 open source projects on us we do 10 000 test runs daily
00:08:56.399
over the last year and a half we've done 707 000 build requests on github
00:09:02.160
totaling 1.9 million total test runs and we support 11 different languages
00:09:08.800
everything from ruby to php to python to go i think there's eight different versions of ruby three different
00:09:14.560
versions of php and python we've got it if we don't have it tell us and we'll
00:09:20.480
add it and if you see this it's the sign that it's on travis
00:09:26.160
it was started for fun the whole idea was for it to be instant live and modern and
00:09:31.440
hackable and as any rails app it started kind of like this it's just a simple rails app
00:09:37.920
that would run on heroku it would use rescue and it would talk to a dedicated box we'd use redis and pusher and a nice
00:09:45.200
pretty database and it's you know it's about getting things out there this
00:09:50.640
was simple and this did the job it did the job when you had five projects but not when you've got 25 000.
00:09:56.800
so it kind of grew to this this is a simplistic view of our architecture
00:10:02.640
we have travis listener which listens from github q stuff to rabbit that then goes to travis hub which will
00:10:08.959
process the requests send it to the database and buy a pusher the user will of course talk to travis
00:10:14.959
fire traverse ci which is the web part and then all of the jobs go to travis worker which uses virtualbox
00:10:21.680
and i will just state that i want to stab virtualbox in the eye but
00:10:26.880
we've got 10 of these boxes that we manage and this works this is how we have scaled
00:10:34.079
as a side note we're splitting travis hub into three bits so now when you look
00:10:39.120
at this it's not four different bits it's actually you know seven different bits and on top of that
00:10:44.560
we've got other bits behind the scenes we use everything from redis to rabbit
00:10:49.600
to virtualbox jruby and mri we're using an assortment of everything to make this
00:10:56.000
job possible and it's kind of built around a service orientated architecture
00:11:02.399
but we kind of call it the mini soa version because we are sharing services
00:11:08.560
and when we say services i'm saying we share reddises and databases between these little apps you
00:11:15.279
generally shouldn't be doing this but anyhow we'll get to this right now we have a total of 10
00:11:21.920
deployable apps and on top of that we have four different environments
00:11:27.279
we have org staging and production and we've got a whole separate setup for pro
00:11:32.560
just to keep everything a little bit more secure and also because org is sponsored we don't want
00:11:38.399
that to leach over the pro so we've got four different environments totally 40 different deployments
00:11:45.279
i mean that's a lot so really measuring and monitoring is a challenge but even more so it's more
00:11:51.440
important to get it right we need to know what's going on inside of travis so we can optimize and make it better for
00:11:58.240
you the user also i used to play canoe polo i'd like us to get to know each other
00:12:05.680
so let's talk about logging so logging is important everyone here is
00:12:11.519
doing it and they might not be knowing it but you're doing logging it's useful but
00:12:16.880
this is the standard rails logging we've got this is what seven eight different lines
00:12:24.000
just for your production log this is your default in production what we really want is something simple
00:12:30.399
we want something concise and it should be easily readable it should be one line
00:12:36.959
if you look at this the rails log standard rails log we can actually break it down we've got the request type the
00:12:43.839
url the format the action status total duration view and database
00:12:49.200
those are all there but there are over eight different lines why not something like this
00:12:55.680
this is all we need and even better it's on one line so this is what we use across our web
00:13:02.079
applications in production it keeps it nice and simple it's keyword we can search for it and it means that we don't
00:13:07.279
have super verbose logging going on and you can find this available it's from matthias one of the travis guys
00:13:14.399
it's under roid rage log rage and you can just add it to your rails app and you should have
00:13:20.240
better logging just for production but what about non-web apps that's not going to work for a non-web app it's
00:13:26.320
very rails specific so when you're creating a ruby agent or
00:13:31.760
ruby tool you want to think about how you use logging logging has everything
00:13:36.800
from debug info and error you know simply put you want to debug
00:13:42.639
lots you want to info a bit and you want to error those exceptions it's not a huge it's not a huge thing to grasp
00:13:50.240
really you want to be logging all the things now if we think about travis law travis
00:13:56.240
hub just for a second here this is one of our components we're doing about
00:14:01.680
100 requests per minute using rabbitmq and doing processing via
00:14:07.519
threads and subscriptions it has things like build finished notifications and it's doing background
00:14:13.839
processing and threads and this is what we have you'll see that on the well down here
00:14:21.120
these are all related log messages and why they are is that you can see that the build idea is the same
00:14:26.959
so you need to think about context when you're logging an easy way to group them if it be tagging if it be putting an
00:14:33.680
identifier or a uuid what you want is any log message that happens in your processing needs to be easily
00:14:41.440
grouped and attributed to a previous log message if that's applicable
00:14:46.720
and what you can see is it's a very simple kind of context we've got the type
00:14:53.279
debugger info the class the method and any extra information and the model and its id
00:15:00.639
and for an error it's just even simpler error class and error message
00:15:05.760
it allows us to then go you know just detail that and to grab and we will get a live streaming of
00:15:11.680
anything to do with that build id but on travis work it's a little bit different so that was the part that's
00:15:18.000
actually processing the jobs which is managing the virtual boxes that's got five vms per worker which is effectively
00:15:25.199
five jobs per worker at any one time what the goal is so every time you look
00:15:30.639
at logging you should have a specific goal in mind and for our goal for logging we wanted to easily be able to
00:15:37.120
see what was going on and we needed it to be able to do it live because we don't want to disturb the application we just want to tail
00:15:44.320
so in here the context is the worker and you'll see on the left hand side
00:15:50.880
on the left hand side it says ruby 2 and then it's got each different class and then what's going on so we can even see
00:15:57.600
in a nice clear human readable way about to run about to perform change directory
00:16:02.880
export etc we know what commands it's running as well we can see it's doing a bundle install on the bottom
00:16:09.199
being able to see that makes it much easier to just tap in and watch the build and make sure nothing's exploding
00:16:17.920
it's a much more verbose form of info compared to the rails app which is usually doing it on a singular request
00:16:24.880
but it's a different goal so don't you don't want to treat every app the same
00:16:31.360
and as i said like everything's grouped appropriately making it easier to read and then all we need to do is do a nice
00:16:37.680
little tail grab for ruby 2 and we can see everything going on with that worker
00:16:42.720
and even better we can improve it further we can consolidate all our logging
00:16:48.079
for example you could async it to a single server you could use something like graylog 2
00:16:53.519
which is built on ruby and java uses elasticsearch self-hosted so you need to look after that
00:16:59.199
or you can use something like paper trail hosted and just pure awesome and for us now we've got all of our web
00:17:06.079
stuff or our hub stuff going on in paper trail nice and easy to see we've also got search at the bottom and then you
00:17:12.319
can go back in the time frame nice and easily it's got excellent stuff like archiving notifications and web hooks when it
00:17:19.199
detects a certain stream so now you don't have to build an alert in yourself you can get to have it detect certain
00:17:25.199
errors and then maybe notify you on campfire and of course we love peter trail
00:17:31.600
oh i also used to play nick bottom in a midsummer's night dream
00:17:37.919
thank you in high school so exceptions boom
00:17:44.559
so exceptions you never want to check logs for errors
00:17:49.679
that is no fun and we're beyond that when we're doing exception logging we want to
00:17:56.559
use an app a service or a library for this if you're not doing this already there's a million available and just to
00:18:02.559
clarify really quickly we've got something as basic as exception notifier a very old rails plugin would send you
00:18:08.240
an email each time you could do something like century which is self install or hosted and this
00:18:15.679
is built on python if i believe right and it's very nice interface and it's just
00:18:21.760
simple you've also got urbit which is a self install and written in ruby you can host
00:18:27.840
that on heroku as well for us we use haystack so haystack is built by the guys at
00:18:34.080
github sadly it's not open source but this is what you get so this is an example of an internal
00:18:41.679
tool that they've seen the need for and the difference that you'll see here is that there's a nice pretty graph and
00:18:48.640
visualizations and context to be able to see what's going on with your errors so you've even got the most common errors
00:18:54.240
and then the latest exceptions along with a nice little metric of how many exceptions there are
00:19:00.240
and of course you get the normal breakdown of what's going wrong so you can easily add extra bits to here and
00:19:06.640
you can have the back trace a lot of people see errors as a to-do list
00:19:13.360
me personally i don't i don't see errors as a to-do list i see errors as a natural part of your app errors are
00:19:19.520
going to happen it's you know i'm not saying ignore them but don't fret over them they're not the end of
00:19:25.360
the world the best error exception notifier is your customer your user
00:19:31.200
that's that's how i find it personally if an error goes wrong and travis we know something's bad when someone tweets
00:19:36.400
saying hey sync isn't working respond what monitor twitter and email
00:19:42.799
like support emails twitter oh this is my little sister so
00:19:50.480
that was very unexpected even for me um i i should actually start just littering
00:19:56.559
them everywhere and i won't even know where i'll just get someone else to do it next time so exception monitoring is really
00:20:02.559
important but don't see it as the first thing that you need to do every morning it is a good thing to get rid of them
00:20:08.400
though so do watch the big numbers uh and my little sister
00:20:13.679
so let's talk about metrics who here uses new relic again just a
00:20:18.720
show of hands awesome who here is using new relic and not
00:20:23.919
for a web application one person
00:20:29.280
maybe two over there was it was two sold okay so this is what you'd normally
00:20:35.919
see in new relic you know pretty graphs a nice little aptx score throughput
00:20:41.360
awesome it's really super powerful but there's something that you might not
00:20:47.200
notice here this is not for a web application new relic is bendable around your agent
00:20:54.240
apps anything that's doing background processing it might be background jobs might be a whole separate app just
00:20:59.679
during amqp subscriptions this here is all of our amqp subscriptions and we can see which ones in particular are very
00:21:06.320
slow or in this case the most time consuming and for us to do that we just built a
00:21:11.760
little wrapper it's a little bit of handy meta programming but we attach we tell a new relic to attach its
00:21:19.679
its proxy and controller notification so we can then get a trace for that method and then what we get out of that
00:21:27.039
is the beauty of new relic you get all of the database breakups all of the
00:21:32.480
web external breakups as well and then you also get to dive into each of those requests and see what's going
00:21:39.280
on it's really powerful if it i will say though it's not incredibly intuitive to
00:21:45.360
add because they call it controller notification and people here will probably think of a web controller as
00:21:51.200
well a web controller that's what i associate controller notification with but for us
00:21:56.799
controller notification was used for an amqp subscription which was a different way of hooking into it nonetheless it's
00:22:04.480
really powerful so you can get working with it straight away because if you're using new rail sorry heroku you get the
00:22:11.360
standard for free you can just shove your app on roku and then you get all of this pretty stuff
00:22:17.760
so as i said not just for new relic it's for all apps i love popcorn
00:22:23.440
we're really getting to know each other here is great so let's talk about metrics
00:22:29.360
so metrics are awesome okay hands down awesome
00:22:34.400
i would like to introduce you to matthias our chief metrics officer and his pioneer and i when i was preparing
00:22:41.679
this talk i've given it a few times now and i was like matthias how do i explain to the audience how awesome metrics are
00:22:48.640
and he said very abstractly metrics are to a running app as unit tests are to
00:22:54.159
your code it's like that's beautiful that even i have trouble understanding that tell me again
00:23:01.039
well it represents a certain part of your app by measuring how long it takes and how often it occurs at a given point
00:23:07.039
of time it's as simple as that i've even attributed him to 2012
00:23:13.200
just in case you ever reused this effectively metrics is attached to a
00:23:18.320
method call it's about how long it took but it doesn't have to be attached to a
00:23:23.440
method call but it's attached to something that's happening in your code an event and for us we use it on the method calls
00:23:30.000
a lot so for example there are four parts to metrics if i if i start there you've got collecting
00:23:37.039
sending aggregating and visualizing those are the four main parts of metrics
00:23:43.760
and if we talk about collecting for a second i want to introduce you to active support notifications
00:23:50.960
if you don't already know it you're using it every single day in your rails app it's littered everywhere it's even
00:23:56.720
how logging works there's no logger dot print or logger dot info that happens via active support notifications
00:24:06.480
the greatest thing is it's a very simple publish and subscriber or observable
00:24:11.840
kind of pattern you publish events you instrument some code or you sub and then you subscribe to
00:24:18.080
those events or instrumentations you can subscribe to one event type or
00:24:23.600
multiple using regex it's really simple you publish one-time events
00:24:29.679
or you can yield a block and time it or you can just publish an event using instrument and send some extra info
00:24:37.120
i won't go fully into active support notifications but i would highly recommend that you buy
00:24:43.440
this book from joe's avaleem it is absolutely hands down amazing will give
00:24:48.640
you a very thorough insight into what's happening inside of rails and also that particular topic of notifications
00:24:55.520
so i'm just going to refer to active support notifications as asn for now but let's look at how we could use it
00:25:02.400
that's all we need to do we can just start littering our code with publishing and campfire and urls
00:25:09.039
and we can start passing information about this was called we can go instrument
00:25:14.880
okay so they're pretty much the same or we can instrument and it will do a timing as well
00:25:21.200
nice and simple we use it to publish three events in travis now this is all over our code we
00:25:28.880
published a received completed and failed
00:25:33.919
you'll kind of you'll see a pattern similar to this you'll send a message and we say oh
00:25:40.080
someone fired off a uh campfire message or notification we do
00:25:46.400
some processing and then we publish that it completed successfully or if there is an error we publish that failed and we
00:25:52.799
re-raise it's nothing particularly special but because this is a whole lot of craft
00:25:58.080
code ah completed sorry josh there we go
00:26:04.080
because of the craft we use a little bit of meta programming and now you just go instrument and send message
00:26:09.600
and all we're doing is emitting a whole lot of events these events are completely metrics agnostic so like step
00:26:16.799
one is that you can even add this to your library or gem but you can add this all over your code knowing that it's not
00:26:22.240
going to take up extra cpu it's not going to make your app slower it's going to just start emitting events and then
00:26:27.360
you can start thinking about how do you want to use those events it's very general purpose
00:26:33.120
and it's all about the subscriber this is the beauty oh i love german food
00:26:41.120
so let's think about sending and aggregating for a second statsd
00:26:47.120
that's no js using udp is one of the methods that we can use we can start aggregating and sending if or sending
00:26:52.880
and have it aggregate it uses simple udp messages like this where you'd say this
00:26:58.080
is one counter increment by one please or we can do a timing like 320
00:27:03.760
milliseconds and you can even do this from your console you can set up stat steve and
00:27:09.679
then just do this using netcat nice and simple you've got a nice little ruby gem for it
00:27:16.240
you can just set up the host in the port increment do some timings boom done
00:27:21.279
and it's only 155 lines of code which is a very small gem it's got some tests in there
00:27:27.600
but this is what we use we use a combination of metrics d which is by eric from paper trail
00:27:35.279
and it's based on ruby and event machine it's coupled together with the metrics
00:27:41.360
gem now the metrics gem is very similar again you're not doing anything mind blowing here we set up counters and we
00:27:47.440
get to increment and we can do some timings and then we flush that every x amount of
00:27:53.600
seconds so we flush it every 60 seconds and then what metrics the gem does is it
00:27:58.799
part aggregates it aggregates within that process and then it sends it and it now what it
00:28:05.039
does is it sends it to a reporter so this is quite cool because now you can send to different types of reporters
00:28:12.159
and in this case we can send it to our logs and we can actually see what's happening in our logs for each of these
00:28:18.399
metrics so let's let's subscribe to something here
00:28:23.600
what we're going to do is we're going to add a started app so we know how long it took we'll continue to use this instrument
00:28:30.480
we're going to also do the started and finished app on the completed and then we subscribe
00:28:37.520
we just subscribe to anything using that namespace travis at the top and then if it's if there is a finished
00:28:43.760
stat then we send the time it took and if not we just mark it that this
00:28:49.120
method was called so it's nice and simple this is all you need to subscribe your
00:28:55.120
entire you know code base you litter it with events you do a nice little subscriber
00:29:01.279
and now what we do is we start the logger and this is what we get out we get everything from now you'll see
00:29:07.440
that at the top it's active record and we've got a timer we've got how many times it happened the one minute rate 15
00:29:14.559
minute rate means max standard deviations everything this is just for one process
00:29:20.240
so now what you can do is if we think about it this is what we've
00:29:25.279
just set up so we use paper trail this is travis in particular travis uses active support
00:29:30.480
notifications we use metrics and then it goes to paper trail so all of our metrics are going to paper trail
00:29:36.960
it's a very quick way to get started and it means it's very visual as well so it gives you a little bit of debug time
00:29:44.320
but it's the thing that comes after it that's really important because what we're doing now is we're emitting events
00:29:49.919
we're aggregating we're sending awesome now we need to start visualizing we need to start making use of these effects
00:29:57.039
so there's graphite and graphite is a self-hosted product so
00:30:02.159
you're going to need to run this yourself and it gives you very thorough graphs it's not the prettiest interface but
00:30:08.880
sometimes it isn't about the interface even though it actually is that's why i have a mac and
00:30:15.039
then there's librato uh librato we are in love with we get pretty graphs like
00:30:20.240
this that allows us to do absolutely everything and create our own dashboards this here is actually
00:30:26.080
a set of our errors we have everything from http errors on the top left to amqp
00:30:32.080
messages we've got log entries and also notifications and if we look at here this gives us a
00:30:38.720
breakdown of every type of notification we're sending the top green bar the big peaks are actually pushing notifications
00:30:45.279
so you can see that we're sending a crap load and the cool thing is that you can embed
00:30:50.320
these graphs into your campfire including now live graphs so you can actually share a graph with
00:30:56.159
your team and say let's monitor it live together and start commenting this is actually one of the graphs that
00:31:02.240
we blogged about a while ago what is what you'll see here is that we deployed some code and then boom we had a lot of
00:31:09.679
errors and then we fixed the bug quickly and everything peaked off you know we still had some high peaks and then we fixed
00:31:16.799
another error and they just all went away so as you can see you could use something
00:31:23.200
like you know urbit or get sentry or haystack but you can also use
00:31:30.000
librato and your own metrics and monitoring as well even better is to sayo
00:31:37.279
it gives you a nice handy dashboard that links into your librato or graphite metrics now actually the original fork
00:31:44.159
of it or the original version of taseo which is on uh obscurity at the bottom there the
00:31:50.240
github that only works with graphite at the moment but if you use mathias or roid rages fork this can also
00:31:58.159
work with librato so you're getting a nice little dashboard and it's also an easy dashboard that you can note your
00:32:04.399
graphs you'll see at the red one amqp ready you can see that we've got 17 000 18 000 messages this is another error
00:32:12.320
and we can see nice and quick and simply that is a big warning sign we need to do something about it
00:32:19.360
so if we put everything together in our holy circle of metrics
00:32:24.480
it goes into paper trail we need a web hook to pull it out of paper trail and then it goes to librato but you know we
00:32:30.960
can do a lot better here so we've got a lot of aggregating and sending it's a little bit silly
00:32:37.679
so we can use the metrics d reporter along with metrics d and now we get something like this
00:32:45.519
okay we already cut one piece out of it or we can go even one more step
00:32:51.440
and we can use the labrado metrics reporter and then this is all you need
00:32:57.200
librato will do the aggregating the sending and the visualizing for you
00:33:02.799
i used a scuba diver crayfish fun fact so the next part is like you're using
00:33:09.279
all these tools you're doing logging properly you're doing exception tracking you've got metrics
00:33:16.000
but the third part to all of this is the internal tools so i want to show you travis admin it
00:33:22.720
was a little app that we've built it's open source as well you know all of travis is open source
00:33:28.399
and this is a app built on sinatra along with bootstrap by the beautiful
00:33:33.840
constantine hazard i call him constantine house i would urge you all to do that because i don't
00:33:40.080
think he likes it that much i love you constantine in case you're watching um
00:33:46.159
so you this is part of the app it's composed of three four different bits this one here allows us to view the
00:33:51.679
service hooks set up for each of the repos what we do is you would put in a user on the or the user and a repo on
00:33:58.640
the right hand side and it checks the database for their oauth token and then we can check the hooks set up for that user we can
00:34:05.279
confirm that the hook is set up correctly we can also disable pull requests if we need to or maybe we can
00:34:11.200
enable some extra things to look out for so you'll see pull requests just down here
00:34:17.440
we've also got the gh web console gh is a gem created by constantine which
00:34:22.960
is a nice way of interacting with the github api it's different from octocat this one uses the headers a lot to
00:34:30.000
traverse and it's a bit more low level it's very low level but what we can do is we can put the user and the
00:34:37.760
the command that we're going for like user repos and we get a nice formatted
00:34:43.040
json list along with the headers uh on the right hand side so we can see
00:34:48.800
what our rate limit is and what was the scope of that request so it's a nice way of seeing what's going on with our
00:34:55.359
github requests because we do a lot of them but the crown of all of this is the
00:35:00.960
event monitor so as i showed before this was our infrastructure it's it's not
00:35:06.480
even part of it it's just half of our infrastructure and the dear old nicolas cage joined our
00:35:13.680
team and we tasked him to create a way to unify all of our notifications
00:35:19.520
so if we have all of this information going on how can we kind of see the path that a request takes comes in by the
00:35:26.240
listener it goes to hub it goes to the worker and then it comes back to hub how do we visualize that
00:35:32.240
so he built an app for us built on ember js and using redis pub sub
00:35:37.680
and we've got something like this now you might not see it very easily but i'll turn on the live version in a
00:35:43.920
second but what we have on the left hand side is an event list all of the events that happen and the master event so in
00:35:50.000
this case you'll see that a job finished and then we see all of the events that happened
00:35:56.880
after that that might be a push a notification was sent an irc notification was sent database was
00:36:03.520
updated so we can see all of the sub events and then we can see the payload attached to each of these so now you can
00:36:09.440
unify and actually see live what's going on this just goes to redis it expires
00:36:14.560
quickly and we can watch it live and then we can pause it if need be it uses travis no uh instrumentation a
00:36:22.160
custom thing that we built for doing all of these hooks along with active support notifications but
00:36:27.839
what i'll do
00:36:41.359
so this should be travis happening right now
00:36:49.520
and wow it's very quiet i know there we go and you can see that it's updating
00:36:55.119
accounts we can kind of see the payloads that are going on we can see the [ __ ] ruby driver just
00:37:00.400
started and all the stuff that started with that so it allows us to see not what's going on in a logging kind of
00:37:07.200
way but what's going on in a rich json logging kind of way we can kind of see
00:37:12.720
all of the events that are happening after each other now right now if you wanted to go use
00:37:17.920
this it's completely open source
00:37:24.000
if you want to use this it it's completely open source but i will admit that there's a fair bit of
00:37:30.000
code that is kind of bound to how we saw the problem i'd love to have other people try this out and get involved and
00:37:36.400
see how we could take this to the next step it's a lot of the code is located in
00:37:42.400
travis support which is one of our repos on github it's completely open source
00:37:48.960
um i'd like to just finish off very quickly to say that
00:37:54.720
the whole point of this talk is to not just deploy your app to heroku or to
00:38:00.079
capistra or using capistrano to a dedicated server you need to know what's going on inside your app
00:38:05.760
knowing what's going on inside your app will allow you to find those errors to make things faster and to find out what features your users are using in fact
00:38:13.119
the one of the best ways you can use metrics is start attaching it to features start seeing the paths that
00:38:19.760
people use and what they're doing and how often something is used if you just spent two weeks working on a feature
00:38:25.280
that no one uses is that a good return on investment don't treat your app like a black box
00:38:31.760
don't just deploy it and forget about it and just look at those nice little circle jerk stats actually look into
00:38:37.280
what's going on the knowledge will help you improve don't look at those big war numbers
00:38:43.920
use services all of these services are available and they're amazing build internal tools don't build them
00:38:50.960
straight away but over time figure out what is it that you need or your team needs to make it
00:38:56.720
easier for you to administer the product you don't need to build it in your current normal application build it as a
00:39:02.000
side app that's why we have travis admin that stuff does not belong in travis itself it's just another tool that would read
00:39:08.480
the database and all the external services services are your friends i think i've said services of your friends enough
00:39:15.520
you can find us on irc and you know on github and on the twitters
00:39:21.760
if you'd like a pretty sticker come talk to me you are awesome
00:39:27.119
thank you
00:39:36.400
and any questions
00:39:43.760
this one there's two i don't think we have microphones so you any just yell it out
00:39:49.680
travis pro the council bill uh email support travelci.com
00:39:55.599
we've got a blog post with our pricing and all of the plans um we are freely giving out the invites
00:40:02.240
we're just controlling that slowly while while we get those pages up and running um
00:40:07.839
yeah email us we'll get you on there
00:40:22.079
in terms of
00:40:30.480
i guess the class declaration of like saying instrument this method because it's our way of pulling it out of the
00:40:36.319
code we don't have to have it sitting with the method underneath you can actually eval that into the classes in a
00:40:41.680
different area later on and then you can group them together but we find that it's the least intrusive
00:40:47.599
way we're not actually worried about instrumenting too much in fact the more you instrument the better it's about how
00:40:54.000
you kind of use that statistics like the main things that you should probably look at like start easy
00:40:59.520
is how long do maybe some of my notifiers take if i'm posting to twitter
00:41:04.640
or facebook how long do they take how many of them error out maybe you can look at features as well like it's a
00:41:10.400
great way to find out what your users are doing within your system if you have an amazing time tracking app and no one
00:41:17.440
ever uses the schedule function then maybe you need to think about how you rework that it can help you with
00:41:23.839
your a b testing metrics is very abstract it's not just about code but it's about everything that can kind of
00:41:28.880
go on within your app but we generally feel that every time we add
00:41:33.920
a feature now we make sure that we add a metric so we can track what's going on with it as
00:41:46.839
well you can't because you only have http for heroku we can't run metrics d
00:41:52.480
on there or statsd we use at the moment the web hook with paper trail but we'll soon be changing
00:41:59.839
over to using the librato notifier reporter so now it will just be the app
00:42:05.200
will use the metrics gem and then we'll go straight to librato librato's building and better alerting it's kind
00:42:10.880
of the one thing i didn't cover in this talk because it's a little bit hard to throw in everything including the kitchen sink
00:42:16.960
is the the holy thing that mixes everything you know connects it all up is alerting
00:42:22.960
so the final stage in all this is that when you've got all these metrics you've got all this uh the measurements and the
00:42:29.119
monitoring you need to be alerted when something hits peak now teseo takes it to a certain degree you're getting nice
00:42:35.680
colorings out of it but what you really want is campfire or pager duty or you
00:42:41.040
know someone to come out and get you from the restaurant or you know you need a way to find out when something's going
00:42:46.240
wrong it's the first step this step takes you to knowing what's going on inside the app then you need to kind of
00:42:52.640
add the alerting to notify you when something's going a bit astray
00:43:02.240
no other questions oh well thank you very much if um you've got anything else just come talk to me after thank you
00:43:42.319
you