Quantcast

From the category archives:

Tools

Meanwhile: More Meta-Metrics

by allspaw on October 5, 2009

Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as…

What:

…did we do before (historical trending, etc)
…is going on right now? (troubleshooting, health, etc.)
…is coming down the road (capacity planning, new feature adoption, etc.)
…can we do to make things better (business intelligence, user-behavior, etc.)

All of which, of course, should be considered mandatory in order to help your business increase its awesome. Yay metrics!

Some time ago, Matthias wrote great a blog post about some of the metrics that can reasonably profile the effectiveness of web operations, taken from the ITIL primer, VisibleOps.

In my opinion, there’s nothing on that list of things that isn’t valuable, as long as the cost of gathering those metrics isn’t too behaviorally, technically, or organizationally expensive. The topics included in that list of metrics and the context they live in is fodder for many, many blog posts.

But in the category of historical trending, I’m more and more fascinated by gathering what I’ll call “meta-metrics”, which is data about how you respond to the changes your system is experiencing.

One of the best examples of this is gathering information about operational disruptions. Collecting information about how many times your on-call rotation was alerted/paged/woken-up, during what times, and for what service(s) can be enlightening to say the least.  We’ve been tracking the volume of alerts a lot closer recently, and even with the level of automation we’ve got at Flickr, it’s still something you have to keep on top of, especially if you’re always finding new things to measure and alert on.

Now ideally, you have an alerting system that only communicates conditions that need resolvable action by a human. Which means every alert is critically important, and you’re not ignoring or dismissing any pages for any reasons that sound like “oh, that’s ok, that cluster always does that…it’ll clear up, I’ll just acknowledge the page so I can shut up nagios.” In other words, our goal is to have a zero-noise alerting system. Which means that all alerts are actionable, not ignorable, and require a human to troubleshoot or fix. Over time, you push as much of this work as you can to the robots. In the meantime, save humans for the yet-to-be-automated work, or the stuff that isn’t easily captured by robots.

Why is this important to us? I may be stating the obvious, but it’s because interrupting humans with alerts that don’t require action has a mental and physical context switching cost (especially if the guy on-call was sleeping), and it increases the likelihood of missing a truly critical page in a slew of non-critical ones.

Of course in the reality of evolving and growing web applications, even if we could reach a 100% noise-free alerting system, it’s impossible to sustain for any extended period of time, because your application, usage, and failure modes are constantly changing. So in the meantime, knowing how your alerts affect the team is a worthwhile thing to do for us. In fact, I think it’s so important that it’s worth collecting and displaying next to the rest of your metrics, and exposing these metrics to the entire dev and ops groups.

Something like this: (made-up numbers)

Tracking Critical Alerts

Tracking Critical Alerts

Gathering up info about these alerts should give us a better perspective on where we can improve. So, things like:

  • How many critical alerts are sent on a daily/hourly/weekly basis?
  • What does a time histogram of the alerts look like? Do you get more or less alerts during nighttime or non-peak hours?
  • How much (if any) correlation is there between critical alerts and:

- code deploys?
- software upgrades?
- feature launches?
- open API abuse?

  • What does a breakdown of the alerts look like, in terms of: host type, service type, and frequency of each in a given time period?

and maybe the most important ones:

  • How many of those alerts aren’t actually critical or demand human attention?
  • How many of them always self-recover?
  • How many (and which) don’t matter in their role context (like, a single node in a load-balanced cluster) and could be turned into an aggregate check?

We’ve built our own stuff to track and analyze these things. My question to the community is: I’m not aware of any open-source tool that is dedicated to analyzing these metrics. Do they exist? Nagios obviously has host/hostgroup/cluster warning and critical histories, and those can be crunched to find critical alert statistics, but I’m not aware of any comprehensive crunching. Of course, until I find one, we’re just building our own.

Thoughts, lazyweb?

{ 6 comments }

Domas makes something more useful than I bet most would think: http://mituzas.lt/2009/06/26/uncache/

{ 0 comments }

Slides for Velocity Talk 2009

June 23, 2009

UPDATE: blip.tv has the video of the talk as well, below. Jeez I have some major bed-head.
That was a blast! I had never done a ‘duet’ talk before. Here are the slides:
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
…and the video of it is here:

Read the full article →

Slides from Web2.0 Expo 2009. (and somethin else interestin’)

April 3, 2009

That was a pretty good time. Saw lots of good and wicked smaht people, and I got a lot of great questions after my talk. The slides are up on slideshare, and here are the PDF slides.
Operational Efficiency Hacks Web20 Expo2009
View more presentations from John Allspaw.

UPDATE: Gil Raphaelli has posted his python bindings he [...]

Read the full article →

Speaking at Web2.0 Expo 2009

February 19, 2009

Looks like I’m gonna talk about even more nerdy things at the Web2.0 Expo in April.

You don’t have to wait for a recession to tighten up your operations. Squeezing more oomph out of your servers (or instances!) is always a good thing, and streamlining how you handle site issues is too. We’ll will talk about [...]

Read the full article →

Web Ops Visualizations Group on Flickr

December 16, 2008

Like lots of operations people, we’re quite addicted to data pr0n here at Flickr. We’ve got graphs for pretty much everything, and add graphs all of the time. We’ve blogged about some of how and why we do it.
One thing we’re in the habit of is screenshotting these graphs when things go wrong, right, or [...]

Read the full article →

Code Swarm for Config Management

October 21, 2008

Gil Raphaelli, one of the guys on our Flickr Ops team, put together a Code Swarm animation for the configuration/deployment management tool we use at Flickr to manage our infrastructure. Myles Grant did this for our bug reporting system as well. Check it out:

Our automated config management system is called Gemstone, but conceptually you can [...]

Read the full article →

Why we use GraphicsMagick

September 2, 2008

Speed: http://www.graphicsmagick.org/www/BENCHMARKS.html
Also, it looks like the GM devs are working on getting OpenMP (parallelism) put into GM processing, which will be a huge boom for multicore boxes. Yay!

Read the full article →

Tool update: WTF is inside filesystem cache ?

March 27, 2008

Awhile back, I said I’d love to have a tool that would allow me to peek inside filesystem cache and tell me what files (or pages of files) are inside. Well Peter Zaitsev points to the fincore tool, which comes pretty damn close: you give it a file, and it will tell you which pages [...]

Read the full article →

WebOps Communication Tools

March 10, 2008

After seeing Jesse’s great post on Radar (never knew about FreeConferenceCall, very cool!) about the quick and easy webops event communications, I thought I might put a post together on some of what we’re using at Flickr to keep track of things ops-related.
Production Changes/Immediate Issues
We have our configuration management schemes wrapped up in version control, [...]

Read the full article →