Site icon Kitchen Soap

Meanwhile: More Meta-Metrics

Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as…

What:

…did we do before (historical trending, etc)
…is going on right now? (troubleshooting, health, etc.)
…is coming down the road (capacity planning, new feature adoption, etc.)
…can we do to make things better (business intelligence, user-behavior, etc.)

All of which, of course, should be considered mandatory in order to help your business increase its awesome. Yay metrics!

Some time ago, Matthias wrote great a blog post about some of the metrics that can reasonably profile the effectiveness of web operations, taken from the ITIL primer, VisibleOps.

In my opinion, there’s nothing on that list of things that isn’t valuable, as long as the cost of gathering those metrics isn’t too behaviorally, technically, or organizationally expensive. The topics included in that list of metrics and the context they live in is fodder for many, many blog posts.

But in the category of historical trending, I’m more and more fascinated by gathering what I’ll call “meta-metrics”, which is data about how you respond to the changes your system is experiencing.

One of the best examples of this is gathering information about operational disruptions. Collecting information about how many times your on-call rotation was alerted/paged/woken-up, during what times, and for what service(s) can be enlightening to say the least.  We’ve been tracking the volume of alerts a lot closer recently, and even with the level of automation we’ve got at Flickr, it’s still something you have to keep on top of, especially if you’re always finding new things to measure and alert on.

Now ideally, you have an alerting system that only communicates conditions that need resolvable action by a human. Which means every alert is critically important, and you’re not ignoring or dismissing any pages for any reasons that sound like “oh, that’s ok, that cluster always does that…it’ll clear up, I’ll just acknowledge the page so I can shut up nagios.” In other words, our goal is to have a zero-noise alerting system. Which means that all alerts are actionable, not ignorable, and require a human to troubleshoot or fix. Over time, you push as much of this work as you can to the robots. In the meantime, save humans for the yet-to-be-automated work, or the stuff that isn’t easily captured by robots.

Why is this important to us? I may be stating the obvious, but it’s because interrupting humans with alerts that don’t require action has a mental and physical context switching cost (especially if the guy on-call was sleeping), and it increases the likelihood of missing a truly critical page in a slew of non-critical ones.

Of course in the reality of evolving and growing web applications, even if we could reach a 100% noise-free alerting system, it’s impossible to sustain for any extended period of time, because your application, usage, and failure modes are constantly changing. So in the meantime, knowing how your alerts affect the team is a worthwhile thing to do for us. In fact, I think it’s so important that it’s worth collecting and displaying next to the rest of your metrics, and exposing these metrics to the entire dev and ops groups.

Something like this: (made-up numbers)

Tracking Critical Alerts

Gathering up info about these alerts should give us a better perspective on where we can improve. So, things like:

– code deploys?
– software upgrades?
– feature launches?
– open API abuse?

and maybe the most important ones:

We’ve built our own stuff to track and analyze these things. My question to the community is: I’m not aware of any open-source tool that is dedicated to analyzing these metrics. Do they exist? Nagios obviously has host/hostgroup/cluster warning and critical histories, and those can be crunched to find critical alert statistics, but I’m not aware of any comprehensive crunching. Of course, until I find one, we’re just building our own.

Thoughts, lazyweb?

Exit mobile version