An Open Letter To Monitoring/Metrics/Alerting Companies

I’d like to open up a dialogue with companies who are selling X-As-A-Service products that are focused on assisting operations and development teams in tracking the health and performance of their software systems.

Note: It’s likely my suggestions below are understood and embraced by many companies already. I know a number of them who are paying attention to all areas I would want them to, and/or make sure they’re not making claims about their product that aren’t genuine. 

Anomaly detection is important. It can’t be overlooked. We as a discipline need to pay attention to it, and continually get better at it.

But for the companies who rely on your value-add selling point(s) as:

  • “our product will tell you when things are going wrong” and/or
  • “our product will automatically fix things when it finds something is wrong”

the implication is these things will somehow relieve the engineer from thinking or doing anything about those activities, so they can focus on more ‘important’ things. “Well-designed automation will keep people from having to do tedious work”, the cartoon-like salesman says.

Please stop doing this. It’s a lie in the form of marketing material and it’s a huge boondoggle that distracts us away from focusing on what we should work on, which is to augment and assist people in solving problems.

Anomaly detection in software is, and always will be, an unsolved problem. Your company will not solve it. Your software will not solve it. Our people will improvise around it and adapt their work to cope with the fact that we will not always know what and how something is wrong at the exact time we need to know.

My suggestion is to first acknowledge this (that your attempts to detect anomalies perfectly, at the right time, is not possible) when you talk to potential customers. Want my business? Say this up front, so we can then move on to talking about how your software will assist my team of expert humans who will always be smarter than your code.

In other words, your monitoring software should take the Tony Stark approach, not the WOPR/HAL9000 approach.

These are things I’d like to know about how you thought about your product:

  • Tell me about how you used qualitative research in developing your product.
  • Tell me about how you observed actual engineers in their natural habitat, in the real world, as they detected and responded to anomalies that arose.
  • Show me your findings from when you had actual UX/UI professionals consider carefully how the interfaces of your product should be designed.
  • Demonstrate to me the people designing your product have actually been on-call and have experience with the scenario where they needed to understand what the hell was going on, had no idea where to start looking, all under time and consequence pressure.
  • Show me the people who are building your product take as a first design principle that outages and other “untoward” events are handled not by a lone engineer, but more often then not by a team of engineers all with their different expertise and focus of attention. Successful response depends on not just on anomaly detection, but how the team shares the observations they are making amongst each other in order to come up with actions to take.

 

Stop thinking you’re trying to solve a troubleshooting problem; you’re not.

 

The world you’re trying to sell to is in the business of dynamic fault managementThis means that quite often you can’t just take a component out of service and investigate what’s wrong with it. It means diagnosis involves testing hypotheses that could actually make things a lot worse than they already are. This means that phases of responding to issues have overlapping concerns all at the same time. Things like:

  • I don’t know what is going on.
  • I have a guess about what is going on, but I’m not sure, and I don’t know how to confirm it.
  • Because of what Sue and Alice said, and what I see, I think what is going on is X.
  • Since we think X is happening, I think we should do Y.
  • Is there a chance that Y will make things worse?
  • If we don’t know what’s happening with N, can we do M so things don’t get worse, or we can buy time to figure out what to do about N?
  • Do we think this thing (that we have no clue about) is changing for the better or the worse?
  • etc.

Instead of telling me about how your software will solve problems, show me you’re trying to build a product that is going to join my team as an awesome team member, because I’m going to think about using/buying your service in the same way I think about hiring.

Sincerely,

John Allspaw

 

A Mature Role for Automation: Part I

(Part 1 of 2 posts)
I’ve been percolating on this post for a long time. Thanks very much to Mark Burgess for reviewing early drafts of it.

One of the ideas that permeates our field of web operations is that we can’t have enough automation. You’ll see experience with “building automation” on almost every job description, and many post-mortem transcriptions around the world have remediation items that state that more automation needs to be in place to prevent similar incidents.

“AUTOMATE ALL THE THINGS!” the meme says.

But the where, when, and how to design, implement, and operate automation is not as straightforward as “AUTOMATE ALL THE THINGS!”

I’d like to explore this concept that everything that could be automated should be automated, and I’d like to take a stab at putting context around the reasons why we might think this is a good idea. I’d also like to give some background on the research of how automation is typically approached, the reasoning behind various levels of automation, and most importantly: the spectrum of downsides of automation done poorly or haphazardly.

(Since it’s related, I have to include an obligatory link to Github’s public postmortem on issues they found with their automated database failover, and some follow-up posts that are well worth reading.)

In a recent post by Mathias Meyer he gives some great pointers on this topic, and strongly hints at something I also agree with, which is that we should not let learnings from other safety-related fields (aviation, combat, surgery, etc.) go to waste, because there are some decades of thinking and exploration there. This is part of my plan for exploring automation.

Frankly, I think that we as a field could have a more mature relationship with automation. Not unlike the relationship humans have with fire: a cautious but extremely useful one, not without risks.

I’ve never done a true “series” of blog posts before, but I think this topic deserves one. There’s simply too much in this exploration to have in a single post.

What this means: There will not be, nor do I think should there ever be, a tl;dr for a mature role of automation, other than: its value is extremely context-specific, domain-specific, and implementation-specific.

If I’m successful with this series of posts, I will convince you to at least investigate your own intuition about automation, and get you to bring the same “constant sense of unease” that you have with making change in production systems to how you design, implement, and reason about it. In order to do this, I’m going to reference a good number of articles that will branch out into greater detail than any single blog post could shed light on.

Bluntly, I’m hoping to use some logic, research, science, and evidence to approach these sort of questions:

  1. What do we mean when we say “automation”? What do those other fields mean when they say it?
  2. What do we expect to gain from using automation? What problem(s) does it solve?
  3. Why do we reach for it so quickly sometimes, so blindly sometimes, as the tool to cure all evils?
  4. What are the (gasp!) possible downsides, concerns, or limitations of automation?
  5. And finally – given the potential benefits and concerns with automation, what does a mature role and perspective for automation look like in web engineering?

Given that I’m going to mention limitations of automation, I want to be absolutely clear, I am not against automation. On the contrary, I am for it.

Or rather, I am for: designing and implementing automation while keeping an eye on both its limitations and benefits.

So what limitations could there be? The story of automation (at least in web operations) is one of triumphant victory. The reason that we feel good and confident about reaching for automation is almost certainly due to the perceived benefits we’ve received when we’ve done it in the past.

Canonical example: engineer deploys to production by running a series of commands by hand, to each server one by one. Man that’s tedious and error-prone, right? Now we’ve automated the whole process, it can run on its own, and we can spend our time on more fun and challenging things.

This is a prevailing perspective, and a reasonable one.

Of course we can’t ditch the approach of automation, even if we wanted to.  Strictly speaking, almost every use of a computer is to some extent using “automation”, even if we are doing things “by hand.” Which brings me to…

Definitions and Foundations

I’d like to point at the term itself, because I think it’s used in a number of different contexts to mean different things. If we’re to look at it closely, I’d like to at least clarify what I (and others who have researched the topic quite well) mean by the term “automation”. The word comes from the Greek: auto, meaning ‘self’, and matos, meaning ‘willing’, which implies something is acting on its own accord.

Some modern definitions:

“Automation is defined as the technology concerned with the application of complex mechanical, electronic, and computer based systems in the operations and control of production.” – Raouf (1988)

‘Automation’ as used in the ATA Human Factors Task Force report in 1989 refers to…”a system or method in which many of the processes of production are auotmatically controlled or performed by self-operating machines, electronic devices, etc.” – Billings (1991)

“We define automation as the execution by a machine agent (usually a computer) of a function that was previously carried out by a human.” – Parasuraman (1997)

I’ll add to that somewhat broad definition functions that have never been carried out by a human. Namely, processes and tasks that could never be performed by a human, by exploiting the resources available in a modern computer. The recording and display of computations per second, for example.

To help clarify my use of the term:

  • Automation is not just about provisioning and configuration management. Although this is maybe the most popular context in which the term is used, it’s almost certainly not the only place for automation.
  • It’s also not simply the result of programming what were previously performed as manual tasks.
  • It can mean enforcing predefined or dynamic limits on operational tasks, automated or manual.
  • It can mean surfacing, displaying, and analyzing metrics from tasks and actions.
  • It can mean making decisions and possibly taking action on observed states in a system.

Some familiar examples of these facets of automation:

  • MySQL max_connections and Apache’s MaxClients directives: these are upper bounds intended on preventing high workloads from causing damage.
  • Nagios (or any monitoring system for that matter): these perform checks on values and states at rates and intervals only a computer could perform, and can also take action on those states in order to course-correct a process (as with Event Handlers)
  • Deployment tools and configuration management tools (like Deployinator, as well as Chef/Puppet/CFEngine, etc.)
  • Provisioning tools (golden-image or package-install based)
  • Any collection or display of metrics (StatsD, Ganglia, Graphite, etc.)

Which is basically…well, everything, in some form or another in web operations. 🙂

Domains To Learn From

In many of the papers found in Human Factors and Resilience Engineering, and in blog posts that generally talk about limitations of automation, it’s done in the context of aviation. And what a great context that is! You have dramatic consequences (people die) and you have a plethora of articles and research to choose from. The volume of research done on automation in the cockpit is large due to the drama (people die, big explosions, etc.) so no surprise there.

Except the difference is, in the cockpit, human and machine elements have a different context. There are mechanical actions that the operator can and needs to do during takeoff and landing. They physically step on pedals, push levers and buttons, watch dials and gauges in various points during takeoff and landing. Automation in that context is, frankly, much more evolved there, and the contrast (and implicit contract) there between man and machine is much more stark than in the context of web infrastructures. Display layouts, power-assisted controls…we should be so lucky to have attention like that paid to our working environment in web operations! (but also, cheers to people not dying when the site goes down, amirite?)

My point is that while we discuss the pros, cons, and considerations for designing automation to help us in web operations, we have to be clear that we are not aviation, and that our discussion should reflect that while still trying to glean information from that field’s use of it.

We ought to understand also that when we are designing tasks, automation is but one (albeit a complex one) approach we can take, and that it can be implemented in a wide spectrum of ways. This also means that if we decide in some cases to not automate something (gasp!) or to step back from full automation for good reason, we shouldn’t feel bad or failed about it. Ray Kurzweil and the nutjobs that think the “singularity” is coming RealSoonNow™ won’t be impressed, but then again you’ve got work to do.

So Why Do We Want to Use Automation?

Historically, automation is used for:

  • Precision
  • Stability
  • Speed

Which sounds like a pretty good argument for it, right? Who wants to be less precise, less stable, or slower? Not I, says the Ops guy. So using automation at work seems like a no-brainer.  But is it really just as simple as that?

Some common motivations for automation are:

  • Reduce or eliminate human error
  • Reduction of the human’s workload. Specifically, ridding humans of boring and tedious tasks so they can tackle the more difficult ones
  • Bring stability to a system
  • Reduce fatigue on humans

No article about automation would be complete without pointing first at Lisanne Bainbridge’s 1983 paper, “The Ironies of Automation”. I would put her work here as modern canonical on the topic. Any self-respecting engineer should read it. While its prose is somewhat dated, the value is still very real and pertinent.

What she says, in a nutshell, is that there are at least two ironies with automation, from the traditional view of it. The premise reflects a gut intuition that pervades many fields of engineering, and one that I think should be questioned:

The basic view is that the human operator is unreliable and inefficient, and therefore should be eliminated from the system.

Roger that. This supports the idea to take humans out of the loop (because they are unreliable and inefficient) and replace them with automated processes.

The first irony is:

Designer errors [in automation] can be a major source of operating problems.

This means that the designers of automation make decisions about how it will work based on how they envision the context it will be used. There is a very real possibility that the designer hasn’t imagined (or, can’t imagine) every scenario and situation the automation and human will find themselves in, and so therefore can’t account for it in the design.

Let’s re-read the statement: “This supports the idea to take humans out of the loop (because they are unreliable and inefficient) and replace them with automated processes.”…which are designed by humans, who are assumed to be unrelia…oh, wait.

The second irony is:

The designer [of the automation], who tries to eliminate the operator, still leaves the operator to do the tasks which the designer cannot think how to automate.

Which is to say that because the designers of automation can’t fully automate the human “out” of everything in a task, the human is left to cope with what’s left after the automated parts. Which by definition are the more complex bits. So the proposed benefit of relieving humans of cognitive workload isn’t exactly realized.

There are some more generalizations that Bainbridge makes, paraphrased by James Reason in Managing The Risks of Organizational Accidents:

  • In highly automated systems, the task of the human operator is to monitor the systems to ensure that the ‘automatics’ are working as they should. But it’s well known that even the best motivated people have trouble maintaining vigilance for long periods of time. They are thus ill-suited to watch out for these very rare abnormal conditions.
  • Skills need to be practiced continuously in order to preserve them. Yet an automatic system that fails only very occasionally denies the human operator the opportunity to practice the skills that will be called upon in an emergency. Thus, operators can become deskilled in just those abilities that justify their (supposedly) marginalized existence.
  • And ‘Perhaps the final irony is that it is the most successful automated systems with rare need for manual intervention which may need the greatest investment in operator training.’

Bainbridge’s exploration of ironies and costs of automation bring a much more balanced view of the topic, IMHO. It also points to something that I don’t believe is apparent to our community, which is that automation isn’t an all-or-nothing proposition. It’s easy to bucket things that humans do, and things that machines do, and while the two do meet from time to time in different contexts, it’s simpler to think of their abilities apart from each other.

Viewing automation instead on a spectrum of contexts can break this oversimplification, which I think can help us gain a glimpse into what a more mature perspective towards automation could look like.

Levels Of Automation

It would seem automation design needs to be done with the context of its use in mind. Another fundamental work in the research of automation is the so-called “Levels Of Automation”. In their seminal 1999 paper “Human And Computer Control of Undersea Teleoperators”, Sheridan and Verplank lay out the landscape for where automation exists along the human-machine relationship (Table 8.2 in the original and most excellent vintage 1978 typewritten engineering paper)

Automation Level Automation Description
1 The computer offers no assistance: human must take all decision and actions.
 2  The computer offers a complete set of decision/action alternatives, or
 3  …narrows the selection down to a few, or
 4  …suggests one alternative, and
 5  …executes that suggestion if the human approves, or
 6  …allows the human a restricted time to veto before automatic execution, or
 7  …executes automatically, then necessarily informs humans, and
 8  …informs the human only if asked, or
 9  …informs him after execution if it, the computer, decides to.
 10  The computer decides everything and acts autonomously, ignoring the
human.

 

This was extended later in Parasuraman, Sheridan, and Wickens (2000) “A Model for Types and Levels of Human Interaction with Automation” to include four stages of information processing within which each level of automation may exist:

  1. Information Acquisition. The first stage involves the acquisition, registration, and position of multiple information sources similar to that of humans’ initial sensory processing.
  2. Information Analysis.  The second stage refers to conscious perception, selective attention, cognition, and the manipulation of processed information such as in the Baddeley model of information processing
  3. Decision and Action Selection. Next, automation can make decisions based on information acquisition, analysis and integration.
  4. Action Implementation. Finally, automation may execute forms of action.

Viewing the above 10 Levels of Automation (LOA) as a spectrum within each of those four stages allows for a way of discerning where and how much automation could (or should) be implemented, in the context of performance and cost of actions. This feels to me like a step towards making mature decisions about the role of automation in different contexts.

Here is an example of these stages and the LOA in each of them, suggested for Air Traffic Control activities:

Endsley (1999) also came up with a similar paradigm of stages of automation, in “Level of automation effects on performance, situation awareness and workload in a dynamic control task”

What are examples of viewing LOA in the context of web operations and engineering?

At Etsy, we’ve made decisions (sometimes implicitly) about the levels of automation in various tasks and tooling:

  • Deployinator: assisted by automated processes, humans trigger application code deploys to production. The when and what is human-centered. The how is computer-centered.
  • Chef: humans decide on some details in recipes (this configuration file in this place), computers decide on others (use 85% of total RAM for memcached, other logic in templates), and computer decides on automatic deployment (random 10 minute splay for Chef client runs). Mostly, humans provide the what, and computers decide the when and how.
  • Database Schema changes: assisted by automated processes, humans trigger the what and when, computer handles the how.
  • Event handling: some Nagios alerts trigger simple self-healing attempts upon some (not all) alertable events. Human decides what and how. Computer decides when.

I suspect that in many organizations, the four stages of automation (from Parasuraman, Sheridan, and Wickens) line up something like this, with regards to the breakdown in human or computer function allocation:

Information Acquisition
  • Largely computer-driven for application and infra metrics (think Graphite/Ganglia/NewRelic/Boundary/etc.)
  • Some higher-level human-driven data acquisition (think UX testing and observation/focus groups/etc.)
Information Analysis
  • Some computer-driven for application and infra (think Holt-Winters, CEP, A/B testing results, deductive reasoning about metrics, etc.)
  • Some human-driven analysis (think BI/behavioral/funnel correlations, inductive reasoning about metrics, etc.)
Decision and Action Selection
  • Some computer-driven for application and infra (think event handlers, fault tolerance and protection methods, CI, etc.)
  • Some human-driven (think some deployments, core network or storage changes deemed risky, etc.)
Action Implementation
  • Some computer-driven for application and infra (think event handlers, some config mgmt implementations, scheduled jobs with feed-back and feed-forward loops, etc.)
  • Some human-driven (think some deployments, feature ramp-ups, coordinated multi-team actions, etc.)

 

Trust

In many cases, what level of automation is appropriate and in which context is informed by the level of trust that operators have in the automation to be successful.

Do you trust an iPhone’s ability to auto-correct your spelling enough to blindly accept all suggestions? I suspect no one would, and the iPhone auto-correct designers know this because they’ve given the human the veto power of the suggestion by putting an “x” next to them. (following automation level 5, above)

Do you trust a GPS routing system enough to follow it without question? Let’s hope not. Given that there is context missing, such as stop signs, red lights, pedestrians, and other dynamic phenomena going on in traffic, GPS automobile routing may be a good example of keeping the LOA at level 4 and below, and even then only sticking to the “Information Acquisition” and “Information Analysis” states, and keeping the “Decision and Action” and “Action Implementation” stages to the human who can recognize the more complex context.

In “Trust in Automation: Designing for Appropriate Reliance“, James Lee and Katrina A. See investigate the concerns surrounding trusting automation, including organizational issues, cultural issues, and context that can influence how automation is designed and implemented. They outline a concern I think that should be familiar to anyone who has had experiences (good or bad) with automation (emphasis mine):

As automation becomes more prevalent, poor partnerships between people and automation will become increasingly costly and catastrophic. Such flawed partnerships between automation and people can be described in terms of misuse and disuse of automation. (Parasuraman & Riley, 1997).

Misuse refers to the failures that occur when people inadvertently violate critical assumptions and rely on automation inappropriately, whereas disuse signifies failures that occur when people reject the capabilities of automation.

Misuse and disuse are two examples of inappropriate reliance on automation that can compromise safety and profitability.

They discuss methods on making automation trustable:

  • Design for appropriate trust, not greater trust.
  • Show the past performance of the automation.
  • Show the process and algorithms of the automation by revealing intermediate results in a way that is comprehensible to the operators.
  • Simplify the algorithms and operation of the automation to make it more understandable.
  • Show the purpose of the automation, design basis, and range of applications in a way that relates to the users’ goals.
  • Train operators regarding its expected reliability, the mechanisms governing its behavior, and its intended use.
  • Carefully evaluate any anthropomorphizing of the automation, such as using speech to create a synthetic conversational partner, to ensure appropriate trust.

Adam Jacob, in a private email thread with myself and some others had some very insightful things to say on the topic:

The practical application of the ironies isn’t that you should/should not automate a given task, it’s answering the questions of “When is it safe to automate?”, perhaps followed by “How do I make it safe?”. We often jump directly to “automation is awesome”, which is an answer to a different question.

[if you were to ask]…”how do you draw the line between what is and isn’t appropriate?”, I come up with a couple of things:

  • The purpose of automation is to serve a need – for most of us, it’s a business need. For others, it’s a human-critical one (do not crash planes full of people regularly due to foreseeable pilot error.)
  • Recognize the need you are serving – it’s not good for its own sake, and different needs call for different levels of automation effort.
  • The implementers of that automation have a moral imperative to create automation that is serviceable, instrumented, and documented.
  • The users of automation have an imperative to ensure that the supervisors understand the system in detail, and can recover from
    failures.

I think Adam is putting this eloquently, and I think it’s an indication that we as a field are moving towards a more mature perspective on the subject.

There is a growing notion amongst those who study the history, ironies, limitations, and advantages of automation that an evolved perspective on the human-machine relationship may look a lot like human-human relationships. Specifically, the characteristics that govern groups of humans that are engaged in ‘joint activity’ could also be seen as ways that automation could interact.

Collaboration, communication, and cooperation are some of the hallmarks of teamwork amongst people. In “Ten Challenges for Making Automation a ‘Team Player’ in Joint Human-Agent Activity” David Woods, Gary Klein, Jeffrey M. Bradshaw, Robert R. Hoffman, and Paul J. Feltovich make a case for how such a relationship might exist. I wrote briefly a little while ago about the ideas that this paper rests on, in this post here about how people work together.

Here are these ten challenges the authors say we face, where ‘agents’ = humans and machines/automated processes designed by humans:

  • Basic Compact – Challenge 1: To be a team player, an intelligent agent must fulfill the requirements of a Basic Compact to engage in common-grounding activities.
  • Adequate models – Challenge 2: To be an effective team player, intelligent agents must be able to adequately model the other participants’ intentions and actions vis-à-vis the joint activity’s state and evolution—for example, are they having trouble? Are they on a standard path proceeding smoothly? What impasses have arisen? How have others adapted to disruptions to the plan?
  • Predictability – Challenge 3: Human-agent team members must be mutually predictable.
  • Directability – Challenge 4: Agents must be directable.
  • Revealing status and intentions – Challenge 5: Agents must be able to make pertinent aspects of their status and intentions obvious to their teammates.
  • Interpreting signals – Challenge 6: Agents must be able to observe and interpret pertinent signals of status and intentions.
  • Goal negotiation – Challenge 7: Agents must be able to engage in goal negotiation.
  • Collaboration – Challenge 8: Support technologies for planning and autonomy must enable a collaborative approach.
  • Attention management – Challenge 9: Agents must be able to participate in managing attention.
  • Cost control – Challenge 10: All team members must help control the costs of coordinated activity.

I do recognize these to be traits and characteristics of high-performing human teams. Think of the best teams in many contexts (engineering, sports, political, etc.) and these certainly show up. Can humans and machines work together just as well? Maybe we’ll find out over the next ten years. 🙂

“The question is no longer whether one or another function can be automated, but, rather, whether it should be. – Wiener & Curry (1980)”
“…and in what ways it should be automated.” – John Allspaw (right now, in response to Wiener & Curry’s quote above)

Meanwhile: More Meta-Metrics

Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as…

What:

…did we do before (historical trending, etc)
…is going on right now? (troubleshooting, health, etc.)
…is coming down the road (capacity planning, new feature adoption, etc.)
…can we do to make things better (business intelligence, user-behavior, etc.)

All of which, of course, should be considered mandatory in order to help your business increase its awesome. Yay metrics!

Some time ago, Matthias wrote great a blog post about some of the metrics that can reasonably profile the effectiveness of web operations, taken from the ITIL primer, VisibleOps.

In my opinion, there’s nothing on that list of things that isn’t valuable, as long as the cost of gathering those metrics isn’t too behaviorally, technically, or organizationally expensive. The topics included in that list of metrics and the context they live in is fodder for many, many blog posts.

But in the category of historical trending, I’m more and more fascinated by gathering what I’ll call “meta-metrics”, which is data about how you respond to the changes your system is experiencing.

One of the best examples of this is gathering information about operational disruptions. Collecting information about how many times your on-call rotation was alerted/paged/woken-up, during what times, and for what service(s) can be enlightening to say the least.  We’ve been tracking the volume of alerts a lot closer recently, and even with the level of automation we’ve got at Flickr, it’s still something you have to keep on top of, especially if you’re always finding new things to measure and alert on.

Now ideally, you have an alerting system that only communicates conditions that need resolvable action by a human. Which means every alert is critically important, and you’re not ignoring or dismissing any pages for any reasons that sound like “oh, that’s ok, that cluster always does that…it’ll clear up, I’ll just acknowledge the page so I can shut up nagios.” In other words, our goal is to have a zero-noise alerting system. Which means that all alerts are actionable, not ignorable, and require a human to troubleshoot or fix. Over time, you push as much of this work as you can to the robots. In the meantime, save humans for the yet-to-be-automated work, or the stuff that isn’t easily captured by robots.

Why is this important to us? I may be stating the obvious, but it’s because interrupting humans with alerts that don’t require action has a mental and physical context switching cost (especially if the guy on-call was sleeping), and it increases the likelihood of missing a truly critical page in a slew of non-critical ones.

Of course in the reality of evolving and growing web applications, even if we could reach a 100% noise-free alerting system, it’s impossible to sustain for any extended period of time, because your application, usage, and failure modes are constantly changing. So in the meantime, knowing how your alerts affect the team is a worthwhile thing to do for us. In fact, I think it’s so important that it’s worth collecting and displaying next to the rest of your metrics, and exposing these metrics to the entire dev and ops groups.

Something like this: (made-up numbers)

Tracking Critical Alerts

Tracking Critical Alerts

Gathering up info about these alerts should give us a better perspective on where we can improve. So, things like:

  • How many critical alerts are sent on a daily/hourly/weekly basis?
  • What does a time histogram of the alerts look like? Do you get more or less alerts during nighttime or non-peak hours?
  • How much (if any) correlation is there between critical alerts and:

– code deploys?
– software upgrades?
– feature launches?
– open API abuse?

  • What does a breakdown of the alerts look like, in terms of: host type, service type, and frequency of each in a given time period?

and maybe the most important ones:

  • How many of those alerts aren’t actually critical or demand human attention?
  • How many of them always self-recover?
  • How many (and which) don’t matter in their role context (like, a single node in a load-balanced cluster) and could be turned into an aggregate check?

We’ve built our own stuff to track and analyze these things. My question to the community is: I’m not aware of any open-source tool that is dedicated to analyzing these metrics. Do they exist? Nagios obviously has host/hostgroup/cluster warning and critical histories, and those can be crunched to find critical alert statistics, but I’m not aware of any comprehensive crunching. Of course, until I find one, we’re just building our own.

Thoughts, lazyweb?

Slides from Web2.0 Expo 2009. (and somethin else interestin’)

That was a pretty good time. Saw lots of good and wicked smaht people, and I got a lot of great questions after my talk. The slides are up on slideshare, and here are the PDF slides.

UPDATE: Gil Raphaelli has posted his python bindings he wrote for our libyahoo2 use in our Ops IM Bot.

There was something that I left out of my slides, mostly because I didn’t want to distract from the main topic, which was optimization and efficiencies.

While I used our image processing capacity at Flickr as an example of how compilers and hardware can have some significant influence on how fast or efficient you can run, I had wondered what the Magical Cloud™ would do with these differences.

So I took the tests I ran on our own machines and ran them on Small, Medium, Large, Extra Large, and Extra Large(High) instances of EC2, to see. The results were a bit surprising to me, but I’m sure not surprising to anyone who uses EC2 with any significant amount of CPU demand.

For the testing, I have a script that does some super simple image resizing with GraphicsMagick. It splits a DSLR photo into 6 different sizes, much in the same way that we do at Flickr for the real world. It does that resizing on about 7 different files, and I timed them all. This is with the most recent version of GraphicsMagick, 1.3.5, with the awesome OpenMP bits in it.

Here is the slide of the tests run on different (increasingly faster) dedicated machines:

Faster Image Processing Hardware

and here is the slide that I didn’t include, of the EC2 timings of the same test:

Image Processing on EC2

Now I’m not suggesting that the two graphs should look similar, or that EC2 should be faster. I’m well aware of the shift in perspective when deploying capacity within the cloud versus within your own data center. So I’m not surprised that the fastest test results are on the order of 2x slower on EC2. Application logic, feature designs (synchronous versus asynchronous image processing, for example) can take care of these differences and could be a welcome trade-off in having to run your own machines.

What I am surprised about is the variation (or lack thereof) of all but the small instances. After I took a closer look at vmstat and top, I realized that the small instances consistently saw about 50-60% CPU stolen from it, the mediums almost always saw zero stolen, and the Large and ExtraLarges saw up to 35% CPU stolen from it during the jobs.

So, interesting.

Speaking at Web2.0 Expo 2009

Looks like I’m gonna talk about even more nerdy things at the Web2.0 Expo in April.

You don’t have to wait for a recession to tighten up your operations. Squeezing more oomph out of your servers (or instances!) is always a good thing, and streamlining how you handle site issues is too. We’ll will talk about what we’ve been doing at Flickr to get more out of less from both our machines and our humans.

Capacity Hacks: diagonal scaling, tuning opportunities, and some other stupid performance tricks.

Ops “runbook” Hacks: Server and process self-healing, application-level measurement, ops communication tools, and some worst-case scenario tricks to have in your back pocket.

Web Ops Visualizations Group on Flickr

Like lots of operations people, we’re quite addicted to data pr0n here at Flickr. We’ve got graphs for pretty much everything, and add graphs all of the time. We’ve blogged about some of how and why we do it.

One thing we’re in the habit of is screenshotting these graphs when things go wrong, right, or indifferent, and adding them to a group on Flickr. I’ve decided to make a public group for these sort of screenshots, for anyone to contribute to:

http://flickr.com/groups/webopsviz/

You should realize before posting anything here, that you might want to think about if you want everyone in the world to see what you’ve got. I’ve made a quick FAQ on the groups page, but I’ll repeat it here:

Q: What is this?
A: This group is for sharing visualizations of web operations metrics. For the most part, this means graphs of systems and application metrics, from software like ganglia, cacti, hyperic, etc.

Q:Who gets to see this?
A: This is a semi-public group, so don’t post anything you don’t want others to see.
For now, it’ll be for members-only to post and view. Ideally, I think it’d be great to share some of these things publicly.

Q: What’s interesting to post here?
A: Spikes, dips, patterns. Things with colors. Shiny things. Donuts. Ponies.

Q: My company will fire me if I show our metrics!
A: Don’t be dense, and post your pageview, revenue, or other super-secret stuff that you think would be sensitive. Your mileage may vary.

So: you’ve got something to brag about? How many requests per second can your awesome new solid-state-disk database do? You got spikes? Post them!

Code Swarm for Config Management

Gil Raphaelli, one of the guys on our Flickr Ops team, put together a Code Swarm animation for the configuration/deployment management tool we use at Flickr to manage our infrastructure. Myles Grant did this for our bug reporting system as well. Check it out:

Our automated config management system is called Gemstone, but conceptually you can think of it as a pretty extensible SystemImager/Puppet/cfengine-style system. In the animation, the dots are changes made by the ops person shown.  The legend is:

transforms
: this is what cluster should have what packages, files, actionable scripts, etc.
raw: these are actual files, like apache/memcached/squid configs, which get munged depending on what cluster they might be in
conf: this is what boxes/clusters are subsets or supersets of which clusters
code: ops-written tools/utilities
Misc: stuff that doesn’t fit into the above. 🙂