Quantcast

I guess I’m late on getting to this, but How Complex Systems Fail by Richard Cook is excellent.

Let me start with this: I don’t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure. I found this via @benjaminblack, and I agree with him 100%: this should be considered required reading for anyone in our industry. I’m not sure if Cook ever thought that his paper would apply to web infrastructure, but I think it can and does. Please take 30 minutes right now and read it. :)

There are a number of salient points in the paper that I’d like to comment on. Again, this is through the lens of failures of complex systems as it pertains to web operations:

7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

I’m going to guess that this portion may be viewed as controversial in the prevailing webops wisdom, where post-mortems are for sure necessary, but whose content may or may not be effective in preventing similar types of failure. I do value the process of a post-mortem, because I think the human element of understanding complex failures is important and doing whatever you can to put in place safety is good, modulo what is said in section #16 of the paper. I believe that even a rudimentary process of “5 Whys” has value. But at the same time, I also think that there is something in the spirit of this paragraph, which is that there is a danger in standing behind a single underlying cause when there are systemic failures involved. Doing this can lead to the false belief that you’ve got this mode covered, you’ve found the silver bullet that made the whole mountain crumble, and jeez what a relief because that will never bite us again.

14) Change introduces new forms of failure.

I totally agree with this point. However, I often see this as a rallying point for operations teams to say “No!” to change, when instead they should be working alongside development (and product owners) with a goal of reducing the risk of failure associated with each change. I do not believe that ‘release early, release often’ in and of itself can reduce that risk. I believe that the real (and only) way to do this is both technical and cultural. But I’ve spoken about this before.

16) Safety is a characteristic of systems and not of their components

Emphasis on “Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system.” Real safety comes from smart people doing smart things to the entire shebang, not the individual guts.

and I think the point I love the most, with all of my heart:

18) Failure free operations require experience with failure.

Fear is a strong emotion. I believe it can be used as a strong motivator for ensuring safety in the face of constant change, instead of a reason to push back on the very idea of change. Embrace fear of outages and degradation. Use it to guide your architecture, your code, your infrastructure. So lean into it.

There are a lot of great points in the paper, and I could go on, but you get the idea.

{ 4 comments }

The minimum cycle of questions you should be asking yourself. As brought up by @debuggist and @benjaminblack.

What you might want to ask yourself before you deploy changes to production?

{ 1 comment }

Meanwhile: More Meta-Metrics

by allspaw on October 5, 2009

Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as…

What:

…did we do before (historical trending, etc)
…is going on right now? (troubleshooting, health, etc.)
…is coming down the road (capacity planning, new feature adoption, etc.)
…can we do to make things better (business intelligence, user-behavior, etc.)

All of which, of course, should be considered mandatory in order to help your business increase its awesome. Yay metrics!

Some time ago, Matthias wrote great a blog post about some of the metrics that can reasonably profile the effectiveness of web operations, taken from the ITIL primer, VisibleOps.

In my opinion, there’s nothing on that list of things that isn’t valuable, as long as the cost of gathering those metrics isn’t too behaviorally, technically, or organizationally expensive. The topics included in that list of metrics and the context they live in is fodder for many, many blog posts.

But in the category of historical trending, I’m more and more fascinated by gathering what I’ll call “meta-metrics”, which is data about how you respond to the changes your system is experiencing.

One of the best examples of this is gathering information about operational disruptions. Collecting information about how many times your on-call rotation was alerted/paged/woken-up, during what times, and for what service(s) can be enlightening to say the least.  We’ve been tracking the volume of alerts a lot closer recently, and even with the level of automation we’ve got at Flickr, it’s still something you have to keep on top of, especially if you’re always finding new things to measure and alert on.

Now ideally, you have an alerting system that only communicates conditions that need resolvable action by a human. Which means every alert is critically important, and you’re not ignoring or dismissing any pages for any reasons that sound like “oh, that’s ok, that cluster always does that…it’ll clear up, I’ll just acknowledge the page so I can shut up nagios.” In other words, our goal is to have a zero-noise alerting system. Which means that all alerts are actionable, not ignorable, and require a human to troubleshoot or fix. Over time, you push as much of this work as you can to the robots. In the meantime, save humans for the yet-to-be-automated work, or the stuff that isn’t easily captured by robots.

Why is this important to us? I may be stating the obvious, but it’s because interrupting humans with alerts that don’t require action has a mental and physical context switching cost (especially if the guy on-call was sleeping), and it increases the likelihood of missing a truly critical page in a slew of non-critical ones.

Of course in the reality of evolving and growing web applications, even if we could reach a 100% noise-free alerting system, it’s impossible to sustain for any extended period of time, because your application, usage, and failure modes are constantly changing. So in the meantime, knowing how your alerts affect the team is a worthwhile thing to do for us. In fact, I think it’s so important that it’s worth collecting and displaying next to the rest of your metrics, and exposing these metrics to the entire dev and ops groups.

Something like this: (made-up numbers)

Tracking Critical Alerts

Tracking Critical Alerts

Gathering up info about these alerts should give us a better perspective on where we can improve. So, things like:

  • How many critical alerts are sent on a daily/hourly/weekly basis?
  • What does a time histogram of the alerts look like? Do you get more or less alerts during nighttime or non-peak hours?
  • How much (if any) correlation is there between critical alerts and:

- code deploys?
- software upgrades?
- feature launches?
- open API abuse?

  • What does a breakdown of the alerts look like, in terms of: host type, service type, and frequency of each in a given time period?

and maybe the most important ones:

  • How many of those alerts aren’t actually critical or demand human attention?
  • How many of them always self-recover?
  • How many (and which) don’t matter in their role context (like, a single node in a load-balanced cluster) and could be turned into an aggregate check?

We’ve built our own stuff to track and analyze these things. My question to the community is: I’m not aware of any open-source tool that is dedicated to analyzing these metrics. Do they exist? Nagios obviously has host/hostgroup/cluster warning and critical histories, and those can be crunched to find critical alert statistics, but I’m not aware of any comprehensive crunching. Of course, until I find one, we’re just building our own.

Thoughts, lazyweb?

{ 6 comments }

I think I’ve said before somewhere that working in the field of web operations prepared me somewhat for being a parent. I thought the other day that I should write down some of this reasoning, because it’s pretty often that I’m reminded of similarities:

High availability

Having redundant infrastructure is WebOps 101. For my kids’ most prized possessions, their sleeping  ‘loveys’ there is no reason to have a SPOF, under any circumstances. We have at least 4 backups for each on any trip that we go on, as well as a couple of trusted stuffed animals who might meet unfortunate fates.

Capacity planning

This applies to both disposable diapers (a.k.a. consumable capacity) and episodes of the few TV shows we allow them to watch, on the Tivo. My daughter, at 3 and a half, knows every detail from every of the 49 episodes of The Backyardigans. Having some of them on ipods and iphones can make a 6 hour drive to L.A. feel like 4, not 12.

Documentation

Since I’m already used to writing down observations and techniques learned ‘in the field’, then I was totally prepared:

Allspaw Baby Soothing Method, v1

Allspaw Baby Soothing Method, v1

and in case I ever forgot what my most successful swaddling method was:


Architecture and design

It’s unfortunate that I was so sleep-deprived that I never got a photo of the RadioShack remote-control truck that I turned into a cam-driven Moses basket automatic rocker mechanism. But you understand what I’m talking about.

There is one other thing that I learned from working at Flickr which turned out to be useful new parent advice: expect the unexpected, and never rely on past behaviors as an indication of what can happen in the future. They’re kids, not applications. :)

{ 2 comments }

Wow, how did I miss this until now? In June, some smart people gathered in Barcelona for the First Workshop on Automated Control for Datacenters and Clouds (ACDC09) and jeez it looked like it was a good time, from a glance at the program.

One of the cooler papers is “Automatic exploration of datacenter performance regimes” in which the smart folks over at the RAD Lab at UCB tackle the idea of:

  1. Gathering up real usage metrics in production
  2. Taking that data to feed a resource allocation (”auto-scaling”) controller

The bits about coming up with an exploration policy is where the juicy stuff comes in, building in safety factors driven by external SLAs. You should read the whole thing to see how thoughtful their method was, which includes taking into account effects such as cold ramping, which you almost never see accounted for in simulated situations.  Rock on, RAD Lab: this is the stuff that brings the academia smarts to the real world. Kudos.

FYI: I’m not just saying the paper is cool because they cite my book as a resource in it. :)

{ 10 comments }

I’ve said it before that I’ve always been a huge fan of SystemImager, for super simple imaging. It has some shortcomings for config management, but those are solved with things like Chef or Puppet.

With all of the great things being talked about surrounding ‘Automated Infrastructure’, I’ll point to something insanely cool: 1,190 nodes installed from bare metal to all done in 15 minutes.

That’s One Thousand One Hundred and Ninety nodes. Completely installed in: Fifteen. Fucking. Minutes.

{ 7 comments }

SLAs, clouds, and whatnot

by allspaw on July 16, 2009

Excellent. Good work, Ben:

ah, the mighty service level agreement! the tooth and claw by which the wily customer brings the vendor to heel. get the SLA right and you, the customer, can sit back and relax, safe in the knowledge that should there be an outage, you are covered. your business is protected from harm by the warm, experienced embrace of a big, stable telco. pinch me, i must be dreaming.

go read the whole thing.

{ 0 comments }

Domas makes something more useful than I bet most would think: http://mituzas.lt/2009/06/26/uncache/

{ 0 comments }

UPDATE: blip.tv has the video of the talk as well, below. Jeez I have some major bed-head.

That was a blast! I had never done a ‘duet’ talk before. Here are the slides:

…and the video of it is here:

{ 12 comments }

Annoying To Me.

by allspaw on May 22, 2009

I can’t tell you how ripped I get when people say things like this:

“cloud computing means getting rid of ops”

If by “ops” you mean “people in data centers racking servers, installing OSes, running cables, replacing broken hardware, etc.” then sure, cloud computing aims to relieve you of those burdens. If you really think ‘ops’ is just that, then you really should put down your Nick Carr book and pay attention to the real world for a change.

The reality is, if your ops team is spending a lot of time doing that, then you’re either:

  1. Too big to use someone *else’s* cloud, because you basically have your own (Yahoo, Amazon, Google, etc.)
  2. Stuck in 1999.

If you deal with any of these things:

  • handling site issues/incidents
  • building and maintaining tools to monitor and gather systems and application-level metrics
  • program abilities to adapt infrastructure to changing system or application-level conditions (usage, failure, degradation, etc.)
  • implements, and maintains deployment systems (code, config management, etc.)
  • capacity planning (no, really)

then you’re doing “ops”, by my definition. In some environments, these things are done by “developers”. But my definition says those devs are performing ops functions.

Cloud computing isn’t going to make ‘ops’ go away, it’s relieving of ops (and dev) of a bunch of pain-in-the-ass things so they can focus on the real work needed. Namely: your application.

Last I checked, clouds don’t perform the tasks listed above, because those things (done right) are application-specific. And while cloud computing enables (in an excellent way) the efficient resource allocation (or de-allocation) for an application, it doesn’t get rid of the need to do the above things.

{ 8 comments }