Quantcast

From the category archives:

WebOps

I guess I’m late on getting to this, but How Complex Systems Fail by Richard Cook is excellent.

Let me start with this: I don’t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure. I found this via @benjaminblack, and I agree with him 100%: this should be considered required reading for anyone in our industry. I’m not sure if Cook ever thought that his paper would apply to web infrastructure, but I think it can and does. Please take 30 minutes right now and read it. :)

There are a number of salient points in the paper that I’d like to comment on. Again, this is through the lens of failures of complex systems as it pertains to web operations:

7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

I’m going to guess that this portion may be viewed as controversial in the prevailing webops wisdom, where post-mortems are for sure necessary, but whose content may or may not be effective in preventing similar types of failure. I do value the process of a post-mortem, because I think the human element of understanding complex failures is important and doing whatever you can to put in place safety is good, modulo what is said in section #16 of the paper. I believe that even a rudimentary process of “5 Whys” has value. But at the same time, I also think that there is something in the spirit of this paragraph, which is that there is a danger in standing behind a single underlying cause when there are systemic failures involved. Doing this can lead to the false belief that you’ve got this mode covered, you’ve found the silver bullet that made the whole mountain crumble, and jeez what a relief because that will never bite us again.

14) Change introduces new forms of failure.

I totally agree with this point. However, I often see this as a rallying point for operations teams to say “No!” to change, when instead they should be working alongside development (and product owners) with a goal of reducing the risk of failure associated with each change. I do not believe that ‘release early, release often’ in and of itself can reduce that risk. I believe that the real (and only) way to do this is both technical and cultural. But I’ve spoken about this before.

16) Safety is a characteristic of systems and not of their components

Emphasis on “Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system.” Real safety comes from smart people doing smart things to the entire shebang, not the individual guts.

and I think the point I love the most, with all of my heart:

18) Failure free operations require experience with failure.

Fear is a strong emotion. I believe it can be used as a strong motivator for ensuring safety in the face of constant change, instead of a reason to push back on the very idea of change. Embrace fear of outages and degradation. Use it to guide your architecture, your code, your infrastructure. So lean into it.

There are a lot of great points in the paper, and I could go on, but you get the idea.

{ 3 comments }

Meanwhile: More Meta-Metrics

by allspaw on October 5, 2009

Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as…

What:

…did we do before (historical trending, etc)
…is going on right now? (troubleshooting, health, etc.)
…is coming down the road (capacity planning, new feature adoption, etc.)
…can we do to make things better (business intelligence, user-behavior, etc.)

All of which, of course, should be considered mandatory in order to help your business increase its awesome. Yay metrics!

Some time ago, Matthias wrote great a blog post about some of the metrics that can reasonably profile the effectiveness of web operations, taken from the ITIL primer, VisibleOps.

In my opinion, there’s nothing on that list of things that isn’t valuable, as long as the cost of gathering those metrics isn’t too behaviorally, technically, or organizationally expensive. The topics included in that list of metrics and the context they live in is fodder for many, many blog posts.

But in the category of historical trending, I’m more and more fascinated by gathering what I’ll call “meta-metrics”, which is data about how you respond to the changes your system is experiencing.

One of the best examples of this is gathering information about operational disruptions. Collecting information about how many times your on-call rotation was alerted/paged/woken-up, during what times, and for what service(s) can be enlightening to say the least.  We’ve been tracking the volume of alerts a lot closer recently, and even with the level of automation we’ve got at Flickr, it’s still something you have to keep on top of, especially if you’re always finding new things to measure and alert on.

Now ideally, you have an alerting system that only communicates conditions that need resolvable action by a human. Which means every alert is critically important, and you’re not ignoring or dismissing any pages for any reasons that sound like “oh, that’s ok, that cluster always does that…it’ll clear up, I’ll just acknowledge the page so I can shut up nagios.” In other words, our goal is to have a zero-noise alerting system. Which means that all alerts are actionable, not ignorable, and require a human to troubleshoot or fix. Over time, you push as much of this work as you can to the robots. In the meantime, save humans for the yet-to-be-automated work, or the stuff that isn’t easily captured by robots.

Why is this important to us? I may be stating the obvious, but it’s because interrupting humans with alerts that don’t require action has a mental and physical context switching cost (especially if the guy on-call was sleeping), and it increases the likelihood of missing a truly critical page in a slew of non-critical ones.

Of course in the reality of evolving and growing web applications, even if we could reach a 100% noise-free alerting system, it’s impossible to sustain for any extended period of time, because your application, usage, and failure modes are constantly changing. So in the meantime, knowing how your alerts affect the team is a worthwhile thing to do for us. In fact, I think it’s so important that it’s worth collecting and displaying next to the rest of your metrics, and exposing these metrics to the entire dev and ops groups.

Something like this: (made-up numbers)

Tracking Critical Alerts

Tracking Critical Alerts

Gathering up info about these alerts should give us a better perspective on where we can improve. So, things like:

  • How many critical alerts are sent on a daily/hourly/weekly basis?
  • What does a time histogram of the alerts look like? Do you get more or less alerts during nighttime or non-peak hours?
  • How much (if any) correlation is there between critical alerts and:

- code deploys?
- software upgrades?
- feature launches?
- open API abuse?

  • What does a breakdown of the alerts look like, in terms of: host type, service type, and frequency of each in a given time period?

and maybe the most important ones:

  • How many of those alerts aren’t actually critical or demand human attention?
  • How many of them always self-recover?
  • How many (and which) don’t matter in their role context (like, a single node in a load-balanced cluster) and could be turned into an aggregate check?

We’ve built our own stuff to track and analyze these things. My question to the community is: I’m not aware of any open-source tool that is dedicated to analyzing these metrics. Do they exist? Nagios obviously has host/hostgroup/cluster warning and critical histories, and those can be crunched to find critical alert statistics, but I’m not aware of any comprehensive crunching. Of course, until I find one, we’re just building our own.

Thoughts, lazyweb?

{ 6 comments }

Slides for Velocity Talk 2009

June 23, 2009

UPDATE: blip.tv has the video of the talk as well, below. Jeez I have some major bed-head.
That was a blast! I had never done a ‘duet’ talk before. Here are the slides:
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
…and the video of it is here:

Read the full article →

Slides from Web2.0 Expo 2009. (and somethin else interestin’)

April 3, 2009

That was a pretty good time. Saw lots of good and wicked smaht people, and I got a lot of great questions after my talk. The slides are up on slideshare, and here are the PDF slides.
Operational Efficiency Hacks Web20 Expo2009
View more presentations from John Allspaw.

UPDATE: Gil Raphaelli has posted his python bindings he [...]

Read the full article →

Some Things We Did Today

March 5, 2009

Moving one of our eight photoserving farms from hardware Layer7 URL hash balancing (expensive, has limits) to L4 DSR balancing with CARP (cheap and simple) and figuring out how to juggle 18,000 requests/second while we do it.
Built yet some more automated query analysis reporting (with some yummy MySQLProxy)
Added yet another aggregated graph of database queries, [...]

Read the full article →

2009 Velocity Conference submissions are open!

November 20, 2008

The CFP for next year’s Velocity Conference is up now, so all you ops and performance ninjas submit your ideas for talks.
I’m lucky enough to be on the program committee this year, and I think the conference is a huge opportunity to spread the ops love on all kinds of topics. There’s a list on [...]

Read the full article →

Code Swarm for Config Management

October 21, 2008

Gil Raphaelli, one of the guys on our Flickr Ops team, put together a Code Swarm animation for the configuration/deployment management tool we use at Flickr to manage our infrastructure. Myles Grant did this for our bug reporting system as well. Check it out:

Our automated config management system is called Gemstone, but conceptually you can [...]

Read the full article →

More back-of-envelope-math…

September 18, 2008

Via kottke: some good examples of doing rough math in your head, causing you to guess about assumptions all along the way.
IMHO, being able to do this is one of the things that makes a good web ops person. The examples might be “useless”, but the process is invaluable.

Read the full article →

Internet-Scale Efficiency

September 16, 2008

James Hamilton’s excellent LADIS 2008 presentation has lots of great stuff in it about internet scale bits. Cool stats.

Read the full article →

Slides from Velocity

June 25, 2008

Here are the slides from my talk at the Velocity Conference.

Read the full article →