Reflections on the 6th Resilience Engineering Symposium

I just spent the last week in Lisbon, Portugal at the Resilience Engineering Symposium. Zoran Perkov and I were invited to speak on the topic of software operations and resilience in the financial trading and Internet services worlds, to an audience of practitioners and researchers from all around the globe, in a myriad of industries.

My hope was to start a dialogue about the connections we’ve seen (and to hopefully explore more) between practices and industries, and to catch theories about resilience up to what’s actually happening in these “pressurized and consequential”1 worlds.

I thought I’d put down some of my notes, highlights and takeaways here.

  • In order to look at how resilience gets “engineered” (if that is actually a thing) we have to look at adaptations that people make in the work that they do, to fill in the gaps that show up as a result of the incompleteness of designs, tools, and prescribed practices. We have to do this with a “low commitment to concepts”2 because otherwise we run the risk of starting with a model (OODA? four cornerstones of resilience? swiss cheese? situation awareness? etc.) and then finding data to fill in those buckets. Which can happen unfortunately quite easily, and also: is not actually science.

 

  • While I had understood this before the symposium, I’m now even clearer on it: resilience is not the same as fault-tolerance or “graceful degradation.” Instead, it’s something more, akin to what Woods calls graceful extensibility.”

 

  • The other researchers and practitioners in ‘safety-critical’ industries were very interested in approaches such as continuous deployment/delivery might look like in their fields. They saw it as a set of evolutions from waterfall that Internet software has made that allows it to be flexible and adaptive in the face of uncertainty of how the high-level system of users, providers, customers, operations, performance, etc. will behave in production. This was their reflection, not my words in their mouths, and I really couldn’t agree more. Validating!

 

  • While financial trading systems and Internet software have some striking similarities, the differences are stark. Zoran and I are both jealous of each other’s worlds in different ways. Also: Zoran can quickly scare the shit out of an audience filled with pension and retirement plans. 🙂

 

  • The lines between words (phases?) such as: design-implementation-operations are blurred in worlds where adaptive cycles take place, largely because feedback loops are the focus (or source?) of the cycles.

 

  • We still have a lot to do in “software operations”3 in that we may be quite good at focusing and discussing software development and practices, alongside the computer science concepts that influence those things, but we’re not yet good at exploring what we can find about our field through the lenses of social science and cognitive psychology. I would like to change that, because I think we haven’t gone far enough in being introspective on those fronts. I think we might only currently flirting with those areas. By dropping a Conway’s Law here and a cognitive bias there, it’s a good start. But we need to consider that we might not actually know what the hell we’re talking about (yet!). However, I’m optimistic on this front, because our community has both curiosity and a seemingly boundless ability to debate esoteric topics with each other. Now if we can only stop doing it in 140 characters at a time… 🙂

 

  • The term “devops” definitely has analogues in other industries. At the very least, the term brought vigorous nodding as I explained it. Woods used the phrase “throw it over the wall” and it resonated quite strongly with many folks from diverse fields. People from aviation, maritime, patient safety…they all could easily give a story that was analogous to “worked fine in dev, ops problem now” in their worlds. Again, validating.

 

  • There is no Resilience Engineering (or Cognitive Systems Engineering or Systems Safety for that matter) without real dialogue about real practice in the world. In other words, there is no such thing as purely academic here. Every “academic” here viewed their “laboratories” as cockpits, operating rooms and ERs, control rooms in mission control and nuclear plants, on the bridges of massive ships. I’m left thinking that for the most part, this community abhors the fluorescent-lighted environments of universities. They run toward potential explosions, not away from them. Frankly, I think our field of software has a much larger population of the stereotype of the “out-of-touch” computer scientist whose ideas in papers never see the light of production traffic. (hat tip to Kyle for doing the work to do real-world research on what were previously known as academic theories!)

 


 

1 Richard Cook’s words.

2 David Woods’ words. I now know how important this is when connecting theory to practice. More on this topic in a different post!

3 This is what I’m now calling what used to be known as “WebOps” or what some refer to as ‘devops’ to reflect that there is more to software services that are delivered via the Internet than just the web, and I’d like to update my language a bit.

Owning Attention (Considerations for Alert Design)

In the past month or two, I’ve spoken on the topic of alert design. There’s a video of my giving the talk (at Monitorama, as well), but I thought I’d try to post on the topic and material as well.

The topic of alerts and “alert design” as seen as a deliberate and purposeful thing to do has been on my mind.

In my experience and my asking many people in engineering and operations (at least in the web and financial trading domains) nothing spikes blood pressure like the topics of alerts. The caricature of the sysadmin waking up to a buzzing pager or phone is what comes to mind.

The costs of not paying attention to how your organization views or treats what comes of this behavior in operational teams (developers and systems folks included) I think are both largely invisible and much higher than most people think. It may be clear that what we’re talking about here is a signal:noise ratio, but it goes way beyond that. The cognitive cost of an engineer to attend to an alert (a fundamentally interrupting event by design) is akin to the cost of a software developer losing their “flow”; context switching is expensive. Expensive from a financial standpoint, a productivity perspective, and I’ll argue a career development view.

Here are some (likely melodramatic) assertions:

  • Alert numbness and fatigue is a blight on our industry. Because we can alert on basically anything, and we can argue that anything could be a harbinger of things that could drastically affect our business, we generally put an alert on everything we get our hands on.
  • Knowing something has happened almost always trumps not knowing something happened, with sometimes not much effort put into whether the “something” is important with respect to the context it’s happened in.
  • Computers deciding what is important to alert on is and will always be brittle. Meaning: alerts and their criteria originate in the author’s mind, which may or may not be in the same place as the receiver of the alert in the future. In other words: we all write documentation and procedures that make sense to us when we write them. They never survive too much of the future, because our worlds that refer to them change. Example: corporate wiki pages are commonly referred to as the place where “documentation goes to die”. Alerts are no different.

Therefore, I’d love to get a much deeper and broader conversation about alert design in our domain. Because I’ll say that it’s not the technology that sucks, it’s our use of it. Consider the possibility that you don’t have a Nagios problem, you have an alert design problem.

Down and In

As the years go by and we see the continued decline of storage prices, the explosion of accessible processing power, we have an ever-expanding ability to zoom in deeply to the ways servers and services talk to each other and process information.

We can zoom in on the relationships and behaviors of seemingly disparate pieces of data, and we can discover and detect disruptions or anomalies in sometimes surprising places. This is interesting, for sure.

But it is also woefully incomplete if we are to make any progress in technical operations.

Up and Out

It is incomplete because as we zoom out of those high-resolution metrics collection and analysis tooling, what we find is a much-ignored environment which includes one of the most powerful context-sensitive and incredibly adaptive anomaly detection and response agents in the world:  humans.

Do we have anomaly detection problems? Certainly. One can argue (I will) that we will always have them, for many reasons. (One of those reasons is the Law Of Stretched Systems, but that is for a different post.)

What I’m interested in is not how software can be used to detect anomalies automatically,
(well, I’m interested, but I don’t doubt that we all will continue to get better at it)

…it is how people navigate this boundary between themselves and the machines they work with. The boundary between humans and machines, as we observe our use of tools, is a focus in and of itself. If we have any hope of making progress in monitoring complex systems, we must take this boundary into account.

As an aside, some more bullet points:

  1. We don’t use a single tool to gain insight into the architectures we build. And we will not, much to the dismay of many monitoring-as-a-service business models. (“A single plane of glass?! Where do I sign?!”)
  2. Teams of people are the norm, which means that communication and coordination become as important (if not more important) than surfacing anomalies themselves.
  3. We bring our biases, expectations, trust, and perceptions to the table when it comes to monitoring and response. No tool or piece of automation will ever change that.
  4. Understanding the breakdowns at these boundaries between people and machines should be a part of how we approach the design of tools. Organizational behavior beats technology at every turn.

Less Code, More Social Science

When we look at Boyd’s OODA loop, we see “observe” and “orient” as critical pieces. Note that these are not Unix commands, they are human activities.

So writing code to tell computers what to look at is quite different than making sure that the code’s human supervisors are equipped or aided in what to look when an alert goes off. Figuring out how people make sense of what is actually going on at a given point (in diagnosis? in planning? in response to an outage? in control?) is just plain hard.

A step that Don Norman (and other folks known in the world of ergonomics and human factors) have been tugging at for a couple of decades is to first attempt to understand how people consume, adapt to, work around, and make use of tools under “normal” operating conditions. Once that’s done, it’s suggested, then we can try to understand how people make sense of their world under high-tempo or escalating scenarios (during an outage, for example) when the signals they receive can sometimes be disorienting as things escalate.

Questions

  • Who has ever gotten an alert and ignored it? (/me looks at alert, says “oh, it’ll probably recover, no need to look further”)
  • How many alerts were received in the past week that were not actionable? (no human action was required)
  • How many alerts were received in the past week as a result of known work being done (expected) but alerts were not silenced during that period?
  • How many alerts were received as a result of a previously silenced alert (because work was being done) that was mistakenly un-silenced?

Here are some quotes from engineers who have found themselves in interesting situations related to alerts:

“The whole place just lit up. I mean, all the lights came on. So instead of being able to tell you what went wrong, the lights were absolutely no help at all.”
– Comment by one space controller in mission control after the Apollo 12 spacecraft was struck by lightning (Murray and Cox 1990).

 

“I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information.”
– Comment by one operator at the Three Mile Island nuclear power plant to the official inquiry following the TMI accident (Kemeny 1979).

 

“When the alarm kept going off then we kept shutting it [the device] off [and on] and when the alarm would go off [again], we’d shut it off.”
“… so I just reset it [a device control] to a higher temperature. So I kinda fooled it [the alarm]…”
– Physicians explaining how they respond to a nuisance alarm on a computerized operating room device (Cook, Potter, Woods and McDonald 1991).

 

“A [computer] program alarm could be triggered by trivial problems that could be ignored altogether. Or it could be triggered by problems that called for an immediate abort [of the lunar landing]. How to decide which was which? It wasn’t enough to memorize what the program alarm numbers stood for, because even within a single number the alarm might signify many different things.

 

“We wrote ourselves little rules like: ‘If this alarm happens and it only happens once, don’t worry about it. If it happens repeatedly, but other indicators are okay, don’t worry about it.'” And of course, if some alarms happen even once, or if other alarms happen repeatedly and the other indicators are not okay, then they should get the LEM [lunar module] the hell out of there.
– Response to discovery of a set of computer alarms linked to the astronauts displays shortly before the Apollo 11 mission (Murray and Cox 1990).

 

“1202.” (Astronaut announcing that an alarm buzzer and light had gone off and the code 1202 was indicated on the computer display.)
“What’s a 1202?”
“1202, what’s that?”
“12…1202 alarm.”
– Mission control dialog as the LEM descended to the moon during Apollo 11 (Murray and Cox 1990).

 

“I know exactly what it [an alarm] is–it’s because the patient has been, hasn’t taken enough breaths or–I’m not sure exactly why.”
– Physician explaining one alarm on a computerized operating room device that commonly occurred at a particular stage of surgery (Cook et al. 1991).

These quotes are from the excellent paper The Alarm Problem and Directed Attention in Dynamic Fault Management (Woods, 1995).

David Woods writes at great length on the topic and gives great insight into what essentially alerts and alarms are: directed attention. As operators of systems that are beyond our full understanding at any given point and perspective, he shines light on the core of the alarm problem: that there is always context sensitivity to alerts, and in many ways the author/designer of the alert hasn’t (can’t!) imagine how the receiver of the alert will interpret it.

For example: he points to signal detection theory as a framework for thinking about alert/alarm criteria. That is to say, there is always a relationship between true “signal” and “noise” and the trade-offs inherent in choosing the alerting criteria (sometimes, but not always, viewed as a simple threshold) can be thought of like this:

Signal Detection Theory

In other words, there are four outcomes that are possible that reflect how sensitive the alerting criteria can be:

 

SDT outcomes

So this is a tough one, and points out that getting good (forget about perfect!) signal-to-noise ratio is hard. Too sensitive, you’ll get too many false alarms. Not sensitive enough, and you’ll miss something.

I’ll say that because of this, we generally err on the side of too many false alarms. For fear of missing something (or the embarrassment of it being known that you missed something going wrong with your systems!) we will crank up the sensitivity.

But in doing so, we essentially ignore the detrimental effect of the false alarms on our engineers and organizations. Underlying the false alarms are not just limitations in the alerting algorithms themselves, but the conditions and factors that the alert systems cannot detect or interpret.

An often-given example of this manifests at the Cincinnati Airport. A riverbank leading up to a particular runway there triggers a threshold in ground proximity warning systems (in-cockpit alerts) because the system can’t detect that it’s going to plateau at the runway. Pilots familiar with this particular runway at this particular airport ignore the alerts.

Once more, with feeling: the pilots, who are flying massive cylinders of metal containing many humans ignore a Ground Proximity Warning alert.

When we talk about how the receiver of an alert will behave, we begin to uncover the context sensitivity of an alert.

How can we take into account how someone might react when we they are woken up to an alert we’ve designed? Will they shake their head, wondering what it’s all about? Are we helping them understand what might be going on, or hindering them by including only the bare minimum of data?

What about the engineer who gets an alert in a sea of alerts, while an outage is ongoing? How much attention will they give one amongst a hundred?

Something that might affect our behavior when we get an alert is the amount of trust that we have in the alert: is it telling us something we should believe? Should we drop everything we’re doing in order to pay attention to it? If not, why not?

As an example of this, take the Ground Proximity Warning System I mentioned above. Turns out that in many studies across a number of years, a majority of pilots delay reacting to a GPWS alarm, not just in Cincinnati. Why? Because they take time to validate that the alarm is actually legitimate by looking out the window. This is enough of a problem that the FAA has coined this phenomenon “delayed GPWS response syndrome“.

Trust in automation: it’s a thing that might be worth thinking closely about.

Two Views

“The critical point is that the challenge of fault management lies in sorting through an avalanche of raw data — a data overload problem. This is in contrast to the view that the performance bottleneck is the difficulty of picking up subtle early indications of a fault against the background of a quiescent monitored process.” (Woods, 1995)

The next time you set up an alert in your system, consider how you’re thinking the receiver of that alert will take it. Do you believe that your alert will save the day, providing information for someone to head off catastrophe before it’s too late? Or will it be likely discarded as noise amongst a sea of alerts as someone struggles to understand an outage?

“Information is not a scarce resource, attention is.” – Herb Simon

Herb Simon has mentioned this in many pieces of his writing, as David Woods and Emily Patterson remarks in Can We Ever Escape From Data Overload, A Cognitive Systems DiagnosisThus far we’ve captured that designing alerts is hard, even if we only invest effort in capturing signal, forget about providing context. Woods talks a bit more about directed attention, about a paradox:

“Note the paradox at the heart of directed attention. Given that the supervisory agent is loaded by various other task related demands, how does one interpret information about the potential need to switch attentional focus without interrupting or interfering with the tasks or lines of reasoning already under attentional control. We can state this paradox in another way: how can one skillfully ignore a signal that should not shift attention within the current context, without first processing it — in which case it hasn’t been ignored.”

So Where Is “Design”?

“It is the expertise of the human operator that makes it possible to adapt the  performance of the joint system, in real time, to unexpected events and disturbances. Every working day, across the whole spectrum of human enterprise, a large number of near-misses are prevented from turning into accidents only because human operators intervene.

The system should therefore be designed so that human adaptation is enhanced.”

(emphasis mine) – Erik Hollnagel, Expertise and Technology: Cognition &  Human-Computer Cooperation, 1995

Instead of thinking about alerts and alert design as tasks that underscore the mental model of a subordinate or otherwise dumb messenger delivering news to us?

What if we viewed alerting systems as a partner? What does the world look like if we designed alerting systems to cooperate with us?
If trust in alerting systems is such a big deal, as it is with the GPWS and alert numbness,  what can we learn from how humans learn to trust each other, and let that influence our design decisions?

In other words: how can we design alerts that support our efforts to confirm their legitimacy, or our expectations when an alert will fire? Is context-sensitivity part of this?

This is the type of partnership and thinking that I’m interested in. 🙂

MTTR is more important than MTBF (for most types of F)

This week I gave a talk at QCon SF about development and operations cooperation at Etsy and Flickr.  It’s a refresh of talks I’ve given in the past, with more detail about how it’s going at Etsy. (It’s going excellently 🙂 )

There’s a bunch of topics in the presentation slides, all centered around roles, responsibilities, and intersection points of domain expertise commonly found in development and operations teams. One of the not-groundbreaking ideas that I’m finally getting down is something that should be evident for anyone practicing or interested in ‘continuous deployment’:

Being able to recover quickly from failure is more important than having failures less often.

This has what should be an obvious caveat: some types of failures shouldn’t ever happen, and not all failures/degradations/outages are the same. (like failures resulting in accidental data loss, for example)

Put another way:

MTTR is more important than MTBF

(for most types of F)

(Edited: I did say originally “MTTR > MTBF”)

What I’m definitely not saying is that failure should be an acceptable condition. I’m positing that since failure will happen, it’s just as important (or in some cases more important) to spend time and energy on your response to failure than trying to prevent it. I agree with Hammond, when he said:

If you think you can prevent failure, then you aren’t developing your ability to respond.

In a complete steal of Artur Bergman‘s material, an example in the slides of the talk is of the Jeep versus Rolls Royce:

Jeep versus Rolls Artur has a Jeep, and he’s right when he says that for the most part, Jeeps are built with optimizing Mean-Time-To-Repair, not the classical approach to automotive engineering, which is to optimize Mean-Time-Between-Failures. This is likely because Jeep owners have been beating the shit out of their vehicles for decades, and every now and again, they expect that abuse to break something. Jeep designers know this, which is why it’s so damn easy to repair. Nuts and bolts are easy to reach, tools are included when you buy the thing, and if you haven’t seen the video of Army personnel disassembling and reassembling a Jeep in under 4 minutes, you’re missing out.

The Rolls Royce, on the other hand, likely don’t have such adventurous owners, and when it does break down, it’s a fine and acceptable thing for the car to be out of service for a long and expensive fixing by the manufacturer.

We as web operations folks want our architectures to be built optimized for MTTR, not for MTBF. I think that the reasons should be obvious, and the fact that practices like:

  • Dark launching
  • Percentage-based production A/B rollouts
  • Feature flags

are becoming commonplace should verify this approach as having legs.

The slides from QConSF are here:

Go or No-Go: Operability and Contingency Planning (Surge)

Last month I had the honor of speaking at the Surge Conference in Baltimore, put together by OmniTI.

It was a most excellent conference, and the expertise levels were ridiculously high. I count myself lucky to be considered the same league as the rest of the presenters. I did give a Keynote talk, and I haven’t uploaded those slides yet. The talk I gave on the second day of the conference was about how we plan for feature launches at Etsy, which follows a similar pattern we had at Flickr.

So, here are the slides for that talk:

Slides from Web2.0 Expo 2009. (and somethin else interestin’)

That was a pretty good time. Saw lots of good and wicked smaht people, and I got a lot of great questions after my talk. The slides are up on slideshare, and here are the PDF slides.

UPDATE: Gil Raphaelli has posted his python bindings he wrote for our libyahoo2 use in our Ops IM Bot.

There was something that I left out of my slides, mostly because I didn’t want to distract from the main topic, which was optimization and efficiencies.

While I used our image processing capacity at Flickr as an example of how compilers and hardware can have some significant influence on how fast or efficient you can run, I had wondered what the Magical Cloud™ would do with these differences.

So I took the tests I ran on our own machines and ran them on Small, Medium, Large, Extra Large, and Extra Large(High) instances of EC2, to see. The results were a bit surprising to me, but I’m sure not surprising to anyone who uses EC2 with any significant amount of CPU demand.

For the testing, I have a script that does some super simple image resizing with GraphicsMagick. It splits a DSLR photo into 6 different sizes, much in the same way that we do at Flickr for the real world. It does that resizing on about 7 different files, and I timed them all. This is with the most recent version of GraphicsMagick, 1.3.5, with the awesome OpenMP bits in it.

Here is the slide of the tests run on different (increasingly faster) dedicated machines:

Faster Image Processing Hardware

and here is the slide that I didn’t include, of the EC2 timings of the same test:

Image Processing on EC2

Now I’m not suggesting that the two graphs should look similar, or that EC2 should be faster. I’m well aware of the shift in perspective when deploying capacity within the cloud versus within your own data center. So I’m not surprised that the fastest test results are on the order of 2x slower on EC2. Application logic, feature designs (synchronous versus asynchronous image processing, for example) can take care of these differences and could be a welcome trade-off in having to run your own machines.

What I am surprised about is the variation (or lack thereof) of all but the small instances. After I took a closer look at vmstat and top, I realized that the small instances consistently saw about 50-60% CPU stolen from it, the mediums almost always saw zero stolen, and the Large and ExtraLarges saw up to 35% CPU stolen from it during the jobs.

So, interesting.

Speaking at Web2.0 Expo 2009

Looks like I’m gonna talk about even more nerdy things at the Web2.0 Expo in April.

You don’t have to wait for a recession to tighten up your operations. Squeezing more oomph out of your servers (or instances!) is always a good thing, and streamlining how you handle site issues is too. We’ll will talk about what we’ve been doing at Flickr to get more out of less from both our machines and our humans.

Capacity Hacks: diagonal scaling, tuning opportunities, and some other stupid performance tricks.

Ops “runbook” Hacks: Server and process self-healing, application-level measurement, ops communication tools, and some worst-case scenario tricks to have in your back pocket.

2009 Velocity Conference submissions are open!

The CFP for next year’s Velocity Conference is up now, so all you ops and performance ninjas submit your ideas for talks.

I’m lucky enough to be on the program committee this year, and I think the conference is a huge opportunity to spread the ops love on all kinds of topics. There’s a list on the O’Reilly page to get you thinking about what might make for a good submission:

– How to tie web performance and operations to the bottom line
– Real-world incident management – getting “tight like a pit crew”
– Making websites as fast and reliable as desktop apps
– Networking, DNS, and load balancing
– Profiling’s not just on the backend: JavaScript, CSS, and the network
– Managing web services – flaming disasters you survived and lessons learned
– The intersection between performance and design
– Wicked cool (and actionable) metrics
– Ads, ads, ads – the performance killer?
– Troubleshooting in production
– How to scale and be fast on the social web
– Capacity planning and load testing
– Establishing performance and operations best practices within your organization
– Configuration management best (and worst) tools and practices
– Monitoring and instrumentation: Open Source, as a service, commercially supported solutions
– Using multiple CDNs to improve customer experience and reduce cost

Think for a minute: Do you have a bunch of sweet ops hacks that you’re really proud of? Do you and your dev teams collaborate on making things easy to manage? Do you face unique challenges that others don’t which ops folks can learn from?

If so, don’t be lame: submit a proposal!