Reflections on the 6th Resilience Engineering Symposium

I just spent the last week in Lisbon, Portugal at the Resilience Engineering Symposium. Zoran Perkov and I were invited to speak on the topic of software operations and resilience in the financial trading and Internet services worlds, to an audience of practitioners and researchers from all around the globe, in a myriad of industries.

My hope was to start a dialogue about the connections we’ve seen (and to hopefully explore more) between practices and industries, and to catch theories about resilience up to what’s actually happening in these “pressurized and consequential”1 worlds.

I thought I’d put down some of my notes, highlights and takeaways here.

  • In order to look at how resilience gets “engineered” (if that is actually a thing) we have to look at adaptations that people make in the work that they do, to fill in the gaps that show up as a result of the incompleteness of designs, tools, and prescribed practices. We have to do this with a “low commitment to concepts”2 because otherwise we run the risk of starting with a model (OODA? four cornerstones of resilience? swiss cheese? situation awareness? etc.) and then finding data to fill in those buckets. Which can happen unfortunately quite easily, and also: is not actually science.

 

  • While I had understood this before the symposium, I’m now even clearer on it: resilience is not the same as fault-tolerance or “graceful degradation.” Instead, it’s something more, akin to what Woods calls graceful extensibility.”

 

  • The other researchers and practitioners in ‘safety-critical’ industries were very interested in approaches such as continuous deployment/delivery might look like in their fields. They saw it as a set of evolutions from waterfall that Internet software has made that allows it to be flexible and adaptive in the face of uncertainty of how the high-level system of users, providers, customers, operations, performance, etc. will behave in production. This was their reflection, not my words in their mouths, and I really couldn’t agree more. Validating!

 

  • While financial trading systems and Internet software have some striking similarities, the differences are stark. Zoran and I are both jealous of each other’s worlds in different ways. Also: Zoran can quickly scare the shit out of an audience filled with pension and retirement plans. 🙂

 

  • The lines between words (phases?) such as: design-implementation-operations are blurred in worlds where adaptive cycles take place, largely because feedback loops are the focus (or source?) of the cycles.

 

  • We still have a lot to do in “software operations”3 in that we may be quite good at focusing and discussing software development and practices, alongside the computer science concepts that influence those things, but we’re not yet good at exploring what we can find about our field through the lenses of social science and cognitive psychology. I would like to change that, because I think we haven’t gone far enough in being introspective on those fronts. I think we might only currently flirting with those areas. By dropping a Conway’s Law here and a cognitive bias there, it’s a good start. But we need to consider that we might not actually know what the hell we’re talking about (yet!). However, I’m optimistic on this front, because our community has both curiosity and a seemingly boundless ability to debate esoteric topics with each other. Now if we can only stop doing it in 140 characters at a time… 🙂

 

  • The term “devops” definitely has analogues in other industries. At the very least, the term brought vigorous nodding as I explained it. Woods used the phrase “throw it over the wall” and it resonated quite strongly with many folks from diverse fields. People from aviation, maritime, patient safety…they all could easily give a story that was analogous to “worked fine in dev, ops problem now” in their worlds. Again, validating.

 

  • There is no Resilience Engineering (or Cognitive Systems Engineering or Systems Safety for that matter) without real dialogue about real practice in the world. In other words, there is no such thing as purely academic here. Every “academic” here viewed their “laboratories” as cockpits, operating rooms and ERs, control rooms in mission control and nuclear plants, on the bridges of massive ships. I’m left thinking that for the most part, this community abhors the fluorescent-lighted environments of universities. They run toward potential explosions, not away from them. Frankly, I think our field of software has a much larger population of the stereotype of the “out-of-touch” computer scientist whose ideas in papers never see the light of production traffic. (hat tip to Kyle for doing the work to do real-world research on what were previously known as academic theories!)

 


 

1 Richard Cook’s words.

2 David Woods’ words. I now know how important this is when connecting theory to practice. More on this topic in a different post!

3 This is what I’m now calling what used to be known as “WebOps” or what some refer to as ‘devops’ to reflect that there is more to software services that are delivered via the Internet than just the web, and I’d like to update my language a bit.

The Infinite Hows (or, the Dangers Of The Five Whys)

(this is also posted on O’Reilly’s Radar blog. Much thanks to Daniel Schauenberg, Morgan Evans, and Steven Shorrock for feedback on this)

Before I begin this post, let me say that this is intended to be a critique of the Five Whys method, not a criticism of the people who are in favor of using it.

This critique I present is hardly original; most of this post is inspired by Todd Conklin, Sidney Dekker, and Nancy Leveson.

The concept of post-hoc explanation (or “postmortems” as they’re commonly known) has, at this point, taken hold in the web engineering and operations domain. I’d love to think that the concepts that we’ve taken from the New View on ‘human error’ are becoming more widely known and that people are looking to explore their own narratives through those lenses.

I think that this is good, because my intent has always been (might always be) to help translate concepts from one domain to another. In order to do this effectively, we need to know also what to discard (or at least inspect critically) from those other domains.

The Five Whys is such an approach that I think we should discard.

This post explains my reasoning for discarding it, and how using it has the potential to be harmful, not helpful, to an organization. Here’s how I intend on doing this: I’m first going to talk about what I think are deficiencies in the approach, suggest an alternative, and then ask you to simply try the alternative yourself.

Here is the “bottom line, up front” gist of my assertions:

“Why?” is the wrong question.

In order to learn (which should be the goal of any retrospective or post-hoc investigation) you want multiple and diverse perspectives. You get these by asking people for their own narratives. Effectively, you’re asking  “how?

Asking “why?” too easily gets you to an answer to the question “who?” (which in almost every case is irrelevant) or “takes you to the ‘mysterious’ incentives and motivations people bring into the workplace.”

Asking “how?” gets you to describe (at least some) of the conditions that allowed an event to take place, and provides rich operational data.

Asking a chain of “why?” assumes too much about the questioner’s choices, and assumes too much about each answer you get. At best, it locks you into a causal chain, which is not how the world actually works. This is a construction that ignores a huge amount of complexity in an event, and it’s the complexity that we want to explore if we have any hope of learning anything.

But It’s A Great Way To Get People Started!

The most compelling argument to using the Five Whys is that it’s a good first step towards doing real “root cause analysis” – my response to that is twofold:

  1. “Root Cause Analysis*” isn’t what you should be doing anyway, and
  2. It’s only a good “first step” because it’s easy to explain and understand, which makes it easy to socialize. The issue with this is that the concepts that the Five Whys depend on are not only faulty, but can be dangerous for an organization to embrace.

If the goal is learning (and it should be) then using a method of retrospective learning should be confident in how it’s bringing to light data that can be turned into actionable information. The issue with the Five Whys is that it’s tunnel-visioned into a linear and simplistic explanation of how work gets done and events transpire. This narrowing can be incredibly problematic.

In the best case, it can lead an organization to think they’re improving on something (or preventing future occurrences of events) when they’re not.

In the worst case, it can re-affirm a faulty worldview of causal simplification and set up a structure where individuals don’t feel safe in giving their narratives because either they weren’t asked the right “why?” question or the answer that a given question pointed to ‘human error’ or individual attributes as causal.

Let’s take an example. From my tutorials at the Velocity Conference in New York, I used an often-repeated straw man to illustrate this:

Screen Shot 2014-11-12 at 3.45.24 PM

This is the example of the Five Whys found in the Web Operations book, as well.

This causal chain effectively ends with a person’s individual attributes, not with a description of the multiple conditions that allow an event like this to happen. Let’s look into some of the answers…

“Why did the server fail? Because an obscure subsystem was used in the wrong way.”

This answer is dependent on the outcome. We know that it was used in the “wrong” way only because we’ve connected it to the resulting failure. In other words, we as “investigators” have the benefit of hindsight. We can easily judge the usage of the server because we know the outcome. If we were to go back in time and ask the engineer(s) who were using it: “Do you think that you’re doing this right?” they would answer: yes, they are. We want to know what are the various influences that brought them to think that, which simply won’t fit into the answer of “why?”

The answer also limits the next question that we’d ask. There isn’t any room in the dialogue to discuss things such as the potential to use a server in the wrong way and it not result in failure, or what ‘wrong’ means in this context. Can the server only be used in two ways – the ‘right’ way or the ‘wrong’ way? And does success (or, the absence of a failure) dictate which of those ways it was used? We don’t get to these crucial questions.

“Why was it used in the wrong way? The engineer who used it didn’t know how to use it properly.”

This answer is effectively a tautology, and includes a post-hoc judgement. It doesn’t tell us anything about how the engineer did use the system, which would provide a rich source of operational data, especially for engineers who might be expected to work with the system in the future. Is it really just about this one engineer? Or is it possibly about the environment (tools, dashboards, controls, tests, etc.) that the engineer is working in? If it’s the latter, how does that get captured in the Five Whys?

So what do we find in this chain we have constructed above? We find:

  • an engineer with faulty (or at least incomplete) knowledge
  • insufficient indoctrination of engineers
  • a manager who fouls things up by not being thorough enough in the training of new engineers (indeed: we can make a post-hoc judgement about her beliefs)

If this is to be taken as an example of the Five Whys, then as an engineer or engineering manager, I might not look forward to it, since it focuses on our individual attributes and doesn’t tell us much about the event other than the platitude that training (and convincing people about training) is important.

These are largely answers about “who?” not descriptions of what conditions existed. In other words, by asking “why?” in this way, we’re using failures to explain failures, which isn’t helpful.

If we ask: “Why did a particular server fail?” we can get any number of answers, but one of those answers will be used as the primary way of getting at the next “why?” step. We’ll also lose out on a huge amount of important detail, because remember: you only get one question before the next step.

If instead, we were to ask the engineers how they went about implementing some new code (or ‘subsystem’), we might hear a number of things, like maybe:

  • the approach(es) they took when writing the code
  • what ways they gained confidence (tests, code reviews, etc.) that the code was going to work in the way they expected it before it was deployed
  • what (if any) history of success or failure have they had with similar pieces of code?
  • what trade-offs they made or managed in the design of the new function?
  • how they judged the scope of the project
  • how much (and in what ways) they experienced time pressure for the project
  • the list can go on, if you’re willing to ask more and they’re willing to give more

Rather than judging people for not doing what they should have done, the new view presents tools for explaining why people did what they did. Human error becomes a starting point, not a conclusion. (Dekker, 2009)

When we ask “how?”, we’re asking for a narrative. A story.

In these stories, we get to understand how people work. By going with the “engineer was deficient, needs training, manager needs to be told to train” approach, we might not have a place to ask questions aimed at recommendations for the future, such as:

  • What might we put in place so that it’s very difficult to put that code into production accidentally?
  • What sources of confidence for engineers could we augment?

As part of those stories, we’re looking to understand people’s local rationality. When it comes to decisions and actions, we want to know how it made sense for someone to do what they did. And make no mistake: they thought what they were doing made sense. Otherwise, they wouldn’t have done it.

WhyHow.001

Again, I’m not original with this thought. Local rationality (or as Herb Simon called it, “bounded rationality”) is something that sits firmly atop some decades of cognitive science.

These stories we’re looking for contain details that we can pull on and ask more about, which is critical as a facilitator of a post-mortem debriefing, because people don’t always know what details are important. As you’ll see later in this post, reality doesn’t work like a DVR; you can’t pause, rewind and fast-forward at will along a singular and objective axis, picking up all of the pieces along the way, acting like CSI. Memories are faulty and perspectives are limited, so a different approach is necessary.

Not just “how”

In order to get at these narratives, you need to dig for second stories. Asking “why?” will get you an answer to first stories. These are not only insufficient answers, they can be very damaging to an organization, depending on the context. As a refresher…

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

 

Now, read again the straw-man example of the Five Whys above. The questions that we ask frame the answers that we will get in the form of first stories. When we ask more and better questions (such as “how?”) we have a chance at getting at second stories.

You might wonder: how did I get from the Five Whys to the topic of ‘human error’? Because once ‘human error’ is a candidate to reach for as a cause (and it will, because it’s a simple and potentially satisfying answer to “why?”) then you will undoubtedly use it.

At the beginning of my tutorial in New York, I asked the audience this question:

IsThisRight.001

At the beginning of the talk, a large number of people said yes, this is correct. Steven Shorrock (who is speaking at Velocity next week in Barcelona on this exact topic) has written a great article on this way of thinking: If It Weren’t For The People. By the end of my talk, I was able to convince them that this is also the wrong focus of a post-mortem description.

This idea accompanies the Five Whys more often than not, and there are two things that I’d like to shine some light on about it:

Myth of the “human or technical failure” dichotomy

This is dualistic thinking, and I don’t have much to add to this other than what Dekker has said about it (Dekker, 2006):

“Was the accident caused by mechanical failure or by human error? It is a stock question in the immediate aftermath of a mishap. Indeed, it seems such a simple, innocent question. To many it is a normal question to ask: If you have had an accident, it makes sense to find out what broke. The question, however, embodies a particular understanding of how accidents occur, and it risks confining our causal analysis to that understanding. It lodges us into a fixed interpretative repertoire. Escaping from this repertoire may be difficult. It sets out the questions we ask, provides the leads we pursue and the clues we examine, and determines the conclusions we will eventually draw.”

Myth: during a retrospective investigation, something is waiting to be “found”

I’ll cut to the chase: there is nothing waiting to be found, or “revealed.” These “causes” that we’re thinking we’re “finding”? We’re constructing them, not finding them. We’re constructing them because we are the ones that are choosing where (and when) to start asking questions, and where/when to stop asking the questions. We’ve “found” a root cause when we stop looking. And in many cases, we’ll get lazy and just chalk it up to “human error.”

As Erik Hollnagel has said (Hollnagel, 2009, p. 85):

“In accident investigation, as in most other human endeavours, we fall prey to the What-You-Look-For-Is-What-You-Find or WYLFIWYF principle. This is a simple recognition of the fact that assumptions about what we are going to see (What-You-Look-For), to a large extent will determine what we actually find (What-You-Find).”

More to the point: “What-You-Look-For-Is-What-You-Fix”

We think there is something like the cause of a mishap (sometimes we call it the root cause, or primary cause), and if we look in the rubble hard enough, we will find it there. The reality is that there is no such thing as the cause, or primary cause or root cause . Cause is something we construct, not find. And how we construct causes depends on the accident model that we believe in. (Dekker, 2006)

Nancy Leveson comments on this in her excellent book Engineering a Safer World this idea (p.20):

Subjectivity in Selecting Events

The selection of events to include in an event chain is dependent on the stopping rule used to determine how far back the sequence of explanatory events goes. Although the first event in the chain is often labeled the ‘initiating event’ or ‘root cause’ the selection of an initiating event is arbitrary and previous events could always be added.

Sometimes the initiating event is selected (the backward chaining stops) because it represents a type of event that is familiar and thus acceptable as an explanation for the accident or it is a deviation from a standard [166]. In other cases, the initiating event or root cause is chosen because it is the first event in the backward chain for which it is felt that something can be done for correction.

The backward chaining may also stop because the causal path disappears due to lack of information. Rasmussen suggests that a practical explanation for why actions by operators actively involved in the dynamic flow of events are so often identified as the cause of an accident is the difficulty in continuing the backtracking “through” a human [166].

A final reason why a “root cause” may be selected is that it is politically acceptable as the identified cause. Other events or explanations may be excluded or not examined in depth because they raise issues that are embarrassing to the organization or its contractors or are politically unacceptable.

Learning is the goal. Any prevention depends on that learning.

So if not the Five Whys, then what should you do? What method should you take?

I’d like to suggest an alternative, which is to first accept the idea that you have to actively seek out and protect the stories from bias (and judgement) when you ask people “how?”-style questions. Then you can:

  • Ask people for their story without any replay of data that would supposedly ‘refresh’ their memory
  • Tell their story back to them and confirm you got their narrative correct
  • Identify critical junctures
  • Progressively probe and re-build how the world looked to people inside of the situation at each juncture.

As a starting point for those probing questions, we can look to Gary Klein and Sidney Dekker for the types of questions you can ask instead of “why?”…

Debriefing Facilitation Prompts

(from The Field Guide To Understanding Human Error, by Sidney Dekker)

At each juncture in the sequence of events (if that is how you want to structure this part of the accident story), you want to get to know:

  • Which cues were observed (what did he or she notice/see or did not notice what he or she had expected to notice?)
  • What knowledge was used to deal with the situation? Did participants have any experience with similar situations that was useful in dealing with this one?
  • What expectations did participants have about how things were going to develop, and what options did they think they have to influence the course of events?
  • How did other influences (operational or organizational) help determine how they interpreted the situation and how they would act?

Here are some questions Gary Klein and his researchers typically ask to find out how the situation looked to people on the inside at each of the critical junctures:

Cues What were you seeing?

What were you focused on?

What were you expecting to happen?

Interpretation If you had to describe the situation to your colleague at that point, what would you have told?
Errors What mistakes (for example in interpretation) were likely at this point?
Previous knowledge/experience

Were you reminded of any previous experience?

Did this situation fit a standard scenario?

Were you trained to deal with this situation?

Were there any rules that applied clearly here?

Did any other sources of knowledge suggest what to do?

Goals What were you trying to achieve?Were there multiple goals at the same time?Was there time pressure or other limitations on what you could do?
Taking Action How did you judge you could influence the course of events?

Did you discuss or mentally imagine a number of options or did you know straight away what to do?

Outcome Did the outcome fit your expectation?
Did you have to update your assessment of the situation?
Communications What communication medium(s) did you prefer to use? (phone, chat, email, video conf, etc.?)

Did you make use of more than one communication channels at once?

Help

Did you ask anyone for help?

What signal brought you to ask for support or assistance?

Were you able to contact the people you needed to contact?

For the tutorials I did at Velocity, I made a one-pager of these: http://bit.ly/DebriefingPrompts

Screen Shot 2014-11-12 at 4.03.30 PM

Try It

I have tried to outline some of my reasoning on why using the Five Whys approach is suboptimal, and I’ve given an alternative. I’ll do one better and link you to the tutorials that I gave in New York in October, which I think digs deeper into these concepts. This is in four parts, 45 minutes each.

Part I – Introduction and the scientific basis for post-hoc restrospective pitfalls and learning

Part II – The language of debriefings, causality, case studies, teams coping with complexity

Part III – Dynamic fault management, debriefing prompts, gathering and contextualizing data, constructing causes

Part IV – Taylorism, normal work, ‘root cause’ of software bugs in cars, Q&A

My request is that the next time that you would do a Five Whys, that you instead ask “how?” or the variations of the questions I posted above. If you think you get more operational data from a Five Whys and are happy with it, rock on.

If you’re more interested in this alternative and the fundamentals behind it, then there are a number of sources you can look to. You could do a lot worse than starting with Sidney Dekker’s Field Guide To Understanding Human Error.

An Explanation

For those readers who think I’m too unnecessarily harsh on the Five Whys approach, I think it’s worthwhile to explain why I feel so strongly about this.

Retrospective understanding of accidents and events is important because how we make sense of the past greatly and almost invisibly influences our future. At some point in the not-so-distant past, the domain of web engineering was about selling books online and making a directory of the web. These organizations and the individuals who built them quickly gave way to organizations that now build cars, spacecraft, trains, aircraft, medical monitoring devices…the list goes on…simply because software development and distributed systems architectures are at the core of modern life.

The software worlds and the non-software worlds have collided and will continue to do so. More and more “life-critical” equipment and products rely on software and even the Internet.

Those domains have had varied success in retrospective understanding of surprising events, to say the least. Investigative approaches that are firmly based on causal oversimplification and the “Bad Apple Theory” of deficient individual attributes (like the Five Whys) have shown to not only be unhelpful, but objectively made learning harder, not easier. As a result, people who have made mistakes or involved in accidents have been fired, banned from their profession, and thrown in jail for some of the very things that you could find in a Five Whys.

I sometimes feel nervous that these oversimplifications will still be around when my daughter and son are older. If they were to make a mistake, would they be blamed as a cause? I strongly believe that we can leave these old ways behind us and do much better.

My goal is not to vilify an approach, but to state explicitly that if the world is to become safer, then we have to eschew this simplicity; it will only get better if we embrace the complexity, not ignore it.

 

Epilogue: The Longer Version For Those Who Have The Stomach For Complexity Theory

The Five Whys approach follows a Newtonian-Cartesian worldview. This is a worldview that is seductively satisfying and compellingly simple. But it’s also false in the world we live in.

What do I mean by this?

There are five areas why the Five Whys firmly sits in a Newtonian-Cartesian worldview that we should eschew when it comes to learning from past events. This is a Cliff Notes version of “The complexity of failure: Implications of complexity theory for safety investigations” –

First, it is reductionist. The narrative built by the Five Whys sits on the idea that if you can construct a causal chain, then you’ll have something to work with. In other words: to understand the system, you pull it apart into its constituent parts. Know how the parts interact, and you know the system.

Second, it assumes what Dekker has called “cause-effect symmetry” (Dekker, complexity of failure):

“In the Newtonian vision of the world, everything that happens has a definitive, identifiable cause and a definitive effect. There is symmetry between cause and effect (they are equal but opposite). The determination of the ‘‘cause’’ or ‘‘causes’’ is of course seen as the most important function of accident investigation, but assumes that physical effects can be traced back to physical causes (or a chain of causes-effects) (Leveson, 2002). The assumption that effects cannot occur without specific causes influences legal reasoning in the wake of accidents too. For example, to raise a question of negligence in an accident, harm must be caused by the negligent action (GAIN, 2004). Assumptions about cause-effect symmetry can be seen in what is known as the outcome bias (Fischhoff, 1975). The worse the consequences, the more any preceding acts are seen as blameworthy (Hugh and Dekker, 2009).”

John Carroll (Carroll, 1995) called this “root cause seduction”:

The identification of a root cause means that the analysis has found the source of the event and so everyone can focus on fixing the problem.  This satisfies people’s need to avoid ambiguous situations in which one lacks essential information to make a decision (Frisch & Baron, 1988) or experiences a salient knowledge gap (Loewenstein, 1993). The seductiveness of singular root causes may also feed into, and be supported by, the general tendency to be overconfident about how much we know (Fischhoff,Slovic,& Lichtenstein, 1977).

That last bit about a tendency to be overconfident about how much we know (in this context, how much we know about the past) is a strong piece of research put forth by Baruch Fischhoff, who originally researched what we now understand to be the Hindsight Bias. Not unsurprisingly, Fischhoff’s doctoral thesis advisor was Daniel Kahneman (you’ve likely heard of him as the author of Thinking Fast and Slow), whose research in cognitive biases and heuristics everyone should at least be vaguely familiar with.

The third issue with this worldview, supported by the idea of Five Whys and something that follows logically from the earlier points is that outcomes are foreseeable if you know the initial conditions and the rules that govern the system. The reason that you would even construct a serial causal chain like this is because

The fourth part of this is that time is irreversible. We can’t look to a causal chain as something that you can fast-forward and rewind, no matter how attractively simple that seems. This is because the socio-technical systems that we work on and work in are complex in nature, and are dynamic. Deterministic behavior (or, at least predictability) is something that we look for in software; in complex systems this is a foolhardy search because emergence is a property of this complexity.

And finally, there is an underlying assumption that complete knowledge is attainable. In other words: we only have to try hard enough to understand exactly what happened. The issue with this is that success and failure have many contributing causes, and there is no comprehensive and objective account. The best that you can do is to probe people’s perspectives at juncture points in the investigation. It is not possible to understand past events in any way that can be considered comprehensive.

Dekker (Dekker, 2011):

As soon as an outcome has happened, whatever past events can be said to have led up to it, undergo a whole range of transformations (Fischhoff and Beyth, 1975; Hugh and Dekker, 2009). Take the idea that it is a sequence of events that precedes an accident. Who makes the selection of the ‘‘events’’ and on the basis of what? The very act of separating important or contributory events from unimportant ones is an act of construction, of the creation of a story, not the reconstruction of a story that was already there, ready to be uncovered. Any sequence of events or list of contributory or causal factors already smuggles a whole array of selection mechanisms and criteria into the supposed ‘‘re’’construction. There is no objective way of doing this—all these choices are affected, more or less tacitly, by the analyst’s background, preferences, experiences, biases, beliefs and purposes. ‘‘Events’’ are themselves defined and delimited by the stories with which the analyst configures them, and are impossible to imagine outside this selective, exclusionary, narrative fore-structure (Cronon, 1992).

Here is a thought exercise: what if we were to try to use the Five Whys for finding the “root cause” of a success?

Why didn’t we have failure X today?

Now this question gets a lot more difficult to have one answer. This is because things go right for many reasons, and not all of them obvious. We can spend all day writing down reasons why we didn’t have failure X today, and if we’re committed, we can keep going.

So if success requires “multiple contributing conditions, each necessary but only jointly sufficient” to happen, then how is it that failure only requires just one? The Five Whys, as its commonly presented as an approach to improvement (or: learning?), will lead us to believe that not only is just one condition sufficient, but that condition is a canonical one, to the exclusion of all others.

* RCA, or “Root Cause Analysis” can also easily turn into “Retrospective Cover of Ass”

References

Carroll, J. S. (1995). Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability. Organization & Environment, 9(2), 175–197. doi:10.1177/108602669500900203

Dekker, S. (2004). Ten questions about human error: A new view of human factors and system safety. Mahwah, N.J: Lawrence Erlbaum.

Dekker, S., Cilliers, P., & Hofmeyr, J.-H. (2011). The complexity of failure: Implications of complexity theory for safety investigations. Safety Science, 49(6), 939–945. doi:10.1016/j.ssci.2011.01.008

Hollnagel, E. (2009). The ETTO principle: Efficiency-thoroughness trade-off : why things that go right sometimes go wrong. Burlington, VT: Ashgate.
Leveson, N. (2012). Engineering a Safer World. Mit Press.

 

 

Availability: Nuance As A Service

Something that has struck me funny recently surrounds the traditional notion of availability of web applications. With respect to its relationship to revenue, to infrastructure and application behavior, and fault protection and tolerance, I’m thinking it may be time to get a broader upgrade adjustment to the industry’s perception on the topic.

These nuances in the definition and affects of availability aren’t groundbreaking. They’ve been spoken about before, but for some reason I’m not yet convinced that they’re widely known or understood.

Impact On Business

What is laid out here in this article is something that’s been parroted for decades: downtime costs companies money, and lost value. Generally speaking, this is obviously correct, and by all means you should strive to design and operate your site with high availability and fault tolerance in mind.

But underneath the binary idea that uptime = good and downtime = bad, the reality is that there’s a lot more detail that deserves exploring.

This irritatingly-designed site has a post about a common equation to help those that are arithmetically challenged:

LOST REVENUE = (GR/TH) x I x H
GR = gross yearly revenue
TH = total yearly business hours
I = percentage impact
H = number of hours of outage

In my mind, this is an unnecessarily blunt measure. I see the intention behind this approach, because it’s not meant to be anywhere close to being accurate. But modern web operations is now a field where gathering metrics in the hundreds of thousands per second is becoming more common-place, fault-tolerance/protection is a thing we do increasingly well, and graceful degradation techniques are the norm.

In other words: there are a lot more considerations than outage minutes = lost revenue, even if you did have a decent way to calculate it (which, you don’t). Companies selling monitoring and provisioning services will want you to subscribe to this notion.

We can do better than this blunt measure, and I thought it’s worth digging in a bit deeper.

“Loss”

Thought experiment: if Amazon.com has a full and global outage for 30 minutes, how much revenue did it “lose”? Using the above rough equation, you can certainly come up with a number, let’s say N million dollars. But how accurate is N, really? Discussions that surround revenue loss are normally designed to motivate organizations to invest in availability efforts, so N only needs to be big and scary enough to provide that motivation. So let’s just say that goal has been achieved: you’re convinced! Availability is important, and you’re a firm believer that You Own Your Own Availability.

Outside of the “let this big number N convince you to invest in availability efforts” I have some questions that surround N:

  • How many potential customers did Amazon.com lose forever, during that outage? Meaning: they tried to get to Amazon.com, with some nonzero intent/probability of buying something, found it to be offline, and will never return there again, for reasons of impatience, loss of confidence, the fact that it was an impulse-to-buy click whose time has passed, etc.
  • How much revenue did Amazon lose during that 30 minute window, versus how the revenue that it simply postponed when it was down, only to be executed later? In other words: upon finding the site down, they’ll return sometime later to do what they originally intended, which may or may not include buying something or participate in some other valuable activity.
  • How much did that 30 minutes of downtime affect the strength of the Amazon brand, in a way that could be viewed as revenue-affecting? Meaning: are users and potential users now swayed to having less confidence in Amazon because they came to the site only to be disappointed that it’s down, enough to consider alternatives the next time they would attempt to go to the site in the future?

I don’t know the answers to these questions about Amazon, but I do know that at Etsy, those answers depend on some variables:

  • the type of outage or degradation (more on that in a minute),
  • the time of day/week/year
  • how we actually calculate/forecast how those metrics would have behaved during the outage

So, let’s crack those open a bit, and see what might be inside…

Temporal Concerns

Not all time periods can be considered equal when it comes to availability, and the idea of lost revenue. For commerce sites (or really any site whose usage varies with some seasonality) this is hopefully glaringly obvious. In other words:

X minutes of full downtime during the peak hour of the peak day of the year can be worlds apart from Y minutes of full downtime during the lowest hour of the lowest day of the year, traffic-wise.

Take for example a full outage that happens during a period of the peak day of the year, and contrast it with one that happens during a lower-period of the year. Let’s say that this graph of purchases is of those 24-hour periods, indicating when the outages happen:

A Tale of Two Outages

The impact time of the outage during the lower-traffic day is actually longer than the peak day, affecting the precious Nines math by a decent margin. But yet: which outage would you rather have, if you had to have one of those? 🙂

Another temporal concern is: across space and time, distribution and volume of any level degradation could be viewed as perfect uptime as the length of the outage approaches zero.

Dig, if you will, these two outage profiles, across a 24-hour period. The first one has many small outages across the day:

Screen Shot 2013-01-03 at 8.09.59 AM

and the other has the same amount of impact time, in a single go:

Screen Shot 2013-01-03 at 8.12.54 AM

So here we have the same amount of time, but spread out throughout the day. Hopefully, folks will think a bit more beyond the clear “they’re both bad! don’t have outages!” and could investigate how they could be different. Some considerations in this simplified example:

  • Hour of day. Note that the single large outage is “earlier” in the day. Maybe this will affect EU or other non-US users more broadly, depending on the timezone of the original graph. Do EU users have a different expectation or tolerance for outages in a US-based company’s website?
  • Which outage scenario has a greater affect on the user population: if the ‘normal’ behavior is “get in, buy your thing, and get out” quickly, I could see the many-small-outages more preferable to the single large one. If the status quo is some mix of searching, browsing, favoriting/sharing, and then purchase, I could see the singular constrained outage being preferable.

Regardless, this underscores the idea that not all outages are created equal with respect to impact timing.

Performance

Loss of “availability” can also be seen as an extreme loss of performance. At a particular threshold, given the type of feedback to the user (a fast-failed 404 or browser error, versus a hanging white page and spinning “loading…”) the severity of an event being slow can effectively be the same as a full outage.

Some concerns/thought exercises around this:

  • Where is this latency threshold for your site, for the functionality that is critical for the business?
  • Is this threshold a cliff, or is it a continuous/predictable relationship between performance and abandonment?

There’s been much more work on performance’s effects on revenue than availability. The Velocity Conference in 2009 brought the first real production-scale numbers (in the form of a Bing/Google joint presentation as well as Shopzilla and Mozilla talks) behind how performance affects businesses, and if you haven’t read about it, please do.

Graceful Degradation

Will Amazon (or Etsy) lose sales if all or a portion of its functionality is gone (or sufficiently slow) for a period of time? Almost certainly. But that question is somewhat boring without further detail.

In many cases, modern web sites don’t simply live in a “everything works perfectly” or “nothing works at all” boolean world. (To be sure, neither does the Internet as a whole.) Instead, fault-tolerance and resilience approaches allow for features and operations degrade under a spectrum of failure conditions. Many companies build their applications to have both in-flight fault tolerance to degrade the experience in the face of singular failures, as well as making use of “feature flags” (Martin and Jez call them “feature toggles“) which allow for specific features to be shut off if they’re causing problems.

I’m hoping that most organizations are familiar with this approach at this point. Just because user registration is broken at the moment, you don’t want to prevent  already logged-in users from using the otherwise healthy site, do you? 🙂

But these graceful degradation approaches further complicates the notion of availability, as well as its impact on the business as a whole.

For example: if Etsy’s favoriting feature is not working (because the site’s architecture allows it to gracefully fail without affecting other critical functionality), but checkout is working fine…what is the result? Certainly you might paused before marking down your blunt Nines record.

You might also think: “so what? as long as people can buy things, then favoriting listings on the site shouldn’t be considered in scope of availability.”

But consider these possibilities:

  • What if Favoriting listings was a significant driver of conversions?
  • If Favoriting was a behavior that led to conversions at a rate of X%, what value should X be before ‘availability’ ought to be influenced by such a degradation?
  • What if Favoriting was technically working, but was severely degraded (see above) in performance?

Availability can be a useful metric, but when abused as a silver bullet to inform or even dictate architectural, business priority, and product decisions, there’s a real danger of oversimplifying what are really nuanced concerns.

Bounce-Back and Postponement

As I mentioned above, what is more likely for sites that have an established community or brand, outages (even full ones) don’t mark an instantaneous amount of ‘lost’ revenue or activity. For a nonzero amount, they’re simply postponed. This is the area that I think could use a lot more data and research in the industry, much in the same way that latency/conversion relationship has been investigated.

The over-simplified scenario involves something that looks like this. Instead of the blunt math of “X minutes of downtime = Y dollars of lost revenue”, we can be a bit more accurate, if we tried just a bit harder. The red is the outage:

OutageGraph2

 

So we have some more detail, which is that if we can make a reasonable forecast about what purchases did during the time of the outage, then we could make a better-inform estimate of purchases “lost” during that time period.

But is that actually the case?

What we see at Etsy is something different, a bit more like this:

Screen Shot 2013-01-03 at 12.35.41 PM

Clearly this is an oversimplification, but I think the general behavior comes across. When a site comes back from a full outage, there is an increase in the amount of activity as users who were stalled/paused in their behavior by the outage resumes. My assumption is that many organizations see this behavior, but it’s just not being talked about publicly.
The phenomenon that needs more real-world data is to support (or deny) the hypothesis that depending on:
  • Position of the outage in the daily traffic profile (start-end)
  • Position of the outage in the yearly season

the bounce-back volume will vary in a reasonably predictable fashion. Namely, as the length of the outage grows, the amount of bounce-back volume shrinks:

Screen Shot 2013-01-03 at 12.55.14 PM

What this line of thinking doesn’t capture is how many of those users postponed their activity not for immediately after the outage, but maybe the next day because they needed to leave their computer for a meeting at work, or leaving work to commute home?

Intention isn’t entirely straightforward to figure out, but in the cases where you have a ‘fail-over’ page that many CDNs will provide when the origin servers aren’t available, you can get some more detail about what requests (add to cart? submit payment?) came in during that time.

Regardless, availability and its affect on business metrics isn’t as easy as service providers and monitoring-as-a-service companies will have you believe. To be sure, a good amount of this investigation will vary wildly from company to company, but I think it’s well worth taking a look into.

 

Fundamental: Stress-Strain Curves In Web Engineering

I make it no secret that my background is in mechanical engineering. I still miss those days of explicit and dynamic finite element analysis, when I worked for the VNTSC, working on vehicle crashworthiness studies for the NHTSA.

What was there not to like? Things like cars and airbags and seatbelts and dummies and that get crushed, sheared, cracked, busted in every way, all made of different materials: steel, glass, rubber, even flesh (cadaver studies)…it was awesome.

I’ve made some analogies from the world of statics and dynamics to the world of web operations before (Part I and Part II), and it still sticks in my mind as a fundamental mental model in my every day work: resources that have adaptive capacities have a fundamental relationship between stress and strain. Which is to say, in most systems we encounter, as demand for a given resource increases, the strain on the system (and therefore the adaptive capacity) under load also changes, and in most cases increases.

What do I mean by “resource”? Well, from the materials science world, this is generally a component characterized by its material properties. The textbook example is a bar of metal, being stretched.

À la:

In this traditional case, the “system” is simply a beam or a linkage or a load-bearing something.

But in my extension/abuse of the analogy, simple resources in the first order could be imagined as:

  •    CPU
  •    Memory
  •    Disk I/O
  •    Disk consumption
  •    Network bandwidth

To extend it further (and more realistically, because these resources almost never experience work in isolation of each other) you could think of the resource under load to be any combination of these things. And the system under load may be a webserver. Or a database. Or a caching server.

Captain Obvious says: welcome to the underlying facts-on-the-ground of capacity planning and monitoring. 🙂

To me, this leads to some more questions:

    • What does this relationship look like, between stress and strain?
      • Does it fail immediately, as if it was brittle?
      • Or does it “bend”, proportionally, (as in: request rate versus latency) for some period before failure?
      • If the latter, is the relationship linear, or exponential, or something else entirely?
    • Was this relationship known before the design of the system, and therefore taken into account?
      • Which is to say: what approaches are we using most in predicting this relationship between stress and strain:
        • Extrapolated experimental data from synthetic load testing?
        • Previous real-world production data from similarly-designed systems?
        • Percentage rampups of production load?
        • A few cherry-picked reports on HackerNews combined with hope and caffeine?
    • Will we be able to detect when the adaptive capacity of this system is nearing damage or failure?
    • If we can, what are we planning on doing when we reach those inflections?

The more confidence we have about this relationship between stress and strain, the more prepared we are for the system’s failures and successes.

Now, the analogy of this fundamental concept doesn’t end here. What if the “system” under varying load is an organization? What if it’s your development and operations team? Viewed on a longer scale than a web request, this can be seen as a defining characteristic of a team’s adaptive capacities.

David Woods and John Wreathall discuss this analogy they’ve made in Stress-Strain Plots as a Basis for Assessing System Resilience”. They describe how they are mapping the state space of a stress-strain plot to an organization’s adaptive capacities and resilience:

Following the conventions of stress- strain plots in material sciences, the y-axis is the stress axis. We will here label the y-axis as the demand axis (D) and the basic unit of analysis is how the organization responds to an increase in D relative to a base level of D (Figure 1). The x-axis captures how the material stretches when placed under a given load or a change in load. In the extension to organizations, the x-axis captures how the organization stretches to handle an increase in demands (S relative to some base).

In the first region – which we will term the uniform response region – the organization has developed plans, procedures, training, personnel and related operational resources that can stretch uniformly as demand varies in this region. This is the on-plan performance area or what Woods (2006) referred to as the competence envelope.

As you can imagine, the fun begins in the part of the relationship above the uniform region. In materials science, this is where plastic deformation begins; it’s the point on the curve at which a resource/component’s structure deforms under the increased stress and can no longer rebound back to its original position. It’s essentially damaged, or its shape is permanently changed in the given context.

They go on to say that in the organizational stress-strain analogy:

In the second region non-uniform stretching begins; in other words, ‘gaps’ begin to appear in the ability to maintain safe and effective production (as defined within the competence envelope) as the change in demands exceeds the ability of the organization to adapt within the competence envelope. At this point, the demands exceed the limit of the first order adaptations built into the plan-ful operation of the system in question. To avoid an accumulation of gaps that would lead to a system failure, active steps are needed to compensate for the gaps or to extend the ability of the system to stretch in response to increasing demands. These local adaptations are provided by people and groups as they actively adjust strategies and recruit resources so that the system can continue to stretch. We term this the ‘extra’ region (or more compactly, the x-region) as compensation requires extra work, extra resources, and new (extra) strategies.

So this is a good general description in Human Factors Researcher language, but what is an example of this non-uniform or plastic deformation in our world of web engineering? I see a few examples.

  • In distributed systems, at the point at which the volume of data and request (or change) rate of the data is beyond the ability for individual nodes to cope, and a wholesale rehash or fundamental redistribution is necessary. For example, in a typical OneMasterManySlaves approach to database architecture, when the rate of change on the master passes the point where no matter how many slaves you add (to relieve read load on the master) the data will continually be stale. Common solutions to this inflection point are functional partitioning of the data into smaller clusters, or federating the data amongst shards. In another example, it could be that in a Dynamo-influenced datastore, the N, W, and R knobs need adjusting to adapt to the rate or the individual nodes’ resources need to be changed.
  • In Ops teams, when individuals start to discover and compensate for brittleness in the architecture. A common sign of this happening is when alerting thresholds or approaches (active versus passive, aggregate versus individual, etc.) no longer provide the detection needed within an acceptable signal:noise envelope. This compensation can be largely invisible, growing until it’s too late and burnout has settled in.
  • The limits of an underlying technology (or the particular use case for it) is starting to show. An example of this is a single-process server. Low traffic rates pose no problem for software that can only run on a single CPU core; it can adapt to small bursts to a certain extent, and there’s a simple solution to this non-multicore situation: add more servers. However, at some point, the work needed to replace the single-core software with multicore-ready software drops below the amount of work needed to maintain and grow an army of single-process servers. This is especially true in terms of computing efficiency, as in dollars per calculation.

In other words, the ways a design or team once adapted are no longer valid in this new region of the stress-strain relationship. Successful organizations re-group and increase their ability to adapt to this new present case of demands, and invest in new capacities.

For the general case, the exhaustion of capacity to adapt as demands grow is represented by the movement to a failure point. This second phase is represented by the slope and distance to the failure point (the downswing portion of the x-region curve). Rapid collapse is one kind of brittleness; more resilient systems can anticipate the eventual decline or recognize that capacity is becoming exhausted and recruit additional resources and methods for adaptation or switch to a re-structured mode of operations (Figures 2 and 3). Gracefully degrading systems can defer movement toward a failure point by continuing to act to add extra adaptive capacity.

In effect, resilient organizations recognize the need for these new strategies early on in the non-uniform phase, before failure becomes imminent. This, in my view, is the difference between a team who has ingrained into their perspective what it means to be operationally ready, and those who have not. At an individual level, this is what I would consider to be one of the many characteristics that define a “senior” (or, rather a mature) engineer.

This is the money quote, emphasis is mine:

Recognizing that this has occurred (or is about to occur) leads people in these various roles to actively adapt to make up for the non- uniform stretching (or to signal the need for active adaptation to others). They inject new resources, tactics, and strategies to stretch adaptive capacity beyond the base built into on-plan behavior. People are the usual source of these gap-filling adaptations and these people are often inventive in finding ways to adapt when they have experience with particular gaps (Cook et al., 2000). Experienced people generally anticipate the need for these gap-filling adaptations to forestall or to be prepared for upcoming events (Klein et al., 2005; Woods and Hollnagel, 2006), though they may have to adapt reactively on some occasions after the consequences of gaps have begun to appear. (The critical role of anticipation was missed in some early work that noticed the importance of resilient performance, e.g., Wildavsky, 1988.)

This behavior leads to the extension of the non-uniform space into new uniform spaces, as the team injects new adaptive capacities:

There is a lot more in this particular paper that Woods and Wreathall cover, including:

  • Calibration – How engineering leaders and teams view themselves and their situation, along the demand-strain curve. Do they underestimate or overestimate how close they are to failure points or active adaptations that are indicative of “drift” towards failure?
  • Costs of Continual Adaption in the X-Region – As the compensations for cracks and gaps in the system’s armor increase, so does the cost. At some point, the cost of restructuring the technology or the teams becomes lower than the continual making-up-for-the-gaps that are happening.
  • The Law of Stretched Systems – “As an organization is successful, more may be demanded of it (‘faster, better, cheaper’ pressures) pushing the organization to handle demands that will exceed its uniform range. In part this relationship is captured in the Law of Stretched Systems (Woods and Hollnagel, 2006) – with new capabilities, effective leaders will adapt to exploit the new margins by demanding higher tempos, greater efficiency, new levels of performance, and more complex ways of work.”

Overall, I think Woods and Wreathall hit the nail on the head for me.  Of course, as with all analogies, this mapping of resilience and adaptive capacity to stress-strain curves has limits and they are clear on pointing those out as well.

My suggestion of course is for you to read the whole chapter. It may or may not be useful for you, but it sure is to me. I mean, I embrace the concept so much that I got a it printed on a coffee mug, and I’m thinking of making an Etsy Engineering t-shirt as well. 🙂

Resilience Engineering Part II: Lenses

(this is part 2 of a series: here is part 1)

One of the challenges of building and operating complex systems is that it’s difficult to talk about one facet or component of them without bleeding the conversation into other related concerns. That’s the funky thing about complex systems and systems thinking: components come together to behave in different (sometimes surprising) ways that they never would on their own, in isolation. Everything always connects to everything else, so it’s always tempting to see connections and want to include them in discussion. I suspect James Urquhart feels the same.

So one helpful bit (I’ve found) is Erik Hollnagel’s Four Cornerstones of Resilience. I’ve been using it as a lens with which which to discuss and organize my thoughts on…well, everything. I think I’ve included them in every talk, presentation, or maybe even every conversation I’ve had with anyone for the last year and a half. I’m planning on annoying every audience for the foreseeable future by including them, because I can’t seem to shake how helpful they are as a lens into viewing resilience as a concept.

Four Cornerstones of Resilience

The greatest part about the four cornerstones is that it’s a simplification device for discussion. And simplifications don’t come easily when talking about complex systems. The other bit that I like is that it makes it straightforward to see relationships in activities and challenges in each of them, as they relate to each other.

For example: learning is traditionally punctuated by Post-Mortems, and what (hopefully) come out of PMs? Remediation items. Tasks and facts that can aid:

  • monitoring,  (example: “we need to adjust an alerting threshold or mechanism to be more appropriate for detecting anomalies”)
  • anticipation, (example: “we didn’t see this particular failure scenario coming before, so let’s update our knowledge on how it came about.”) 
  • response (example: “we weren’t able to troubleshoot this issue as quickly as we’d like because communicating during the outage was noisy/difficult, so let’s fix that.”)

I’ll leave it as an exercise to imagine how anticipation can then affect monitoring, response, and learning. The point here is that each of the four cornerstones can effect each other in all sorts of ways, and those relationships ought to be explored.

I do think it’s helpful when looking at these pieces to understand that they can exist on a large time window as well as a small one. You might be tempted to view them in the context of infrastructure attributes in outage scenarios; this would be a mistake, because it narrows the perspective to what you can immediately influence. Instead, I think there’s a lot of value in looking at the cornerstones as a lens on the larger picture.

Of course going into each one of these in detail isn’t going to happen in a single epic too-long blog post, but I thought I’d mention a couple of things that I currently think of when I’ve got this perspective in mind.

Anticipation

This is knowing what to expect, and dealing with the potential for fundamental surprises in systems. This involves what Westrum and Adamski called “Requisite Imagination”, the ability to explore the possibilities of failure (and success!) in a given system. The process of imagining scenarios in the future is a worthwhile one, and I certainly think a fun one. The skill of anticipation is one area where engineers can illustrate just how creative they are. Cheers to the engineers who can envision multiple futures and sort them based on likelihood. Drinks all around to those engineers who can take that further and explain the rationale for their sorted likelihood ratings. Whereas monitoring deals in the now, anticipation deals in the future.

At Etsy we have a couple of tools that help with anticipation:

  • Architectural Reviews These are meetings open to all of engineering that we have when there’s a new pattern being used or a new type of technology being introduced, and why. We gather up people proposing the ideas, and then spend time shooting holes into it with the goal of making the solution stronger than it might have been on its own. We’d also entertain what we’d do if things didn’t go according to plan with the idea. We take adopting new technologies very seriously, so this doesn’t happen very often.
  • Go or No-Go Meetings (a.k.a. Operability Reviews) These are where we gather up representative folks (at least someone from Support, Community, Product, and obviously Engineering) to discuss some fundamentals on a public-facing change, and walk through any contingencies that might need to happen. Trick is – in order to get contingencies as part of the discussion, you have to name the circumstances where they’d come up.
  • GameDay Exercises These are exercises where we validate our confidence in production by causing as many horrible things we can to components while they’re in production. Even asking if a GameDay is possible sparks enough conversation to be useful, and burning pieces to the ground to see how it behaves when it does is always a useful task. We want no unique snowflakes, so being able to stand it up as fast as it can burn down is fun for the whole family.

But anticipation isn’t just about thinking along the lines of “what could possibly go wrong?” (although that is always a decent start). It’s also about the organization, and how a team behaves when interacting with the machines. Recognizing when your adaptive capacity is failing is key to anticipation. David Woods has collected some patterns of anticipation worth exploring, many of which relate to a system’s adaptive capacity:

  • Recognize when adaptive capacity is failingExample: Can you detect when your team’s ability to respond to outages degrades?
  • Recognizing the threat of exhausting buffers or reservesExample: Can you tell when your tolerances for faults are breached? When your team’s workload prevents proactive planning from getting done?
  • Recognize when to shift priorities across goal trade-offsExample: Can you tell when you’re going to have to switch from greenfield development, and focus on cleaning up legacy infra?
  • Able to make perspective shifts and contrast diverse perspectives that go beyond their nominal positionExample: Can Operations understand the goals of Development, and vice-versa, and support them in the future?
  • Able to navigate interdependencies across roles, activities, and levels Example: Can you foresee what’s going to be needed from different groups (Finance, Support, Facilities, Development, Ops, Product, etc.) and who in those teams need to be kept up-to-date with ongoing events?
  • Recognize the need to learn new ways to adapt – Example: Will you know when it’s time to include new items in training incoming engineers, as failure scenarios and ways of working change in the organization and infrastructure?

I’m fascinated by the skill of anticipation, frankly. I spoke at Velocity Europe in Berlin last year on the topic.

Monitoring

This is knowing what to look for, and dealing with the critical in systems. Not just the mechanics of servers and networks and applications, but monitoring in the organizational sense. Anomaly detection and metrics collection and alerting are obviously part of this, and should be familiar to anyone expecting their web application to be operable.

But in addition to this, we’re talking as well about meta-metrics on the operations and activities of both infrastructure and staff.

Things like:

  • How might an team measure its cognitive load during an outage, in order to detect when it is drifting?
  • Are there any gaps that appear in a team’s coordinative or collaborative abilities, over time?
  • Can the organization detect when there are goal conflicts (example: accelerating production schedules in the face of creeping scope) quickly enough to make them explicit and do something about them?
  • What leading or lagging indicators could you use to gauge whether or not the performance demand of a team is beyond what could be deemed “normal” for the size and scale it has?
  • How might you tell if a team is becoming complacent with respect to safety, when incidents decrease? (“We’re fine! Look, we haven’t had an outage for months!”)
  • How can you confirm that engineers are being ramped up adequately to being productive and adaptive in a team?

Response

This is knowing what to do, and dealing with the actual in systems. Whether you’ve anticipated a perturbation or disturbance, as long as you can detect it, than you have something to respond to. How do you? Page the on-call engineer? Are you the on-call engineer? Response is fundamental to working in web operations, and differential diagnosis is just as applicable to troubleshooting complex systems as it is

Pitfalls in responding to surprising behaviors in complex systems have exotic and novel characteristics. They are the things that what make Post-Mortem meetings dramatic; the can often include stories of surprising turns of attention, focus, and results that makes troubleshooting more of a mystery than anything. Dietrich Dörner, in his 1980 article “On The Difficulties People Have In Dealing With Complexity“, he gave some characteristics of response in escalating scenarios. These might sound familiar to anyone who has experienced team troubleshooting during an outage:

…[people] tend to neglect how processes develop over time (awareness of rates) versus assessing how things are in the moment.
…[people] have difficulty in dealing with exponential developments (hard to imagine how fast things can change, or accelerate)
…[people] tend to think in causal series (A, therefore B), as opposed to causal nets (A, therefore B and C, therefore D and E, etc.)

I was lucky enough to talk a bit more in detail about Resilient Response In Complex Systems at QCon in London this past year.

Learning

This is knowing what has happened, and dealing with the factual in systems.  Everyone wants to believe that their team or group or company has the ability to learn, right? A hallmark of good engineering is empirical observation that results in future behavior changes. Like I mentioned above, this is the place where Post-Mortems usually come into play. At this point I think our field ought to be familiar with Post-Mortem meetings and the general structure and goal of them: to glean as much information about an incident, an outage, a surprising result, a mistake, etc. and spread those observations far and wide within the organization in order to prevent them from happening in the future.

I’m obviously a huge fan of Post-Mortems and what they can do to improve an organization’s behavior and performance. But a lesser-known tool for learning is the “Near-Miss” opportunities we see in normal, everyday work. An engineer performs an action, and realizes later that it was wrong or somehow produced a result that is surprising. When those happen, we can hold them up high, for all to see and learn from. Did they cause damage? No, that’s why they “missed.”

One of the godfathers of cognitive engineering, James Reason, said that “near-miss” events are excellent learning opportunities for organizations, because they:

  1. Can act like safety “vaccines” for an organization, because they are just a little bit of failure that doesn’t really hurt.
  2. They happen much more often than actual systemic failures, so they provide a lot more data on latent failures.
  3. They are a powerful reminder of hazards, therefore keeping the “constant sense of unease” that is needed to provide resilience in a system.

I’ll add that encouraging engineers to share the details of their near-misses has a positive side effect on the culture of the organization. At Etsy, you will see (from time to time) an email to the whole team from an engineer that has the form:

Dear Everybody,

This morning I went to do X, so I did Y. Boy was that a bad idea! Not only did it not do what I thought it was going to, but also it almost brought the site down because of Z, which was a surprise to me. So whatever you do, don’t make the same mistake I did. In the meantime, I’m going to see what I can do to prevent that from happening.

Love,

Joe Engineer

For one, it provides the confirmation that anyone, at any time, no matter their seniority level, can make a mistake or act on faulty assumptions. The other benefit is that it sends the message that admitting to making a mistake is acceptable and encouraged, and that people should feel safe in admitting to these sometimes embarrassing events.

This last point is so powerful that it’s hard to emphasize it more. It’s related to encouraging a Just Culture, something that I wrote about recently over at Code As Craft, Etsy’s Engineering blog.

The last bit I wanted to mention about learning is purposefully not incident-related. One of the basic tenets of Resilience Engineering is that safety is not the absence of incidents an failures; it’s the presence of actions, behaviors, and culture (all along the lines of the four cornerstones above) that causes an organization to be safe. Learning from failures means that the surface area to learn from is not all that large. To be clear, most organizations see successes much much more often than they do failures.

One such focus might be changes. If, across 100 production deploys, you had 9 change-related outages, which should you learn from? Should you be satisfied to look at those nine, have postmortems, and then move forward, safe in the idea that you’ve rid yourself of the badness? Or should you also look at the 91 deploys, and gather some hypothesis about why they ended up ok? You can learn from 9 events, or 91. The argument here is that you’ll be safer by learning from both.

So in addition to learning from why things go wrong, we ought to learn just as much from why do things go right? Why didn’t  your site go down today? Why aren’t you having an outage right now? This is likely due to a huge number of influences and reasons, all worth exploring.

….

Ok, so I lied. I didn’t expect this to be such a long post. I do find the four cornerstones to be a good lens with which to think and speak about resilience and complex systems. It’s as much of my vocabulary now as OODA and CAP is. 🙂

Each necessary, but only jointly sufficient

I thought it might be worth digging in a bit deeper on something that I mentioned in the Advanced Postmortem Fu talk I gave at last year’s Velocity conference.

For complex socio-technical systems (web engineering and operations) there is a myth that deserves to be busted, and that is the assumption that for outages and accidents, there is a single unifying event that triggers a chain of events that led to an outage.

This is actually a fallacy, because for complex systems:

there is no root cause.

This isn’t entirely intuitive, because it goes against our nature as engineers. We like to simplify complex problems so we can work on them in a reductionist fashion. We want there to be a single root cause for an accident or an outage, because if we can identify that, we’ve identified the bug that we need to fix. Fix that bug, and we’ve prevented this issue from happening in the future, right?

There’s also strong tendency in causal analysis (especially in our field, IMHO) to find the single place where a human touched something, and point to that as “root” cause. Those dirty, stupid humans. That way, we can put the singular onus on “lack of training”, or the infamously terrible label “human error.” This, of course, isn’t a winning approach either.

But, you might ask, what about the “Five Whys” method of root cause analysis? Starting with the outcome and working backwards towards an originally triggering event along a linear chain feels intuitive, which is why it’s so popular. Plus, those Toyota guys know what they’re talking about. But it also falls prey to the same issue with regard to assumptions surrounding complex failures.

As this excellent post in Workplace Psychology rightly points out limitations with the Five Whys:

An assumption underlying 5 Whys is that each presenting symptom has only one sufficient cause. This is not always the case and a 5 Whys analysis may not reveal jointly sufficient causes that explain a symptom.

There are some other limitations of the Five Whys method outlined there, such as it not being an idempotent process, but the point I want to make here is that linear mental models of causality don’t capture what is needed to improve the safety of a system.

Generally speaking, linear chain-of-events approaches are akin to viewing the past as a line-up of dominoes, and reality with complex systems simply don’t work like that. Looking at an accident this way ignores surrounding circumstances in favor of a cherry-picked list of events, it validates hindsight and outcome bias, and focuses too much on components and not enough on the interconnectedness of components.

During stressful times (like outages) people involved with response, troubleshooting, and recovery also often mis-remember the events as they happened, sometimes unconsciously neglecting critical facts and timings of observations, assumptions, etc. This  can obviously affect the results of using a linear accident investigation model like the Five Whys.

However, this identifying a singular root cause and a linear chain that stems from it makes things very easy to explain, understand, and document. This can help us feel confident that we’re going to prevent future occurrences of the issue, because there’s just one thing to fix: the root cause.

Even the eminent cognitive engineering expert James Reason’s epidemiological (the “Swiss Cheese”) model exhibits some of these limitations. While it does help capture multiple contributing causes, the mechanism is still linear, which can encourage people to think that if they only were to remove one of the causes (or fix a ‘barrier’ to a cause in the chain) then they’ll be protected in the future.

I will, however, point out that having an open and mature process of investigating causality, using any model, is a good thing for an organization, and the Five Whys can help kick-off the critical thinking needed. So I’m not specifically knocking the Five Whys as a practice with no value, just that it’s limited in its ability to identify items that can help bring resilience to a system.

Again, this tendency to look for a single root cause for fundamentally surprising (and usually negative) events like outages is ubiquitous, and hard to shake. When we’re stressed for technical, cultural, or even organizationally political reasons, we can feel pressure to get to resolution on an outage quickly. And when there’s pressure to understand and resolve a (perceived) negative event quickly, we reach for oversimplification. Some typical reasons for this are:

  • Management wants an answer to why it happened quickly, and they might even look for a reason to punish someone for it. When there’s a single root cause, it’s straightforward to pin it on “the guy who wasn’t paying attention” or “is incompetent”
  • The engineers involved with designing/building/operating/maintaining the infrastructure touching the outage are uncomfortable with the topic of failure or mistakes, so the reaction is to get the investigation over with. This encourages oversimplification of the causes and remediation.
  • The failure is just too damn complex to keep in one’s head. Hindsight bias encourages counter-factual thinking (“..if only we payed attention, we could have seen this coming!” or “…we should have known better!”) which pushes us into thinking the cause is simple.

So if there’s no singular root cause, what is there?

I agree with Richard Cook’s assertion that failures in complex systems require multiple contributing causes, each necessary but only jointly sufficient.

Hollnagel, Woods, Dekker and Cook point out in this introduction to Resilience Engineering:

Accidents emerge from a confluence of conditions and occurrences that are usually associated with the pursuit of success, but in this combination—each necessary but only jointly sufficient—able to trigger failure instead.

Frankly, I think that this tendency to look for singular root causes also comes from how deeply entrenched modern science and engineering is with the tenets of reductionism. So I blame Newton and Descartes. But that’s for another post. 🙂

Because complex systems have emergent behaviors, not resultant ones, it can be put another way:

Finding the root cause of a failure is like finding a root cause of a success.

So what does that leave us with? If there’s no single root cause, how should we approach investigating outages, degradations, and failures in a way that can help us prevent, detect, and respond to such issues in the future?

The answer is not straightforward. In order to truly learn from outages and failures, systemic approaches are needed, and there are a couple of them mentioned below. Regardless of the implementation, most systemic models recognize these things:

  • …that complex systems involve not only technology but organizational (social, cultural) influences, and those deserve equal (if not more) attention in investigation
  • …that fundamentally surprising results come from behaviors that are emergent. This means they can and do come from components interacting in ways that cannot be predicted.
  • …that nonlinear behaviors should be expected. A small perturbation can result in catastrophically large and cascading failures.
  • …human performance and variability are not intrinsically coupled with causes. Terms like “situational awareness” and “crew resource management” are blunt concepts that can mask the reasons why it made sense for someone to act in a way that they did with regards to a contributing cause of a failure.
  • …diversity of components and complexity in a system can augment the resilience of a system, not simply bring about vulnerabilities.

For the real nerdy details, Zahid H. Qureshi’s A Review of Accident Modelling Approaches for Complex Socio-Technical Systems covers the basics of the current thinking on systemic accident models: Hollnagel’s FRAM, Leveson’s STAMP, and Rassmussen’s framework are all worth reading about.

Also appropriate for further geeking out on failure and learning:

Hollnagel’s talk, On How (Not) To Learn From Accidents

Dekker’s wonderful Field Guide To Understanding Human Error

So the next time you read or hear a report with a singular root cause, alarms should go off in your head. In the same way that you shouldn’t ever have root cause “human error”, if you only have a single root cause, you haven’t dug deep enough. 🙂

 

Fault Tolerance and Protection

In yet another post where I point to a paper written from the perspective of another field of engineering about a topic that I think is inherently mappable to the web engineering world, I’ll at least give a summary. 🙂

Every time someone on-call gets an alert, they should always be thinking along these lines:

  • Does this really require me to wake up from sleeping or pause this movie I’m watching, to fix?
  • Can this really not wait until the morning, during office hours?

If the answer is yes to those, then excellent: the machines alerted a human to something that only a human could ever diagnose or fix. There was nothing that the software could have done to rectify the situation. Paging a human was justified.

But for those situations where the answer was “no” to those questions, one might (or should, anyway) think of bolstering your system’s “fault tolerance” or “fault protection.” But how many folks grok the full details of what that means?  Does it mean self-healing? Does it mean isolation of errors or unexpected behaviors that fall outside the bounds of normal operating circumstances? Or does it mean both and if so how should we approach building this tolerance and protection? The Wikipedia definitions for “fault tolerant systems” and “fault tolerant design” are a very good start on educating yourself on the concepts, but they’re reasonably general in scope.

The fact is, designing web systems to be truly fault-tolerant and protective is hard. These are questions that can’t be answered solely within infrastructural bounds; fault-tolerance isn’t selective in its tiering, it has to be thought of from layer 1 of the network all the way to the browser.

Now, not every web startup is lucky enough to hire someone from NASA’s Jet Propulsion Lab, who has written software for space vehicles, but we managed to convince Greg Horvath to leave there and join Etsy. He pointed me to an excellent paper, by Robert D. Rasmussen, called “GN&C Fault Protection Fundamentals” and thankfully, it’s a lot less about Guidance, Navigation, and Control and more about fault tolerance and protection strategies, concerns, and implementations.

Some of those concerns, from the paper:

  • Do not separate fault protection from normal operation of the same functions.
  • Strive for function preservation, not just fault protection.
  • Test systems, not fault protection; test behavior, not reflexes.
  • Cleanly establish a delineation of mainline control functions from transcendent issues.
  • Solve problems locally, if possible; explicitly manage broader impacts, if not.
  • Respond to the situation as it is, not as it is hoped to be.
  • Distinguish fault diagnosis from fault response initiation.
  • Follow the path of least regret.
  • Take the analysis of all contingencies to their logical conclusion.
  • Never underestimate the value of operational flexibility.
  • Allow for all reasonable possibilities — even the implausible ones.

The last idea there points to having “requisite imagination” to explore as fully as possible, the question “What could possibly go wrong?”, which is really just another manifestation of one of the four cornerstones of Resilience Engineering, which is: “Anticipation”. But that’s a topic for another post.

Here’s Rasmussen’s paper, please go and read it. If you don’t, you’re totally missing out and not keeping up!

Training Organizational Resilience in Escalating Situations

This little ramble of thoughts are related to my talk at Velocity coming up, but I know I’ll never get to this part at the conference, so I figured I’d post about it here.

Building resilience from a systems point of view means (amongst other things) understanding how your organization deals with failure and unexpected situations. Generally this means having a development and operations teams that can work well together under pressure, with fluctuating amounts of uncertainty, bringing their own domain expertise to the table when it matters.

This is what drives some of my favorite Ops candidate interview questions. Knowing Unix commands, network architectures, database behaviors, and scripting languages are obviously required, but comprise only one facet of the gig.  The real mettle comes from being able easily zoom in and out of the whole system under scrutiny, splitting up troubleshooting responsibilities amongst your team (and trusting their results) and differentiating red herring symptoms from truly related ones. It also comes from things like:

  • Staying away from distracting conversation during the outage response. Nothing kills a TTR like unrelated talk in IRC or a conf call.
  • Trusting your information. This is where the UI challenges of dashboard design can make or break an outage response. “Are those units milli, or mega?”
  • Balancing too much communication and too little amongst team members. Troubleshooting outage verbosity is a fickle mistress.
  • Stomping actions. OneThingAtATime™ methods aren’t easy to stick to, especially when things escalate.
  • Keeping outage fatigue at bay, and recognizing when brains are melting and need to take a break.

To make matters worse, determining causality can be tenuous at best when you’re working with complex systems, so being able to recognize when a failure has a single root cause (hint: with the big outages – almost never) and when it has multiple contributing causes is a skill that isn’t easily gained without seeing a lot of action in the past.

So it’s not a surprise that working well within a team under stressful scenarios is something other fields try to train people for.  Trauma surgeons, FBI agents, military teams, air traffic control, etc. all have drills, exercises, and simulations for teaching these skills, but they are all done within the context of what those escalating situations look like in their specific fields.

So this brings a question that has come up before in my circles:

Can this sort of organizational resilience be taught, within the context of web operations?

GameDay exercises could certainly be one avenue for testing and training team-based outage response, but most of the focus there (at least those discussed publicly by companies who hold GameDay exercises) is testing the infrastructure and application-level components, and even then under controlled conditions and relatively narrow failure modes.

So the confidence-building value of GameDay drills lie elsewhere, and don’t really exercise the cognitive load that real-world failures can produce on the humans (i.e. the troubleshooting dev and ops teams) like the spectacular Amazon AWS outage recently.

But! Some smart folks have been thinking about this question, at a higher-level:

Is it possible to construct non-contextual and generic drills that can train competencies for this sort of on-the-fly, making-sense-of-unfamiliar-failure-modes, and sometimes disorienting troubleshooting?

At the Lund University in Sweden, there’s an excellent article on building organizational resilience in escalating situations, which I believe resulted in a chapter in the Resilience Engineering in Practice book, and also references another excellent article by David Woods and Emily Patterson called How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands.

The parts I want to highlight here are best practices for designing scenarios meant to train these skills. If you’re looking to design a good drill meant to educate and/or train Ops and Devs on what cognitive muscles to develop for handling large-scale outages, this is a pretty damn good list (quoted from both of those sources above):

  • Try to force people beyond their learned roles and routines. The scenario can contain problems that are not solvable within those roles or routines, and forces people to step out of those roles and routines.
  • Contain a number of hidden goals, at various times during the scenario, that people could pursue (e.g. different ways of escaping the situation or de-escalating it), but that they have to vocalize and articulate in order to begin to achieve them (as they cannot do so by themselves).
  • Include potential actions of which the consequences are both important and difficult to foresee (and that might significantly influence people’s ability to control the problem in the near future). This can force people into pro-active thinking and articulation of their expectations of what might happen.
  • Be able to trap people in locking onto one solution that everybody is fixedly working towards. This can be done by garden-pathing; making the escalating problem look initially (with strong cues) like something the crew could already familiar with, but then letting it depart (with much weaker cues) to see whether the crew is caught on the garden path and lets the situation escalate.
  • Or the scenario, by creating so much cognitive noise in terms of new warnings and events, should be able to trip people into thematic vagabonding—the tendency to redirect attention and change diagnosis with each incoming data piece, which results in a fragmentation of problem-solving.

Think that such a scenario could be constructed?

I want to think so, but of course nothing teaches like the hindsight of a real production outage, eh? 🙂