Architectural Folk Models

I’m going to post the contents of a gist I wrote (2 years ago?!), because Theo is right, some gists are better as posts. The context for this was a debate on Twitter (which, as always, is about as elegant and pleasing to read as a turtle trying to breakdance). 

Summing up contextual influence on systems architecture

1. Monolithic applications and architectures can vary in their monolith-ness. This is an under-specified description.

2. Microservice applications and architectures can vary in their micro-ness. This is an under-specified description.

3. Both microservices and monolithic architectures have both benefits and disadvantages that are contextual.

4. Successful organizations will exploit those benefits while working around any weaknesses.

5. Success of a business is a large influence on the exploitation of benefits and implementation and costs of workarounds.

6. All benefits and work arounds are context-sensitive. Meaning that they are both technically and socially constructed by the organization that navigates them.

7. Path dependency is a thing. History matters and manifests in these architectural decisions and evolution in an organization.

8. Patterns exist to inform practice, not dictate it. Zealous adherence to an architectural pattern brings peril when it is to the exclusion of cultural and social context in actual practice.

9. Architectural patterns will expand, contract, evolve, and change in multiple ways to fit the trade-offs that an organization perceives it has to make, at the time they make them.

Much has been said about this, including some more by me, since then, but apparently it is not a dead topic and I figured I should grab it off of the gist system. 🙂

In the end, I consider architectural patterns to be folk models. Meaning that in popular dialogue, they tend to:

Substitute one label for another, rather than decomposing a large construct into more measurable specifics (how do I know when I say ‘microservice’ to you, we can be sure your understanding of the term is the same without being more specific?)

…are immune to falsification (how do I look at an architecture and decide when it’s no longer a monolith in a way that is universally true?)

…and easily get over-generalized to situations they were never meant to speak about. (when we talk about microservices, how do I know when we are no longer talking about technical specifications and when we start talking about organizational design?)

Much thanks to Hollnagel and Dekker for introducing me to the concept of folk models.

The Infinite Hows (or, the Dangers Of The Five Whys)

(this is also posted on O’Reilly’s Radar blog. Much thanks to Daniel Schauenberg, Morgan Evans, and Steven Shorrock for feedback on this)

Before I begin this post, let me say that this is intended to be a critique of the Five Whys method, not a criticism of the people who are in favor of using it.

This critique I present is hardly original; most of this post is inspired by Todd Conklin, Sidney Dekker, and Nancy Leveson.

The concept of post-hoc explanation (or “postmortems” as they’re commonly known) has, at this point, taken hold in the web engineering and operations domain. I’d love to think that the concepts that we’ve taken from the New View on ‘human error’ are becoming more widely known and that people are looking to explore their own narratives through those lenses.

I think that this is good, because my intent has always been (might always be) to help translate concepts from one domain to another. In order to do this effectively, we need to know also what to discard (or at least inspect critically) from those other domains.

The Five Whys is such an approach that I think we should discard.

This post explains my reasoning for discarding it, and how using it has the potential to be harmful, not helpful, to an organization. Here’s how I intend on doing this: I’m first going to talk about what I think are deficiencies in the approach, suggest an alternative, and then ask you to simply try the alternative yourself.

Here is the “bottom line, up front” gist of my assertions:

“Why?” is the wrong question.

In order to learn (which should be the goal of any retrospective or post-hoc investigation) you want multiple and diverse perspectives. You get these by asking people for their own narratives. Effectively, you’re asking  “how?

Asking “why?” too easily gets you to an answer to the question “who?” (which in almost every case is irrelevant) or “takes you to the ‘mysterious’ incentives and motivations people bring into the workplace.”

Asking “how?” gets you to describe (at least some) of the conditions that allowed an event to take place, and provides rich operational data.

Asking a chain of “why?” assumes too much about the questioner’s choices, and assumes too much about each answer you get. At best, it locks you into a causal chain, which is not how the world actually works. This is a construction that ignores a huge amount of complexity in an event, and it’s the complexity that we want to explore if we have any hope of learning anything.

But It’s A Great Way To Get People Started!

The most compelling argument to using the Five Whys is that it’s a good first step towards doing real “root cause analysis” – my response to that is twofold:

  1. “Root Cause Analysis*” isn’t what you should be doing anyway, and
  2. It’s only a good “first step” because it’s easy to explain and understand, which makes it easy to socialize. The issue with this is that the concepts that the Five Whys depend on are not only faulty, but can be dangerous for an organization to embrace.

If the goal is learning (and it should be) then using a method of retrospective learning should be confident in how it’s bringing to light data that can be turned into actionable information. The issue with the Five Whys is that it’s tunnel-visioned into a linear and simplistic explanation of how work gets done and events transpire. This narrowing can be incredibly problematic.

In the best case, it can lead an organization to think they’re improving on something (or preventing future occurrences of events) when they’re not.

In the worst case, it can re-affirm a faulty worldview of causal simplification and set up a structure where individuals don’t feel safe in giving their narratives because either they weren’t asked the right “why?” question or the answer that a given question pointed to ‘human error’ or individual attributes as causal.

Let’s take an example. From my tutorials at the Velocity Conference in New York, I used an often-repeated straw man to illustrate this:

Screen Shot 2014-11-12 at 3.45.24 PM

This is the example of the Five Whys found in the Web Operations book, as well.

This causal chain effectively ends with a person’s individual attributes, not with a description of the multiple conditions that allow an event like this to happen. Let’s look into some of the answers…

“Why did the server fail? Because an obscure subsystem was used in the wrong way.”

This answer is dependent on the outcome. We know that it was used in the “wrong” way only because we’ve connected it to the resulting failure. In other words, we as “investigators” have the benefit of hindsight. We can easily judge the usage of the server because we know the outcome. If we were to go back in time and ask the engineer(s) who were using it: “Do you think that you’re doing this right?” they would answer: yes, they are. We want to know what are the various influences that brought them to think that, which simply won’t fit into the answer of “why?”

The answer also limits the next question that we’d ask. There isn’t any room in the dialogue to discuss things such as the potential to use a server in the wrong way and it not result in failure, or what ‘wrong’ means in this context. Can the server only be used in two ways – the ‘right’ way or the ‘wrong’ way? And does success (or, the absence of a failure) dictate which of those ways it was used? We don’t get to these crucial questions.

“Why was it used in the wrong way? The engineer who used it didn’t know how to use it properly.”

This answer is effectively a tautology, and includes a post-hoc judgement. It doesn’t tell us anything about how the engineer did use the system, which would provide a rich source of operational data, especially for engineers who might be expected to work with the system in the future. Is it really just about this one engineer? Or is it possibly about the environment (tools, dashboards, controls, tests, etc.) that the engineer is working in? If it’s the latter, how does that get captured in the Five Whys?

So what do we find in this chain we have constructed above? We find:

  • an engineer with faulty (or at least incomplete) knowledge
  • insufficient indoctrination of engineers
  • a manager who fouls things up by not being thorough enough in the training of new engineers (indeed: we can make a post-hoc judgement about her beliefs)

If this is to be taken as an example of the Five Whys, then as an engineer or engineering manager, I might not look forward to it, since it focuses on our individual attributes and doesn’t tell us much about the event other than the platitude that training (and convincing people about training) is important.

These are largely answers about “who?” not descriptions of what conditions existed. In other words, by asking “why?” in this way, we’re using failures to explain failures, which isn’t helpful.

If we ask: “Why did a particular server fail?” we can get any number of answers, but one of those answers will be used as the primary way of getting at the next “why?” step. We’ll also lose out on a huge amount of important detail, because remember: you only get one question before the next step.

If instead, we were to ask the engineers how they went about implementing some new code (or ‘subsystem’), we might hear a number of things, like maybe:

  • the approach(es) they took when writing the code
  • what ways they gained confidence (tests, code reviews, etc.) that the code was going to work in the way they expected it before it was deployed
  • what (if any) history of success or failure have they had with similar pieces of code?
  • what trade-offs they made or managed in the design of the new function?
  • how they judged the scope of the project
  • how much (and in what ways) they experienced time pressure for the project
  • the list can go on, if you’re willing to ask more and they’re willing to give more

Rather than judging people for not doing what they should have done, the new view presents tools for explaining why people did what they did. Human error becomes a starting point, not a conclusion. (Dekker, 2009)

When we ask “how?”, we’re asking for a narrative. A story.

In these stories, we get to understand how people work. By going with the “engineer was deficient, needs training, manager needs to be told to train” approach, we might not have a place to ask questions aimed at recommendations for the future, such as:

  • What might we put in place so that it’s very difficult to put that code into production accidentally?
  • What sources of confidence for engineers could we augment?

As part of those stories, we’re looking to understand people’s local rationality. When it comes to decisions and actions, we want to know how it made sense for someone to do what they did. And make no mistake: they thought what they were doing made sense. Otherwise, they wouldn’t have done it.

WhyHow.001

Again, I’m not original with this thought. Local rationality (or as Herb Simon called it, “bounded rationality”) is something that sits firmly atop some decades of cognitive science.

These stories we’re looking for contain details that we can pull on and ask more about, which is critical as a facilitator of a post-mortem debriefing, because people don’t always know what details are important. As you’ll see later in this post, reality doesn’t work like a DVR; you can’t pause, rewind and fast-forward at will along a singular and objective axis, picking up all of the pieces along the way, acting like CSI. Memories are faulty and perspectives are limited, so a different approach is necessary.

Not just “how”

In order to get at these narratives, you need to dig for second stories. Asking “why?” will get you an answer to first stories. These are not only insufficient answers, they can be very damaging to an organization, depending on the context. As a refresher…

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

 

Now, read again the straw-man example of the Five Whys above. The questions that we ask frame the answers that we will get in the form of first stories. When we ask more and better questions (such as “how?”) we have a chance at getting at second stories.

You might wonder: how did I get from the Five Whys to the topic of ‘human error’? Because once ‘human error’ is a candidate to reach for as a cause (and it will, because it’s a simple and potentially satisfying answer to “why?”) then you will undoubtedly use it.

At the beginning of my tutorial in New York, I asked the audience this question:

IsThisRight.001

At the beginning of the talk, a large number of people said yes, this is correct. Steven Shorrock (who is speaking at Velocity next week in Barcelona on this exact topic) has written a great article on this way of thinking: If It Weren’t For The People. By the end of my talk, I was able to convince them that this is also the wrong focus of a post-mortem description.

This idea accompanies the Five Whys more often than not, and there are two things that I’d like to shine some light on about it:

Myth of the “human or technical failure” dichotomy

This is dualistic thinking, and I don’t have much to add to this other than what Dekker has said about it (Dekker, 2006):

“Was the accident caused by mechanical failure or by human error? It is a stock question in the immediate aftermath of a mishap. Indeed, it seems such a simple, innocent question. To many it is a normal question to ask: If you have had an accident, it makes sense to find out what broke. The question, however, embodies a particular understanding of how accidents occur, and it risks confining our causal analysis to that understanding. It lodges us into a fixed interpretative repertoire. Escaping from this repertoire may be difficult. It sets out the questions we ask, provides the leads we pursue and the clues we examine, and determines the conclusions we will eventually draw.”

Myth: during a retrospective investigation, something is waiting to be “found”

I’ll cut to the chase: there is nothing waiting to be found, or “revealed.” These “causes” that we’re thinking we’re “finding”? We’re constructing them, not finding them. We’re constructing them because we are the ones that are choosing where (and when) to start asking questions, and where/when to stop asking the questions. We’ve “found” a root cause when we stop looking. And in many cases, we’ll get lazy and just chalk it up to “human error.”

As Erik Hollnagel has said (Hollnagel, 2009, p. 85):

“In accident investigation, as in most other human endeavours, we fall prey to the What-You-Look-For-Is-What-You-Find or WYLFIWYF principle. This is a simple recognition of the fact that assumptions about what we are going to see (What-You-Look-For), to a large extent will determine what we actually find (What-You-Find).”

More to the point: “What-You-Look-For-Is-What-You-Fix”

We think there is something like the cause of a mishap (sometimes we call it the root cause, or primary cause), and if we look in the rubble hard enough, we will find it there. The reality is that there is no such thing as the cause, or primary cause or root cause . Cause is something we construct, not find. And how we construct causes depends on the accident model that we believe in. (Dekker, 2006)

Nancy Leveson comments on this in her excellent book Engineering a Safer World this idea (p.20):

Subjectivity in Selecting Events

The selection of events to include in an event chain is dependent on the stopping rule used to determine how far back the sequence of explanatory events goes. Although the first event in the chain is often labeled the ‘initiating event’ or ‘root cause’ the selection of an initiating event is arbitrary and previous events could always be added.

Sometimes the initiating event is selected (the backward chaining stops) because it represents a type of event that is familiar and thus acceptable as an explanation for the accident or it is a deviation from a standard [166]. In other cases, the initiating event or root cause is chosen because it is the first event in the backward chain for which it is felt that something can be done for correction.

The backward chaining may also stop because the causal path disappears due to lack of information. Rasmussen suggests that a practical explanation for why actions by operators actively involved in the dynamic flow of events are so often identified as the cause of an accident is the difficulty in continuing the backtracking “through” a human [166].

A final reason why a “root cause” may be selected is that it is politically acceptable as the identified cause. Other events or explanations may be excluded or not examined in depth because they raise issues that are embarrassing to the organization or its contractors or are politically unacceptable.

Learning is the goal. Any prevention depends on that learning.

So if not the Five Whys, then what should you do? What method should you take?

I’d like to suggest an alternative, which is to first accept the idea that you have to actively seek out and protect the stories from bias (and judgement) when you ask people “how?”-style questions. Then you can:

  • Ask people for their story without any replay of data that would supposedly ‘refresh’ their memory
  • Tell their story back to them and confirm you got their narrative correct
  • Identify critical junctures
  • Progressively probe and re-build how the world looked to people inside of the situation at each juncture.

As a starting point for those probing questions, we can look to Gary Klein and Sidney Dekker for the types of questions you can ask instead of “why?”…

Debriefing Facilitation Prompts

(from The Field Guide To Understanding Human Error, by Sidney Dekker)

At each juncture in the sequence of events (if that is how you want to structure this part of the accident story), you want to get to know:

  • Which cues were observed (what did he or she notice/see or did not notice what he or she had expected to notice?)
  • What knowledge was used to deal with the situation? Did participants have any experience with similar situations that was useful in dealing with this one?
  • What expectations did participants have about how things were going to develop, and what options did they think they have to influence the course of events?
  • How did other influences (operational or organizational) help determine how they interpreted the situation and how they would act?

Here are some questions Gary Klein and his researchers typically ask to find out how the situation looked to people on the inside at each of the critical junctures:

Cues What were you seeing?

What were you focused on?

What were you expecting to happen?

Interpretation If you had to describe the situation to your colleague at that point, what would you have told?
Errors What mistakes (for example in interpretation) were likely at this point?
Previous knowledge/experience

Were you reminded of any previous experience?

Did this situation fit a standard scenario?

Were you trained to deal with this situation?

Were there any rules that applied clearly here?

Did any other sources of knowledge suggest what to do?

Goals What were you trying to achieve?Were there multiple goals at the same time?Was there time pressure or other limitations on what you could do?
Taking Action How did you judge you could influence the course of events?

Did you discuss or mentally imagine a number of options or did you know straight away what to do?

Outcome Did the outcome fit your expectation?
Did you have to update your assessment of the situation?
Communications What communication medium(s) did you prefer to use? (phone, chat, email, video conf, etc.?)

Did you make use of more than one communication channels at once?

Help

Did you ask anyone for help?

What signal brought you to ask for support or assistance?

Were you able to contact the people you needed to contact?

For the tutorials I did at Velocity, I made a one-pager of these: http://bit.ly/DebriefingPrompts

Screen Shot 2014-11-12 at 4.03.30 PM

Try It

I have tried to outline some of my reasoning on why using the Five Whys approach is suboptimal, and I’ve given an alternative. I’ll do one better and link you to the tutorials that I gave in New York in October, which I think digs deeper into these concepts. This is in four parts, 45 minutes each.

Part I – Introduction and the scientific basis for post-hoc restrospective pitfalls and learning

Part II – The language of debriefings, causality, case studies, teams coping with complexity

Part III – Dynamic fault management, debriefing prompts, gathering and contextualizing data, constructing causes

Part IV – Taylorism, normal work, ‘root cause’ of software bugs in cars, Q&A

My request is that the next time that you would do a Five Whys, that you instead ask “how?” or the variations of the questions I posted above. If you think you get more operational data from a Five Whys and are happy with it, rock on.

If you’re more interested in this alternative and the fundamentals behind it, then there are a number of sources you can look to. You could do a lot worse than starting with Sidney Dekker’s Field Guide To Understanding Human Error.

An Explanation

For those readers who think I’m too unnecessarily harsh on the Five Whys approach, I think it’s worthwhile to explain why I feel so strongly about this.

Retrospective understanding of accidents and events is important because how we make sense of the past greatly and almost invisibly influences our future. At some point in the not-so-distant past, the domain of web engineering was about selling books online and making a directory of the web. These organizations and the individuals who built them quickly gave way to organizations that now build cars, spacecraft, trains, aircraft, medical monitoring devices…the list goes on…simply because software development and distributed systems architectures are at the core of modern life.

The software worlds and the non-software worlds have collided and will continue to do so. More and more “life-critical” equipment and products rely on software and even the Internet.

Those domains have had varied success in retrospective understanding of surprising events, to say the least. Investigative approaches that are firmly based on causal oversimplification and the “Bad Apple Theory” of deficient individual attributes (like the Five Whys) have shown to not only be unhelpful, but objectively made learning harder, not easier. As a result, people who have made mistakes or involved in accidents have been fired, banned from their profession, and thrown in jail for some of the very things that you could find in a Five Whys.

I sometimes feel nervous that these oversimplifications will still be around when my daughter and son are older. If they were to make a mistake, would they be blamed as a cause? I strongly believe that we can leave these old ways behind us and do much better.

My goal is not to vilify an approach, but to state explicitly that if the world is to become safer, then we have to eschew this simplicity; it will only get better if we embrace the complexity, not ignore it.

 

Epilogue: The Longer Version For Those Who Have The Stomach For Complexity Theory

The Five Whys approach follows a Newtonian-Cartesian worldview. This is a worldview that is seductively satisfying and compellingly simple. But it’s also false in the world we live in.

What do I mean by this?

There are five areas why the Five Whys firmly sits in a Newtonian-Cartesian worldview that we should eschew when it comes to learning from past events. This is a Cliff Notes version of “The complexity of failure: Implications of complexity theory for safety investigations” –

First, it is reductionist. The narrative built by the Five Whys sits on the idea that if you can construct a causal chain, then you’ll have something to work with. In other words: to understand the system, you pull it apart into its constituent parts. Know how the parts interact, and you know the system.

Second, it assumes what Dekker has called “cause-effect symmetry” (Dekker, complexity of failure):

“In the Newtonian vision of the world, everything that happens has a definitive, identifiable cause and a definitive effect. There is symmetry between cause and effect (they are equal but opposite). The determination of the ‘‘cause’’ or ‘‘causes’’ is of course seen as the most important function of accident investigation, but assumes that physical effects can be traced back to physical causes (or a chain of causes-effects) (Leveson, 2002). The assumption that effects cannot occur without specific causes influences legal reasoning in the wake of accidents too. For example, to raise a question of negligence in an accident, harm must be caused by the negligent action (GAIN, 2004). Assumptions about cause-effect symmetry can be seen in what is known as the outcome bias (Fischhoff, 1975). The worse the consequences, the more any preceding acts are seen as blameworthy (Hugh and Dekker, 2009).”

John Carroll (Carroll, 1995) called this “root cause seduction”:

The identification of a root cause means that the analysis has found the source of the event and so everyone can focus on fixing the problem.  This satisfies people’s need to avoid ambiguous situations in which one lacks essential information to make a decision (Frisch & Baron, 1988) or experiences a salient knowledge gap (Loewenstein, 1993). The seductiveness of singular root causes may also feed into, and be supported by, the general tendency to be overconfident about how much we know (Fischhoff,Slovic,& Lichtenstein, 1977).

That last bit about a tendency to be overconfident about how much we know (in this context, how much we know about the past) is a strong piece of research put forth by Baruch Fischhoff, who originally researched what we now understand to be the Hindsight Bias. Not unsurprisingly, Fischhoff’s doctoral thesis advisor was Daniel Kahneman (you’ve likely heard of him as the author of Thinking Fast and Slow), whose research in cognitive biases and heuristics everyone should at least be vaguely familiar with.

The third issue with this worldview, supported by the idea of Five Whys and something that follows logically from the earlier points is that outcomes are foreseeable if you know the initial conditions and the rules that govern the system. The reason that you would even construct a serial causal chain like this is because

The fourth part of this is that time is irreversible. We can’t look to a causal chain as something that you can fast-forward and rewind, no matter how attractively simple that seems. This is because the socio-technical systems that we work on and work in are complex in nature, and are dynamic. Deterministic behavior (or, at least predictability) is something that we look for in software; in complex systems this is a foolhardy search because emergence is a property of this complexity.

And finally, there is an underlying assumption that complete knowledge is attainable. In other words: we only have to try hard enough to understand exactly what happened. The issue with this is that success and failure have many contributing causes, and there is no comprehensive and objective account. The best that you can do is to probe people’s perspectives at juncture points in the investigation. It is not possible to understand past events in any way that can be considered comprehensive.

Dekker (Dekker, 2011):

As soon as an outcome has happened, whatever past events can be said to have led up to it, undergo a whole range of transformations (Fischhoff and Beyth, 1975; Hugh and Dekker, 2009). Take the idea that it is a sequence of events that precedes an accident. Who makes the selection of the ‘‘events’’ and on the basis of what? The very act of separating important or contributory events from unimportant ones is an act of construction, of the creation of a story, not the reconstruction of a story that was already there, ready to be uncovered. Any sequence of events or list of contributory or causal factors already smuggles a whole array of selection mechanisms and criteria into the supposed ‘‘re’’construction. There is no objective way of doing this—all these choices are affected, more or less tacitly, by the analyst’s background, preferences, experiences, biases, beliefs and purposes. ‘‘Events’’ are themselves defined and delimited by the stories with which the analyst configures them, and are impossible to imagine outside this selective, exclusionary, narrative fore-structure (Cronon, 1992).

Here is a thought exercise: what if we were to try to use the Five Whys for finding the “root cause” of a success?

Why didn’t we have failure X today?

Now this question gets a lot more difficult to have one answer. This is because things go right for many reasons, and not all of them obvious. We can spend all day writing down reasons why we didn’t have failure X today, and if we’re committed, we can keep going.

So if success requires “multiple contributing conditions, each necessary but only jointly sufficient” to happen, then how is it that failure only requires just one? The Five Whys, as its commonly presented as an approach to improvement (or: learning?), will lead us to believe that not only is just one condition sufficient, but that condition is a canonical one, to the exclusion of all others.

* RCA, or “Root Cause Analysis” can also easily turn into “Retrospective Cover of Ass”

References

Carroll, J. S. (1995). Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability. Organization & Environment, 9(2), 175–197. doi:10.1177/108602669500900203

Dekker, S. (2004). Ten questions about human error: A new view of human factors and system safety. Mahwah, N.J: Lawrence Erlbaum.

Dekker, S., Cilliers, P., & Hofmeyr, J.-H. (2011). The complexity of failure: Implications of complexity theory for safety investigations. Safety Science, 49(6), 939–945. doi:10.1016/j.ssci.2011.01.008

Hollnagel, E. (2009). The ETTO principle: Efficiency-thoroughness trade-off : why things that go right sometimes go wrong. Burlington, VT: Ashgate.
Leveson, N. (2012). Engineering a Safer World. Mit Press.

 

 

Paradigm Check Point: Prefacing Debriefings

I’m a firm believer in restating values, goals, and perspectives at the beginning of every group debriefing (e.g. “postmortem meetings”) in order to bring new folks up to speed on how we view the process and what the purpose of the debriefing is.

When I came upon a similar baselining dialogue from another domain, I thought I’d share…

Screen Shot 2014-03-10 at 4.43.19 PM

  • Risk is in everything we do. Short of never doing anything, there is no way to avoid all risk or ever to be 100% safe.
  • How employees (at any level) perceive, anticipate, interpret, and react to risk is systematically connected to conditions associated with the design, systems, features, and culture of the workplace.
  • “Risk does not exist “out there,” independent of our minds and culture, waiting to be measured. Human beings have invented the concept of “risk” to help them understand and cope with the dangers and the uncertainties of life. Although these dangers are real, there is no such thing as a “real risk” or “objective risk.””*
  • The best definition of “safety” is: the reasonableness of risk. It is a feeling. It is not an absolute. It is personal and contextual and will vary between people even within identical situations.
  • While safety is an essential business practice, our agency does not exist to be safe or to protect our employees. We exist to accomplish a mission as efficiently as possible–knowing that many activities we choose to perform are inherently hazardous (for example, deployment, data migration, code commits, on-call response, editing configurations, and even powering on a device on the network).
  • Mistakes, errors, and lapses are normal and inevitable human behaviors. So are optimism and fatalism. So are taking shortcuts to save time and effort. So are under- and over-estimating risk. In spite of this, our work systems are generally designed for the optimal worker, not the normal one.
  • Essentially every risk mitigation (every safety precaution) carries some level of “cost” to production or compromise to efficiency. One of the most obvious is the cost of training. Employees at all levels (administrators, safety advisors, system designers, and front-line employees) are continuously–and often subconsciously– estimating, balancing, optimizing, managing, and accepting these subtle and nuanced tradeoffs between safety and production.
  • All successful systems, organizations, and individuals will trend toward efficiency over thoroughness (production over protection) over time until something happens (usually an accident or a close call) that changes their perception of risk. This creativity and drive for efficiency is what makes people, businesses and agencies successful.
  • Our natural intuition (our common sense) is to let outcomes draw the line between success and failure and to base safety programs on outcomes. This is shortsighted and eventually dangerous. Using the science of risk management is more potent and robust. Importantly, Risk Management is wholly concerned with managing risks, not outcomes. Risk management is counterintuitive.
  • Employees directly involved in the event did not expect that the accident was going to happen. They expected a positive outcome. If this is not the case, then you’re not dealing with an accident.

*Paul Slovic, as quoted in Daniel Kahneman, Thinking Fast and Slow (Farrar, Straus and Giroux, 2011), p141.
The above is excerpted from the Facilitated Learning Analysis Implementation Guide, US Forestry Service, Wildland Fire Operations.

Learning from Failure at Etsy

(This was originally posted on Code As Craft, Etsy’s engineering blog. I’m re-posting it here because it still resonates strongly as I prepare to teach a ‘postmortem facilitator’s course internally at Etsy.)

Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought I might give some detail on how that actually happens, and why.

Anyone who’s worked with technology at any scale is familiar with failure. Failure cares not about the architecture designs you slave over, the code you write and review, or the alerts and metrics you meticulously pore through.

So: failure happens. This is a foregone conclusion when working with complex systems. But what about those failures that have resulted due to the actions (or lack of action, in some cases) of individuals? What do you do with those careless humans who caused everyone to have a bad day?

Maybe they should be fired.

Or maybe they need to be prevented from touching the dangerous bits again.

Or maybe they need more training.

This is the traditional view of “human error”, which focuses on the characteristics of the individuals involved. It’s what Sidney Dekker calls the “Bad Apple Theory” – get rid of the bad apples, and you’ll get rid of the human error. Seems simple, right?

We don’t take this traditional view at Etsy. We instead want to view mistakes, errors, slips, lapses, etc. with a perspective of learning. Having blameless Post-Mortems on outages and accidents are part of that.

A Blameless Post-Mortem

What does it mean to have a ‘blameless’ Post-Mortem?
Does it mean everyone gets off the hook for making mistakes? No.

Well, maybe. It depends on what “gets off the hook” means. Let me explain.

Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

  • what actions they took at what time,
  • what effects they observed,
  • expectations they had,
  • assumptions they had made,
  • and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.

Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.

We believe that this detail is paramount to improving safety at Etsy.

If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?

This cycle of name/blame/shame can be looked at like this:

  1. Engineer takes action and contributes to a failure or incident.
  2. Engineer is punished, shamed, blamed, or retrained.
  3. Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
  4. Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
  5. Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
  6. Errors more likely, latent conditions can’t be identified due to #5, above
  7. Repeat from step 1

We need to avoid this cycle. We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.

The base fundamental here is something Erik Hollnagel has said:

We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.

A Second Story

This idea of digging deeper into the circumstance and environment that an engineer found themselves in is called looking for the “Second Story”. In Post-Mortem meetings, we want to find Second Stories to help understand what went wrong.

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

Allowing Engineers to Own Their Own Stories

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items.

So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.

So what do we do to enable a “Just Culture” at Etsy?

  • We encourage learning by having these blameless Post-Mortems on outages and accidents.
  • The goal is to understand how an accident could have happened, in order to better equip ourselves from it happening in the future
  • We seek out Second Stories, gather details from multiple perspectives on failures, and we don’t punish people for making mistakes.
  • Instead of punishing engineers, we instead give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures.
  • We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future.
  • We accept that there is always a discretionary space where humans can decide to make actions or not, and that the judgement of those decisions lie in hindsight.
  • We accept that the Hindsight Bias will continue to cloud our assessment of past events, and work hard to eliminate it.
  • We accept that the Fundamental Attribution Error is also difficult to escape, so we focus on the environment and circumstances people are working in when investigating accidents.
  • We strive to make sure that the blunt end of the organization understands how work is actually getting done (as opposed to how they imagine (or hope) it’s getting done, via Gantt charts and procedures) on the sharp end.
  • The sharp end is relied upon to inform the organization where the line is between appropriate and inappropriate behavior. This isn’t something that the blunt end can come up with on its own.

Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.

One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”

Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.

That’s why we have blameless Post-Mortems at Etsy, and why we’re looking to create a Just Culture here.

On Being A Senior Engineer

I think that there’s a lot of institutional knowledge in our field, especially about what makes for a productive engineer. But while there are a good deal of books in the management field about “expert” roles and responsibilities of non-technical individual contributors, I don’t see too many modern books or posts that might shed light directly on what makes for a good senior engineer. One notable exception is of course Kate Matsudaira, who has been posting quite a good deal recently about the cultural sides of engineering.

Yet at the same time, a good lot of successful engineers whom I have known all remember the mentor who taught them what it meant to be “senior”.

I do, however, agree 100% with my friend Theo’s words about being “senior” in his chapter of the Web Operations book by O’Reilly:

“Generation X (and even more so generation Y) are cultures of immediate gratification. I’ve worked with a staggering number of engineers that expect the “career path” to take them to the highest ranks of the engineering group inside 5 years just because they are smart. This is simply impossible in the staggering numbers I’ve witnessed. Not everyone can be senior. If, after five years, you are senior, are you at the peak of your game? After five more years will you not have accrued more invaluable experience? What then? “Super engineer”? Five more years? “Super-duper engineer.” I blame the youth of our discipline for this affliction. The truth is that there are very few engineers that have been in the field of web operations for fifteen years. Given the dynamics of our industry many elected to move on to managerial positions or risk an entrepreneurial run at things.”

He’s right: this field of web operations is still quite young. So we can’t be surprised when people who have a title of ‘senior’ exhibit unsurprisingly immature behavior, both technical and non-technical. If you haven’t read Theo’s chapter, I suggest you do.

Having said that, what does it actually mean to be ‘senior’ in this discipline? I certainly have an opinion of what it means, given that I’m charged with hiring, supporting, and retaining engineers whom are deemed to be senior. This notion that there is a bar to be passed in terms of career development is a good one, but I’d also add that these criteria exist on a spectrum, as opposed to a simple list of check-boxes. You don’t wake up one day and you are “senior” just because your title reflects that upon a promotion. Senior engineers don’t know everything. They’re not perfect in their technical knowledge, and they’re OK with that.

In order not to confuse titles with expectations that are fuzzy, sometimes I’ll refer to engineering maturity.

Meaning: I expect a “senior” engineer to be a mature engineer.

I’m going to gloss over the part where one could simply list the technical areas in which a mature engineer should have some level of mastery or understanding (such as “Networking”, “Filesystems”, “Algorithms”, etc.) and instead highlight the personal characteristics that in my mind give me indication that someone can influence an organization or a business positively in the domain of engineering.

Over on Quora, someone once asked me “What are the attributes (other than technical ability/experience) that makes a great VP of Technical Operations?”. The list of attributes that I mentioned in the answer came with the understanding that they are perpetual aspirations of my own. This post is similar to that answer.

I might first argue that senior engineers in web development and operations have the same characteristics as senior engineers in other fields of engineering (mechanical, electrical, chemical, etc.) in which case The Unwritten Laws of Engineering are applicable. Again, if you haven’t read this, please go do so. It was originally written in 1944, published by the American Society of Mechanical Engineers. A good excerpt from the book is here.

While the book’s structure and prose still has a dated feel (“…refrain from using profanity in the workplace…” or “…men should pay particular attention to shaving habits and the trimming of beards and mustaches…”), it gives a good outline of the non-technical expectations, responsibilities, and inner workings of an engineering organization with respect to how both managers and mature engineers might behave.

Obligatory Pithy Characteristics of Mature Engineers

All posts that attempt to give insight to aspirational characteristics must have an over-abundance of bullet points, and the field of engineering has a fair share of them. Therefore, I’m going to give you some, some mine and some pulled from various sources, many from the Unwritten Laws mentioned above.

Mature engineers seek out constructive criticism of their designs.

Every successful engineer I’ve met, upon finishing up a design or getting ready for a project, will continually ask their peers questions along the lines of:

  • “What could I be missing?”
  • “How will this not work?”
  • “Will you please shoot as many holes as possible into my thinking on this?”
  • “Even if it’s technically sound, is it understandable enough for the rest of the organization to operate, troubleshoot, and extend it?”

This is because they know that nothing they make will ever only be in their hands, and that good peer review is what makes better design decisions. As it’s been said elsewhere, they “beg for the bad news.”

Mature engineers understand the non-technical areas of how they are perceived.

Being able to write a Bloom Filter in Erlang, or write multi-threaded C in your sleep is insufficient. None of that matters if no one wants to work with you. Mature engineers know that no matter how complete, elegant, or superior their designs are, it won’t matter if no one wants to work alongside them because they are assholes. Condescension, belittling, narcissism, and ego-boosting behavior send the message to other engineers (maybe tacitly) to stay away. Part of being happy in engineering comes from enjoying the company of the people you work with while designing and building things. An engineer who is quick to call someone a moron is someone destined to stunt his or her career.

This also means that mature engineers have self-awareness when it comes to their communication. This isn’t to say that every mature engineer communicates perfectly, only that they have some notion about where they could be better, and continually ask for a gut-check from peers and managers on how they’re doing. They aim to be assertive, not passive or aggressive in how they get their ideas across.

I’ve mentioned it elsewhere, but I must emphasize the point more: the degree to which other people want to work with you is a direct indication on how successful you’ll be in your career as an engineer. Be the engineer that everyone wants to work with.

Now this isn’t to say that you should shy away from giving (or getting) constructive criticism on the work produced by engineering (as opposed to the engineer personally), for fear of pissing someone off. There’s a difference between calling someone a moron and pointing out faults in their code or product. In a conversation with Theo, he pointed out another possible area where our field may grow up:

“We as an industry need to (of course) refrain from critiques of human character and condition, but not shy away from critiques of work product. We need to get tougher skin and be able to receive critique through a lens that attempts to eliminate personal focus.

There will be assholes, they should be shunned. But the attitude that someone’s code is their baby should come to an end. Code doesn’t have feelings, doesn’t develop complexes and certainly doesn’t exhibit the most important trait (the ability to reproduce) of that which carries for your genetic strains.”

See also below #2 and #10 in The Ten Commandments of Egoless Programming.

I think this has a corollary from the Unwritten Laws (emphasis mine):

Be careful about whom you mark for copies of letters, memos, etc., when the interests of other departments are involved.

A lot of mischief has been caused by young people broadcasting memorandum containing damaging or embarrassing statements. Of course it is sometimes difficult for a novice to recognize the “dynamite” in such a document but, in general, it is apt to cause trouble if it steps too heavily upon someone’s toes or reveals a serious shortcoming on anybody’s part. If it has wide distribution or if it concerns manufacturing or customer difficulties, you’d better get the boss to approve it before it goes out unless you’re very sure of your ground.

This of course underscores the dated feel of the book, but in the modern era, I still believe the main point to be true. Nothing indicates that you have a lack of perspective and awareness like a poorly thought out and nonconstructive tweet that slings venomous insults. It’s a junior engineer mistake to toss insults about a piece of complex technology in 140 characters.

I certainly (much like Christopher Brown mentioned in his keynote at Velocity London) pay attention to those sorts of public remarks when I come across them so that I can note who I would reconsider hiring if they ever applied to work at Etsy.

Mature engineers do not shy away from making estimates, and are always trying to get better at it.

From the Unwritten Laws:

Promises, schedules, and estimates are necessary and important instruments in a well-ordered business. Many engineers fail to realize this, or habitually try to dodge the irksome responsibility for making commitments. You must make promises based upon your own estimates for the part of the job for which you are responsible, together with estimates obtained from contributing departments for their parts. No one should be allowed to avoid the issue by the old formula, “I can’t give a promise because it depends upon so many uncertain factors.”

Avoiding responsibility for estimates is another way of saying, “I’m not ready to be relied upon for building critical pieces of infrastructure.” All businesses rely on estimates, and all engineers working on a project are involved in Joint Activity, which means that they have a responsibility to others to make themselves interpredictable. In general, mature engineers are comfortable with working within some nonzero amount of uncertainty and risk.

Mature engineers have an innate sense of anticipation, even if they don’t know they do.

This code looks good, I’m proud of myself. I’ve asked other people to review it, and I’ve taken their feedback. Now: how long will it last before it’s rewritten? Once it’s in production, how will its execution affect resource usage? How much so I expect CPU/memory/disk/network to increase or decrease? Will others be able to understand this code? Am I making it as easy as I can for others to extend or introspect this work?

Mature engineers understand that not all of their projects are filled with rockstar-on-stage work.

However menial and trivial your early assignments may appear, give them your best effort.

Getting things done means doing things you might not be interested in. No matter how sexy a project is, there are always boring tasks. Tedious tasks. Tasks that a less mature engineer may deem beneath their dignity or their job title. My good friend Kellan Elliot-McCrea (Etsy’s CTO) had this to say about it:

“Sometimes the saving grace of a tedious task is their simplicity and maturity manifests in finishing them quickly and moving on. Sometimes tasks are tedious because they require extreme discipline and malleable attention span. It’s an odd phenomena that the most tedious tasks, only to be carried out by the most senior engineers, can also be the most terrifying.”

Mature engineers lift the skills and expertise of those around them.

They recognize that at some point, their individual contribution and potential cannot be exercised singularly. They recognize that there is only so much that can be produced by a single person, and the world’s best engineering feats are executed by teams, not singularly brilliant and lone engineers. Tom Limoncelli makes this point quite well in his post.

At Etsy we call this a “generosity of spirit.” Generosity of spirit is one of our core engineering values, but also a primary responsibility of our Staff Engineer position, a career-level position. These engineers spend the time to make sure that more junior or new engineers unfamiliar with the tech or processes we have not only understand what they are doing, but also why they are doing it. “Teaching to fish” is a mandatory skill at this level, and that requires having both patience and a perspective of investment in the rest of the organization.

Therefore instead of: “OK, move over, lemme just do it for you”, it’s instead: “Ok, let’s work on this together. I can show you how I’m writing/troubleshooting/etc. Then, you do it so I can be sure you know why/how we’re doing it this way, etc.”

Related: see below about getting credit.

Mature engineers make their trade-offs explicit when making judgements and decisions.

They realize all engineering decisions, implementations, and designs exist within a spectrum; we do not live in a binary world. They can quickly point out contexts where one successful approach or solution could work and where it could not. They know that one cannot be both efficient and thorough at the same time (The ETTO Principle), that most projects engineers work on exist on an axis of optimality and brittleness, and that whether the problems they are solving are acute or chronic.

They know that they work within a spectrum of ideal and non-ideal, and are OK with that. They are comfortable with it because they strive to make the ideal and non-ideal in a design explicit. Later on in the lifecycle of a design, when the original design is not scaling anymore or needs to be replaced or rewritten, they can look back not with a perspective of how short-sighted those earlier decisions were, but instead say “yep, we made it this far with it and knew we’d have to extend or change it at some point. Looks like that time is now, let’s get to work!” instead of responding with a cranky-pants, passive-aggressive Hindsight Bias-filled remark with counterfactuals (e.g.. “those idiots didn’t do it right the first time!”, “they cut corners!”, “I TOLD them this wouldn’t work!”)

Many pithy quotes exist that shine light on this notion of trade-offs, and mature engineers know that there are limits to any philosophy-laden quotes (including the ones I’m writing here):

  • “Premature optimization is the root of all evil.” – a very abused maxim, and I’ve written about it before. A corollary to that might be (taken from here) ‘Understanding what is and isn’t “premature” is what separates senior engineers from junior engineers.’
  • “Right tool for the job” – another abused one. The intention here is reasonable: who wants to use a tool that isn’t appropriate? But a rare perspective is that this can be detrimental when taken to the extreme. A carpenter doesn’t arm himself with every variation and size of hammer that is available, even thought he may encounter hammering tasks that could be ideally handled by each one. Why? Because lugging around (and maintaining) a gazillion hammers incurs a cost. As such, decisions on this axis have trade-offs.

The tl;dr on trade-offs is that everyone cuts corners, in every project. Immature engineers discover them in hindsight, disgusted. Mature engineers spell them out at the onset of a project, accept them and recognize them as part of good engineering.

(Related: Your Code May Be Elegant, But Mine Fucking Works)

Mature engineers don’t practice CYAE (“Cover Your Ass Engineering”)

The scenario where someone will stand on ceremony as an excuse for not attempting to understand how his or her code (or infrastructure) could be touched by other parts of the system or business is a losing proposition. Covering your ass sends the implicit message that you are someone willing to throw others (on your team? in your company? in your community?) under the proverbial bus at the mere hint that your work had any flaw. Mature engineers stand up and accept the responsibility given to them. If they find they don’t have the requisite authority to be held accountable for their work, they seek out ways to rectify that.

An example of CYAE is “It’s not my fault. They broke it, they used it wrong. I built it to spec, I can’t be held responsible for their mistakes or improper specification.”

Mature engineers are empathetic.

In complex projects, there are usually a number of stakeholders. In any project, the designers, product managers, operations engineers, developers, and business development folks all have goals and perspectives, and mature engineers realize that those goals and views may be different. They understand this so that they can navigate effectively in the work that they do. Being empathetic in this sense means having the ability to view the project from another person’s perspective and to take that into consideration into your own work.

Goal conflicts are inherent in all engineering work, and complaining about them (instead of embracing them as requirements for success) is a sign of a less mature engineer.

They don’t make empty complaints.

Instead, they express judgements based on empirical evidence and bring with those judgements options for solving the problem which they’ve identified. A great manager of mine said to never go to your boss with a complaint about anything without at least one (ideally more than one) suggestion for a solution. Even demonstrating that you’ve tried working the problem on your own and came up empty-handed is better than an empty complaint.

Mature engineers are aware of cognitive biases

This isn’t to say that every mature engineer needs to have a degree in psychology, but cognitive biases are what can limit the growth of an engineer’s career at a certain point. Even if they’re not aware of the details of how they appear or how these biases can be guarded against, most mature engineers I know have a level of self-awareness to at least recognize they (like everyone) are susceptible to them.

Culturally, engineers work day-to-day in empirical evidence in research. Basically: show me the data. The issue with cognitive biases is that we can be blissfully unaware of when we are interpreting data with our own brains in ways that defy empirical data, and can have a surprising effect on how we get work done and work on teams.

A great list of them exists on Wikipedia, but some of the ones that I’ve seen engineers (including myself) fall prey to are:

  • Self-Serving Bias – basically: if something is good, it’s probably because of something I did or thought of. If it’s bad, it’s probably the doing of someone else.
  • Fundamental Attribution Error – basically: the bad results that someone else got from his work must have something to do with how he is, personally (stupid, clumsy, sloppy, etc.) whereas if I get bad results, it’s because of the context that I was in, the pressure I was under, the situation I was in, etc.
  • Hindsight Bias – (it is said that this is the most-studied phenomenon in the history of modern psychology) basically: after an untoward or negative event (a severe bug, an outage, etc.) “I knew it all along!”. It is the very strong tendency to view the past more simply than it was in reality. You can tell there is Hindsight Bias going on when descriptions involve counterfactuals, or “…they should have…”, or “…how did they not see that, it’s so obvious!”.
  • Outcome Bias – like above, this comes up after a surprising or negative event. If the event was very damaging, expensive to clean up, or severe, then the decisions or actions that contributed to that event are judged to be very stupid, reckless, or negligent. The judgement is proportional to how severe the event was.
  • Planning Fallacy – (related to the point about making estimates under uncertainty, above) basically: being more optimistic about forecasting the time a particular project will take.

There are plenty of others, all of which I find personally fascinating and I can get lost in learning more about them. Highly suggested reading, if you’re at all interested in learning about how you might be limiting your own effectiveness.

The Ten Commandments of Egoless Programming

Appropriate, even if old…I’ve seen it referenced as coming from The Psychology of Computer Programming, written in 1971, but I don’t actually see it in the text. Regardless, here are The Ten Commandments of Egoless Programming, found on @wyattdanger‘s blog post on receiving advice from his dad:

  1. Understand and accept that you will make mistakes. The point is to find them early, before they make it into production. Fortunately, except for the few of us developing rocket guidance software at JPL, mistakes are rarely fatal in our industry. We can, and should, learn, laugh, and move on.
  2. You are not your code. Remember that the entire point of a review is to find problems, and problems will be found. Don’t take it personally when one is uncovered. (Allspaw note – related: see below, number #10, and the points Theo made above.)
  3. No matter how much “karate” you know, someone else will always know more. Such an individual can teach you some new moves if you ask. Seek and accept input from others, especially when you think it’s not needed.
  4. Don’t rewrite code without consultation. There’s a fine line between “fixing code” and “rewriting code.” Know the difference, and pursue stylistic changes within the framework of a code review, not as a lone enforcer.
  5. Treat people who know less than you with respect, deference, and patience. Non-technical people who deal with developers on a regular basis almost universally hold the opinion that we are prima donnas at best and crybabies at worst. Don’t reinforce this stereotype with anger and impatience.
  6. The only constant in the world is change. Be open to it and accept it with a smile. Look at each change to your requirements, platform, or tool as a new challenge, rather than some serious inconvenience to be fought.
  7. The only true authority stems from knowledge, not from position. Knowledge engenders authority, and authority engenders respect – so if you want respect in an egoless environment, cultivate knowledge.
  8. Fight for what you believe, but gracefully accept defeat. Understand that sometimes your ideas will be overruled. Even if you are right, don’t take revenge or say “I told you so.” Never make your dearly departed idea a martyr or rallying cry.
  9. Don’t be “the coder in the corner.” Don’t be the person in the dark office emerging only for soda. The coder in the corner is out of sight, out of touch, and out of control. This person has no voice in an open, collaborative environment. Get involved in conversations, and be a participant in your office community.
  10. Critique code instead of people – be kind to the coder, not to the code. As much as possible, make all of your comments positive and oriented to improving the code. Relate comments to local standards, program specs, increased performance, etc.

Novices versus Experts

Now I generally don’t follow too much on knowledge acquisition as a research topic, but I do believe it’s hard to get away from when talking about the evolving nature of a discipline. One bit of interesting breakdown comes from a paper from Dreyfus and Dreyfus called “A Five Stage Model of the Mental Activities Involved in Directed Skill Acquisition” which has laid out characteristics of various levels of expertise:

Novice
  • Rigid adherence to rules or plans
  • Little situational perception
  • No (or limited) discretionary judgment
Advanced Beginner
  • Guidelines for action based on attributes and aspects, which are all equal and separate
  • Limited situational perception
Competent
  • Conscious deliberate planning
  • Standardized and routine procedures
Proficient
  • Sees situations holistically rather than as aspects
  • Perceives deviations from normal patterns
  • Uses maxims for guidance, whose meanings are contextual
Expert
  • No longer relies on rules, guidelines or maxims
  • Intuitive grasp of situations
  • Analytic approach used only in novel situations

The paper goes on to state:

Novices operate from an explicit rules and knowledge-based perspective. They are deliberate and analytical, and therefore slower to take action, they decide or choose.

(which means that novices are deeply subject to local rationality)

Experts operate from a mature, holistic well-tried understanding, intuitively and without conscious deliberation. This is a function of experience. They do not see problems as one thing and solutions as another, they act.

(which means that experts are context driven)

I don’t necessarily subscribe to the notion of such dry lines being drawn between skill levels, because I think that there is a lot more granularity and facets of expertise than just those outlined above, but I think it’s helpful to be aware of the unfortunately over-simplified categories.

Dirty secret: mature engineers know the importance of (sometimes irrational) feelings people have. (gasp!)

How people feel about technologies, technical decisions, and technical directions is just as important (if not more) than the facts about the details. Mature engineers know this, and adjust accordingly. Again, being empathetic can help you understand how another person on your team feels about a technical decision, even if they themselves don’t have an easy time articulating why they feel that way.

People’s confidence in software, architectures, or patterns is heavily influenced by past experience, and can result in positive or negative reactions to using them. Used to work at a mod_perl shop that had a lot of mystifying outages? Then you can’t be surprised to feel reluctant to use it in a different company, even if the supporting expertise and use cases are entirely different. All you remember is that mod_perl = major headaches, so you’re going to be wary of using it in any context again.

Mature engineers understand this phenomenon when making a case to use technology that carries baggage, even if it’s irrational. Convincing a group to use tools and patterns that they aren’t comfortable with isn’t a straightforward task. The “right tool for the job” maxim also has (sometimes unquantifiable) comfortability as a parameter.

For an illustration of how people’s emotions drive technical decisions and opinions, read any flame war about anything, ever.

“It is amazing what you can accomplish if you do not care who gets credit.”

This quote is commonly attributed to Harry S. Truman, but it looks like it might have first been said by a Jesuit priest in a different form. In any case, this is another indication you’re working with a mature engineer: they hold the success of the project much higher than the potential praise they may get personally for working on it. The attribution of praise or credit can be the source of such dysfunction in an engineering-driven organization, and I believe it’s because it’s largely invisible.

The notion is liberating, and once understood and internalized, a world of progress and innovative thinking can flourish, because the engineer isn’t overly concerned with the personal liability of equating the work to their own career success.

Not The End

I’m at the moment blessed to work with a number of mature engineers here at Etsy, and it’s quite humbling. We are indeed a young field, and while I think we can learn a great deal from other fields of engineering on this topic, I also think we have an advantage. The web is inextricably tied to the notion of publishing and sharing information, globally. We need to continue pointing out what it means to be a “senior” and “mature” engineer if we have a hope of progressing the field into a true discipline.

Many thanks to members of the Etsy Operations team, Mike Brittain, Kellan Elliott-McCrea, Marc Hedlund, and Theo Schlossnagle for reviewing drafts of this post. They all make me a more mature engineer.

Convincing management that cooperation and collaboration was worth it

While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo’s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at Flickr, and why. I did this in the hope that other Yahoo properties could learn from that team’s process and culture, which we worked really hard at building and keeping.

The idea that Development and Operations could:

  • Share responsibility/accountability for availability and performance
  • Have an equal seat at the table when it came to application and infrastructure design, architecture, and emergency response
  • Build and maintain a deferential culture to each other when it came to domain expertise
  • Cultivate equanimity when it came to emergency response and post-mortem meetings

…wasn’t evenly distributed across other Yahoo properties, from my limited perspective.

But I knew (still know) lots of incredible engineers at Yahoo that weren’t being supported as they could be by their upper management. So sending this letter was driven by wanting to help their situation. Don’t get me wrong, not everything was rainbows and flowers at Flickr, but we certainly had a lot more of them than other Yahoo groups.

When I re-read this, I’m reminded that when I came to Etsy, I wasn’t entirely sure that any of these approaches would work in the Etsy Engineering environment. The engineering staff at Etsy was a lot larger than Flickr’s and continuous deployment was in its infancy when I got there. I can now happily report that 2 years later, these concepts not only solidified at Etsy, they evolved to accommodate a lot more than what challenged us at Flickr. I couldn’t be happier about how it’s turned out.

I’ll note that there’s nothing groundbreaking in this note I sent, and nothing that I hadn’t said publicly in a presentation or two around the same time.

This is the note I sent to the three layers of management above me in my org at Yahoo:

Subject: Why Flickr went from 73rd most popular Y! property in 2005 to the 6th, 5 years later.

Below are my thoughts about some of the reasons why Flickr has had success, from an Operations Engineering manager’s point of view.

When I say everyone below, I mean all of the groups and sub-groups within the Flickr property: Product, Customer Care, Development, Service Engineering, Abuse and Advocacy, Design, and Community Management.

Here are at least some of the reasons we had success:

  • Product included and respected everyone’s thoughts, in almost every feature and choice.
  • Everyone owned availability of the site, not just Ops.
  • Community management and customer service were involved early and often. In everything. If they weren’t, it was an oversight taken seriously, and would be fixed.
  • Development and Operations had zero divide when it came to availability and performance. No, really. They worked in concert, involving each other in their own affairs when it mattered, and trusting each other every step of the way. This culture was taught, not born.
  • I have never viewed Flickr Operations as firefighters, and have never considered Flickr Dev Engineering to be arsonists. (I have heard this analogy elsewhere in Yahoo.) The two teams are 100% equal partners, with absolute transparency. If anything, we had a problem with too much deference given between the two teams.
  • The site was able to evolve, change, and grow as fast as needed to be as long as it was made safe to do so. To be specific: code and config deploys. When it wasn’t safe, we slowed, and everyone was fine with that happening, knowing that the goal was to return to fast-as-we-need-to-be. See above about everyone owning availability.
  • Developers were able to see their work almost instantly in production. Institutionalized fear of degradation and outage ensured that changes were as safe as they needed to be. Developers and Ops engineers knew intuitively that the safety net you have is the one that you have built for yourself. When changes are small and frequent, the causes of degradation or outage due to code deploys are exceptionally transparent to all involved. (Re-read above about everyone owning availability.)
  • We never deployed “early and often” because it was:
    • a trend,
    • we wanted to brag,
    • or because we think we’re better than anyone. (We did it because it was right for Flickr to do so.)
  • Everyone was made aware of any launches that had risks associated with it, and we worked on lists of things that could possibly go wrong, and what we would do in the event they did go wrong. Sometimes we missed things, and we had to think quickly, but those times were rare with new feature launches.
  • Flickr Ops had always had the “go or no-go” decision, as did other groups who could vote with respect to their preparedness. A significant part of my job was working towards saying “go”, not “no-go”. In fact, almost all of it.

Examples: the most boring (anti-climatic, from an operational perspective) launches ever

  • Flickr Video: I actually held the launch back by some hours until we could rectify a networking issue that I thought posed a risk to post-launch traffic. Other than that, it was a switch in the application that was turned from off to on. The feature’s code had been on prod servers for months in beta. See ‘dark launch’
  • Homepage redesign: Unprecedented amount of activity data being pulled onto the logged-in homepage, order of magnitude increase in the number of calls to backend databases. Why was it boring? Because it was dark launched 10 days earlier. The actual launch was a flip of the ‘on’ switch
  • People In Photos (aka, ‘people tagging’): Because the feature required data that we didn’t actually have yet, we couldn’t exactly dark launch it. It was a feature that had to be turned on, or off. Because of this, Flickr’s Architect wrote out a list of all of the parts of the feature that could cause load-related issues, what the likelihood of each was, how to turn those parts of the feature off, what custome care affect it might have, and what contingencies would probably require some community management involvement.

Dark Launches

When we already have the data on the backend needed to display for a new feature, we would ‘dark launch’, meaning that the code would make all of the back-end calls (i.e. the calls that bring load-related risk to the deploy) and simply throw the data away, not showing it to the user. We could then increase or decrease the percentage of traffic who made those calls in safety, since we never risked the user experience by showing them a new feature and then having to take it away because of load issues.

This increases everyone’s confidence almost to the point of apathy, as far as fear of load-related issues are concerned. I have no idea how many code deploys there were made to production on any given day in the past 5 years (although I could find it on a graph easily), because for the most part I don’t care, because those changes made in production have such a low chance of causing issues. When they have caused issues, everyone on the Flickr staff can find on a webpage when the change was made, who made the change, and exactly (line-by-line) what the change was.

In the case where we had confidence in the resource consumption of a feature, but not 100% confidence in functionality, the feature was turned on for staff only. I’d say that about 95% of the features we launched in those 5 years were turned on for staff long before they were turned on for the entire Flickr population. When we still didn’t feel 100% confident, we ramped up the percentage of Flickr members who could see and use the new feature slowly.

Config Flags

We have many pieces of Flickr that are encapsulated as ‘feature’ flags, which look as simple as: $cfg[disable_feature_video] = 0; this allows the site to be much more resilient to specific failures. If we have any degradation within a certain feature, we can simply turn that feature off in many cases, instead of taking the entire site down. These ‘flags’ have, in the past, been prioritized with conversations with Product, so there is an easy choice to make if something goes wrong and site uptime becomes opposed to feature uptime.

This is an extremely important point: Dark Launches and Config Flags, were concepts and tools created by Flickr Development, not Flickr Operations, even though the end-result of each points toward a typical Operations goal: stability and availability. This is a key distinction. These are initiatives made by Engineering leadership because devs feel protective of the availability of the site, respectful of Operations responsibilities, and just plain good engineering.

If the Flickr Operations had built these tools and approaches to keeping the site stable, I do not believe we would have the same amount of success.

There is more on this topic here: http://code.flickr.com/blog/2009/12/02/flipping-out/

Summary

Flickr Operations is in an enviable position in that they don’t have to convince anyone in the Flickr property that:

    1. Operations has ‘go or no-go’ decision-making power, along with every other subgroup.
    2. Spending time, effort, and money to ensure stable feature launches before they launch is the rule, not the exception.
    3. Continuous Deployment is better for the availability of the site
    4. Flickr Operations should be involved as early as possible in the development phase of any project

These things are taken for granted. Any other way would simply feel weird.

I have no idea if posting this letter helps anyone other than myself, but there you go.

Systems Engineering: A great definition.

Ben Rockwood said something last December about the re-emergence of the Systems Engineer and I agree with him, 100%.

NASA Systems Engineering Handbook

NASA Systems Engineering Handbook, 2007

To add to that, I’d like to quote the excellent NASA Systems Engineering handbook’s introduction. The emphasis is mine:

Systems engineering is a methodical, disciplined approach for the design, realization, technical management, operations, and retirement of a system. A “system” is a construct or collection of different elements that together produce results not obtainable by the elements alone. The elements, or parts, can include people, hardware, software, facilities, policies, and documents; that is, all things required to produce system-level results. The results include system-level qualities, properties, characteristics, functions, behavior, and performance. The value added by the system as a whole, beyond that contributed independently by the parts, is primarily created by the relationship among the parts; that is, how they are interconnected. It is a way of looking at the “big picture” when making technical decisions. It is a way of achieving stakeholder functional, physical, and operational performance requirements in the intended use environment over the planned life of the systems. In other words, systems engineering is a logical way of thinking.

Systems engineering is the art and science of developing an operable system capable of meeting requirements within often opposed constraints. Systems engineering is a holistic, integrative discipline, wherein the contributions of structural engineers, electrical engineers, mechanism designers, power engineers, human factors engineers, and many more disciplines are evaluated and balanced, one against another, to produce a coherent whole that is not dominated by the perspective of a single discipline.

Systems engineering seeks a safe and balanced design in the face of opposing interests and multiple, sometimes conflicting constraints. The systems engineer must develop the skill and instinct for identifying and focusing efforts on assessments to optimize the overall design and not favor one system/subsystem at the expense of another. The art is in knowing when and where to probe. Personnel with these skills are usually tagged as “systems engineers.” They may have other titles—lead systems engineer, technical manager, chief engineer— but for this document, we will use the term systems engineer.

The exact role and responsibility of the systems engineer may change from project to project depending on the size and complexity of the project and from phase to phase of the life cycle. For large projects, there may be one or more systems engineers. For small projects, sometimes the project manager may perform these practices. But, whoever assumes those responsibilities, the systems engineering functions must be performed. The actual assignment of the roles and responsibilities of the named systems engineer may also therefore vary. The lead systems engineer ensures that the system technically fulfills the defined needs and requirements and that a proper systems engineering approach is being followed. The systems engineer oversees the project’s systems engineering activities as performed by the technical team and directs, communicates, monitors, and coordinates tasks. The systems engineer reviews and evaluates the technical aspects of the project to ensure that the systems/subsystems engineering processes are functioning properly and evolves the system from concept to product. The entire technical team is involved in the systems engineering process.

I would imagine that successful organization understands this concept of systems engineering, but I don’t think I’ve ever seen it put so well.

NASA’s engineers have both common and conflicting goals, just like we do in web operations. They weigh trade-offs in efficiency and thoroughness, and wade into the constraints of better, cheaper, faster, and hopefully: more resilient.

This re-emergence of the systems engineering (or “full-stack” engineering) notion is excellent and exciting to me, and I’m hoping that everyone in our field, when they hear “DevOps” (and/or how Theo says *Ops) what they mean is taking a systems engineering view.

 

Training Organizational Resilience in Escalating Situations

This little ramble of thoughts are related to my talk at Velocity coming up, but I know I’ll never get to this part at the conference, so I figured I’d post about it here.

Building resilience from a systems point of view means (amongst other things) understanding how your organization deals with failure and unexpected situations. Generally this means having a development and operations teams that can work well together under pressure, with fluctuating amounts of uncertainty, bringing their own domain expertise to the table when it matters.

This is what drives some of my favorite Ops candidate interview questions. Knowing Unix commands, network architectures, database behaviors, and scripting languages are obviously required, but comprise only one facet of the gig.  The real mettle comes from being able easily zoom in and out of the whole system under scrutiny, splitting up troubleshooting responsibilities amongst your team (and trusting their results) and differentiating red herring symptoms from truly related ones. It also comes from things like:

  • Staying away from distracting conversation during the outage response. Nothing kills a TTR like unrelated talk in IRC or a conf call.
  • Trusting your information. This is where the UI challenges of dashboard design can make or break an outage response. “Are those units milli, or mega?”
  • Balancing too much communication and too little amongst team members. Troubleshooting outage verbosity is a fickle mistress.
  • Stomping actions. OneThingAtATime™ methods aren’t easy to stick to, especially when things escalate.
  • Keeping outage fatigue at bay, and recognizing when brains are melting and need to take a break.

To make matters worse, determining causality can be tenuous at best when you’re working with complex systems, so being able to recognize when a failure has a single root cause (hint: with the big outages – almost never) and when it has multiple contributing causes is a skill that isn’t easily gained without seeing a lot of action in the past.

So it’s not a surprise that working well within a team under stressful scenarios is something other fields try to train people for.  Trauma surgeons, FBI agents, military teams, air traffic control, etc. all have drills, exercises, and simulations for teaching these skills, but they are all done within the context of what those escalating situations look like in their specific fields.

So this brings a question that has come up before in my circles:

Can this sort of organizational resilience be taught, within the context of web operations?

GameDay exercises could certainly be one avenue for testing and training team-based outage response, but most of the focus there (at least those discussed publicly by companies who hold GameDay exercises) is testing the infrastructure and application-level components, and even then under controlled conditions and relatively narrow failure modes.

So the confidence-building value of GameDay drills lie elsewhere, and don’t really exercise the cognitive load that real-world failures can produce on the humans (i.e. the troubleshooting dev and ops teams) like the spectacular Amazon AWS outage recently.

But! Some smart folks have been thinking about this question, at a higher-level:

Is it possible to construct non-contextual and generic drills that can train competencies for this sort of on-the-fly, making-sense-of-unfamiliar-failure-modes, and sometimes disorienting troubleshooting?

At the Lund University in Sweden, there’s an excellent article on building organizational resilience in escalating situations, which I believe resulted in a chapter in the Resilience Engineering in Practice book, and also references another excellent article by David Woods and Emily Patterson called How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands.

The parts I want to highlight here are best practices for designing scenarios meant to train these skills. If you’re looking to design a good drill meant to educate and/or train Ops and Devs on what cognitive muscles to develop for handling large-scale outages, this is a pretty damn good list (quoted from both of those sources above):

  • Try to force people beyond their learned roles and routines. The scenario can contain problems that are not solvable within those roles or routines, and forces people to step out of those roles and routines.
  • Contain a number of hidden goals, at various times during the scenario, that people could pursue (e.g. different ways of escaping the situation or de-escalating it), but that they have to vocalize and articulate in order to begin to achieve them (as they cannot do so by themselves).
  • Include potential actions of which the consequences are both important and difficult to foresee (and that might significantly influence people’s ability to control the problem in the near future). This can force people into pro-active thinking and articulation of their expectations of what might happen.
  • Be able to trap people in locking onto one solution that everybody is fixedly working towards. This can be done by garden-pathing; making the escalating problem look initially (with strong cues) like something the crew could already familiar with, but then letting it depart (with much weaker cues) to see whether the crew is caught on the garden path and lets the situation escalate.
  • Or the scenario, by creating so much cognitive noise in terms of new warnings and events, should be able to trip people into thematic vagabonding—the tendency to redirect attention and change diagnosis with each incoming data piece, which results in a fragmentation of problem-solving.

Think that such a scenario could be constructed?

I want to think so, but of course nothing teaches like the hindsight of a real production outage, eh? 🙂

Resilience Engineering: Part I

I’ve been drafting this post for a really long time. Like most posts, it’s largely for me to get some thoughts down. It’s also very related to the topic I’ll be talking about at Velocity later this year.

When I gave a keynote talk at the Surge Conference last year, I talked about how our field of web engineering is still young, and would do very well to pay attention to other fields of engineering, since I suspect that we have a lot to learn from them. Contrary to popular belief, concepts such as fault tolerance, redundancy of components, sacrificial parts, automatic safety mechanisms, and capacity planning weren’t invented with the web. As it turns out, some of those ideas have been studied and put into practice in other fields for decades, if not centuries.

Systems engineering, control theory, reliability engineering…the list goes on for where we should be looking for influences, and other folks have noticed this as well. As our field recognizes the value of taking a “systems” (the C. West Churchman definition, not the computer software definition) view on building and managing infrastructures with a “Full Stack Programmer” perspective, we should pull our heads out of our echo chamber every now and again, because we can gain so much from lessons learned elsewhere.

Last year, I was lucky to convince Dr. Richard Cook to let us include his article “How Complex Systems Fail” in Web Operations. Some months before, I had seen the article and began to poke around Dr. Cook’s research areas: human error, cognitive systems engineering, safety, and a relatively new multi-discipline area known as Resilience Engineering.

What I found was nothing less than exhilarating and inspirational, and it’s hard for me to not consider this research mandatory reading for anyone involved with building or designing socio-technical systems. (Hint: we all do, in web operations) Frankly, I haven’t been this excited since I saw Jimmy Page in a restaurant once in the mid-90s. Even though Dr. Cook (and others in his field, like Erik Hollnagel, David Woods, and Sidney Dekker) historically have written and researched resilience in the context of aviation, space transportation, healthcare and manufacturing, their findings strike me as incredibly appropriate to web operations and development.

Except, of course, accidents in our field don’t actually harm or kill people. But they almost always involve humans, machines, high stress, and high expectations.

Some of the concepts in resilience engineering run contrary to the typical (or stereotypical) perspectives that I’ve found in operations management, and that’s what I find so fascinating. I’m especially interested in organizational resilience, and the realization that safety in systems develops not in spite of us messy humans, but because of it.

For example:

Historical approaches taken towards improving “safety” in production might not be best

Conventional wisdom might have you believe that the systems we build are basically safe, and that all they need is protection from unreliable humans. This logically stems from the myth that all outages/degradations occur as the result of a change gone wrong, and I suspect this idea also comes from Root Cause Analysis write-ups ending with “human error” at the bottom of the page. But Dekker, Woods, and others in Behind Human Error suggest that listing human error as a root cause isn’t where you should end, it’s where you should start your investigation. Getting behind what led to a ‘human error’ is where the good stuff happens, but unless you’ve got a safe political climate (i.e., no one is going to get punished or fired for making mistakes) you’ll never get at how and why the error was made. Which means that you will ignore one of the largest opportunities to make your system (and organization) more efficient and resilient in the face of incidents. Mismatches, slips, lapses, and violations…each one of those types of error can lead to different ways of improving. And of course, working out the motivations and intentions of people who have made errors isn’t straightforward, especially engineers who might not have enough humility to admit to making an error in the first place.

Root Cause Analysis can be easily misinterpreted and abused

The idea that failures in complex systems can literally have a singular ‘root’ cause, as if failures are the result of linear steps in time, is just incorrect. Not only is it almost always incorrect, but in practice that perspective can be harmful to an organization because it allows management and others to feel better about improving safety, when they’re not, because the solution(s) can be viewed as simple and singular fixes (in reality, they’re not). James Reason’s pioneering book Human Error is enlightening on these points, to say the least. In reality (and I am guilty of this as anyone) there are motivations to reduce complex failures to singular/linear models, tipping the scales on what Hollnagel refers to as an ETTO, or Efficiency-Thoroughness Trade-Off, which I think will sound familiar to anyone working in a web startup. Because why spend extra time digging to find details of that human error-causing outage, when you have work to do? Plus, if you linger too long in that postmortem meeting, people are going to feel even worse about making a mistake, and that’s just cruel, right? 🙂

PostMortems or accident investigations is not the only way an organization can improve “safety”

Only looking at failures to guide your designs, tools, and processes drastically minimizes your ability to improve, Hollnagel says. Instead of looking at the things that go wrong, looking at the things that go right is a better strategy to improve resiliency. Personally, I think that engineering teams who practice continuous deployment intuitively understand this. Small and frequent changes made to production by a growing number of developers ascribe to a particular culture of safety, whether they know it or not. It requires what Hollnagel refers to as a “constant sense of unease”, and awareness of failure is what helps bridge that stereotypical development and operations divide.

Resilience should be a 4th management objective, alongside Better/Faster/Cheaper

The definition goes like this:

Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Since resilience is about being able to function, rather than being impervious to failure, there is no conflict between productivity and safety.

This sounds like one of those commonsense ideas, right? In an extremely self-serving way, I find some validation in that definition that optimizing for MTTR is better than optimizing for MTBF. My gut says that this shouldn’t be shocking or a revelation; it’s what mature engineering is all about.

Safety might not come from the sources you think it comes from

“…so safety isn’t about the absence of something…that you need to count errors or monitor violations, and tabulate incidents and try to make those things go away…..it’s about the presence of something. But the presence of what? When we find that things go right under difficult circumstances, it’s mostly because of people’s adaptive capacity; their ability to recognize, adapt to, and absorb changes and disruptions, some of which might fall outside of what the system is designed or trained to handle.”

– Sidney Dekker

My plan is to post more about these topics, because there are just too many ideas to explain in a single go. Apparently, Ashgate Publishing has owned this space, with a whole series of books. The newest one, Resilience Engineering in Practice, is in my bag, and I can’t put it down. Examples of these ideas in real-world scenarios (hospital and medical ops, power plants, air traffic control, financial services) are juicy with details, and the chapter “Lessons from the Hudson” goes into excellent detail about the trade-offs that go on in the mind of someone in high-stress failure scenarios, like Chesley Sullenberger.

I’ll end on this decent introduction to some of the ideas that includes the above quote, from Sidney Dekker. There’s some distracting camera work, but the ideas get across:

MTTR is more important than MTBF (for most types of F)

This week I gave a talk at QCon SF about development and operations cooperation at Etsy and Flickr.  It’s a refresh of talks I’ve given in the past, with more detail about how it’s going at Etsy. (It’s going excellently 🙂 )

There’s a bunch of topics in the presentation slides, all centered around roles, responsibilities, and intersection points of domain expertise commonly found in development and operations teams. One of the not-groundbreaking ideas that I’m finally getting down is something that should be evident for anyone practicing or interested in ‘continuous deployment’:

Being able to recover quickly from failure is more important than having failures less often.

This has what should be an obvious caveat: some types of failures shouldn’t ever happen, and not all failures/degradations/outages are the same. (like failures resulting in accidental data loss, for example)

Put another way:

MTTR is more important than MTBF

(for most types of F)

(Edited: I did say originally “MTTR > MTBF”)

What I’m definitely not saying is that failure should be an acceptable condition. I’m positing that since failure will happen, it’s just as important (or in some cases more important) to spend time and energy on your response to failure than trying to prevent it. I agree with Hammond, when he said:

If you think you can prevent failure, then you aren’t developing your ability to respond.

In a complete steal of Artur Bergman‘s material, an example in the slides of the talk is of the Jeep versus Rolls Royce:

Jeep versus Rolls Artur has a Jeep, and he’s right when he says that for the most part, Jeeps are built with optimizing Mean-Time-To-Repair, not the classical approach to automotive engineering, which is to optimize Mean-Time-Between-Failures. This is likely because Jeep owners have been beating the shit out of their vehicles for decades, and every now and again, they expect that abuse to break something. Jeep designers know this, which is why it’s so damn easy to repair. Nuts and bolts are easy to reach, tools are included when you buy the thing, and if you haven’t seen the video of Army personnel disassembling and reassembling a Jeep in under 4 minutes, you’re missing out.

The Rolls Royce, on the other hand, likely don’t have such adventurous owners, and when it does break down, it’s a fine and acceptable thing for the car to be out of service for a long and expensive fixing by the manufacturer.

We as web operations folks want our architectures to be built optimized for MTTR, not for MTBF. I think that the reasons should be obvious, and the fact that practices like:

  • Dark launching
  • Percentage-based production A/B rollouts
  • Feature flags

are becoming commonplace should verify this approach as having legs.

The slides from QConSF are here: