The Infinite Hows (or, the Dangers Of The Five Whys)

(this is also posted on O’Reilly’s Radar blog. Much thanks to Daniel Schauenberg, Morgan Evans, and Steven Shorrock for feedback on this)

Before I begin this post, let me say that this is intended to be a critique of the Five Whys method, not a criticism of the people who are in favor of using it.

This critique I present is hardly original; most of this post is inspired by Todd Conklin, Sidney Dekker, and Nancy Leveson.

The concept of post-hoc explanation (or “postmortems” as they’re commonly known) has, at this point, taken hold in the web engineering and operations domain. I’d love to think that the concepts that we’ve taken from the New View on ‘human error’ are becoming more widely known and that people are looking to explore their own narratives through those lenses.

I think that this is good, because my intent has always been (might always be) to help translate concepts from one domain to another. In order to do this effectively, we need to know also what to discard (or at least inspect critically) from those other domains.

The Five Whys is such an approach that I think we should discard.

This post explains my reasoning for discarding it, and how using it has the potential to be harmful, not helpful, to an organization. Here’s how I intend on doing this: I’m first going to talk about what I think are deficiencies in the approach, suggest an alternative, and then ask you to simply try the alternative yourself.

Here is the “bottom line, up front” gist of my assertions:

“Why?” is the wrong question.

In order to learn (which should be the goal of any retrospective or post-hoc investigation) you want multiple and diverse perspectives. You get these by asking people for their own narratives. Effectively, you’re asking  “how?

Asking “why?” too easily gets you to an answer to the question “who?” (which in almost every case is irrelevant) or “takes you to the ‘mysterious’ incentives and motivations people bring into the workplace.”

Asking “how?” gets you to describe (at least some) of the conditions that allowed an event to take place, and provides rich operational data.

Asking a chain of “why?” assumes too much about the questioner’s choices, and assumes too much about each answer you get. At best, it locks you into a causal chain, which is not how the world actually works. This is a construction that ignores a huge amount of complexity in an event, and it’s the complexity that we want to explore if we have any hope of learning anything.

But It’s A Great Way To Get People Started!

The most compelling argument to using the Five Whys is that it’s a good first step towards doing real “root cause analysis” – my response to that is twofold:

  1. “Root Cause Analysis*” isn’t what you should be doing anyway, and
  2. It’s only a good “first step” because it’s easy to explain and understand, which makes it easy to socialize. The issue with this is that the concepts that the Five Whys depend on are not only faulty, but can be dangerous for an organization to embrace.

If the goal is learning (and it should be) then using a method of retrospective learning should be confident in how it’s bringing to light data that can be turned into actionable information. The issue with the Five Whys is that it’s tunnel-visioned into a linear and simplistic explanation of how work gets done and events transpire. This narrowing can be incredibly problematic.

In the best case, it can lead an organization to think they’re improving on something (or preventing future occurrences of events) when they’re not.

In the worst case, it can re-affirm a faulty worldview of causal simplification and set up a structure where individuals don’t feel safe in giving their narratives because either they weren’t asked the right “why?” question or the answer that a given question pointed to ‘human error’ or individual attributes as causal.

Let’s take an example. From my tutorials at the Velocity Conference in New York, I used an often-repeated straw man to illustrate this:

Screen Shot 2014-11-12 at 3.45.24 PM

This is the example of the Five Whys found in the Web Operations book, as well.

This causal chain effectively ends with a person’s individual attributes, not with a description of the multiple conditions that allow an event like this to happen. Let’s look into some of the answers…

“Why did the server fail? Because an obscure subsystem was used in the wrong way.”

This answer is dependent on the outcome. We know that it was used in the “wrong” way only because we’ve connected it to the resulting failure. In other words, we as “investigators” have the benefit of hindsight. We can easily judge the usage of the server because we know the outcome. If we were to go back in time and ask the engineer(s) who were using it: “Do you think that you’re doing this right?” they would answer: yes, they are. We want to know what are the various influences that brought them to think that, which simply won’t fit into the answer of “why?”

The answer also limits the next question that we’d ask. There isn’t any room in the dialogue to discuss things such as the potential to use a server in the wrong way and it not result in failure, or what ‘wrong’ means in this context. Can the server only be used in two ways – the ‘right’ way or the ‘wrong’ way? And does success (or, the absence of a failure) dictate which of those ways it was used? We don’t get to these crucial questions.

“Why was it used in the wrong way? The engineer who used it didn’t know how to use it properly.”

This answer is effectively a tautology, and includes a post-hoc judgement. It doesn’t tell us anything about how the engineer did use the system, which would provide a rich source of operational data, especially for engineers who might be expected to work with the system in the future. Is it really just about this one engineer? Or is it possibly about the environment (tools, dashboards, controls, tests, etc.) that the engineer is working in? If it’s the latter, how does that get captured in the Five Whys?

So what do we find in this chain we have constructed above? We find:

  • an engineer with faulty (or at least incomplete) knowledge
  • insufficient indoctrination of engineers
  • a manager who fouls things up by not being thorough enough in the training of new engineers (indeed: we can make a post-hoc judgement about her beliefs)

If this is to be taken as an example of the Five Whys, then as an engineer or engineering manager, I might not look forward to it, since it focuses on our individual attributes and doesn’t tell us much about the event other than the platitude that training (and convincing people about training) is important.

These are largely answers about “who?” not descriptions of what conditions existed. In other words, by asking “why?” in this way, we’re using failures to explain failures, which isn’t helpful.

If we ask: “Why did a particular server fail?” we can get any number of answers, but one of those answers will be used as the primary way of getting at the next “why?” step. We’ll also lose out on a huge amount of important detail, because remember: you only get one question before the next step.

If instead, we were to ask the engineers how they went about implementing some new code (or ‘subsystem’), we might hear a number of things, like maybe:

  • the approach(es) they took when writing the code
  • what ways they gained confidence (tests, code reviews, etc.) that the code was going to work in the way they expected it before it was deployed
  • what (if any) history of success or failure have they had with similar pieces of code?
  • what trade-offs they made or managed in the design of the new function?
  • how they judged the scope of the project
  • how much (and in what ways) they experienced time pressure for the project
  • the list can go on, if you’re willing to ask more and they’re willing to give more

Rather than judging people for not doing what they should have done, the new view presents tools for explaining why people did what they did. Human error becomes a starting point, not a conclusion. (Dekker, 2009)

When we ask “how?”, we’re asking for a narrative. A story.

In these stories, we get to understand how people work. By going with the “engineer was deficient, needs training, manager needs to be told to train” approach, we might not have a place to ask questions aimed at recommendations for the future, such as:

  • What might we put in place so that it’s very difficult to put that code into production accidentally?
  • What sources of confidence for engineers could we augment?

As part of those stories, we’re looking to understand people’s local rationality. When it comes to decisions and actions, we want to know how it made sense for someone to do what they did. And make no mistake: they thought what they were doing made sense. Otherwise, they wouldn’t have done it.

WhyHow.001

Again, I’m not original with this thought. Local rationality (or as Herb Simon called it, “bounded rationality”) is something that sits firmly atop some decades of cognitive science.

These stories we’re looking for contain details that we can pull on and ask more about, which is critical as a facilitator of a post-mortem debriefing, because people don’t always know what details are important. As you’ll see later in this post, reality doesn’t work like a DVR; you can’t pause, rewind and fast-forward at will along a singular and objective axis, picking up all of the pieces along the way, acting like CSI. Memories are faulty and perspectives are limited, so a different approach is necessary.

Not just “how”

In order to get at these narratives, you need to dig for second stories. Asking “why?” will get you an answer to first stories. These are not only insufficient answers, they can be very damaging to an organization, depending on the context. As a refresher…

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

 

Now, read again the straw-man example of the Five Whys above. The questions that we ask frame the answers that we will get in the form of first stories. When we ask more and better questions (such as “how?”) we have a chance at getting at second stories.

You might wonder: how did I get from the Five Whys to the topic of ‘human error’? Because once ‘human error’ is a candidate to reach for as a cause (and it will, because it’s a simple and potentially satisfying answer to “why?”) then you will undoubtedly use it.

At the beginning of my tutorial in New York, I asked the audience this question:

IsThisRight.001

At the beginning of the talk, a large number of people said yes, this is correct. Steven Shorrock (who is speaking at Velocity next week in Barcelona on this exact topic) has written a great article on this way of thinking: If It Weren’t For The People. By the end of my talk, I was able to convince them that this is also the wrong focus of a post-mortem description.

This idea accompanies the Five Whys more often than not, and there are two things that I’d like to shine some light on about it:

Myth of the “human or technical failure” dichotomy

This is dualistic thinking, and I don’t have much to add to this other than what Dekker has said about it (Dekker, 2006):

“Was the accident caused by mechanical failure or by human error? It is a stock question in the immediate aftermath of a mishap. Indeed, it seems such a simple, innocent question. To many it is a normal question to ask: If you have had an accident, it makes sense to find out what broke. The question, however, embodies a particular understanding of how accidents occur, and it risks confining our causal analysis to that understanding. It lodges us into a fixed interpretative repertoire. Escaping from this repertoire may be difficult. It sets out the questions we ask, provides the leads we pursue and the clues we examine, and determines the conclusions we will eventually draw.”

Myth: during a retrospective investigation, something is waiting to be “found”

I’ll cut to the chase: there is nothing waiting to be found, or “revealed.” These “causes” that we’re thinking we’re “finding”? We’re constructing them, not finding them. We’re constructing them because we are the ones that are choosing where (and when) to start asking questions, and where/when to stop asking the questions. We’ve “found” a root cause when we stop looking. And in many cases, we’ll get lazy and just chalk it up to “human error.”

As Erik Hollnagel has said (Hollnagel, 2009, p. 85):

“In accident investigation, as in most other human endeavours, we fall prey to the What-You-Look-For-Is-What-You-Find or WYLFIWYF principle. This is a simple recognition of the fact that assumptions about what we are going to see (What-You-Look-For), to a large extent will determine what we actually find (What-You-Find).”

More to the point: “What-You-Look-For-Is-What-You-Fix”

We think there is something like the cause of a mishap (sometimes we call it the root cause, or primary cause), and if we look in the rubble hard enough, we will find it there. The reality is that there is no such thing as the cause, or primary cause or root cause . Cause is something we construct, not find. And how we construct causes depends on the accident model that we believe in. (Dekker, 2006)

Nancy Leveson comments on this in her excellent book Engineering a Safer World this idea (p.20):

Subjectivity in Selecting Events

The selection of events to include in an event chain is dependent on the stopping rule used to determine how far back the sequence of explanatory events goes. Although the first event in the chain is often labeled the ‘initiating event’ or ‘root cause’ the selection of an initiating event is arbitrary and previous events could always be added.

Sometimes the initiating event is selected (the backward chaining stops) because it represents a type of event that is familiar and thus acceptable as an explanation for the accident or it is a deviation from a standard [166]. In other cases, the initiating event or root cause is chosen because it is the first event in the backward chain for which it is felt that something can be done for correction.

The backward chaining may also stop because the causal path disappears due to lack of information. Rasmussen suggests that a practical explanation for why actions by operators actively involved in the dynamic flow of events are so often identified as the cause of an accident is the difficulty in continuing the backtracking “through” a human [166].

A final reason why a “root cause” may be selected is that it is politically acceptable as the identified cause. Other events or explanations may be excluded or not examined in depth because they raise issues that are embarrassing to the organization or its contractors or are politically unacceptable.

Learning is the goal. Any prevention depends on that learning.

So if not the Five Whys, then what should you do? What method should you take?

I’d like to suggest an alternative, which is to first accept the idea that you have to actively seek out and protect the stories from bias (and judgement) when you ask people “how?”-style questions. Then you can:

  • Ask people for their story without any replay of data that would supposedly ‘refresh’ their memory
  • Tell their story back to them and confirm you got their narrative correct
  • Identify critical junctures
  • Progressively probe and re-build how the world looked to people inside of the situation at each juncture.

As a starting point for those probing questions, we can look to Gary Klein and Sidney Dekker for the types of questions you can ask instead of “why?”…

Debriefing Facilitation Prompts

(from The Field Guide To Understanding Human Error, by Sidney Dekker)

At each juncture in the sequence of events (if that is how you want to structure this part of the accident story), you want to get to know:

  • Which cues were observed (what did he or she notice/see or did not notice what he or she had expected to notice?)
  • What knowledge was used to deal with the situation? Did participants have any experience with similar situations that was useful in dealing with this one?
  • What expectations did participants have about how things were going to develop, and what options did they think they have to influence the course of events?
  • How did other influences (operational or organizational) help determine how they interpreted the situation and how they would act?

Here are some questions Gary Klein and his researchers typically ask to find out how the situation looked to people on the inside at each of the critical junctures:

Cues What were you seeing?

What were you focused on?

What were you expecting to happen?

Interpretation If you had to describe the situation to your colleague at that point, what would you have told?
Errors What mistakes (for example in interpretation) were likely at this point?
Previous knowledge/experience

Were you reminded of any previous experience?

Did this situation fit a standard scenario?

Were you trained to deal with this situation?

Were there any rules that applied clearly here?

Did any other sources of knowledge suggest what to do?

Goals What were you trying to achieve?Were there multiple goals at the same time?Was there time pressure or other limitations on what you could do?
Taking Action How did you judge you could influence the course of events?

Did you discuss or mentally imagine a number of options or did you know straight away what to do?

Outcome Did the outcome fit your expectation?
Did you have to update your assessment of the situation?
Communications What communication medium(s) did you prefer to use? (phone, chat, email, video conf, etc.?)

Did you make use of more than one communication channels at once?

Help

Did you ask anyone for help?

What signal brought you to ask for support or assistance?

Were you able to contact the people you needed to contact?

For the tutorials I did at Velocity, I made a one-pager of these: http://bit.ly/DebriefingPrompts

Screen Shot 2014-11-12 at 4.03.30 PM

Try It

I have tried to outline some of my reasoning on why using the Five Whys approach is suboptimal, and I’ve given an alternative. I’ll do one better and link you to the tutorials that I gave in New York in October, which I think digs deeper into these concepts. This is in four parts, 45 minutes each.

Part I – Introduction and the scientific basis for post-hoc restrospective pitfalls and learning

Part II – The language of debriefings, causality, case studies, teams coping with complexity

Part III – Dynamic fault management, debriefing prompts, gathering and contextualizing data, constructing causes

Part IV – Taylorism, normal work, ‘root cause’ of software bugs in cars, Q&A

My request is that the next time that you would do a Five Whys, that you instead ask “how?” or the variations of the questions I posted above. If you think you get more operational data from a Five Whys and are happy with it, rock on.

If you’re more interested in this alternative and the fundamentals behind it, then there are a number of sources you can look to. You could do a lot worse than starting with Sidney Dekker’s Field Guide To Understanding Human Error.

An Explanation

For those readers who think I’m too unnecessarily harsh on the Five Whys approach, I think it’s worthwhile to explain why I feel so strongly about this.

Retrospective understanding of accidents and events is important because how we make sense of the past greatly and almost invisibly influences our future. At some point in the not-so-distant past, the domain of web engineering was about selling books online and making a directory of the web. These organizations and the individuals who built them quickly gave way to organizations that now build cars, spacecraft, trains, aircraft, medical monitoring devices…the list goes on…simply because software development and distributed systems architectures are at the core of modern life.

The software worlds and the non-software worlds have collided and will continue to do so. More and more “life-critical” equipment and products rely on software and even the Internet.

Those domains have had varied success in retrospective understanding of surprising events, to say the least. Investigative approaches that are firmly based on causal oversimplification and the “Bad Apple Theory” of deficient individual attributes (like the Five Whys) have shown to not only be unhelpful, but objectively made learning harder, not easier. As a result, people who have made mistakes or involved in accidents have been fired, banned from their profession, and thrown in jail for some of the very things that you could find in a Five Whys.

I sometimes feel nervous that these oversimplifications will still be around when my daughter and son are older. If they were to make a mistake, would they be blamed as a cause? I strongly believe that we can leave these old ways behind us and do much better.

My goal is not to vilify an approach, but to state explicitly that if the world is to become safer, then we have to eschew this simplicity; it will only get better if we embrace the complexity, not ignore it.

 

Epilogue: The Longer Version For Those Who Have The Stomach For Complexity Theory

The Five Whys approach follows a Newtonian-Cartesian worldview. This is a worldview that is seductively satisfying and compellingly simple. But it’s also false in the world we live in.

What do I mean by this?

There are five areas why the Five Whys firmly sits in a Newtonian-Cartesian worldview that we should eschew when it comes to learning from past events. This is a Cliff Notes version of “The complexity of failure: Implications of complexity theory for safety investigations” –

First, it is reductionist. The narrative built by the Five Whys sits on the idea that if you can construct a causal chain, then you’ll have something to work with. In other words: to understand the system, you pull it apart into its constituent parts. Know how the parts interact, and you know the system.

Second, it assumes what Dekker has called “cause-effect symmetry” (Dekker, complexity of failure):

“In the Newtonian vision of the world, everything that happens has a definitive, identifiable cause and a definitive effect. There is symmetry between cause and effect (they are equal but opposite). The determination of the ‘‘cause’’ or ‘‘causes’’ is of course seen as the most important function of accident investigation, but assumes that physical effects can be traced back to physical causes (or a chain of causes-effects) (Leveson, 2002). The assumption that effects cannot occur without specific causes influences legal reasoning in the wake of accidents too. For example, to raise a question of negligence in an accident, harm must be caused by the negligent action (GAIN, 2004). Assumptions about cause-effect symmetry can be seen in what is known as the outcome bias (Fischhoff, 1975). The worse the consequences, the more any preceding acts are seen as blameworthy (Hugh and Dekker, 2009).”

John Carroll (Carroll, 1995) called this “root cause seduction”:

The identification of a root cause means that the analysis has found the source of the event and so everyone can focus on fixing the problem.  This satisfies people’s need to avoid ambiguous situations in which one lacks essential information to make a decision (Frisch & Baron, 1988) or experiences a salient knowledge gap (Loewenstein, 1993). The seductiveness of singular root causes may also feed into, and be supported by, the general tendency to be overconfident about how much we know (Fischhoff,Slovic,& Lichtenstein, 1977).

That last bit about a tendency to be overconfident about how much we know (in this context, how much we know about the past) is a strong piece of research put forth by Baruch Fischhoff, who originally researched what we now understand to be the Hindsight Bias. Not unsurprisingly, Fischhoff’s doctoral thesis advisor was Daniel Kahneman (you’ve likely heard of him as the author of Thinking Fast and Slow), whose research in cognitive biases and heuristics everyone should at least be vaguely familiar with.

The third issue with this worldview, supported by the idea of Five Whys and something that follows logically from the earlier points is that outcomes are foreseeable if you know the initial conditions and the rules that govern the system. The reason that you would even construct a serial causal chain like this is because

The fourth part of this is that time is irreversible. We can’t look to a causal chain as something that you can fast-forward and rewind, no matter how attractively simple that seems. This is because the socio-technical systems that we work on and work in are complex in nature, and are dynamic. Deterministic behavior (or, at least predictability) is something that we look for in software; in complex systems this is a foolhardy search because emergence is a property of this complexity.

And finally, there is an underlying assumption that complete knowledge is attainable. In other words: we only have to try hard enough to understand exactly what happened. The issue with this is that success and failure have many contributing causes, and there is no comprehensive and objective account. The best that you can do is to probe people’s perspectives at juncture points in the investigation. It is not possible to understand past events in any way that can be considered comprehensive.

Dekker (Dekker, 2011):

As soon as an outcome has happened, whatever past events can be said to have led up to it, undergo a whole range of transformations (Fischhoff and Beyth, 1975; Hugh and Dekker, 2009). Take the idea that it is a sequence of events that precedes an accident. Who makes the selection of the ‘‘events’’ and on the basis of what? The very act of separating important or contributory events from unimportant ones is an act of construction, of the creation of a story, not the reconstruction of a story that was already there, ready to be uncovered. Any sequence of events or list of contributory or causal factors already smuggles a whole array of selection mechanisms and criteria into the supposed ‘‘re’’construction. There is no objective way of doing this—all these choices are affected, more or less tacitly, by the analyst’s background, preferences, experiences, biases, beliefs and purposes. ‘‘Events’’ are themselves defined and delimited by the stories with which the analyst configures them, and are impossible to imagine outside this selective, exclusionary, narrative fore-structure (Cronon, 1992).

Here is a thought exercise: what if we were to try to use the Five Whys for finding the “root cause” of a success?

Why didn’t we have failure X today?

Now this question gets a lot more difficult to have one answer. This is because things go right for many reasons, and not all of them obvious. We can spend all day writing down reasons why we didn’t have failure X today, and if we’re committed, we can keep going.

So if success requires “multiple contributing conditions, each necessary but only jointly sufficient” to happen, then how is it that failure only requires just one? The Five Whys, as its commonly presented as an approach to improvement (or: learning?), will lead us to believe that not only is just one condition sufficient, but that condition is a canonical one, to the exclusion of all others.

* RCA, or “Root Cause Analysis” can also easily turn into “Retrospective Cover of Ass”

References

Carroll, J. S. (1995). Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability. Organization & Environment, 9(2), 175–197. doi:10.1177/108602669500900203

Dekker, S. (2004). Ten questions about human error: A new view of human factors and system safety. Mahwah, N.J: Lawrence Erlbaum.

Dekker, S., Cilliers, P., & Hofmeyr, J.-H. (2011). The complexity of failure: Implications of complexity theory for safety investigations. Safety Science, 49(6), 939–945. doi:10.1016/j.ssci.2011.01.008

Hollnagel, E. (2009). The ETTO principle: Efficiency-thoroughness trade-off : why things that go right sometimes go wrong. Burlington, VT: Ashgate.
Leveson, N. (2012). Engineering a Safer World. Mit Press.

 

 

Translations Between Domains: David Woods

One of the reasons I’ve continued to be more and more interested in Human Factors and Safety Science is that I found myself without many answers to the questions I have had in my career. Questions surrounding how organizations work, how people think and work with computers, how decisions get made under uncertainty, and how do people cope with increasing amounts of complexity.

As a result, my journey took me deep into a world where I immediately saw connections — between concepts found in other high-tempo, high-consequence domains and my own world of software engineering and operations. One of the first connections was in Richard Cook’s How Complex Systems Fail, and it struck me so deeply I insisted that it get reprinted (with additions by Richard) into O’Reilly’s Web Operations book.

I simply cannot un-see these connections now, and the field of study keeps me going deeper. So deep that I felt I needed to get a degree. My goal with getting a degree in the topic is not just to satisfy my own curiosity, but also to explore these topics in sufficient depth to feel credible in thinking about them critically.

In software, the concept and sometimes inadvertent practice of “cargo cult engineering” is well known. I’m hoping to avoid that in my own translation(s) of what’s been found in human factors, safety science, and cognitive systems engineering, as they looked into domains like aviation, patient safety, or power plant operations. Instead, I’m looking to truly understand that work in order to know what to focus on in my own research as well as to understand how my domain is either similar (and in what ways?) or different (and in what ways?)

For example, just a hint of what sorts of questions I have been mulling over:

  • How does the concept of “normalization of deviance” manifest in web engineering? How does it relate to our concept of ‘technical debt’?
  • What organizational dynamics might be in play when it comes to learning from “successes” and “failures”?
  • What methods of inquiry can we use to better design interfaces that have functionality and safety and diagnosis support as their core? Or, are those goals in conflict? If so, how?
  • How can we design alerts to reduce noise and increase signal in a way that takes into account the context of the intended receiver of the alert? In other words, how can we teach alerts to know about us, instead of the other way around?
  • The Internet (include its technical, political, and cultural structures) has non-zero amounts of diversity, interdependence, connectedness, and adaptation, which by many measures constitutes a complex system.
  • How do successful organizations navigate trade-offs when it comes to decisions that may have unexpected consequences?

I’ve done my best to point my domain at some of these connections as I understand them, and the Velocity Conference has been one of the ways I’ve hoped to bring people “over the bridge” from Safety Science, Human Factors, and Cognitive Systems Engineering into software engineering and operations as it exists as a practice on Internet-connected resources. If you haven’t seen Dr. Richard Cook’s 2012 and 2013 keynotes, or Dr. Johan Bergstrom’s keynote, stop what you’re doing right now and watch them.

I’m willing to bet you’ll see connections immediately…



DavidWoodsDavid Woods is one of the pioneers in these fields, and continues to be a huge influence on the way that I think about our domain and my own research (my thesis project relies heavily on some of his previous work) and I can’t be happier that he’s speaking at Velocity in New York, which is coming up soon. (Pssst: if you register for it here, you can use the code “JOHN20” for 20% discount)

I have posted before (and likely will again) about a paper Woods contributed to, Common Ground and Coordination in Joint Activity (Klein, Feltovich, Bradshaw, & Woods, 2005) which in my mind might as well be considered the best explanation on what “devops” means to me, and what makes successful teams work. If you haven’t read it, do it now.

 

Dynamic Fault Management and Anomaly Response

I thought about listing all of Woods’ work that I’ve seen connections in thus far, but then I realized that if I wasn’t careful, I’d be writing a literature review and not a blog post. 🙂 Also, I have thesis work to do. So for now, I’d like to point only at two concepts that struck me as absolutely critical to the day-to-day of many readers of this blog, dynamic fault management and anomaly response.

Woods sheds some light on these topics in Joint Cognitive Systems: Patterns in Cognitive Systems Engineering. Pay particular attention to the characteristics of these phenomenons:

“In anomaly response, there is some underlying process, an engineered or physiological process which will be referred to as the monitored process, whose state changes over time. Faults disturb the functions that go on in the monitored process and generate the demand for practitioners to act to compensate for these disturbances in order to maintain process integrity—what is sometimes referred to as “safing” activities. In parallel, practitioners carry out diagnostic activities to determine the source of the disturbances in order to correct the underlying problem.

Anomaly response situations frequently involve time pressure, multiple interacting goals, high consequences of failure, and multiple interleaved tasks (Woods, 1988; 1994). Typical examples of fields of practice where dynamic fault management occurs include flight deck operations in commercial aviation (Abbott, 1990), control of space systems (Patterson et al., 1999; Mark, 2002), anesthetic management under surgery (Gaba et al., 1987), terrestrial process control (Roth, Woods & Pople, 1992), and response to natural disasters.” (Woods & Hollnagel, 2006, p.71)

Now look down at the distributed systems you’re designing and operating.

Look at the “runbooks” and postmortem notes that you have written in the hopes that they can help guide teams as they try to untangle the sometimes very confusing scenarios that outages can bring.

Does “safing” ring familiar to you?

Do you recognize managing “multiple interleaved tasks” under “time pressure” and “high consequences of failure”?

I think it’s safe to say that almost every Velocity Conference attendee would see connections here.

In How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands (Woods & Patterson, 1999), he introduces the concept of escalation, in terms of anomaly response:

The concept of escalation captures a dynamic relationship between the cascade of effects that follows from an event and the demands for cognitive and collaborative work that escalate in response (Woods, 1994). An event triggers the evolution of multiple interrelated dynamics.

  • There is a cascade of effects in the monitored process. A fault produces a time series of disturbances along lines of functional and physical coupling in the process (e.g., Abbott, 1990). These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications (Reiersen, Marshall, & Baker, 1988).
  • Demands for cognitive activity increase as the problem cascades. More knowledge potentially needs to be brought to bear. There is more to monitor. There is a changing set of data to integrate into a coherent assessment. Candidate hypotheses need to be generated and evaluated. Assessments may need to be revised as new data come in. Actions to protect the integrity and safety of systems need to be identified, carried out, and monitored for success. Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.
  • Demands for coordination increase as the problem cascades. As the cognitive activities escalate, the demand for coordination across people and across people and machines rises. Knowledge may reside in different people or different parts of the operational system. Specialized knowledge and expertise from other parties may need to be brought into the problem-solving process. Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process. The trouble in the underlying process requires informing and updating others – those whose scope of responsibility may be affected by the anomaly, those who may be able to support recovery, or those who may be affected by the consequences the anomaly could or does produce.
  • The cascade and escalation is a dynamic process. A variety of complicating factors can occur, which move situations beyond canonical, textbook forms. The concept of escalation captures this movement from canonical to nonroutine to exceptional. The tempo of operations increases following the recognition of a triggering event and is synchronized by temporal landmarks that represent irreversible decision points.

When I read…

“These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications” 

I think of many large-scale outages and multi-day recovery activities, like this one that you all might remember (AWS EBS/RDS outage, 2011).

When I read…

“Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.” 

I think of many outage response scenarios I have been in with multiple teams (network, storage, database, security, etc.) gathering data from the multiple places they are expert in, at the same time making sense of that data as normal or abnormal signals.

When I read…

“Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process.”

I think of these two particular outages, and how in the fog of ambiguous signals coming in during diagnosis of an issue, there is a “divide and conquer” effort distributed throughout differing domain expertise (database, network, various software layers, hardware, etc.) that aims to split the search space of diagnosis, while at the same time keeping each other up-to-date on what pathologies have been eliminated as possibilities, what new data can be used to form hypotheses about what’s going on, etc.

I will post more on the topic of anomaly response in detail (and more of Woods’ work) in another post.

In the meantime, I urge you to take a look at David Woods’ writings, and look for connections in your own work. Below is a talk David gave at IBM’s Almaden Research Center, called “Creating Safety By Engineering Resilience”:

David D. Woods, Creating Safety by Engineering Resilience from jspaw on Vimeo.

References

Hollnagel, E., & Woods, D. D. (1983). Cognitive systems engineering: New wine in new bottles. International Journal of Man-Machine Studies, 18(6), 583–600.

Klein, G., Feltovich, P. J., Bradshaw, J. M., & Woods, D. D. (2005). Common ground and coordination in joint activity. Organizational Simulation, 139–184.

Woods, D. D. (1995). The alarm problem and directed attention in dynamic fault management. Ergonomics. doi:10.1080/00140139508925274

Woods, D. D., & Hollnagel, E. (2006). Joint cognitive systems : patterns in cognitive systems engineering. Boca Raton : CRC/Taylor & Francis.

Woods, D. D., & Patterson, E. S. (1999). How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands. Stress, 1–13.

Woods, D. D., Patterson, E. S., & Roth, E. M. (2002). Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis. Cognition, Technology & Work, 4(1), 22–36. doi:10.1007/s101110200002

Teaching Engineering As A Social Science

Below is a piece written by Edward Wenk, Jr., which originally appeared in PRlSM, the magazine for the American Society for Engineering Education (Publication Volume 6. No. 4. December 1996.)

While I think that there’s much more than what Wenk points to as ‘social science’ – I agree wholeheartedly with his ideas. I might even say that he didn’t go far enough in his recommendations.

Enjoy. 🙂

 

Edward Wenk, Jr.

Teaching Engineering as a Social Science

Today’s public engages in a love affair with technology, yet it consistently ignores the engineering at technology’s core. This paradox is reinforced by the relatively few engineers in leadership positions. Corporations, which used to have many engineers on their boards of directors, today are composed mainly of M.B.A.s and lawyers. Few engineers hold public office or even run for office. Engineers seldom break into headlines except when serious accidents are attributed to faulty design.

While there are many theories on this lack of visibility, from inadequate public relations to inadequate public schools, we may have overlooked the real problem: Perhaps people aren’t looking at engineers because engineers aren’t looking at people.

If engineering is to be practiced as a profession, and not just a technical craft, engineers must learn to harmonize natural sciences with human values and social organization. To do this we must begin to look at engineering as a social science and to teach, practice, and present engineering in this context.

To many in the profession, looking at teaching engineering as a social science is anathema. But consider the multiple and profound connections of engineering to people.

Technology in Everyday Life

The work of engineers touches almost everyone every day through food production, housing, transportation, communications, military security, energy supply, water supply, waste disposal, environmental management, health care, even education and entertainment. Technology is more than hardware and silicon chips.

In propelling change and altering our belief systems and culture, technology has joined religion, tradition, and family in the scope of its influence. Its enhancements of human muscle and human mind are self-evident. But technology is also a social amplifier. It stretches the range, volume, and speed of communications. It inflates appetites for consumer goods and creature comforts. It tends to concentrate wealth and power, and to increase the disparity of rich and poor. In the com- petition for scarce resources, it breeds conflicts.

In social psychological terms, it alters our perceptions of space. Events anywhere on the globe now have immediate repercussions everywhere, with a portfolio of tragedies that ignite feelings of helplessness. Technology has also skewed our perception of time, nourishing a desire for speed and instant gratification and ignoring longer-term impacts.

Engineering and Government

All technologies generate unintended consequences. Many are dangerous enough to life, health, property, and environment that the public has demanded protection by the government.

Although legitimate debates erupt on the size of government, its cardinal role is demonstrated in an election year when every faction seeks control. No wonder vested interests lobby aggressively and make political campaign contributions.

Whatever that struggle, engineers have generally opted out. Engineers tend to believe that the best government is the least government, which is consistent with goals of economy and efficiency that steer many engineering decisions without regard for social issues and consequences.

Problems at the Undergraduate Level

By both inclination and preparation, many engineers approach the real world as though it were uninhabited. Undergraduates who choose an engineering career often see it as escape from blue- collar family legacies by obtaining the social prestige that comes with belonging to a profession. Others love machines. Few, however, are attracted to engineering because of an interest in people or a commitment to public service. On the contrary, most are uncomfortable with the ambiguities human behavior, its absence of predictable cause and effect, its lack of control, and with the demands for direct encounters with the public.

Part of this discomfort originates in engineering departments, which are often isolated from arts, humanities, and social sciences classrooms by campus geography as well as by disparate bodies of scholarly knowledge and cultures. Although most engineering departments require students to take some nontechnical courses, students often select these on the basis of hearsay, academic ease, or course instruction, not in terms of preparation for life or for citizenship.

Faculty attitudes don’t help. Many faculty members enter teaching immediately after obtaining their doctorates, their intellect sharply honed by a research specialty. Then they continue in that groove because of standard academic reward systems for tenure and promotion. Many never enter a professional practice that entails the human equation.

We can’t expect instant changes in engineering education. A start, however, would be to recognize that engineering is more than manipulation of intricate signs and symbols. The social context is not someone else’s business. Adopting this mindset requires a change in attitudes. Consider these axioms:

  • Technology is not just hardware; it is a social process.
  • All technologies generate side effects that engineers should try to anticipate and to protect against.
  • The most strenuous challenge lies in synthesis of technical, social, economic, environmental, political, and legal processes.
  • For engineers to fulfill a noblesse oblige to society, the objectivity must not be defined by conditions of employment, as, for example, in dealing with tradeoffs by an employer of safety for cost.

In a complex, interdependent, and sometimes chaotic world, engineering practice must continue to excel in problem solving and creative synthesis. But today we should also emphasize social responsibility and commitment to social progress. With so many initiatives having potentially unintended consequences, engineers need to examine how to serve as counselors to the public in answering questions of “What if?” They would thus add sensitive, future-oriented guidance to the extraordinary power of technology to serve important social purposes.

In academic preparation, most engineering students miss exposure to the principles of social and economic justice and human rights, and to the importance of biological, emotional, and spiritual needs. They miss Shakespeare’s illumination of human nature – the lust for power and wealth and its corrosive effects on the psyche, and the role of character in shaping ethics that influence professional practice. And they miss models of moral vision to face future temptations.

Engineering’s social detachment is also marked by a lack of teaching about the safety margins that accommodate uncertainties in engineering theories, design assumptions, product use and abuse, and so on. These safety margins shape practice with social responsibility to minimize potential harm to people or property. Our students can learn important lessons from the history of safety margins, especially of failures, yet most use safety protocols without knowledge of that history and without an understanding of risk and its abatement. Can we expect a railroad systems designer obsessed with safety signals to understand that sleep deprivation is even more likely to cause accidents? No, not if the systems designer lacks knowledge of this relatively common problem.

Safety margins are a protection against some unintended consequences. Unless engineers appreciate human participation in technology and the role of human character in performance, they are unable to deal with demons that undermine the intended benefits.

Case Studies in Socio-Technology

Working for the legislative and executive branches of US. government since the 1950s, I have had a ringside seat from which to view many of the events and trends that come from the connections between engineering and people. Following are a few of those cases.

Submarine Design

The first nuclear submarine, USS Nautilus, was taken on its deep submergence trial February 28, I955. The subs’ power plant had been successfully tested in a full-scale mock-up and in a shallow dive, but the hull had not been subject to the intense hydrostatic pressure at operating depth. The hull was unprecedented in diameter, in materials, and in special joints connecting cylinders of different diameter. Although it was designed with complex shell theory and confirmed by laboratory tests of scale models, proof of performance was still necessary at sea.

During the trial, the sub was taken stepwise to its operating depth while evaluating strains. I had been responsible for the design equations, for the model tests, and for supervising the test at sea, so it was gratifying to find the hull performed as predicted.

While the nuclear power plant and novel hull were significant engineering achievements, the most important development occurred much earlier on the floor of the US. Congress. That was where the concept of nuclear propulsion was sold to a Congressional committee by Admiral Hyman Rickover, an electrical engineer. Previously rejected by a conservative Navy, passage of the proposal took an electrical engineer who understood how Constitutional power was shared and how to exercise the right of petition. By this initiative, Rickover opened the door to civilian nuclear power that accounts for 20 percent of our electrical generation, perhaps 50 percent in France. If he had failed, and if the Nautilus pressure hull had failed, nuclear power would have been set back by a decade.

Space Telecommunications

Immediately after the 1957 Soviet surprise of Sputnik, engineers and scientists recognized that global orbits required all nations to reserve special radio channels for telecommunications with spacecraft. Implementation required the sanctity of a treaty, preparation of which demanded more than the talents of radio specialists; it engaged politicians, space lawyers, and foreign policy analysts. As science and technology advisor to Congress, I evaluated the treaty draft for technical validity and for consistency with U.S. foreign policy.

The treaty recognized that the airwaves were a common property resource, and that the virtuosity of communications engineering was limited without an administrative protocol to safeguard integrity of transmissions. This case demonstrated that all technological systems have three major components — hardware or communications equipment; software or operating instructions (in terms of frequency assignments); and peopleware, the organizations that write and implement the instructions.

National Policy for the Oceans

Another case concerned a national priority to explore the oceans and to identify U.S. rights and responsibilities in the exploitation and conservation of ocean resources. This issue, surfacing in 1966, was driven by new technological capabilities for fishing, offshore oil development, mining of mineral nodules on the ocean floor, and maritime shipment of oil in supertankers that if spilled could contaminate valuable inshore waters. Also at issue was the safety of those who sailed and fished.

This issue had a significant history. During the late 1950s, the US. Government was downsizing oceanographic research that initially had been sponsored during World War II. This was done without strong objection, partly because marine issues lacked coherent policy or high-level policy leadership and strong constituent advocacy.

Oceanographers, however, wanting to sustain levels of research funding, prompted a study by the National Academy of Sciences (NAS), Using the reports findings, which documented the importance of oceanographic research, NAS lobbied Congress with great success, triggering a flurry of bills dramatized by such titles as “National Oceanographic Program.”

But what was overlooked was the ultimate purpose of such research to serve human needs and wants, to synchronize independent activities of major agencies, to encourage public/private partnerships, and to provide political leadership. During the 1960s, in the role of Congressional advisor, I proposed a broad “strategy and coordination machinery” centered in the Office of the President, the nation’s systems manager. The result was the Marine Resources and Engineering Development Act, passed by Congress and signed into law by President Johnson in 1966.

The shift in bill title reveals the transformation from ocean sciences to socially relevant technology, with engineering playing a key role. The legislation thus embraced the potential of marine resources and the steps for both development and protection. By emphasizing policy, ocean activities were elevated to a higher national priority.

Exxon Valdez

Just after midnight on March 24, 1989, the tanker Exxon Valdez, loaded with 50 million gallons of Alaska crude oil, fetched up on Bligh Reef in Prince William Sound and spilled its guts. For five hours, oil surged from the torn bottom at an incredible rate of 1,000 gallons per second. Attention quickly focused on the enormity of environmental damage and on blunders of the ship operators. The captain had a history of alcohol abuse, but was in his cabin at impact. There was much finger- pointing as people questioned how the accident could happen during a routine run on a clear night. Answers were sought by the National Transportation Safety Board and by a state of Alaska commission to which I was appointed. That blame game still continues in the courts.

The commission was instructed to clarify what happened, why, and how to keep it from happening again. But even the commission was not immune to the political blame game. While I wanted to look beyond the ship’s bridge and search for other, perhaps more systemic problems, the commission chair blocked me from raising those issues. Despite my repeated requests for time at the regularly scheduled sessions, I was not allowed to speak. The chair, a former official having tanker safety responsibilities in Alaska, had a different agenda and would only let the commission focus largely on cleanup rather than prevention. Fortunately, I did get to have my say by signing up as a witness and using that forum to express my views and concerns.

The Exxon Valdez proved to be an archetype of avoidable risk. Whatever the weakness in the engineered hardware, the accident was largely due to internal cultures of large corporations obsessed with the bottom line and determined to get their way, a U.S. Coast Guard vulnerable to political tampering and unable to realize its own ethic, a shipping system infected with a virus of tradition, and a cast of characters lulled into complacency that defeated efforts at prevention.

Lessons

These examples of technological delivery systems have unexpected commonalities. Space telecommunications and sea preservation and exploitation were well beyond the purview of just those engineers and scientists working on the projects; they involved national policy and required interaction between engineers, scientists, users, and policymakers. The Exxon Valdez disaster showed what happens when these groups do not work together. No matter how conscientious a ship designer is about safety, it is necessary to anticipate the weaknesses of fallibility and
the darker side of self-centered, short-term ambition.

Recommendations

Many will argue that the engineering curriculum is so overloaded that the only source of socio- technical enrichment is a fifth year. Assuming that step is unrealistic, what can we do?

  • The hodge podge of nonengineering courses could be structured to provide an integrated foundation in liberal arts.
  • Teaching at the upper division could be problem- rather than discipline-oriented, with examples from practice that integrate nontechnical parameters.
  • Teaching could employ the case method often used in law, architecture, and business.
  • Students could be encouraged to learn about the world around them by reading good newspapers and nonengineering journals.
  • Engineering students could be encouraged to join such extracurricular activities as debating or political clubs that engage students from across the campus.

As we strengthen engineering’s potential to contribute to society, we can market this attribute to women and minority students who often seek socially minded careers and believe that engineering is exclusively a technical pursuit.

For practitioners of the future, something radically new needs to be offered in schools of engineering. Otherwise, engineers will continue to be left out.

Paradigm Check Point: Prefacing Debriefings

I’m a firm believer in restating values, goals, and perspectives at the beginning of every group debriefing (e.g. “postmortem meetings”) in order to bring new folks up to speed on how we view the process and what the purpose of the debriefing is.

When I came upon a similar baselining dialogue from another domain, I thought I’d share…

Screen Shot 2014-03-10 at 4.43.19 PM

  • Risk is in everything we do. Short of never doing anything, there is no way to avoid all risk or ever to be 100% safe.
  • How employees (at any level) perceive, anticipate, interpret, and react to risk is systematically connected to conditions associated with the design, systems, features, and culture of the workplace.
  • “Risk does not exist “out there,” independent of our minds and culture, waiting to be measured. Human beings have invented the concept of “risk” to help them understand and cope with the dangers and the uncertainties of life. Although these dangers are real, there is no such thing as a “real risk” or “objective risk.””*
  • The best definition of “safety” is: the reasonableness of risk. It is a feeling. It is not an absolute. It is personal and contextual and will vary between people even within identical situations.
  • While safety is an essential business practice, our agency does not exist to be safe or to protect our employees. We exist to accomplish a mission as efficiently as possible–knowing that many activities we choose to perform are inherently hazardous (for example, deployment, data migration, code commits, on-call response, editing configurations, and even powering on a device on the network).
  • Mistakes, errors, and lapses are normal and inevitable human behaviors. So are optimism and fatalism. So are taking shortcuts to save time and effort. So are under- and over-estimating risk. In spite of this, our work systems are generally designed for the optimal worker, not the normal one.
  • Essentially every risk mitigation (every safety precaution) carries some level of “cost” to production or compromise to efficiency. One of the most obvious is the cost of training. Employees at all levels (administrators, safety advisors, system designers, and front-line employees) are continuously–and often subconsciously– estimating, balancing, optimizing, managing, and accepting these subtle and nuanced tradeoffs between safety and production.
  • All successful systems, organizations, and individuals will trend toward efficiency over thoroughness (production over protection) over time until something happens (usually an accident or a close call) that changes their perception of risk. This creativity and drive for efficiency is what makes people, businesses and agencies successful.
  • Our natural intuition (our common sense) is to let outcomes draw the line between success and failure and to base safety programs on outcomes. This is shortsighted and eventually dangerous. Using the science of risk management is more potent and robust. Importantly, Risk Management is wholly concerned with managing risks, not outcomes. Risk management is counterintuitive.
  • Employees directly involved in the event did not expect that the accident was going to happen. They expected a positive outcome. If this is not the case, then you’re not dealing with an accident.

*Paul Slovic, as quoted in Daniel Kahneman, Thinking Fast and Slow (Farrar, Straus and Giroux, 2011), p141.
The above is excerpted from the Facilitated Learning Analysis Implementation Guide, US Forestry Service, Wildland Fire Operations.

High Tempo, High Consequence

A Time to Remember

I want you to think back to a time when you found yourself in an emergency situation at work.

Maybe it was diagnosing and trying to recover from a site outage.
Maybe it was when you were confronting the uncertain possibility of critical data loss.
Maybe it was when you and your team were responding to a targeted and malicious attack.

Maybe it was a time when you, maybe even milliseconds after you triggered some action (maybe even just hit “enter” after a command), realized that you just made a terrible mistake and inadvertently kicked off irreparable destruction that cannot be undone.

Maybe it was a shocking discovery that something bad (silent data corruption, for example) has been happening for a long time and no one knew it was happening.

Maybe it was a time when silence descended upon your team as they tried to understand what was happening to the site, and the business. The time when you didn’t even know what was going on, forget about hypothesizing how to go about fixing it.

Think back to the time when you had to actively hold back the fears of what the news headlines or management or your board of directors were going to say about it when it was over, because you have a job to do and worrying about those things wouldn’t bring the site back up.

Think back to a time when after you’ve resolved an outage and the dust has settled, your adrenaline turns its focus to amplifying the fear that you and your team will have no idea when that will happen again in the future because you’re still uncertain how it happened in the first place.

I’ve been working in web operations for over 15 years, and I can describe in excruciating detail examples of many of those situations. Many of my peers can tell stories like those, and often do. I’m willing to bet that you too, dear reader, will find these to be familiar feelings.

Those moments are real.
The cortisol coursing through your body during those times was real.
The effect of time pressure is real.

The problems that show up when you have to make critical decisions based on incredibly incomplete information is real.

The issues that show up when having to communicate effectively across multiple team members, sometimes separated by time (as people enter and exit the response to an outage) as well as distance (connected through chat or audio/video conferencing) are all real.

The issues when coordinating who is going to do what, and when they’re going to do it, and confirming that whatever they did went well enough for someone else to do their part next, etc. are all real.

And they all are real regardless of the outcomes of the scenarios.

Comparisons

Those moments do happen in other domains. Other domains like healthcare, where nurses work in neonatal intensive care units.

Like infantry in battle.
Like ground control in a mission control organization.
Like a regional railway control center.
Like a trauma surgeon in an operating room.
Like an air traffic controller.
Like a pilot, just flying.
Like a wildland firefighting hotshow crew.
Like a ship crew.

Like a software engineer working in a high-frequency trading company.

All of those domains (and many others) have these in common:

  • They need to make decisions and take action under time pressure and with incomplete information, and when the results have just as much potential to make things worse than they do to make things better.
  • They have to communicate a lot of information and coordinate actions between teams and team members in the shortest time possible, while also not missing critical details.
  • They all work in areas where small changes can bring out large results whose potential for surprising everyone is quite high.
  • They all work in organizations whose cultural, social, hierarchical, and decision-making norms are influenced by past successes and failures, many of which manifest in these high-tempo scenarios.

But: do the people in those domains experience those moments differently?

In other words: does a nurse or air traffic controller’s experience in those real moments differ from ours, because lives are at stake?

Do they experience more stress? Different stress?
Do they navigate alerts, alarms, and computers in more prudent or better ways than we do?
Do they have more problems with communications and coordinating response amongst multiple team members?
Are they measurably more careful in their work because of the stakes?

Are all of their decisions perfectly clear, or do they have to navigate ambiguity sometimes, just like we do?
Because there are lives to protect, is their decision-making in high-tempo scenarios different? Better?

My assertion is that high-tempo/high-consequence scenarios in the domain of Internet engineering and operations do indeed have similarities with those other domains, and that understanding all of those dynamics, pitfalls, learning opportunities, etc. is critical for the future.

All of the future.

Do these scenarios yield the same results, organizationally, in those domains as they do in web engineering and operations? Likely not. But I will add that unless we attempt to understand those similarities and differences, we’re not going to know what to learn from, and what to discard.

Hrm. Really?

Because how can we compare something like the Site Reliability Engineer team’s experience at Google.com to something like the air traffic control crew experience landing airplanes at Heathrow?

I have two responses to this question.

The first is that we’re assuming that the potential severity of the consequence influences the way people (and teams of people) think, act, and behave under those conditions. Research on how people behave under uncertain conditions and escalating scenarios do indeed have generalizable findings across many domains.

The second is that in trivializing the comparison to loss of life versus non-loss of life, we can underestimate the n-order effects that the Internet can have on geopolitical, economic, and other areas that are further away from servers and network cables. We would be too reductionist in our thinking. The Internet is not just about photos of cats. It bolsters elections in emerging democracies, revolutions, and a whole host of other things that prove to be life-critical.

A View From Not Too Far Away

At the Velocity Conference in 2012, Dr. Richard Cook (an anesthesiologist and one of the most forward-thinking men I know in these areas), was interviewed after his keynote by Mac Slocum, from O’Reilly.

Mac, hoping to contrast Cook’s normal audience to that of Velocity’s, asked about whether or not he saw crossover from the “safety-critical” domains to that of web operations:

Cook: “Anytime you find a world in which you have high consequences, high tempo, time pressure, and lots of complexity, semantic complexity, underlying deep complexity, and people are called upon to manage that you’re going to have these kinds of issues arise. And the general model that we have is one for systems, not for specific instances of systems. So I kind of expected that it would work…”

Mac: ”…obviously failure in the health care world is different than failure in the [web operations] world. What is the right way to address failure, the appropriate way to address failure? Because obviously you shouldn’t have people in this space who are assigning the same level of importance to failure as you would?”

Cook: “You really think so?”

Mac: “Well, if a computer goes down, that’s one thing.”

Cook: “If you lose $300 to $400 million dollars, don’t you think that would buy a lot of vaccines?”

Mac: “[laughs] well, that’s true.”

Cook: “Look, the fact that it appears to be dramatic because we’re in the operating room or the intensive care unit doesn’t change the importance of what people are doing. That’s a consequence of being close to and seeing things in a dramatic fashion. But what’s happening here? This is the lifeblood of commerce. This is the core of the economic engine that we’re now experiencing. You think that’s not important?”

Mac: “So it’s ok then, to assign deep importance to this work?”

Cook: “Yeah, I think the big question will be whether or not we are actually able to conclude the healthcare importance measures up to the importance web ops, not the other way around.”

Richard further mentioned in his keynote last year at New York’s Velocity that:

“…web applications have a tendency to become business critical applications, and business-critical applications have a tendency to become safety-critical systems.”

And yes, software bugs have killed people.

When I began my studies at Lund University, I was joined by practitioners in many of those domains: air traffic control, aviation, wildland fire, child welfare services, mining, oil and gas industry, submarine safety, and maritime accident investigation.

I will admit at the first learning lab of my course, I mentioned that I felt like a bit of an outsider (or at least a cheater in getting away with failures that don’t kill people) and one of my classmates responded:

“John, why do you think that understanding these scenarios and potentially improving upon them has anything to do with body count? Do you think that our organizations are influenced more by body count than commercial and economic influences? Complex failures don’t care about how many dollars or bodies you will lose – they are equal opportunists.”

I now understand this.

So don’t be fooled into thinking that those human moments at the beginning of this post are any different in other domains, or that our responsibility to understand complex system failures is less important in web engineering and operations than it is elsewhere.

Counterfactual Thinking, Rules, and The Knight Capital Accident

In between reading copious amounts of indignation surrounding whatever is suboptimal about healthcare.gov, you may or may not have noticed the SEC statement regarding the Knight Capital accident that took place in 2012.

This Release No. 70694 is a document that contains many details about the accident, and you can read what looks like on the surface to be an in-depth analysis of what went wrong and how best to prevent such an accident from happening in the future.

You may believe this document can serve as a ‘post-mortem’ narrative. It cannot, and should not.

Any ‘after-action’ or ‘postmortem’ document (in my domain of web operations and engineering) has two main goals:

  1. To provide an explanation of how an event happened, as the organization (including those closest to the work) best understands it.
  2. To produce artifacts (recommendations, remediations, etc.) aimed at both prevention and the improvement of detection and response approaches to aid in handling similar events in the future.

You need #1 in order to work on #2. If you don’t understand how the event unfolded, you can’t make gains towards prevention in the future.

The purpose of this post is to outline how the release is not something that can or should be used for explanation or prevention.

The Release No. 70694 document does not address either of those concerns in any meaningful way.

What it does address, however, is exactly what a regulatory body is tasked to do in the wake of a known outcome: contrast how an organization was or was not in compliance with the rules that the body has put in place. Nothing more, nothing less. In this area, the document is concise and focused.

You can be forgiven for thinking that the document could serve as an explanation, because you can find some technical details in it. It looks a little bit like a timeline. What is interesting is not what details are covered, but what details are not covered, including the organizational sensemaking that is part of every complex systems failure.

If you are looking for a real postmortem of the Knight Capital accident in this post, you’re going to be disappointed. At the end of this post, I will certainly attempt to list some questions that I might pose if I was facilitating a debriefing of the event, but no real investigation can happen without the individuals closest to the work involved in the discussion.

However, I’d like to write up a bit about why it should not be viewed as what is traditionally known (at least in the web operations and engineering community) as a postmortem report. Because frankly I think that is more important than the specific event itself.

But before I do that, it’s necessary to unpack a few concepts related to learning in a retrospective way, as in a postmortem…

Counterfactuals

Learning from events in the past (both successful and unsuccessful) puts us into a funny position as humans. In a process that is genuinely interested in learning from events, we have to rectify our need to understand with the reality that we will never get a complete picture of what has happened in the past. Regulatory bodies such as the SEC (lucky for them) don’t have to get a complete picture in order to do their job. They have only to point out the gap between how “work is prescribed” versus “work is being done” (or what Richard Cook has said  “the system as imagined” versus “the system as found.”)

In many circumstances (as in the case of the SEC release), what this means is to point out the things that people and organizations didn’t do in the time preceding an event. This is usually done by using “counterfactuals”, which means literally “counter the facts.”

In the language of my domain, using counterfactuals in the process of explanation and prevention is an anti-pattern, and I’ll explain why.

One of the potential pitfalls of postmortem reports (and debriefings) is that the language we use can cloud our opportunities to learn what took place and the context people (and machines!) found themselves in. Sidney Dekker says this about using counterfactuals:

“They make you spend your time talking about a reality that did not happen (but if it had happened, the mishap would not have happened).” (Dekker, 2006, p. 39)

What are examples of counterfactuals? In ordinary language, they look like:

  • “they shouldn’t have…”
  • “they could have…”
  • “they failed to…”
  • “if only they had…!”

Why are these statements woefully inappropriate for aiding explanation of what happened? Because stating what you think should have happened doesn’t explain people’s (or an organization’s) behavior. Counterfactuals serve as a massive distraction, because it brings sharply into focus what didn’t happen, when what is required for explanation is to understand why people did what they did. 

People do what makes sense to them, given their focus, their goals, and what they perceive to be their environment. This is known as the local rationality principle, and it is required in order to tease out second stories, which in turn is required for learning from failure. People’s local rationality is influenced by many dynamics, and I can imagine some of these things might feel familiar to any engineers who operate in high-tempo organizations:

  • Multiple conflicting goals
    • E.g., “Deploy the new stuff, and do it quickly because our competitors may beat us! Also: take care of all of the details while you do it quickly, because one small mistake could make for a big deal!”
  • Multiple targets of attention
    • E.g., “When you deploy the new stuff, make sure you’re looking at the logs. And ignore the errors that are normally there, so you can focus on the right ones to pay attention to. Oh, and the dashboard graph of errors…pay attention to that. And the deployment process. And the system resources on each node as you deploy to them. And the network bandwidth. Also: remember, we have to get this done quickly.”

David Woods put counterfactual thinking in context with how people actually work:

“After-the-fact, based on knowledge of outcome, outsiders can identify “critical” decisions and actions that, if different, would have averted the negative outcome. Since these “critical” points are so clear to you with the benefit of hindsight, you could be tempted to think they should have been equally clear and obvious to the people involved in the incident. These people’s failure to see what is obvious now to you seems inexplicable and therefore irrational or even perverse. In fact, what seems to be irrational behavior in hindsight turns out to be quite reasonable from the point of view of the demands practitioners face and the resources they can bring bear.” (Woods, 2010)

Dekker concurs:

“You construct a referent world from outside the accident sequence, based on data you now have access to, based on facts you now know to be true. The problem is that these after-the-fact-worlds may have very little relevance to the circumstances of the accident sequence. They do not explain the observed behavior. You have substituted your own world for the one that surrounded the people in question.” (Dekker, 2004, p.33)

“Saying what people failed to do, or implying what they could or should have done to prevent the mishap, has no role in understanding human error.”  (Dekker, 2004, p.43)

The engineers and managers at Knight Capital did not set out that morning of August 1, 2012 to lose $460 million. If they did, we’d be talking about sabotage and not human error. They did, however, set out to perform some work successfully (in this case, roll out what they needed to participate in the Retail Liquidity Program.)

If you haven’t picked up on it already, the use of counterfactuals is a manifestation of one of the most studied cognitive bias in modern psychology: The Hindsight Bias. I will leave it as an exercise to the reader to dig into that.

Outcome Bias

Cognitive biases are the greatest pitfalls in explaining surprising outcomes. The weird cousin of The Hindsight Bias is Outcome Bias. In a nutshell, it says that we are biased to “judge a past decision by its ultimate outcome instead of based on the quality of the decision at the time it was made, given what was known at that time.” (Outcome Bias, 2013)

In other words, we can be tricked into thinking that if the result of an accident is truly awful (like people dying, something crashing, or, say, losing $460 million in 20 minutes) then the decisions that led up to that outcome must have been reeeeeealllllllyyyy bad. Right?

This is a myth debunked by a few decades of social science, but it remains persistent. No decision maker has omniscience about results, so the severity of the outcome cannot be seen to be proportional to the quality of thought that went into the decisions or actions that led up to the result. Why we have this bias to begin with is yet another topic that we can explore another time.

But a possible indication that you are susceptible to The Outcome Bias is a quick thought exercise on results: if Knight Capital lost only $1,000 (or less) would you think them to be more or less prudent in their preventative measures than in the case of $460 million?

If you’re into sports, maybe this can help shed light on The Outcome Bias.

Procedures 

Operators (within complex systems, at least) have procedures and rules to help them achieve their goals safely. They come in many forms: checklists, guidelines, playbooks, laws, etc. There is a distinction between procedures and rules, but they have similarities when it comes to gaining understanding of the past.

First let’s talk about procedures. In the aftermath of an accident, we can (and will, in the SEC release) see many calls for “they didn’t follow procedures!” or “they didn’t even have a checklist!” This sort of statement can nicely serve as a counterfactual.

What is important to recognize is that procedures are but only one resource people use to do work. If we only worked by following every rule and procedure we’ve written for ourselves, by the letter, then I suspect society would come to a halt. As an aside, “work-to-rule” is a tactic that labor organizations have used to demonstrate the issues that onerous rules and procedures can rob people of their adaptive capacities, and therefore bring business to an effective standstill.

Some more thought exercises to think with on procedures:

  • How easy might it be to go to your corporate wiki or intranet to find a procedure (or a step within a procedure) that was once relevant, but no longer is?
  • Do you think you can find a procedure somewhere in your group that isn’t specific enough to address every context you might use it in?
  • Can you find steps in existing procedures that feel safe to skip, especially in if you’re under time pressure to get something done?
  • Part of the legal terms of using Microsoft Office is that you read and understand the End User License Agreement. You did that before checking “I agree”, right? Or did you violate that legal agreement?! (don’t worry, I won’t tell anyone)

Procedures are important for a number of reasons. They serve as institutional knowledge and guidelines for safe work. But, like wikis, they make sense to the authors of the procedure the day they wrote it. They are written to take into account all of the scenarios and contexts that the author can imagine.

But since that imagination is limited, many procedures that are thought to ensure safety are context-sensitive and they require interpretation, modification, and adaptation.

There are multiple issues with procedures as they are navigated by people who do real work. Stealing from Dekker again:

  1. “First, a mismatch between procedures and practice is not unique to accident sequences. Not following procedures does not necessarily lead to trouble, and safe outcomes may be preceded by just as (relatively) many procedural deviations as those that precede accidents (Woods et al., 1994; Snook, 2000) This turns any “findings” about accidents being preceded by procedural violation into mere tautologies…” 
  2. “Second, real work takes place in a context of limited resources and multiple goals and pressures.” 
  3. “Third, some of the safest complex, dynamic work not only occurs despite the procedures—such as aircraft line maintenance—but without procedures altogether.” The long-studied High Reliability Organizations have examples (in domains such as naval aircraft carrier operations and nuclear power generation) where procedures are eschewed, and instead replaced by less static forms of learning from practice:

    ‘‘there were no books on the integration of this new hardware into existing routines and no other place to practice it but at sea. Moreover, little of the process was written down, so that the ship in operation is the only reliable manual’’. Work is ‘‘neither standardized across ships nor, in fact, written down systematically and formally anywhere’’. Yet naval air- craft carriers—with inherent high-risk operations—have a remarkable safety record, like other so-called high reliability organizations (Rochlin et al., 1987; Weick, 1990; Rochlin, 1999). “

  4. “Fourth, procedure-following can be antithetical to safety.”  – Consider the case of the 1949 US Mann Gulch disaster where firefighters who perished were the ones sticking to the organizational mandate to carry their tools everywhere. Or Swissair Flight 111, when captain and co-pilot of an aircraft disagreed on whether or not to follow the prescribed checklist for an emergency landing. While they argued, the plan crashed. (Dekker, 2003)

Anyone operating in high-tempo and high-consequence environments recognize both the utility and also the brittleness of a procedure, no matter how much thought went into it.

Let’s keep this idea in mind as we walk through the SEC release below.

 

Violation of Rules != Explanation

 

Now let’s talk about rules. The SEC’s job (in a nutshell) is to design, maintain, and enforce regulations of practice for various types of financially-driven organizations in the United States. Note that they are not charged with explaining or preventing events. Preventing may or may not result from their work in regulations, but prevention demands much more than abiding by rules.

Rules and regulations are similar to procedures in that they are written with deliberate but ultimately interpretable intention. Judges and juries help interpret different contexts as they relate to a given rule, law, or regulation. Rules are good for a number of reasons that are beyond the scope of this (now lengthy) post.

If we think about regulations in the context of causality, however, we can get into trouble.

Because we can find ourselves in uncertain contexts that have some of the dynamics that I listed above (multiple conflicting goals and targets of attention) regulations (even when we are acutely aware of them) pose some issues. In the Man-Made Disasters Model, Nick Pidgeon lays some of this out for us:

“Uncertainty may also arise about how to deal with formal violations of safety regulations. Violations might occur because regulations are ambiguous, in conflict with other goals such as the needs of production, or thought to be outdated because of technological advance. Alternatively safety waivers may be in operation, allowing relaxation of regulations under certain circumstances (as also occurred in the `Challenger’ case; see Vaughan, 1996).” (Pidgeon, 2000)

Rules and regulations need to allow for interpretation, otherwise they would be brittle in enforcement. So therefore, vagueness and flexibility in rules is desired. We’ll see how this vagueness can be exploited for enforcement, however, at the expense of learning.

Back to the statement

Once more: the SEC document cannot be viewed as a canonical description of what happened with Knight Capital on August 1, 2012.

It can, however, be viewed as a comprehensive account of the exchange and trading regulations the SEC deems were violated by the organization. This is its purpose. My goal here is not to critique the SEC release for its purpose, it is to reveal how it cannot be seen to aid either explanation or prevention of the event, and so should not be used for that.

Before we walk through (at least parts) of the document, it’s worth noting that there is no objective accident investigative body that exists for electronic trading systems. In aviation, there is a regulative body (the FAA) and an investigative body (the NTSB) and there is significant differences between the two, charter-wise and operations-wise. There exists no such independent investigative body analogous to the NTSB in Knight Capital’s industry. There is only the SEC.

The Release

I’ll have comments in italics, in blue and talk about the highlighted pieces. After getting feedback from many colleagues, I decided to keep the length here for people to dig into, because I think it’s important to understand. If you make it through this, you deserve cake.

If you want to skip the annotated and butchered SEC statement, you can just go to the summary.

I.

The Securities and Exchange Commission (the “Commission”) deems it appropriate and in the public interest that public administrative and cease-and-desist proceedings be, and hereby are, instituted pursuant to Sections 15(b) and 21C of the Securities Exchange Act of 1934 (the “Exchange Act”) against Knight Capital Americas LLC (“Knight” or “Respondent”).

II.

In anticipation of the institution of these proceedings, Respondent has submitted an Offer of Settlement (the “Offer”), which the Commission has determined to accept. Solely for the purpose of these proceedings and any other proceedings by or on behalf of the Commission, or to which the Commission is a party, and without admitting or denying the findings herein, except as to the Commission’s jurisdiction over it and the subject matter of these proceedings, which are admitted, Respondent consents to the entry of this Order Instituting Administrative and Cease-and-Desist Proceedings, Pursuant to Sections 15(b) and 21C of the Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a Cease-and-Desist Order (“Order”), as set forth below:

Note: This means that Knight doesn’t have to agree or disagree with any of the statements in the document. This is expected. If it was intended to be a postmortem doc, then there would be a lot more covered here in addition to listing violations of regulations.

III.

On the basis of this Order and Respondent’s Offer, the Commission finds that:

INTRODUCTION

1. On August 1, 2012, Knight Capital Americas LLC (“Knight”) experienced a significant error in the operation of its automated routing system for equity orders, known as SMARS. While processing 212 small retail orders that Knight had received from its customers, SMARS routed millions of orders into the market over a 45-minute period, and obtained over 4 million executions in 154 stocks for more than 397 million shares. By the time that Knight stopped sending the orders, Knight had assumed a net long position in 80 stocks of approximately $3.5 billion and a net short position in 74 stocks of approximately $3.15 billion. Ultimately, Knight lost over $460 million from these unwanted positions. The subject of these proceedings is Knight’s violation of a Commission rule that requires brokers or dealers to have controls and procedures in place reasonably designed to limit the risks associated with their access to the markets, including the risks associated with automated systems and the possibility of these types of errors.

Note: Again, the purpose of the doc is to point out where Knight violated rules. It is not:

  • a description of the multiple trade-offs that engineering at Knight made or considered when designing fault-tolerance in their systems, or
  • how Knight as an organization evolved over time to focus on evolving some procedures and not others, or
  • how engineers anticipated in preparation for deploying support for the new RLP effort on Aug 1, 2012.

To equate any of those things with violation of a rule is a cognitive leap that we should stay very far away from.

It’s worth mentioning here that the document only focuses on failures, and makes no mention of successes. How Knight succeeded during diagnosis and response is unknown to us, so a rich source of data isn’t available. Because of this, we cannot pretend the document to give explanation.

2. Automated trading is an increasingly important component of the national market system. Automated trading typically occurs through or by brokers or dealers that have direct access to the national securities exchanges and other trading centers. Retail and institutional investors alike rely on these brokers, and their technology and systems, to access the markets.

3. Although automated technology brings benefits to investors, including increased execution speed and some decreased costs, automated trading also amplifies certain risks. As market participants increasingly rely on computers to make order routing and execution decisions, it is essential that compliance and risk management functions at brokers or dealers keep pace. In the absence of appropriate controls, the speed with which automated trading systems enter orders into the marketplace can turn an otherwise manageable error into an extreme event with potentially wide-spread impact.

Note: The sharp contrast between our ability to create complex and valuable automation and our ability to reason about, influence, control, and understand it in even ‘normal’ operating conditions (forget about time-pressured emergency diagnosis of a problem) is something I (and many others over the decades) have written about. The key phrase here is “keep pace”, and it’s difficult for me to argue with. This may be the most valuable statement in the document with regards to safety and the use of automation.

4. Prudent technology risk management has, at its core, quality assurance, continuous improvement, controlled testing and user acceptance, process measuring, management and control, regular and rigorous review for compliance with applicable rules and regulations and a strong and independent audit process. To ensure these basic features are present and incorporated into day-to-day operations, brokers or dealers must invest appropriate resources in their technology, compliance, and supervisory infrastructures. Recent events and Commission enforcement actions have demonstrated that this investment must be supported by an equally strong commitment to prioritize technology governance with a view toward preventing, wherever possible, software malfunctions, system errors and failures, outages or other contingencies and, when such issues arise, ensuring a prompt, effective, and risk-mitigating response. The failure by, or unwillingness of, a firm to do so can have potentially catastrophic consequences for the firm, its customers, their counterparties, investors and the marketplace.

Note: Here we have the first value statement we see in the document. It states what is “prudent” in risk management. This is reasonable for the SEC to state in a generic high-level way, given its charge: to interpret regulations. This sets the stage for showing contrast between what happened, and what the rules are, which comes later.

If this was a postmortem doc, this word should be a red flag that immediately sets your face on fire. Stating what is “prudent” is essentially imposing standards onto history. It is a declaration of what a standard of good practice looks like. The SEC does not mention Knight Capital as not prudent specifically, but they don’t have to. This is the model on which the rest of the document rests. Stating what standards of good practice look like in a document that is looked to for explanation is an anti-pattern. In aviation, this might be analogous to saying that a pilot lacked “good airmanship” and pointing at it as a cause.The phrases “must invest appropriate resources” and “equally strong” above are both non-binary and context-sensitive. What is appropriate and equally strong gets to be defined by…whom?

  • What is “prudent”?
  • The description only says prudence demands prevention of errors, outages, and malfunctions “wherever possible.” How will you know where prevention is not possible? And following that – it would appear that you can be prudent and still not prevent errors and malfunctions.
  • Please ensure a “prompt, effective, and risk-mitigating response.” In other words: fix it correctly and fix it quickly. It’s so simple!

 

5. The Commission adopted Exchange Act Rule 15c3-52 in November 2010 to require that brokers or dealers, as gatekeepers to the financial markets, “appropriately control the risks associated with market access, so as not to jeopardize their own financial condition, that of other market participants, the integrity of trading on the securities markets, and the stability of the financial system.”

Note: It’s true, this is what the rule says. What is deemed  “appropriate”, it would seem, is dependent on the outcome. Had an accident? It was not appropriate control. Didn’t have an accident? It must be appropriate control. This would mean that Knight Capital did have appropriate controls the day before the accident. Outcome bias reigns supreme here.

6. Subsection (b) of Rule 15c3-5 requires brokers or dealers with market access to “establish, document, and maintain a system of risk management controls and supervisory procedures reasonably designed to manage the financial, regulatory, and other risks” of having market access. The rule addresses a range of market access arrangements, including customers directing their own trading while using a broker’s market participant identifications, brokers trading for their customers as agents, and a broker-dealer’s trading activities that place its own capital at risk. Subsection (b) also requires a broker or dealer to preserve a copy of its supervisory procedures and a written description of its risk management controls as part of its books and records.

Note: The rules says, basically:  “have a document about controls and risks”. It doesn’t say anything about an organization’s ability to adapt them as time and technology progresses, only that at some point they were written down and shared with the right parties. 

7. Subsection (c) of Rule 15c3-5 identifies specific required elements of a broker or dealer’s risk management controls and supervisory procedures. A broker or dealer must have systematic financial risk management controls and supervisory procedures that are reasonably designed to prevent the entry of erroneous orders and orders that exceed pre-set credit and capital thresholds in the aggregate for each customer and the broker or dealer. In addition, a broker or dealer must have regulatory risk management controls and supervisory procedures that are reasonably designed to ensure compliance with all regulatory requirements.

Note: This is the first of many instances of the phrase “reasonably designed” in the document. As with the word ‘appropriate’, how something is defined to be “reasonably designed” is dependent on the outcome of that design. This robs both the design and the engineer of the nuanced details that make for resilient systems. Modern technology doesn’t work or not-work. It breaks and fails in surprising (sometimes shocking) ways that were not imagined by its designers, which means that “reason” plays only a part of its quality.

Right now, all over the world, every (non-malicious) engineer around the world is designing and building systems that they believe are “reasonably designed.”  If they didn’t think they were reasonably designed, they wouldn’t be finished with it until they did think it was.

Some of those systems will fail. Most will not. Many of them will fail in ways that are safe and anticipated. Some will will not, and surprise everyone. 

Systems Safety researcher Erik Hollnagel has had related thoughts:

We must strive to understand that accidents don’t happen because people gamble and lose.

Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.

8. Subsection (e) of Rule 15c3-5 requires brokers or dealers with market access to establish, document, and maintain a system for regularly reviewing the effectiveness of their risk management controls and supervisory procedures. This sub-section also requires that the Chief Executive Officer (“CEO”) review and certify that the controls and procedures comply with subsections (b) and (c) of the rule. These requirements are intended to assure compliance on an ongoing basis, in part by charging senior management with responsibility to regularly review and certify the effectiveness of the controls.

Note: This takes into consideration that systems are not indeed static, and it implies that they need to evolve over time. This is important to remember for some notes later on.

9. Beginning no later than July 14, 2011, and continuing through at least August 1, 2012, Knight’s system of risk management controls and supervisory procedures was not reasonably designed to manage the risk of its market access. In addition, Knight’s internal reviews were inadequate, its annual CEO certification for 2012 was defective, and its written description of its risk management controls was insufficient. Accordingly, Knight violated Rule 15c3-5. In particular:

  1. Knight did not have controls reasonably designed to prevent the entry of erroneous orders at a point immediately prior to the submission of orders to the market by one of Knight’s equity order routers, as required under Rule 15c3-5(c)(1)(ii);
  2. Knight did not have controls reasonably designed to prevent it from entering orders for equity securities that exceeded pre-set capital thresholds for the firm, in the aggregate, as required under Rule 15c3-5(c)(1)(i). In particular, Knight failed to link accounts to firm-wide capital thresholds, and Knight relied on financial risk controls that were not capable of preventing the entry of orders;
  3. Knight did not have an adequate written description of its risk management controls as part of its books and records in a manner consistent with Rule 17a-4(e)(7) of the Exchange Act, as required by Rule 15c3-5(b);
  4. Knight also violated the requirements of Rule 15c3-5(b) because Knight did not have technology governance controls and supervisory procedures sufficient to ensure the orderly deployment of new code or to prevent the activation of code no longer intended for use in Knight’s current operations but left on its servers that were accessing the market; and Knight did not have controls and supervisory procedures reasonably designed to guide employees’ responses to significant technological and compliance incidents;
  5. Knight did not adequately review its business activity in connection with its market access to assure the overall effectiveness of its risk management controls and supervisory procedures, as required by Rule 15c3-5(e)(1); and
  6. Knight’s 2012 annual CEO certification was defective because it did not certify that Knight’s risk management controls and supervisory procedures complied with paragraphs (b) and (c) of Rule 15c3-5, as required by Rule 15c3-5(e)(2).

Note: It’s a counterfactual party! The question remains: are conditions sufficient, reasonably designed, or adequate if they don’t result in an accident like this one? Which comes first: these characterizations, or the accident? Knight Capital did believe these things were sufficient, reasonably designed, and adequate enough. Otherwise, they would have addressed them. One question necessary to answer for prevention is: “What were the sources of confidence that Knight Capital drew upon as they designed their systems?” Because improvement lies there.

10. As a result of these failures, Knight did not have a system of risk management controls and supervisory procedures reasonably designed to manage the financial, regulatory, and other risks of market access on August 1, 2012, when it experienced a significant operational failure that affected SMARS, one of the primary systems Knight uses to send orders to the market. While Knight’s technology staff worked to identify and resolve the issue, Knight remained connected to the markets and continued to send orders in certain listed securities. Knight’s failures resulted in it accumulating an unintended multi-billion dollar portfolio of securities in approximately forty-five minutes on August 1 and, ultimately, Knight lost more than $460 million, experienced net capital problems, and violated Rules 200(g) and 203(b) of Regulation SHO.

A. Respondent

FACTS

11. Knight Capital Americas LLC (“Knight”) is a U.S.-based broker-dealer and a wholly-owned subsidiary of KCG Holdings, Inc. Knight was owned by Knight Capital Group, Inc. until July 1, 2013, when that entity and GETCO Holding Company, LLC combined to form KCG Holdings, Inc. Knight is registered with the Commission pursuant to Section 15 of the Exchange Act and is a Financial Industry Regulatory Authority (“FINRA”) member. Knight has its principal business operations in Jersey City, New Jersey. Throughout 2011 and 2012, Knight’s aggregate trading (both for itself and for its customers) generally represented approximately ten percent of all trading in listed U.S. equity securities. SMARS generally represented approximately one percent or more of all trading in listed U.S. equity securities.

B. August 1, 2012 and Related Events

Preparation for NYSE Retail Liquidity Program

12. To enable its customers’ participation in the Retail Liquidity Program (“RLP”) at the New York Stock Exchange, which was scheduled to commence on August 1, 2012, Knight made a number of changes to its systems and software code related to its order handling processes. These changes included developing and deploying new software code in SMARS. SMARS is an automated, high speed, algorithmic router that sends orders into the market for execution. A core function of SMARS is to receive orders passed from other components of Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity, send one or more representative (or “child”) orders to external venues for execution.

13. Upon deployment, the new RLP code in SMARS was intended to replace unused code in the relevant portion of the order router. This unused code previously had been used for functionality called “Power Peg,” which Knight had discontinued using many years earlier. Despite the lack of use, the Power Peg functionality remained present and callable at the time of the RLP deployment. The new RLP code also repurposed a flag that was formerly used to activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag was set to “yes,” the new RLP functionality—rather than Power Peg—would be engaged.

Note: Noting the intention is important in gaining understanding, because it shows effort to get into the mindset of the individual or groups involved in the work. If this introspection continued throughout the document, it would get a little closer to something like a postmortem.

Raise your hand if you can definitively state all of the active and inactive code execution paths in your application right now. Right.

14. When Knight used the Power Peg code previously, as child orders were executed, a cumulative quantity function counted the number of shares of the parent order that had been executed. This feature instructed the code to stop routing child orders after the parent order had been filled completely. In 2003, Knight ceased using the Power Peg functionality. In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.

Note: On the surface, this looks like some technical meat to bite into. There is a some detail surrounding a fault-tolerance guardrail here, something to fail “closed” in the presence of specific criteria. What’s missing? Any dialogue about why the move of the function from one place (in Power Peg) to another (earlier in SMARS) – this is important, because in my experience, engineers don’t make effort in that sort of thing without motivation. If that motivation was explored, then we’d get a better sense of where the organization drew its confidence from, previous to the accident. This helps us understand their local rationality. But: we don’t get that from this document.

15. Beginning on July 27, 2012, Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.

Note: Code and deployment review is a fine thing to have. But is it sufficient? Dr. Nancy Leveson explained when she was invited to speak at the SEC’s “Technology Roundtable” in October of last year that in 1992, she chaired a committee to review the code that was deployed on the Space Shuttle. She said that NASA was spending $100 million a year to maintain the code, was employing the smartest engineers in the world, and there were still found to be gaps of concern. She repeats that there is no such thing as perfect software, no matter how much effort an individual or organization makes to produce such a thing.

Do written procedures requiring a review of code or deployment guarantee safety? Of course not. But ensuring safety isn’t what the SEC is expected to do in this document. Again: they are only pointing out the differences between regulation and practice.

Events of August 1, 2012

16. On August 1, Knight received orders from broker-dealers whose customers were eligible to participate in the RLP. The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server. As a result, this server began sending child orders to certain trading centers for execution. Because the cumulative quantity function had been moved, this server continuously sent child orders, in rapid sequence, for each incoming parent order without regard to the number of share executions Knight had already received from trading centers. Although one part of Knight’s order handling system recognized that the parent orders had been filled, this information was not communicated to SMARS.

Note: So the guardrail/fail-closed mechanism wasn’t in the same place it was before, and the eighth server was allowed to continue on. As Leveson said in her testimony: ” It’s not necessarily just individual component failure. In a lot of these accidents each individual component worked exactly the way it was expected to work. It surprised everyone in the interactions among the components.”

17. The consequences of the failures were substantial. For the 212 incoming parent orders that were processed by the defective Power Peg code, SMARS sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes. Knight inadvertently assumed an approximately $3.5 billion net long position in 80 stocks and an approximately $3.15 billion net short position in 74 stocks. Ultimately, Knight realized a $460 million loss on these positions.

Note: Just in case you forgot, this accident was sooooo bad. These numbers are so big. Keep that in mind, dear reader, because I want to you remember that when you think about the engineer who thought he had deployed the code to the eighth server. 

18. The millions of erroneous executions influenced share prices during the 45 minute period. For example, for 75 of the stocks, Knight’s executions comprised more than 20 percent of the trading volume and contributed to price moves of greater than five percent. As to 37 of those stocks, the price moved by greater than ten percent, and Knight’s executions constituted more than 50 percent of the trading volume. These share price movements affected other market participants, with some participants receiving less favorable prices than they would have in the absence of these executions and others receiving more favorable prices.

BNET Reject E-mail Messages

19. On August 1, Knight also received orders eligible for the RLP but that were designated for pre-market trading. SMARS processed these orders and, beginning at approximately 8:01 a.m. ET, an internal system at Knight generated automated e-mail messages (called “BNET rejects”) that referenced SMARS and identified an error described as “Power Peg disabled.” Knight’s system sent 97 of these e-mail messages to a group of Knight personnel before the 9:30 a.m. market open. Knight did not design these types of messages to be system alerts, and Knight personnel generally did not review them when they were received. However, these messages were sent in real time, were caused by the code deployment failure, and provided Knight with a potential opportunity to identify and fix the coding issue prior to the market open. These notifications were not acted upon before the market opened and were not used to diagnose the problem after the open.

Note: Translated, this says that systems-generated warnings/alerts that were sent via email weren’t noticed. Signals sent by automated systems (synchronously – as in “alerts” or asynchronously – as in “email”) aimed at perfectly detecting or preventing anomalies is not a solved problem. Show me an outage, any outage, and I’ll show you warning signs that humans didn’t pick up on. The document doesn’t give any detail on why those type of messages were sent via email (as opposed to paging-style alerts), what the distribution list was for them, how those messages get generated, or any other details.

Is the number of the emails (97 of them) important? 97 sounds like a lot, doesn’t it? If it was one, and not 97, would the paragraph read differently? What if there were 10,000 messages sent? 

How many engineers right now are receiving alerts on their phone (forget about emails) that they will glance at and think that they are part of the normal levels of noise in the system, because thresholds and error handling are not always precisely tuned?

C. Controls and Supervisory Procedures

SMARS

20. Knight had a number of controls in place prior to the point that orders reached SMARS. In particular, Knight’s customer interface, internal order management system, and system for internally executing customer orders all contained controls concerning the prevention of the entry of erroneous orders.

21. However, Knight did not have adequate controls in SMARS to prevent the entry of erroneous orders. For example, Knight did not have sufficient controls to monitor the output from SMARS, such as a control to compare orders leaving SMARS with those that entered it. Knight also did not have procedures in place to halt SMARS’s operations in response to its own aberrant activity. Knight had a control that capped the limit price on a parent order, and therefore related child orders, at 9.5 percent below the National Best Bid (for sell orders) or above the National Best Offer (for buy orders) for the stock at the time that SMARS had received the parent order. However, this control would not prevent the entry of erroneous orders in circumstances in which the National Best Bid or Offer moved by less than 9.5 percent. Further, it did not apply to orders—such as the 212 orders described above—that Knight received before the market open and intended to send to participate in the opening auction at the primary listing exchange for the stock.

Note: Anomaly detection and error-handling criteria have two origins: the imagination of their authors and the history of surprises that have been encountered already. A significant number of thresholds, guardrails, and alerts in any technical organization are put in place only after it’s realized that they are needed. Some of these realizations come from negative events like outages, data loss, etc. and some of them come from “near-misses” or explicit re-anticipation activated by feedback that comes from real-world operation.

Even then, real-world observations don’t always produce new safeguards. How many successful trades had Knight Capital seen in its lifetime while that control allowed “the entry of erroneous orders in circumstances in which the National Best Bid or Offer moved by less than 9.5 percent.” How many successful Shuttle launches saw degradation in O-ring integrity before the Challenger accident? This ‘normalization of deviance’ (Vaughn, 1997) phenomenon is to be expected in all socio-technical organizations. Financial trading systems are no exception. History matters.

Capital Thresholds

Note: Nothing in this section had much value in explanation or prevention.

Code Development and Deployment

26. Knight did not have written code development and deployment procedures for SMARS (although other groups at Knight had written procedures), and Knight did not require a second technician to review code deployment in SMARS. Knight also did not have a written protocol concerning the accessing of unused code on its production servers, such as a protocol requiring the testing of any such code after it had been accessed to ensure that the code still functioned properly.

Note: Again, does a review guarantee safety? Does testing prevent malfunction?

Incident Response

27. On August 1, Knight did not have supervisory procedures concerning incident response. More specifically, Knight did not have supervisory procedures to guide its relevant personnel when significant issues developed. On August 1, Knight relied primarily on its technology team to attempt to identify and address the SMARS problem in a live trading environment. Knight’s system continued to send millions of child orders while its personnel attempted to identify the source of the problem. In one of its attempts to address the problem, Knight uninstalled the new RLP code from the seven servers where it had been deployed correctly. This action worsened the problem, causing additional incoming parent orders to activate the Power Peg code that was present on those servers, similar to what had already occurred on the eighth server.

Note: I would like to think that most engineering organizations that are tasked with troubleshooting issues in production systems understand that diagnosis isn’t something you can prescribe. Successful incident response in escalating scenarios is something that comes from real-world  practice, not a document. Improvisation and intuition play a significant role in this, which obviously cannot be written down beforehand. 

Thought exercise: you just deployed new code to production. You become aware of an issue. Would it be surprising if one of the ways you attempt to rectify the scenario is to roll back to the last known working version? The SEC release implies that it would be.

D. Compliance Reviews and Written Description of Controls

Note: I’m skipping some sections here as it’s just more about compliance. 

Post-Compliance Date Reviews

32. Knight conducted periodic reviews pursuant to the WSPs. As explained above, the WSPs assigned various tasks to be performed by SCG staff in consultation with the pertinent business and technology units, with a senior member of the pertinent business unit reviewing and approving that work. These reviews did not consider whether Knight needed controls to limit the risk that SMARS could malfunction, nor did these reviews consider whether Knight needed controls concerning code deployment or unused code residing on servers. Before undertaking any evaluation of Knight’s controls, SCG, along with business and technology staff, had to spend significant time and effort identifying the missing content and correcting the inaccuracies in the written description.

33. Several previous events presented an opportunity for Knight to review the adequacy of its controls in their entirety. For example, in October 2011, Knight used test data to perform a weekend disaster recovery test. After the test concluded, Knight’s LMM desk mistakenly continued to use the test data to generate automated quotes when trading began that Monday morning. Knight experienced a nearly $7.5 million loss as a result of this event. Knight responded to the event by limiting the operation of the system to market hours, changing the control so that this system would stop providing quotes after receiving an execution, and adding an item to a disaster recovery checklist that required a check of the test data. Knight did not broadly consider whether it had sufficient controls to prevent the entry of erroneous orders, regardless of the specific system that sent the orders or the particular reason for that system’s error. Knight also did not have a mechanism to test whether their systems were relying on stale data.

Note: That we might be able to cherry-pick opportunities in the past where signs of doomsday could have (or should have) been seen and heeded is consistent with textbook definitions of The Hindsight Bias. How organizations learn is influenced by the social and cultural dynamics of its internal structures. Again, Diane Vaughn’s writings is a place we can look to for exploring how path dependency can get us into surprising places. But again: this is not the SEC’s job to speak to that.  

E. CEO Certification

34. In March 2012, Knight’s CEO signed a certification concerning Rule 15c3-5. The certification did not state that Knight’s controls and procedures complied with the rule. Instead, the certification stated that Knight had in place “processes” to comply with the rule. This drafting error was not intentional, the CEO did not notice the error, and the CEO believed at the time that he was certifying that Knight’s controls and procedures complied with the rule.

Note: This is possibly the only hint at local rationality in the document. 

F. Collateral Consequences

35. There were collateral consequences as a result of the August 1 event, including significant net capital problems. In addition, many of the millions of orders that SMARS sent on August 1 were short sale orders. Knight did not mark these orders as short sales, as required by Rule 200(g) of Regulation SHO. Similarly, Rule 203(b) of Regulation SHO prohibits a broker or dealer from accepting a short sale order in an equity security from another person, or effecting a short sale in an equity security for its own account, unless it has borrowed the security, entered into a bona-fide arrangement to borrow the security, or has reasonable grounds to believe that the security can be borrowed so that it can be delivered on the date delivery is due (known as the “locate” requirement), and has documented compliance with this requirement. Knight did not obtain a “locate” in connection with Knight’s unintended orders and did not document compliance with the requirement with respect to Knight’s unintended orders.

VIOLATIONS
A. Market Access Rule: Section 15(c)(3) of the Exchange Act and Rule 15c3-5

Note: I’m going skip a bit because it’s not much more than a restating of rules that the SEC deemed were broken….

Accordingly, pursuant to Sections 15(b) and 21C of the Exchange Act, it is hereby ORDERED that:

A. Respondent Knight cease and desist from committing or causing any violations and any future violations of Section 15(c)(3) of the Exchange Act and Rule 15c3-5 thereunder, and Rules 200(g) and 203(b) of Regulation SHO.

Note: Translated – you must stop immediately all of the things that violate rules that say you must “reasonably design” things. So don’t unreasonably design things anymore. 

The SEC document does what it needs to do: walk through the regulations that they think were violated, and talk about the settlement agreement. Knight Capital doesn’t have to admit they did anything wrong or suboptimal, and the SEC gets to tell them what to do next. That is, roughly:

  1. Hire a consultant that helps them not unreasonably design things anymore, and document that.
  2. Pay $12M to the SEC.
Under no circumstances should you take this document to be an explanation of the event or how to prevent future ones like it.

What questions remain unanswered?

Like I mentioned before, this SEC release doesn’t help explain why how the event came to be, or make any effort towards prevention other than require Knight Capital to pay a settlement, hire a consultant, and write new procedures that can predict the future. I do not know anyone at Knight Capital (or at the SEC for that matter) so it’s very unlikely that I’ll gain any more awareness of accident details than you will, my dear reader.

But I can put down a few questions that I might ask if I was facilitating the debriefing of the accident, which could possibly help with gaining a systems-thinking perspective on explanation. Real prevention is left to an exercise to the readers who also work at Knight Capital.

  •  The engineer who deployed the new code to support the RLP integration had confidence that all servers (not just seven of the eight) received the new code. What gave him that confidence? Was it a dashboard? Reliance on an alert? Some other sort of feedback from the deployment process?
  • The BNET Reject E-mail Messages: Have they ever been sent before? Do the recipients of them trust their validity? What is the background on their delivery being via email, versus synchronous alerting? Do they provide enough context in their content to give an engineer sufficient criteria to act on?
  • What were the signals that the responding team used to indicate that a roll-back of the code on the seven servers was a potential repairing action?
  • Did the team that were responding to the issue have solid and clear communication channels? Was it textual chat, in-person, or over voice or video conference?
  • Did the team have to improvise any new tooling to be used in the diagnosis or response?
  • What metrics did the team use to guide their actions? Were they infrastructural (such as latency, network, or CPU graphs?) or market-related data (trades, positions, etc.) or a mixture?
  • What indications were there to raise awareness that the eighth server didn’t receive the latest code? Was it a checksum or versioning? Was it logs of a deployment tool? Was it differences in the server metrics of the eighth server?
  • As the new code was rolled out: what was the team focused on? What were they seeing?
  • As they recognized there was an issue: did the symptoms look like something they had seen before?
  • As the event unfolded: did the responding team discuss what to do, or did single actors take action?
  • Regarding non-technical teams: were they involved with directing the response?
  • Many many more questions remain, that presumably (hopefully) Knight Capital has asked and answered themselves.

The Second Victim

What about the engineer who deployed the code…the one who had his hands on the actual work being done? How is he doing? Is he receiving support from his peers and colleagues? Or was he fired? The financial trading world does not exactly have a reputation for empathy, and given that there is no voice given to the people closest to the work (such as this engineer) informing the story, I can imagine that symptoms consistent with traumatic stress are likely.

Some safety-critical domains have put together structured programs to offer support to individuals that are involved with high-tempo and high-consequence work. Aviation and air traffic control has seen good success with CISM (Critical Incident Stress Management) and it’s been embraced by organizations around the world.

As web operations and financial trading systems become more and more complex, we will continue to be surprised by outcomes of what looks like “normal” work. If we do not make effort to support those who navigate this complexity on a daily basis, we will not like the results.

Summary

  1. The SEC does not have responsibility for investigation with the goals of explanation or prevention of adverse events. Their focus is regulation.
  2. Absent a real investigation that eschews counterfactuals, puts procedures and rules into context, and encourages a narrative that holds paramount the voices of those closest to the work: we cannot draw any substantial conclusions. This means armchair accident investigation ripe with indignation.

So please don’t use the SEC Release No. 70694 as a post-mortem document, because it is not.

References

Dekker, S. (2003). Failure to adapt or adaptations that fail: contrasting models on procedures and safety. Applied Ergonomics, 34(3), 233–238. doi:10.1016/S0003-6870(03)00031-0
Dekker, S. (2006). The Field Guide to Understanding Human Error. Ashgate Publishing, Ltd.
Outcome Bias. (n.d.). In Wikipedia. Retrieved October 28, 2013, from https://en.wikipedia.org/wiki/Outcome_bias
Pidgeon, N., & O’Leary, M. (2000). Man-made disasters: why technology and organizations (sometimes) fail. Safety Science, 34(1), 15–30.
Vaughan, D. (2009). The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. University of Chicago Press.
Woods, D. D., Dekker, S., Cook, R., Johannesen, L., & Sarter, N. (2010). Behind Human Error (2nd ed.). Farnham: Ashgate Pub Co.
Weick, K.E., 1993. The collapse of sensemaking in organizations. Administrative Sci. Quart. 38, 628–652.

Learning from Failure at Etsy

(This was originally posted on Code As Craft, Etsy’s engineering blog. I’m re-posting it here because it still resonates strongly as I prepare to teach a ‘postmortem facilitator’s course internally at Etsy.)

Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought I might give some detail on how that actually happens, and why.

Anyone who’s worked with technology at any scale is familiar with failure. Failure cares not about the architecture designs you slave over, the code you write and review, or the alerts and metrics you meticulously pore through.

So: failure happens. This is a foregone conclusion when working with complex systems. But what about those failures that have resulted due to the actions (or lack of action, in some cases) of individuals? What do you do with those careless humans who caused everyone to have a bad day?

Maybe they should be fired.

Or maybe they need to be prevented from touching the dangerous bits again.

Or maybe they need more training.

This is the traditional view of “human error”, which focuses on the characteristics of the individuals involved. It’s what Sidney Dekker calls the “Bad Apple Theory” – get rid of the bad apples, and you’ll get rid of the human error. Seems simple, right?

We don’t take this traditional view at Etsy. We instead want to view mistakes, errors, slips, lapses, etc. with a perspective of learning. Having blameless Post-Mortems on outages and accidents are part of that.

A Blameless Post-Mortem

What does it mean to have a ‘blameless’ Post-Mortem?
Does it mean everyone gets off the hook for making mistakes? No.

Well, maybe. It depends on what “gets off the hook” means. Let me explain.

Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

  • what actions they took at what time,
  • what effects they observed,
  • expectations they had,
  • assumptions they had made,
  • and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.

Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.

We believe that this detail is paramount to improving safety at Etsy.

If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?

This cycle of name/blame/shame can be looked at like this:

  1. Engineer takes action and contributes to a failure or incident.
  2. Engineer is punished, shamed, blamed, or retrained.
  3. Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
  4. Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
  5. Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
  6. Errors more likely, latent conditions can’t be identified due to #5, above
  7. Repeat from step 1

We need to avoid this cycle. We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.

The base fundamental here is something Erik Hollnagel has said:

We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.

A Second Story

This idea of digging deeper into the circumstance and environment that an engineer found themselves in is called looking for the “Second Story”. In Post-Mortem meetings, we want to find Second Stories to help understand what went wrong.

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

Allowing Engineers to Own Their Own Stories

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items.

So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.

So what do we do to enable a “Just Culture” at Etsy?

  • We encourage learning by having these blameless Post-Mortems on outages and accidents.
  • The goal is to understand how an accident could have happened, in order to better equip ourselves from it happening in the future
  • We seek out Second Stories, gather details from multiple perspectives on failures, and we don’t punish people for making mistakes.
  • Instead of punishing engineers, we instead give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures.
  • We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future.
  • We accept that there is always a discretionary space where humans can decide to make actions or not, and that the judgement of those decisions lie in hindsight.
  • We accept that the Hindsight Bias will continue to cloud our assessment of past events, and work hard to eliminate it.
  • We accept that the Fundamental Attribution Error is also difficult to escape, so we focus on the environment and circumstances people are working in when investigating accidents.
  • We strive to make sure that the blunt end of the organization understands how work is actually getting done (as opposed to how they imagine (or hope) it’s getting done, via Gantt charts and procedures) on the sharp end.
  • The sharp end is relied upon to inform the organization where the line is between appropriate and inappropriate behavior. This isn’t something that the blunt end can come up with on its own.

Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.

One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”

Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.

That’s why we have blameless Post-Mortems at Etsy, and why we’re looking to create a Just Culture here.

A Mature Role for Automation: Part II

(Courtney Nash’s excellent post on this topic inadvertently pushed me to finally finish this – give it a read)

In the last post on this topic, I hoped to lay the foundation for what a mature role for automation might look like in web operations, and bring considerations to the decision-making process involved with considering automation as part of a design. Like Richard mentioned in his excellent comment to that post, this is essentially a very high level overview about the past 30 years of research into the effects, benefits, and ironies of automation.

I also hoped in that post to challenge people to investigate their assumptions about automation.

Namely:

  • when will automation be appropriate,
  • what problems could it help solve, and
  • how should it be designed in order to augment and compliment (not simply replace) human adaptive and processing capacities.

The last point is what I’d like to explore further here. Dr. Cook also pointed out that I had skipped over entirely the concept of task allocation as an approach that didn’t end up as intended. I’m planning on exploring that a bit in this post.

But first: what is responsible for the impulse to automate that can grab us so strongly as engineers?

Is it simply the disgust we feel when we find (often in hindsight) a human-driven process that made a mistake (maybe one that contributed to an outage) that is presumed impossible for a machine to make?

It turns out that there are a number of automation ‘philosophies’, some of which you might recognize as familiar.

Philosophies and Approaches

One: The Left-Over Principle

One common way to think of automation is to gather up all of the tasks, and sort them into things that can be automated, and things that can’t be. Even the godfather of Human Factors, Alphonse Chapanis said that it was reasonable to “mechanize everything that can be mechanized” (here). The main idea here is efficiency. Functions that cannot be assigned to machines are left for humans to carry out. This is known as the ‘Left-Over’ Principle.

David Woods and Erik Hollnagel has a response to this early incarnation of the “automate all the things!” approach, in Joint Cognitive Systems: Foundations of Cognitive Systems Engineering, which is (emphasis mine):

“The proviso of this argument is, however, that we should mechanise everything that can be mechanised, only in the sense that it can be guaranteed that the automation or mechanisation always will work correctly and not suddenly require operator intervention or support. Full automation should therefore be attempted only when it is possible to anticipate every possible condition and contingency. Such cases are unfortunately few and far between, and the available empirical evidence may lead to doubts whether they exist at all.

 

Without the proviso, the left-over principle implies a rather cavalier view of humans since it fails to include any explicit assumptions about their capabilities or limitations – other than the sanguine hope that the humans in the system are capable of doing what must be done. Implicitly this means that humans are treated as extremely flexible and powerful machines, which at any time far surpass what technological artefacts can do. Since the determination of what is left over reflects what technology cannot do rather than what people can do, the inevitable result is that humans are faced with two sets of tasks. One set comprises tasks that are either too infrequent or too expensive to automate. This will often include trivial tasks such as loading material onto a conveyor belt, sweeping the floor, or assembling products in small batches, i.e., tasks where the cost of automation is higher than the benefit. The other set comprises tasks that are too complex, too rare or too irregular to automate. This may include tasks that designers are unable to analyse or even imagine. Needless to say that may easily leave the human operator in an unenviable position.”

So to reiterate, the Left-Over Principle basically says that the things that are “left over” after automating as much as you can are either:

  1. Too “simple” to automate (economically, the benefit of automating isn’t worth the expense of automating it) because the operation is too infrequent, OR
  2. Too “difficult” to automate; the operation is too rare or irregular, and too complex to automate.

One critique of the Left-Over Principle is what Bainbridge points to in her second irony that I mentioned in the last post. The tasks that are “left over” after trying to automate all the things that can are the ones that you can’t figure out how to automate effectively (because they are too complicated or infrequent therefore not worth it) you then give back to the human to deal with.

So hold on: I thought we were trying to make humans lives easier, not more difficult?

Giving all of the easy bits to the machine and the difficult bits to the human also has a side affect of amplifying the workload on humans in terms of cognitive load and vigilance. (It turns out that it’s relatively trivial to write code that can do a boatload of complex things quite fast.) There’s usually little consideration given to whether or not the human could effectively perform these remaining non-automated tasks in a way that will benefit the overall system, including the automated tasks.

This approach also assumes that the tasks that are now automated can be done in isolation of the tasks that can’t be, which is almost never the case. When only humans are working on tasks, even with other humans, they can stride at their own rate individually or as a group. When humans and computers work together, the pace is set by the automated part, so the human needs to keep up with the computer. This underscores the importance automation in the context of humans and computers working jointly. Together. As a team, if you will.

We’ll revisit this idea later, but the idea that automation should place high priority and focus on the human-machine collaboration instead of their individual capacities is a main theme in the area of Joint Cognitive Systems, and one that I personally agree with.

The Left-Over Principle

Parasuraman, Sheridan, and Wickens (2000) had this to say about the Left-Over Principle (emphasis mine):

“This approach therefore defines the human operator’s roles and responsibilities in terms of the automation. Designers automate every subsystem that leads to an economic benefit for that subsystem and leave the operator to manage the rest. Technical capability or low cost are valid reasons for automation, given that there is no detrimental impact on human performance in the resulting whole system, but this is not always the case. The sum of subsystem optimizations does not typically lead to whole system optimization.”

Two: The “Compensatory” Principle

Another familiar approach (or justification) for automating processes rests on the idea that you should exploit the strengths of both humans and machines differently. The basic premise is: give the machines the tasks that they are good at, and the humans the things that they are good at.

This is called the Compensatory Principle, based on the idea that humans and machines can compensate for each others’ weaknesses. It’s also known as functional allocation, task allocation, comparison allocation, or the MABA-MABA (“Men Are Better At-Machines Are Better At”) approach.

Historically, functional allocation has been most embodied by “Fitts’ List”, which comes from a report in 1951, “Human Engineering For An Effective Air Navigation and Traffic-Control System” written by Paul Fitts and others.

Fitts’ List, which is essentially the original MABA-MABA list, juxtaposes human with machine capabilities to be used as a guide in automation design to help decided who (humans or machine) does what.

Here is Fitts’ List:

 Humans appear to surpass present-day machines with respect to the following:

  • Ability to detect small amounts of visual or acoustic energy.
  • Ability to perceive patterns of light or sound.
  • Ability to improvise and use flexible procedures.
  • Ability to store very large amounts of information for long periods and to recall relevant facts at the appropriate time.
  • Ability to reason inductively.
  • Ability to exercise judgment.

Modern-day machines (then, in the 1950s) appear to surpass humans with respect to the following:

  • Ability to respond quickly to control signals and to apply great forces smoothly and precisely.
  • Ability to perform repetitive, routine tasks
  • Ability to store information briefly and then to erase it completely
  • Ability to reason deductively, including computational ability
  • Ability to handle highly complex operations, i.e., to do many different things at once

This approach is intuitive for a number of reasons. It at least recognizes that when it comes to a certain category of tasks, humans are much superior to computers and software.

Erik Hollnagel summarized the Fitts’ List in Human Factors for Engineers:

Summary of the Fitts List

It does a good job of looking like a guide; it’s essentially an IF-THEN conditional on where to use automation.

So what’s not to like about this approach?

While this is a reasonable way to look at the situation, it does have some difficulties that have been explored which makes it basically impossible as a practical rationale.

Criticisms of the Compensatory Principle

There are a number of strong criticisms to this approach or argument for putting in place automation. One argument that I agree with most is that the work we do in engineering are never as decomposable as list would imply. You can’t simply say “I have a lot of data analysis to do over huge amounts of data, so I’ll let the computer do that part, because that’s what it’s good at. Then it can present me the results and I can make judgements over them.” for many (if not all) of the work we do.

The systems we build have enough complexity in them that we can’t simply put tasks into these boxes or categories, because then the cost of moving between them becomes extremely high. So high that the MABA-MABA approach, as it stands, is pretty useless as a design guide. The world we’ve built around ourselves simply doesn’t exist neatly into these buckets; we move dynamically between judging and processing and calculating and reasoning and filtering and improvising.

Hollnagel unpacks it more eloquently in Joint Cognitive Systems: Foundations of Cognitive Systems Engineering:

“The compensatory approach requires that the situation characteristics can be described adequately a priori, and that the variability of human (and technological) capabilities will be minimal and perfectly predictable.”

 

“Yet function allocation cannot be achieved simply by substituting human functions by technology, nor vice versa, because of fundamental differences between how humans and machines function and because functions depend on each other in ways that are more complex than a mechanical decomposition can account for. Since humans and machines do not merely interact but act together, automation design should be based on principles of coagency.”

 

David Woods refers to Norbert’s Contrast (from Norbert Weiner’s 1950 The Human Use of Human Beings)

Norbert’s Contrast

Artificial agents are literal minded and disconnected from the world, while human agents are context sensitive and have a stake in outcomes. 

With this perspective, we can see how computers and humans aren’t necessarily decomposable into the work simply based on what they do well.

Maybe, just maybe: there’s hope in a third approach? If we were to imagine humans and machines as partners? How might we view the relationship between humans and computers through a different lens of cooperation?

That’s for the next post. 🙂

Owning Attention (Considerations for Alert Design)

In the past month or two, I’ve spoken on the topic of alert design. There’s a video of my giving the talk (at Monitorama, as well), but I thought I’d try to post on the topic and material as well.

The topic of alerts and “alert design” as seen as a deliberate and purposeful thing to do has been on my mind.

In my experience and my asking many people in engineering and operations (at least in the web and financial trading domains) nothing spikes blood pressure like the topics of alerts. The caricature of the sysadmin waking up to a buzzing pager or phone is what comes to mind.

The costs of not paying attention to how your organization views or treats what comes of this behavior in operational teams (developers and systems folks included) I think are both largely invisible and much higher than most people think. It may be clear that what we’re talking about here is a signal:noise ratio, but it goes way beyond that. The cognitive cost of an engineer to attend to an alert (a fundamentally interrupting event by design) is akin to the cost of a software developer losing their “flow”; context switching is expensive. Expensive from a financial standpoint, a productivity perspective, and I’ll argue a career development view.

Here are some (likely melodramatic) assertions:

  • Alert numbness and fatigue is a blight on our industry. Because we can alert on basically anything, and we can argue that anything could be a harbinger of things that could drastically affect our business, we generally put an alert on everything we get our hands on.
  • Knowing something has happened almost always trumps not knowing something happened, with sometimes not much effort put into whether the “something” is important with respect to the context it’s happened in.
  • Computers deciding what is important to alert on is and will always be brittle. Meaning: alerts and their criteria originate in the author’s mind, which may or may not be in the same place as the receiver of the alert in the future. In other words: we all write documentation and procedures that make sense to us when we write them. They never survive too much of the future, because our worlds that refer to them change. Example: corporate wiki pages are commonly referred to as the place where “documentation goes to die”. Alerts are no different.

Therefore, I’d love to get a much deeper and broader conversation about alert design in our domain. Because I’ll say that it’s not the technology that sucks, it’s our use of it. Consider the possibility that you don’t have a Nagios problem, you have an alert design problem.

Down and In

As the years go by and we see the continued decline of storage prices, the explosion of accessible processing power, we have an ever-expanding ability to zoom in deeply to the ways servers and services talk to each other and process information.

We can zoom in on the relationships and behaviors of seemingly disparate pieces of data, and we can discover and detect disruptions or anomalies in sometimes surprising places. This is interesting, for sure.

But it is also woefully incomplete if we are to make any progress in technical operations.

Up and Out

It is incomplete because as we zoom out of those high-resolution metrics collection and analysis tooling, what we find is a much-ignored environment which includes one of the most powerful context-sensitive and incredibly adaptive anomaly detection and response agents in the world:  humans.

Do we have anomaly detection problems? Certainly. One can argue (I will) that we will always have them, for many reasons. (One of those reasons is the Law Of Stretched Systems, but that is for a different post.)

What I’m interested in is not how software can be used to detect anomalies automatically,
(well, I’m interested, but I don’t doubt that we all will continue to get better at it)

…it is how people navigate this boundary between themselves and the machines they work with. The boundary between humans and machines, as we observe our use of tools, is a focus in and of itself. If we have any hope of making progress in monitoring complex systems, we must take this boundary into account.

As an aside, some more bullet points:

  1. We don’t use a single tool to gain insight into the architectures we build. And we will not, much to the dismay of many monitoring-as-a-service business models. (“A single plane of glass?! Where do I sign?!”)
  2. Teams of people are the norm, which means that communication and coordination become as important (if not more important) than surfacing anomalies themselves.
  3. We bring our biases, expectations, trust, and perceptions to the table when it comes to monitoring and response. No tool or piece of automation will ever change that.
  4. Understanding the breakdowns at these boundaries between people and machines should be a part of how we approach the design of tools. Organizational behavior beats technology at every turn.

Less Code, More Social Science

When we look at Boyd’s OODA loop, we see “observe” and “orient” as critical pieces. Note that these are not Unix commands, they are human activities.

So writing code to tell computers what to look at is quite different than making sure that the code’s human supervisors are equipped or aided in what to look when an alert goes off. Figuring out how people make sense of what is actually going on at a given point (in diagnosis? in planning? in response to an outage? in control?) is just plain hard.

A step that Don Norman (and other folks known in the world of ergonomics and human factors) have been tugging at for a couple of decades is to first attempt to understand how people consume, adapt to, work around, and make use of tools under “normal” operating conditions. Once that’s done, it’s suggested, then we can try to understand how people make sense of their world under high-tempo or escalating scenarios (during an outage, for example) when the signals they receive can sometimes be disorienting as things escalate.

Questions

  • Who has ever gotten an alert and ignored it? (/me looks at alert, says “oh, it’ll probably recover, no need to look further”)
  • How many alerts were received in the past week that were not actionable? (no human action was required)
  • How many alerts were received in the past week as a result of known work being done (expected) but alerts were not silenced during that period?
  • How many alerts were received as a result of a previously silenced alert (because work was being done) that was mistakenly un-silenced?

Here are some quotes from engineers who have found themselves in interesting situations related to alerts:

“The whole place just lit up. I mean, all the lights came on. So instead of being able to tell you what went wrong, the lights were absolutely no help at all.”
– Comment by one space controller in mission control after the Apollo 12 spacecraft was struck by lightning (Murray and Cox 1990).

 

“I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information.”
– Comment by one operator at the Three Mile Island nuclear power plant to the official inquiry following the TMI accident (Kemeny 1979).

 

“When the alarm kept going off then we kept shutting it [the device] off [and on] and when the alarm would go off [again], we’d shut it off.”
“… so I just reset it [a device control] to a higher temperature. So I kinda fooled it [the alarm]…”
– Physicians explaining how they respond to a nuisance alarm on a computerized operating room device (Cook, Potter, Woods and McDonald 1991).

 

“A [computer] program alarm could be triggered by trivial problems that could be ignored altogether. Or it could be triggered by problems that called for an immediate abort [of the lunar landing]. How to decide which was which? It wasn’t enough to memorize what the program alarm numbers stood for, because even within a single number the alarm might signify many different things.

 

“We wrote ourselves little rules like: ‘If this alarm happens and it only happens once, don’t worry about it. If it happens repeatedly, but other indicators are okay, don’t worry about it.'” And of course, if some alarms happen even once, or if other alarms happen repeatedly and the other indicators are not okay, then they should get the LEM [lunar module] the hell out of there.
– Response to discovery of a set of computer alarms linked to the astronauts displays shortly before the Apollo 11 mission (Murray and Cox 1990).

 

“1202.” (Astronaut announcing that an alarm buzzer and light had gone off and the code 1202 was indicated on the computer display.)
“What’s a 1202?”
“1202, what’s that?”
“12…1202 alarm.”
– Mission control dialog as the LEM descended to the moon during Apollo 11 (Murray and Cox 1990).

 

“I know exactly what it [an alarm] is–it’s because the patient has been, hasn’t taken enough breaths or–I’m not sure exactly why.”
– Physician explaining one alarm on a computerized operating room device that commonly occurred at a particular stage of surgery (Cook et al. 1991).

These quotes are from the excellent paper The Alarm Problem and Directed Attention in Dynamic Fault Management (Woods, 1995).

David Woods writes at great length on the topic and gives great insight into what essentially alerts and alarms are: directed attention. As operators of systems that are beyond our full understanding at any given point and perspective, he shines light on the core of the alarm problem: that there is always context sensitivity to alerts, and in many ways the author/designer of the alert hasn’t (can’t!) imagine how the receiver of the alert will interpret it.

For example: he points to signal detection theory as a framework for thinking about alert/alarm criteria. That is to say, there is always a relationship between true “signal” and “noise” and the trade-offs inherent in choosing the alerting criteria (sometimes, but not always, viewed as a simple threshold) can be thought of like this:

Signal Detection Theory

In other words, there are four outcomes that are possible that reflect how sensitive the alerting criteria can be:

 

SDT outcomes

So this is a tough one, and points out that getting good (forget about perfect!) signal-to-noise ratio is hard. Too sensitive, you’ll get too many false alarms. Not sensitive enough, and you’ll miss something.

I’ll say that because of this, we generally err on the side of too many false alarms. For fear of missing something (or the embarrassment of it being known that you missed something going wrong with your systems!) we will crank up the sensitivity.

But in doing so, we essentially ignore the detrimental effect of the false alarms on our engineers and organizations. Underlying the false alarms are not just limitations in the alerting algorithms themselves, but the conditions and factors that the alert systems cannot detect or interpret.

An often-given example of this manifests at the Cincinnati Airport. A riverbank leading up to a particular runway there triggers a threshold in ground proximity warning systems (in-cockpit alerts) because the system can’t detect that it’s going to plateau at the runway. Pilots familiar with this particular runway at this particular airport ignore the alerts.

Once more, with feeling: the pilots, who are flying massive cylinders of metal containing many humans ignore a Ground Proximity Warning alert.

When we talk about how the receiver of an alert will behave, we begin to uncover the context sensitivity of an alert.

How can we take into account how someone might react when we they are woken up to an alert we’ve designed? Will they shake their head, wondering what it’s all about? Are we helping them understand what might be going on, or hindering them by including only the bare minimum of data?

What about the engineer who gets an alert in a sea of alerts, while an outage is ongoing? How much attention will they give one amongst a hundred?

Something that might affect our behavior when we get an alert is the amount of trust that we have in the alert: is it telling us something we should believe? Should we drop everything we’re doing in order to pay attention to it? If not, why not?

As an example of this, take the Ground Proximity Warning System I mentioned above. Turns out that in many studies across a number of years, a majority of pilots delay reacting to a GPWS alarm, not just in Cincinnati. Why? Because they take time to validate that the alarm is actually legitimate by looking out the window. This is enough of a problem that the FAA has coined this phenomenon “delayed GPWS response syndrome“.

Trust in automation: it’s a thing that might be worth thinking closely about.

Two Views

“The critical point is that the challenge of fault management lies in sorting through an avalanche of raw data — a data overload problem. This is in contrast to the view that the performance bottleneck is the difficulty of picking up subtle early indications of a fault against the background of a quiescent monitored process.” (Woods, 1995)

The next time you set up an alert in your system, consider how you’re thinking the receiver of that alert will take it. Do you believe that your alert will save the day, providing information for someone to head off catastrophe before it’s too late? Or will it be likely discarded as noise amongst a sea of alerts as someone struggles to understand an outage?

“Information is not a scarce resource, attention is.” – Herb Simon

Herb Simon has mentioned this in many pieces of his writing, as David Woods and Emily Patterson remarks in Can We Ever Escape From Data Overload, A Cognitive Systems DiagnosisThus far we’ve captured that designing alerts is hard, even if we only invest effort in capturing signal, forget about providing context. Woods talks a bit more about directed attention, about a paradox:

“Note the paradox at the heart of directed attention. Given that the supervisory agent is loaded by various other task related demands, how does one interpret information about the potential need to switch attentional focus without interrupting or interfering with the tasks or lines of reasoning already under attentional control. We can state this paradox in another way: how can one skillfully ignore a signal that should not shift attention within the current context, without first processing it — in which case it hasn’t been ignored.”

So Where Is “Design”?

“It is the expertise of the human operator that makes it possible to adapt the  performance of the joint system, in real time, to unexpected events and disturbances. Every working day, across the whole spectrum of human enterprise, a large number of near-misses are prevented from turning into accidents only because human operators intervene.

The system should therefore be designed so that human adaptation is enhanced.”

(emphasis mine) – Erik Hollnagel, Expertise and Technology: Cognition &  Human-Computer Cooperation, 1995

Instead of thinking about alerts and alert design as tasks that underscore the mental model of a subordinate or otherwise dumb messenger delivering news to us?

What if we viewed alerting systems as a partner? What does the world look like if we designed alerting systems to cooperate with us?
If trust in alerting systems is such a big deal, as it is with the GPWS and alert numbness,  what can we learn from how humans learn to trust each other, and let that influence our design decisions?

In other words: how can we design alerts that support our efforts to confirm their legitimacy, or our expectations when an alert will fire? Is context-sensitivity part of this?

This is the type of partnership and thinking that I’m interested in. 🙂

Prevention versus Governance versus Adaptive Capacities

The other day I posted about the intersections of Systems Safety and web operations and engineering.

One of the largest proponents of bringing a systems thinking perspective to safety (specifically ‘software safety’) is Dr. Nancy Leveson, who has been in that field (really a multidisciplinary field) for at least a couple of decades. She’s the author of a super book, Engineering a Safer World (free download) that discusses this very concept.

I also mentioned the firming up (still in the public comment timeframe) of REG-SCI which puts into place regulation (not just a recommendation or suggestion) the ARP (automation review policy) that public trading markets must comply with.

Without commenting too much on REG-SCI (I have opinions on that which I can post about at a later date) itself, I wanted to point to a Technology Roundable that the SEC had last October and invited Dr. Leveson to speak on the notion and concepts of “safe” software systems. This laid the groundwork that went into (presumably) Regulation SCI.

I clipped out her testimony, it’s about 20 minutes long, but very much worth a watch. She touches on a number of topics, but brings plain language to what organizations (both for-profit and regulatory groups, like the SEC) can expect with respect to introducing an increasing amount of technology to ‘solve’ stability issues in complex systems:

Nancy Leveson SEC Technology Rountable

Nancy Leveson, SEC Technology Rountable 10/2/2012 from jspaw on Vimeo.

Regulation SCI is aimed towards national securities and trading exchanges primarily. And the regulation itself is almost 400 pages long. Even if the intention is to prevent the sort of calamities such as the Flash Crash, the BATS IPO event, and the Knight Capital incident…is regulation the best (or only) way to make our systems safer?