Reflections on the 6th Resilience Engineering Symposium

I just spent the last week in Lisbon, Portugal at the Resilience Engineering Symposium. Zoran Perkov and I were invited to speak on the topic of software operations and resilience in the financial trading and Internet services worlds, to an audience of practitioners and researchers from all around the globe, in a myriad of industries.

My hope was to start a dialogue about the connections we’ve seen (and to hopefully explore more) between practices and industries, and to catch theories about resilience up to what’s actually happening in these “pressurized and consequential”1 worlds.

I thought I’d put down some of my notes, highlights and takeaways here.

  • In order to look at how resilience gets “engineered” (if that is actually a thing) we have to look at adaptations that people make in the work that they do, to fill in the gaps that show up as a result of the incompleteness of designs, tools, and prescribed practices. We have to do this with a “low commitment to concepts”2 because otherwise we run the risk of starting with a model (OODA? four cornerstones of resilience? swiss cheese? situation awareness? etc.) and then finding data to fill in those buckets. Which can happen unfortunately quite easily, and also: is not actually science.

 

  • While I had understood this before the symposium, I’m now even clearer on it: resilience is not the same as fault-tolerance or “graceful degradation.” Instead, it’s something more, akin to what Woods calls graceful extensibility.”

 

  • The other researchers and practitioners in ‘safety-critical’ industries were very interested in approaches such as continuous deployment/delivery might look like in their fields. They saw it as a set of evolutions from waterfall that Internet software has made that allows it to be flexible and adaptive in the face of uncertainty of how the high-level system of users, providers, customers, operations, performance, etc. will behave in production. This was their reflection, not my words in their mouths, and I really couldn’t agree more. Validating!

 

  • While financial trading systems and Internet software have some striking similarities, the differences are stark. Zoran and I are both jealous of each other’s worlds in different ways. Also: Zoran can quickly scare the shit out of an audience filled with pension and retirement plans. :)

 

  • The lines between words (phases?) such as: design-implementation-operations are blurred in worlds where adaptive cycles take place, largely because feedback loops are the focus (or source?) of the cycles.

 

  • We still have a lot to do in “software operations”3 in that we may be quite good at focusing and discussing software development and practices, alongside the computer science concepts that influence those things, but we’re not yet good at exploring what we can find about our field through the lenses of social science and cognitive psychology. I would like to change that, because I think we haven’t gone far enough in being introspective on those fronts. I think we might only currently flirting with those areas. By dropping a Conway’s Law here and a cognitive bias there, it’s a good start. But we need to consider that we might not actually know what the hell we’re talking about (yet!). However, I’m optimistic on this front, because our community has both curiosity and a seemingly boundless ability to debate esoteric topics with each other. Now if we can only stop doing it in 140 characters at a time… :)

 

  • The term “devops” definitely has analogues in other industries. At the very least, the term brought vigorous nodding as I explained it. Woods used the phrase “throw it over the wall” and it resonated quite strongly with many folks from diverse fields. People from aviation, maritime, patient safety…they all could easily give a story that was analogous to “worked fine in dev, ops problem now” in their worlds. Again, validating.

 

  • There is no Resilience Engineering (or Cognitive Systems Engineering or Systems Safety for that matter) without real dialogue about real practice in the world. In other words, there is no such thing as purely academic here. Every “academic” here viewed their “laboratories” as cockpits, operating rooms and ERs, control rooms in mission control and nuclear plants, on the bridges of massive ships. I’m left thinking that for the most part, this community abhors the fluorescent-lighted environments of universities. They run toward potential explosions, not away from them. Frankly, I think our field of software has a much larger population of the stereotype of the “out-of-touch” computer scientist whose ideas in papers never see the light of production traffic. (hat tip to Kyle for doing the work to do real-world research on what were previously known as academic theories!)

 


 

1 Richard Cook’s words.

2 David Woods’ words. I now know how important this is when connecting theory to practice. More on this topic in a different post!

3 This is what I’m now calling what used to be known as “WebOps” or what some refer to as ‘devops’ to reflect that there is more to software services that are delivered via the Internet than just the web, and I’d like to update my language a bit.

Some Principles of Human-Centered Computing

From Perspectives On Cognitive Task Analysis: Historical Origins and Modern Communities of Practice
(emphasis mine)

The Aretha Franklin Principle Do not devalue the human to justify the machine. Do not criticize the machine to rationalize the human. Advocate the human–machine system to amplify both.
The Sacagawea Principle Human-centered computational tools need to support active organization of information, active search for information, active exploration of information, reflection on the meaning of information, and evaluation and choice among action sequence alternatives.
The Lewis and Clark Principle The human user of the guidance needs to be shown the guidance in a way that is organized in terms of their major goals. Information needed for each particular goal should be shown in a meaningful form and should allow the human to directly comprehend the major decisions associated with each goal.
The Envisioned World Principle The introduction of new technology, including appropriately human-centered technology, will bring about changes in environmental constraints (i.e., features of the sociotechnical system or the context of practice). Even though the domain constraints may remain unchanged, and even if cognitive constraints are leveraged and amplified, changes to the environmental constraints will impact the work.
The Fort Knox Principle The knowledge and skills of proficient workers is gold. It must be elicited and preserved, but the gold must not simply be stored and safeguarded. It must be disseminated and used within the organization when needed.
The Pleasure Principle Good tools provide a feeling of direct engagement. They simultaneously provide a feeling of flow and challenge.
The Janus Principle Human-centered systems do not force a separation between learning and performance. They integrate them.
The Mirror– Mirror Principle Every participant in a complex cognitive system will form a model of the other participant agents as well as a model of the controlled process and its environment.
The Moving Target Principle The sociotechnical workplace is constantly changing, and constant change in environmental constraints may entail constant change in cognitive constraints, even if domain constraints remain constant.

 

An Open Letter To Monitoring/Metrics/Alerting Companies

I’d like to open up a dialogue with companies who are selling X-As-A-Service products that are focused on assisting operations and development teams in tracking the health and performance of their software systems.

Note: It’s likely my suggestions below are understood and embraced by many companies already. I know a number of them who are paying attention to all areas I would want them to, and/or make sure they’re not making claims about their product that aren’t genuine. 

Anomaly detection is important. It can’t be overlooked. We as a discipline need to pay attention to it, and continually get better at it.

But for the companies who rely on your value-add selling point(s) as:

  • “our product will tell you when things are going wrong” and/or
  • “our product will automatically fix things when it finds something is wrong”

the implication is these things will somehow relieve the engineer from thinking or doing anything about those activities, so they can focus on more ‘important’ things. “Well-designed automation will keep people from having to do tedious work”, the cartoon-like salesman says.

Please stop doing this. It’s a lie in the form of marketing material and it’s a huge boondoggle that distracts us away from focusing on what we should work on, which is to augment and assist people in solving problems.

Anomaly detection in software is, and always will be, an unsolved problem. Your company will not solve it. Your software will not solve it. Our people will improvise around it and adapt their work to cope with the fact that we will not always know what and how something is wrong at the exact time we need to know.

My suggestion is to first acknowledge this (that your attempts to detect anomalies perfectly, at the right time, is not possible) when you talk to potential customers. Want my business? Say this up front, so we can then move on to talking about how your software will assist my team of expert humans who will always be smarter than your code.

In other words, your monitoring software should take the Tony Stark approach, not the WOPR/HAL9000 approach.

These are things I’d like to know about how you thought about your product:

  • Tell me about how you used qualitative research in developing your product.
  • Tell me about how you observed actual engineers in their natural habitat, in the real world, as they detected and responded to anomalies that arose.
  • Show me your findings from when you had actual UX/UI professionals consider carefully how the interfaces of your product should be designed.
  • Demonstrate to me the people designing your product have actually been on-call and have experience with the scenario where they needed to understand what the hell was going on, had no idea where to start looking, all under time and consequence pressure.
  • Show me the people who are building your product take as a first design principle that outages and other “untoward” events are handled not by a lone engineer, but more often then not by a team of engineers all with their different expertise and focus of attention. Successful response depends on not just on anomaly detection, but how the team shares the observations they are making amongst each other in order to come up with actions to take.

 

Stop thinking you’re trying to solve a troubleshooting problem; you’re not.

 

The world you’re trying to sell to is in the business of dynamic fault managementThis means that quite often you can’t just take a component out of service and investigate what’s wrong with it. It means diagnosis involves testing hypotheses that could actually make things a lot worse than they already are. This means that phases of responding to issues have overlapping concerns all at the same time. Things like:

  • I don’t know what is going on.
  • I have a guess about what is going on, but I’m not sure, and I don’t know how to confirm it.
  • Because of what Sue and Alice said, and what I see, I think what is going on is X.
  • Since we think X is happening, I think we should do Y.
  • Is there a chance that Y will make things worse?
  • If we don’t know what’s happening with N, can we do M so things don’t get worse, or we can buy time to figure out what to do about N?
  • Do we think this thing (that we have no clue about) is changing for the better or the worse?
  • etc.

Instead of telling me about how your software will solve problems, show me you’re trying to build a product that is going to join my team as an awesome team member, because I’m going to think about using/buying your service in the same way I think about hiring.

Sincerely,

John Allspaw

 

Stress, Strain, and Reminders

This is a photo of the backside of the T-shirt for the operations engineering team  at Etsy:

FullSizeRender

This diagram might not come as a surprise to those who know that I come from a mechanical engineering background. But I also wanted to have this on the T-shirt as a reminder (maybe just to myself, but hopefully those on the team) that organizations (or groups within them) can experience stresses and strains just like materials do.

About the time that I was thinking about the T-shirt I had come across “Stress-Strain Plots As a Basis For Assessing System Resilience” (Woods, D. D., & Wreathall, J. (2008))

One of the largest questions in my mind then (well, even before then, since then, and still) was: how to do engineers (in their particular environment and familiarity with their tools) allow for them to adapt and learn? If I could explore that question, then I might have some hope to answer, in the words of Eduardo Salas: “How can you turn a team of experts into an expert team?” (link)

In the paper, Woods and Wreathall explore the idea of the very familiar stress strain diagram that is found in the textbook of any materials science class for the last 10 decades or so. They look to this is an analogy for organizations and as illustrations that groups of people organizations have different “state spaces” in which they adapt.

In “uniform” or normal stretching there is what they would describe as the competence envelope and then past that, there are the more interesting “extra regions” were teams have to reconfigure, improvise, and make trade-offs in uncertain conditions. This topic is so interesting to me I decided to do a master’s thesis on it.

Here’s the thing: no work in complex systems can be prescribed. Which means it can’t be codified, and it can’t be proceduralized. Instead, rules and procedures and code are the scaffolding upon which operators, designers, engineers adapt, in order to be successful.

Sometimes these adaptations bring efficiencies. Sometimes they bring costs. Sometimes they bring surprises. Sometimes they bring more needs to adapt. But one thing is certain: they don’t bring the system back to some well-known equilibrium of ‘stable’ – complex systems don’t work that way.

But you don’t have to read my interpretation of the paper, you should just go and read it. :)

The last (and potentially just as important) reminder for me in the diagram is that all analogies have limits, and this one is no exception. When we use analogies and don’t acknowledge their limitations we can get into trouble. But that’s for a different post on a different day.

The Infinite Hows (or, the Dangers Of The Five Whys)

(this is also posted on O’Reilly’s Radar blog. Much thanks to Daniel Schauenberg, Morgan Evans, and Steven Shorrock for feedback on this)

Before I begin this post, let me say that this is intended to be a critique of the Five Whys method, not a criticism of the people who are in favor of using it.

This critique I present is hardly original; most of this post is inspired by Todd Conklin, Sidney Dekker, and Nancy Leveson.

The concept of post-hoc explanation (or “postmortems” as they’re commonly known) has, at this point, taken hold in the web engineering and operations domain. I’d love to think that the concepts that we’ve taken from the New View on ‘human error’ are becoming more widely known and that people are looking to explore their own narratives through those lenses.

I think that this is good, because my intent has always been (might always be) to help translate concepts from one domain to another. In order to do this effectively, we need to know also what to discard (or at least inspect critically) from those other domains.

The Five Whys is such an approach that I think we should discard.

This post explains my reasoning for discarding it, and how using it has the potential to be harmful, not helpful, to an organization. Here’s how I intend on doing this: I’m first going to talk about what I think are deficiencies in the approach, suggest an alternative, and then ask you to simply try the alternative yourself.

Here is the “bottom line, up front” gist of my assertions:

“Why?” is the wrong question.

In order to learn (which should be the goal of any retrospective or post-hoc investigation) you want multiple and diverse perspectives. You get these by asking people for their own narratives. Effectively, you’re asking  “how?

Asking “why?” too easily gets you to an answer to the question “who?” (which in almost every case is irrelevant) or “takes you to the ‘mysterious’ incentives and motivations people bring into the workplace.”

Asking “how?” gets you to describe (at least some) of the conditions that allowed an event to take place, and provides rich operational data.

Asking a chain of “why?” assumes too much about the questioner’s choices, and assumes too much about each answer you get. At best, it locks you into a causal chain, which is not how the world actually works. This is a construction that ignores a huge amount of complexity in an event, and it’s the complexity that we want to explore if we have any hope of learning anything.

But It’s A Great Way To Get People Started!

The most compelling argument to using the Five Whys is that it’s a good first step towards doing real “root cause analysis” – my response to that is twofold:

  1. “Root Cause Analysis*” isn’t what you should be doing anyway, and
  2. It’s only a good “first step” because it’s easy to explain and understand, which makes it easy to socialize. The issue with this is that the concepts that the Five Whys depend on are not only faulty, but can be dangerous for an organization to embrace.

If the goal is learning (and it should be) then using a method of retrospective learning should be confident in how it’s bringing to light data that can be turned into actionable information. The issue with the Five Whys is that it’s tunnel-visioned into a linear and simplistic explanation of how work gets done and events transpire. This narrowing can be incredibly problematic.

In the best case, it can lead an organization to think they’re improving on something (or preventing future occurrences of events) when they’re not.

In the worst case, it can re-affirm a faulty worldview of causal simplification and set up a structure where individuals don’t feel safe in giving their narratives because either they weren’t asked the right “why?” question or the answer that a given question pointed to ‘human error’ or individual attributes as causal.

Let’s take an example. From my tutorials at the Velocity Conference in New York, I used an often-repeated straw man to illustrate this:

Screen Shot 2014-11-12 at 3.45.24 PM

This is the example of the Five Whys found in the Web Operations book, as well.

This causal chain effectively ends with a person’s individual attributes, not with a description of the multiple conditions that allow an event like this to happen. Let’s look into some of the answers…

“Why did the server fail? Because an obscure subsystem was used in the wrong way.”

This answer is dependent on the outcome. We know that it was used in the “wrong” way only because we’ve connected it to the resulting failure. In other words, we as “investigators” have the benefit of hindsight. We can easily judge the usage of the server because we know the outcome. If we were to go back in time and ask the engineer(s) who were using it: “Do you think that you’re doing this right?” they would answer: yes, they are. We want to know what are the various influences that brought them to think that, which simply won’t fit into the answer of “why?”

The answer also limits the next question that we’d ask. There isn’t any room in the dialogue to discuss things such as the potential to use a server in the wrong way and it not result in failure, or what ‘wrong’ means in this context. Can the server only be used in two ways – the ‘right’ way or the ‘wrong’ way? And does success (or, the absence of a failure) dictate which of those ways it was used? We don’t get to these crucial questions.

“Why was it used in the wrong way? The engineer who used it didn’t know how to use it properly.”

This answer is effectively a tautology, and includes a post-hoc judgement. It doesn’t tell us anything about how the engineer did use the system, which would provide a rich source of operational data, especially for engineers who might be expected to work with the system in the future. Is it really just about this one engineer? Or is it possibly about the environment (tools, dashboards, controls, tests, etc.) that the engineer is working in? If it’s the latter, how does that get captured in the Five Whys?

So what do we find in this chain we have constructed above? We find:

  • an engineer with faulty (or at least incomplete) knowledge
  • insufficient indoctrination of engineers
  • a manager who fouls things up by not being thorough enough in the training of new engineers (indeed: we can make a post-hoc judgement about her beliefs)

If this is to be taken as an example of the Five Whys, then as an engineer or engineering manager, I might not look forward to it, since it focuses on our individual attributes and doesn’t tell us much about the event other than the platitude that training (and convincing people about training) is important.

These are largely answers about “who?” not descriptions of what conditions existed. In other words, by asking “why?” in this way, we’re using failures to explain failures, which isn’t helpful.

If we ask: “Why did a particular server fail?” we can get any number of answers, but one of those answers will be used as the primary way of getting at the next “why?” step. We’ll also lose out on a huge amount of important detail, because remember: you only get one question before the next step.

If instead, we were to ask the engineers how they went about implementing some new code (or ‘subsystem’), we might hear a number of things, like maybe:

  • the approach(es) they took when writing the code
  • what ways they gained confidence (tests, code reviews, etc.) that the code was going to work in the way they expected it before it was deployed
  • what (if any) history of success or failure have they had with similar pieces of code?
  • what trade-offs they made or managed in the design of the new function?
  • how they judged the scope of the project
  • how much (and in what ways) they experienced time pressure for the project
  • the list can go on, if you’re willing to ask more and they’re willing to give more

Rather than judging people for not doing what they should have done, the new view presents tools for explaining why people did what they did. Human error becomes a starting point, not a conclusion. (Dekker, 2009)

When we ask “how?”, we’re asking for a narrative. A story.

In these stories, we get to understand how people work. By going with the “engineer was deficient, needs training, manager needs to be told to train” approach, we might not have a place to ask questions aimed at recommendations for the future, such as:

  • What might we put in place so that it’s very difficult to put that code into production accidentally?
  • What sources of confidence for engineers could we augment?

As part of those stories, we’re looking to understand people’s local rationality. When it comes to decisions and actions, we want to know how it made sense for someone to do what they did. And make no mistake: they thought what they were doing made sense. Otherwise, they wouldn’t have done it.

WhyHow.001

Again, I’m not original with this thought. Local rationality (or as Herb Simon called it, “bounded rationality”) is something that sits firmly atop some decades of cognitive science.

These stories we’re looking for contain details that we can pull on and ask more about, which is critical as a facilitator of a post-mortem debriefing, because people don’t always know what details are important. As you’ll see later in this post, reality doesn’t work like a DVR; you can’t pause, rewind and fast-forward at will along a singular and objective axis, picking up all of the pieces along the way, acting like CSI. Memories are faulty and perspectives are limited, so a different approach is necessary.

Not just “how”

In order to get at these narratives, you need to dig for second stories. Asking “why?” will get you an answer to first stories. These are not only insufficient answers, they can be very damaging to an organization, depending on the context. As a refresher…

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

 

Now, read again the straw-man example of the Five Whys above. The questions that we ask frame the answers that we will get in the form of first stories. When we ask more and better questions (such as “how?”) we have a chance at getting at second stories.

You might wonder: how did I get from the Five Whys to the topic of ‘human error’? Because once ‘human error’ is a candidate to reach for as a cause (and it will, because it’s a simple and potentially satisfying answer to “why?”) then you will undoubtedly use it.

At the beginning of my tutorial in New York, I asked the audience this question:

IsThisRight.001

At the beginning of the talk, a large number of people said yes, this is correct. Steven Shorrock (who is speaking at Velocity next week in Barcelona on this exact topic) has written a great article on this way of thinking: If It Weren’t For The People. By the end of my talk, I was able to convince them that this is also the wrong focus of a post-mortem description.

This idea accompanies the Five Whys more often than not, and there are two things that I’d like to shine some light on about it:

Myth of the “human or technical failure” dichotomy

This is dualistic thinking, and I don’t have much to add to this other than what Dekker has said about it (Dekker, 2006):

“Was the accident caused by mechanical failure or by human error? It is a stock question in the immediate aftermath of a mishap. Indeed, it seems such a simple, innocent question. To many it is a normal question to ask: If you have had an accident, it makes sense to find out what broke. The question, however, embodies a particular understanding of how accidents occur, and it risks confining our causal analysis to that understanding. It lodges us into a fixed interpretative repertoire. Escaping from this repertoire may be difficult. It sets out the questions we ask, provides the leads we pursue and the clues we examine, and determines the conclusions we will eventually draw.”

Myth: during a retrospective investigation, something is waiting to be “found”

I’ll cut to the chase: there is nothing waiting to be found, or “revealed.” These “causes” that we’re thinking we’re “finding”? We’re constructing them, not finding them. We’re constructing them because we are the ones that are choosing where (and when) to start asking questions, and where/when to stop asking the questions. We’ve “found” a root cause when we stop looking. And in many cases, we’ll get lazy and just chalk it up to “human error.”

As Erik Hollnagel has said (Hollnagel, 2009, p. 85):

“In accident investigation, as in most other human endeavours, we fall prey to the What-You-Look-For-Is-What-You-Find or WYLFIWYF principle. This is a simple recognition of the fact that assumptions about what we are going to see (What-You-Look-For), to a large extent will determine what we actually find (What-You-Find).”

More to the point: “What-You-Look-For-Is-What-You-Fix”

We think there is something like the cause of a mishap (sometimes we call it the root cause, or primary cause), and if we look in the rubble hard enough, we will find it there. The reality is that there is no such thing as the cause, or primary cause or root cause . Cause is something we construct, not find. And how we construct causes depends on the accident model that we believe in. (Dekker, 2006)

Nancy Leveson comments on this in her excellent book Engineering a Safer World this idea (p.20):

Subjectivity in Selecting Events

The selection of events to include in an event chain is dependent on the stopping rule used to determine how far back the sequence of explanatory events goes. Although the first event in the chain is often labeled the ‘initiating event’ or ‘root cause’ the selection of an initiating event is arbitrary and previous events could always be added.

Sometimes the initiating event is selected (the backward chaining stops) because it represents a type of event that is familiar and thus acceptable as an explanation for the accident or it is a deviation from a standard [166]. In other cases, the initiating event or root cause is chosen because it is the first event in the backward chain for which it is felt that something can be done for correction.

The backward chaining may also stop because the causal path disappears due to lack of information. Rasmussen suggests that a practical explanation for why actions by operators actively involved in the dynamic flow of events are so often identified as the cause of an accident is the difficulty in continuing the backtracking “through” a human [166].

A final reason why a “root cause” may be selected is that it is politically acceptable as the identified cause. Other events or explanations may be excluded or not examined in depth because they raise issues that are embarrassing to the organization or its contractors or are politically unacceptable.

Learning is the goal. Any prevention depends on that learning.

So if not the Five Whys, then what should you do? What method should you take?

I’d like to suggest an alternative, which is to first accept the idea that you have to actively seek out and protect the stories from bias (and judgement) when you ask people “how?”-style questions. Then you can:

  • Ask people for their story without any replay of data that would supposedly ‘refresh’ their memory
  • Tell their story back to them and confirm you got their narrative correct
  • Identify critical junctures
  • Progressively probe and re-build how the world looked to people inside of the situation at each juncture.

As a starting point for those probing questions, we can look to Gary Klein and Sidney Dekker for the types of questions you can ask instead of “why?”…

Debriefing Facilitation Prompts

(from The Field Guide To Understanding Human Error, by Sidney Dekker)

At each juncture in the sequence of events (if that is how you want to structure this part of the accident story), you want to get to know:

  • Which cues were observed (what did he or she notice/see or did not notice what he or she had expected to notice?)
  • What knowledge was used to deal with the situation? Did participants have any experience with similar situations that was useful in dealing with this one?
  • What expectations did participants have about how things were going to develop, and what options did they think they have to influence the course of events?
  • How did other influences (operational or organizational) help determine how they interpreted the situation and how they would act?

Here are some questions Gary Klein and his researchers typically ask to find out how the situation looked to people on the inside at each of the critical junctures:

Cues What were you seeing?

What were you focused on?

What were you expecting to happen?

Interpretation If you had to describe the situation to your colleague at that point, what would you have told?
Errors What mistakes (for example in interpretation) were likely at this point?
Previous knowledge/experience

Were you reminded of any previous experience?

Did this situation fit a standard scenario?

Were you trained to deal with this situation?

Were there any rules that applied clearly here?

Did any other sources of knowledge suggest what to do?

Goals What were you trying to achieve?Were there multiple goals at the same time?Was there time pressure or other limitations on what you could do?
Taking Action How did you judge you could influence the course of events?

Did you discuss or mentally imagine a number of options or did you know straight away what to do?

Outcome Did the outcome fit your expectation?
Did you have to update your assessment of the situation?
Communications What communication medium(s) did you prefer to use? (phone, chat, email, video conf, etc.?)

Did you make use of more than one communication channels at once?

Help

Did you ask anyone for help?

What signal brought you to ask for support or assistance?

Were you able to contact the people you needed to contact?

For the tutorials I did at Velocity, I made a one-pager of these: http://bit.ly/DebriefingPrompts

Screen Shot 2014-11-12 at 4.03.30 PM

Try It

I have tried to outline some of my reasoning on why using the Five Whys approach is suboptimal, and I’ve given an alternative. I’ll do one better and link you to the tutorials that I gave in New York in October, which I think digs deeper into these concepts. This is in four parts, 45 minutes each.

Part I – Introduction and the scientific basis for post-hoc restrospective pitfalls and learning

Part II – The language of debriefings, causality, case studies, teams coping with complexity

Part III – Dynamic fault management, debriefing prompts, gathering and contextualizing data, constructing causes

Part IV – Taylorism, normal work, ‘root cause’ of software bugs in cars, Q&A

My request is that the next time that you would do a Five Whys, that you instead ask “how?” or the variations of the questions I posted above. If you think you get more operational data from a Five Whys and are happy with it, rock on.

If you’re more interested in this alternative and the fundamentals behind it, then there are a number of sources you can look to. You could do a lot worse than starting with Sidney Dekker’s Field Guide To Understanding Human Error.

An Explanation

For those readers who think I’m too unnecessarily harsh on the Five Whys approach, I think it’s worthwhile to explain why I feel so strongly about this.

Retrospective understanding of accidents and events is important because how we make sense of the past greatly and almost invisibly influences our future. At some point in the not-so-distant past, the domain of web engineering was about selling books online and making a directory of the web. These organizations and the individuals who built them quickly gave way to organizations that now build cars, spacecraft, trains, aircraft, medical monitoring devices…the list goes on…simply because software development and distributed systems architectures are at the core of modern life.

The software worlds and the non-software worlds have collided and will continue to do so. More and more “life-critical” equipment and products rely on software and even the Internet.

Those domains have had varied success in retrospective understanding of surprising events, to say the least. Investigative approaches that are firmly based on causal oversimplification and the “Bad Apple Theory” of deficient individual attributes (like the Five Whys) have shown to not only be unhelpful, but objectively made learning harder, not easier. As a result, people who have made mistakes or involved in accidents have been fired, banned from their profession, and thrown in jail for some of the very things that you could find in a Five Whys.

I sometimes feel nervous that these oversimplifications will still be around when my daughter and son are older. If they were to make a mistake, would they be blamed as a cause? I strongly believe that we can leave these old ways behind us and do much better.

My goal is not to vilify an approach, but to state explicitly that if the world is to become safer, then we have to eschew this simplicity; it will only get better if we embrace the complexity, not ignore it.

 

Epilogue: The Longer Version For Those Who Have The Stomach For Complexity Theory

The Five Whys approach follows a Newtonian-Cartesian worldview. This is a worldview that is seductively satisfying and compellingly simple. But it’s also false in the world we live in.

What do I mean by this?

There are five areas why the Five Whys firmly sits in a Newtonian-Cartesian worldview that we should eschew when it comes to learning from past events. This is a Cliff Notes version of “The complexity of failure: Implications of complexity theory for safety investigations” –

First, it is reductionist. The narrative built by the Five Whys sits on the idea that if you can construct a causal chain, then you’ll have something to work with. In other words: to understand the system, you pull it apart into its constituent parts. Know how the parts interact, and you know the system.

Second, it assumes what Dekker has called “cause-effect symmetry” (Dekker, complexity of failure):

“In the Newtonian vision of the world, everything that happens has a definitive, identifiable cause and a definitive effect. There is symmetry between cause and effect (they are equal but opposite). The determination of the ‘‘cause’’ or ‘‘causes’’ is of course seen as the most important function of accident investigation, but assumes that physical effects can be traced back to physical causes (or a chain of causes-effects) (Leveson, 2002). The assumption that effects cannot occur without specific causes influences legal reasoning in the wake of accidents too. For example, to raise a question of negligence in an accident, harm must be caused by the negligent action (GAIN, 2004). Assumptions about cause-effect symmetry can be seen in what is known as the outcome bias (Fischhoff, 1975). The worse the consequences, the more any preceding acts are seen as blameworthy (Hugh and Dekker, 2009).”

John Carroll (Carroll, 1995) called this “root cause seduction”:

The identification of a root cause means that the analysis has found the source of the event and so everyone can focus on fixing the problem.  This satisfies people’s need to avoid ambiguous situations in which one lacks essential information to make a decision (Frisch & Baron, 1988) or experiences a salient knowledge gap (Loewenstein, 1993). The seductiveness of singular root causes may also feed into, and be supported by, the general tendency to be overconfident about how much we know (Fischhoff,Slovic,& Lichtenstein, 1977).

That last bit about a tendency to be overconfident about how much we know (in this context, how much we know about the past) is a strong piece of research put forth by Baruch Fischhoff, who originally researched what we now understand to be the Hindsight Bias. Not unsurprisingly, Fischhoff’s doctoral thesis advisor was Daniel Kahneman (you’ve likely heard of him as the author of Thinking Fast and Slow), whose research in cognitive biases and heuristics everyone should at least be vaguely familiar with.

The third issue with this worldview, supported by the idea of Five Whys and something that follows logically from the earlier points is that outcomes are foreseeable if you know the initial conditions and the rules that govern the system. The reason that you would even construct a serial causal chain like this is because

The fourth part of this is that time is irreversible. We can’t look to a causal chain as something that you can fast-forward and rewind, no matter how attractively simple that seems. This is because the socio-technical systems that we work on and work in are complex in nature, and are dynamic. Deterministic behavior (or, at least predictability) is something that we look for in software; in complex systems this is a foolhardy search because emergence is a property of this complexity.

And finally, there is an underlying assumption that complete knowledge is attainable. In other words: we only have to try hard enough to understand exactly what happened. The issue with this is that success and failure have many contributing causes, and there is no comprehensive and objective account. The best that you can do is to probe people’s perspectives at juncture points in the investigation. It is not possible to understand past events in any way that can be considered comprehensive.

Dekker (Dekker, 2011):

As soon as an outcome has happened, whatever past events can be said to have led up to it, undergo a whole range of transformations (Fischhoff and Beyth, 1975; Hugh and Dekker, 2009). Take the idea that it is a sequence of events that precedes an accident. Who makes the selection of the ‘‘events’’ and on the basis of what? The very act of separating important or contributory events from unimportant ones is an act of construction, of the creation of a story, not the reconstruction of a story that was already there, ready to be uncovered. Any sequence of events or list of contributory or causal factors already smuggles a whole array of selection mechanisms and criteria into the supposed ‘‘re’’construction. There is no objective way of doing this—all these choices are affected, more or less tacitly, by the analyst’s background, preferences, experiences, biases, beliefs and purposes. ‘‘Events’’ are themselves defined and delimited by the stories with which the analyst configures them, and are impossible to imagine outside this selective, exclusionary, narrative fore-structure (Cronon, 1992).

Here is a thought exercise: what if we were to try to use the Five Whys for finding the “root cause” of a success?

Why didn’t we have failure X today?

Now this question gets a lot more difficult to have one answer. This is because things go right for many reasons, and not all of them obvious. We can spend all day writing down reasons why we didn’t have failure X today, and if we’re committed, we can keep going.

So if success requires “multiple contributing conditions, each necessary but only jointly sufficient” to happen, then how is it that failure only requires just one? The Five Whys, as its commonly presented as an approach to improvement (or: learning?), will lead us to believe that not only is just one condition sufficient, but that condition is a canonical one, to the exclusion of all others.

* RCA, or “Root Cause Analysis” can also easily turn into “Retrospective Cover of Ass”

References

Carroll, J. S. (1995). Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability. Organization & Environment, 9(2), 175–197. doi:10.1177/108602669500900203

Dekker, S. (2004). Ten questions about human error: A new view of human factors and system safety. Mahwah, N.J: Lawrence Erlbaum.

Dekker, S., Cilliers, P., & Hofmeyr, J.-H. (2011). The complexity of failure: Implications of complexity theory for safety investigations. Safety Science, 49(6), 939–945. doi:10.1016/j.ssci.2011.01.008

Hollnagel, E. (2009). The ETTO principle: Efficiency-thoroughness trade-off : why things that go right sometimes go wrong. Burlington, VT: Ashgate.
Leveson, N. (2012). Engineering a Safer World. Mit Press.

 

 

Translations Between Domains: David Woods

One of the reasons I’ve continued to be more and more interested in Human Factors and Safety Science is that I found myself without many answers to the questions I have had in my career. Questions surrounding how organizations work, how people think and work with computers, how decisions get made under uncertainty, and how do people cope with increasing amounts of complexity.

As a result, my journey took me deep into a world where I immediately saw connections — between concepts found in other high-tempo, high-consequence domains and my own world of software engineering and operations. One of the first connections was in Richard Cook’s How Complex Systems Fail, and it struck me so deeply I insisted that it get reprinted (with additions by Richard) into O’Reilly’s Web Operations book.

I simply cannot un-see these connections now, and the field of study keeps me going deeper. So deep that I felt I needed to get a degree. My goal with getting a degree in the topic is not just to satisfy my own curiosity, but also to explore these topics in sufficient depth to feel credible in thinking about them critically.

In software, the concept and sometimes inadvertent practice of “cargo cult engineering” is well known. I’m hoping to avoid that in my own translation(s) of what’s been found in human factors, safety science, and cognitive systems engineering, as they looked into domains like aviation, patient safety, or power plant operations. Instead, I’m looking to truly understand that work in order to know what to focus on in my own research as well as to understand how my domain is either similar (and in what ways?) or different (and in what ways?)

For example, just a hint of what sorts of questions I have been mulling over:

  • How does the concept of “normalization of deviance” manifest in web engineering? How does it relate to our concept of ‘technical debt’?
  • What organizational dynamics might be in play when it comes to learning from “successes” and “failures”?
  • What methods of inquiry can we use to better design interfaces that have functionality and safety and diagnosis support as their core? Or, are those goals in conflict? If so, how?
  • How can we design alerts to reduce noise and increase signal in a way that takes into account the context of the intended receiver of the alert? In other words, how can we teach alerts to know about us, instead of the other way around?
  • The Internet (include its technical, political, and cultural structures) has non-zero amounts of diversity, interdependence, connectedness, and adaptation, which by many measures constitutes a complex system.
  • How do successful organizations navigate trade-offs when it comes to decisions that may have unexpected consequences?

I’ve done my best to point my domain at some of these connections as I understand them, and the Velocity Conference has been one of the ways I’ve hoped to bring people “over the bridge” from Safety Science, Human Factors, and Cognitive Systems Engineering into software engineering and operations as it exists as a practice on Internet-connected resources. If you haven’t seen Dr. Richard Cook’s 2012 and 2013 keynotes, or Dr. Johan Bergstrom’s keynote, stop what you’re doing right now and watch them.

I’m willing to bet you’ll see connections immediately…



DavidWoodsDavid Woods is one of the pioneers in these fields, and continues to be a huge influence on the way that I think about our domain and my own research (my thesis project relies heavily on some of his previous work) and I can’t be happier that he’s speaking at Velocity in New York, which is coming up soon. (Pssst: if you register for it here, you can use the code “JOHN20” for 20% discount)

I have posted before (and likely will again) about a paper Woods contributed to, Common Ground and Coordination in Joint Activity (Klein, Feltovich, Bradshaw, & Woods, 2005) which in my mind might as well be considered the best explanation on what “devops” means to me, and what makes successful teams work. If you haven’t read it, do it now.

 

Dynamic Fault Management and Anomaly Response

I thought about listing all of Woods’ work that I’ve seen connections in thus far, but then I realized that if I wasn’t careful, I’d be writing a literature review and not a blog post. :) Also, I have thesis work to do. So for now, I’d like to point only at two concepts that struck me as absolutely critical to the day-to-day of many readers of this blog, dynamic fault management and anomaly response.

Woods sheds some light on these topics in Joint Cognitive Systems: Patterns in Cognitive Systems Engineering. Pay particular attention to the characteristics of these phenomenons:

“In anomaly response, there is some underlying process, an engineered or physiological process which will be referred to as the monitored process, whose state changes over time. Faults disturb the functions that go on in the monitored process and generate the demand for practitioners to act to compensate for these disturbances in order to maintain process integrity—what is sometimes referred to as “safing” activities. In parallel, practitioners carry out diagnostic activities to determine the source of the disturbances in order to correct the underlying problem.

Anomaly response situations frequently involve time pressure, multiple interacting goals, high consequences of failure, and multiple interleaved tasks (Woods, 1988; 1994). Typical examples of fields of practice where dynamic fault management occurs include flight deck operations in commercial aviation (Abbott, 1990), control of space systems (Patterson et al., 1999; Mark, 2002), anesthetic management under surgery (Gaba et al., 1987), terrestrial process control (Roth, Woods & Pople, 1992), and response to natural disasters.” (Woods & Hollnagel, 2006, p.71)

Now look down at the distributed systems you’re designing and operating.

Look at the “runbooks” and postmortem notes that you have written in the hopes that they can help guide teams as they try to untangle the sometimes very confusing scenarios that outages can bring.

Does “safing” ring familiar to you?

Do you recognize managing “multiple interleaved tasks” under “time pressure” and “high consequences of failure”?

I think it’s safe to say that almost every Velocity Conference attendee would see connections here.

In How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands (Woods & Patterson, 1999), he introduces the concept of escalation, in terms of anomaly response:

The concept of escalation captures a dynamic relationship between the cascade of effects that follows from an event and the demands for cognitive and collaborative work that escalate in response (Woods, 1994). An event triggers the evolution of multiple interrelated dynamics.

  • There is a cascade of effects in the monitored process. A fault produces a time series of disturbances along lines of functional and physical coupling in the process (e.g., Abbott, 1990). These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications (Reiersen, Marshall, & Baker, 1988).
  • Demands for cognitive activity increase as the problem cascades. More knowledge potentially needs to be brought to bear. There is more to monitor. There is a changing set of data to integrate into a coherent assessment. Candidate hypotheses need to be generated and evaluated. Assessments may need to be revised as new data come in. Actions to protect the integrity and safety of systems need to be identified, carried out, and monitored for success. Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.
  • Demands for coordination increase as the problem cascades. As the cognitive activities escalate, the demand for coordination across people and across people and machines rises. Knowledge may reside in different people or different parts of the operational system. Specialized knowledge and expertise from other parties may need to be brought into the problem-solving process. Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process. The trouble in the underlying process requires informing and updating others – those whose scope of responsibility may be affected by the anomaly, those who may be able to support recovery, or those who may be affected by the consequences the anomaly could or does produce.
  • The cascade and escalation is a dynamic process. A variety of complicating factors can occur, which move situations beyond canonical, textbook forms. The concept of escalation captures this movement from canonical to nonroutine to exceptional. The tempo of operations increases following the recognition of a triggering event and is synchronized by temporal landmarks that represent irreversible decision points.

When I read…

“These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications” 

I think of many large-scale outages and multi-day recovery activities, like this one that you all might remember (AWS EBS/RDS outage, 2011).

When I read…

“Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.” 

I think of many outage response scenarios I have been in with multiple teams (network, storage, database, security, etc.) gathering data from the multiple places they are expert in, at the same time making sense of that data as normal or abnormal signals.

When I read…

“Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process.”

I think of these two particular outages, and how in the fog of ambiguous signals coming in during diagnosis of an issue, there is a “divide and conquer” effort distributed throughout differing domain expertise (database, network, various software layers, hardware, etc.) that aims to split the search space of diagnosis, while at the same time keeping each other up-to-date on what pathologies have been eliminated as possibilities, what new data can be used to form hypotheses about what’s going on, etc.

I will post more on the topic of anomaly response in detail (and more of Woods’ work) in another post.

In the meantime, I urge you to take a look at David Woods’ writings, and look for connections in your own work. Below is a talk David gave at IBM’s Almaden Research Center, called “Creating Safety By Engineering Resilience”:

David D. Woods, Creating Safety by Engineering Resilience from jspaw on Vimeo.

References

Hollnagel, E., & Woods, D. D. (1983). Cognitive systems engineering: New wine in new bottles. International Journal of Man-Machine Studies, 18(6), 583–600.

Klein, G., Feltovich, P. J., Bradshaw, J. M., & Woods, D. D. (2005). Common ground and coordination in joint activity. Organizational Simulation, 139–184.

Woods, D. D. (1995). The alarm problem and directed attention in dynamic fault management. Ergonomics. doi:10.1080/00140139508925274

Woods, D. D., & Hollnagel, E. (2006). Joint cognitive systems : patterns in cognitive systems engineering. Boca Raton : CRC/Taylor & Francis.

Woods, D. D., & Patterson, E. S. (1999). How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands. Stress, 1–13.

Woods, D. D., Patterson, E. S., & Roth, E. M. (2002). Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis. Cognition, Technology & Work, 4(1), 22–36. doi:10.1007/s101110200002

Teaching Engineering As A Social Science

Below is a piece written by Edward Wenk, Jr., which originally appeared in PRlSM, the magazine for the American Society for Engineering Education (Publication Volume 6. No. 4. December 1996.)

While I think that there’s much more than what Wenk points to as ‘social science’ – I agree wholeheartedly with his ideas. I might even say that he didn’t go far enough in his recommendations.

Enjoy. :)

 

Edward Wenk, Jr.

Teaching Engineering as a Social Science

Today’s public engages in a love affair with technology, yet it consistently ignores the engineering at technology’s core. This paradox is reinforced by the relatively few engineers in leadership positions. Corporations, which used to have many engineers on their boards of directors, today are composed mainly of M.B.A.s and lawyers. Few engineers hold public office or even run for office. Engineers seldom break into headlines except when serious accidents are attributed to faulty design.

While there are many theories on this lack of visibility, from inadequate public relations to inadequate public schools, we may have overlooked the real problem: Perhaps people aren’t looking at engineers because engineers aren’t looking at people.

If engineering is to be practiced as a profession, and not just a technical craft, engineers must learn to harmonize natural sciences with human values and social organization. To do this we must begin to look at engineering as a social science and to teach, practice, and present engineering in this context.

To many in the profession, looking at teaching engineering as a social science is anathema. But consider the multiple and profound connections of engineering to people.

Technology in Everyday Life

The work of engineers touches almost everyone every day through food production, housing, transportation, communications, military security, energy supply, water supply, waste disposal, environmental management, health care, even education and entertainment. Technology is more than hardware and silicon chips.

In propelling change and altering our belief systems and culture, technology has joined religion, tradition, and family in the scope of its influence. Its enhancements of human muscle and human mind are self-evident. But technology is also a social amplifier. It stretches the range, volume, and speed of communications. It inflates appetites for consumer goods and creature comforts. It tends to concentrate wealth and power, and to increase the disparity of rich and poor. In the com- petition for scarce resources, it breeds conflicts.

In social psychological terms, it alters our perceptions of space. Events anywhere on the globe now have immediate repercussions everywhere, with a portfolio of tragedies that ignite feelings of helplessness. Technology has also skewed our perception of time, nourishing a desire for speed and instant gratification and ignoring longer-term impacts.

Engineering and Government

All technologies generate unintended consequences. Many are dangerous enough to life, health, property, and environment that the public has demanded protection by the government.

Although legitimate debates erupt on the size of government, its cardinal role is demonstrated in an election year when every faction seeks control. No wonder vested interests lobby aggressively and make political campaign contributions.

Whatever that struggle, engineers have generally opted out. Engineers tend to believe that the best government is the least government, which is consistent with goals of economy and efficiency that steer many engineering decisions without regard for social issues and consequences.

Problems at the Undergraduate Level

By both inclination and preparation, many engineers approach the real world as though it were uninhabited. Undergraduates who choose an engineering career often see it as escape from blue- collar family legacies by obtaining the social prestige that comes with belonging to a profession. Others love machines. Few, however, are attracted to engineering because of an interest in people or a commitment to public service. On the contrary, most are uncomfortable with the ambiguities human behavior, its absence of predictable cause and effect, its lack of control, and with the demands for direct encounters with the public.

Part of this discomfort originates in engineering departments, which are often isolated from arts, humanities, and social sciences classrooms by campus geography as well as by disparate bodies of scholarly knowledge and cultures. Although most engineering departments require students to take some nontechnical courses, students often select these on the basis of hearsay, academic ease, or course instruction, not in terms of preparation for life or for citizenship.

Faculty attitudes don’t help. Many faculty members enter teaching immediately after obtaining their doctorates, their intellect sharply honed by a research specialty. Then they continue in that groove because of standard academic reward systems for tenure and promotion. Many never enter a professional practice that entails the human equation.

We can’t expect instant changes in engineering education. A start, however, would be to recognize that engineering is more than manipulation of intricate signs and symbols. The social context is not someone else’s business. Adopting this mindset requires a change in attitudes. Consider these axioms:

  • Technology is not just hardware; it is a social process.
  • All technologies generate side effects that engineers should try to anticipate and to protect against.
  • The most strenuous challenge lies in synthesis of technical, social, economic, environmental, political, and legal processes.
  • For engineers to fulfill a noblesse oblige to society, the objectivity must not be defined by conditions of employment, as, for example, in dealing with tradeoffs by an employer of safety for cost.

In a complex, interdependent, and sometimes chaotic world, engineering practice must continue to excel in problem solving and creative synthesis. But today we should also emphasize social responsibility and commitment to social progress. With so many initiatives having potentially unintended consequences, engineers need to examine how to serve as counselors to the public in answering questions of “What if?” They would thus add sensitive, future-oriented guidance to the extraordinary power of technology to serve important social purposes.

In academic preparation, most engineering students miss exposure to the principles of social and economic justice and human rights, and to the importance of biological, emotional, and spiritual needs. They miss Shakespeare’s illumination of human nature – the lust for power and wealth and its corrosive effects on the psyche, and the role of character in shaping ethics that influence professional practice. And they miss models of moral vision to face future temptations.

Engineering’s social detachment is also marked by a lack of teaching about the safety margins that accommodate uncertainties in engineering theories, design assumptions, product use and abuse, and so on. These safety margins shape practice with social responsibility to minimize potential harm to people or property. Our students can learn important lessons from the history of safety margins, especially of failures, yet most use safety protocols without knowledge of that history and without an understanding of risk and its abatement. Can we expect a railroad systems designer obsessed with safety signals to understand that sleep deprivation is even more likely to cause accidents? No, not if the systems designer lacks knowledge of this relatively common problem.

Safety margins are a protection against some unintended consequences. Unless engineers appreciate human participation in technology and the role of human character in performance, they are unable to deal with demons that undermine the intended benefits.

Case Studies in Socio-Technology

Working for the legislative and executive branches of US. government since the 1950s, I have had a ringside seat from which to view many of the events and trends that come from the connections between engineering and people. Following are a few of those cases.

Submarine Design

The first nuclear submarine, USS Nautilus, was taken on its deep submergence trial February 28, I955. The subs’ power plant had been successfully tested in a full-scale mock-up and in a shallow dive, but the hull had not been subject to the intense hydrostatic pressure at operating depth. The hull was unprecedented in diameter, in materials, and in special joints connecting cylinders of different diameter. Although it was designed with complex shell theory and confirmed by laboratory tests of scale models, proof of performance was still necessary at sea.

During the trial, the sub was taken stepwise to its operating depth while evaluating strains. I had been responsible for the design equations, for the model tests, and for supervising the test at sea, so it was gratifying to find the hull performed as predicted.

While the nuclear power plant and novel hull were significant engineering achievements, the most important development occurred much earlier on the floor of the US. Congress. That was where the concept of nuclear propulsion was sold to a Congressional committee by Admiral Hyman Rickover, an electrical engineer. Previously rejected by a conservative Navy, passage of the proposal took an electrical engineer who understood how Constitutional power was shared and how to exercise the right of petition. By this initiative, Rickover opened the door to civilian nuclear power that accounts for 20 percent of our electrical generation, perhaps 50 percent in France. If he had failed, and if the Nautilus pressure hull had failed, nuclear power would have been set back by a decade.

Space Telecommunications

Immediately after the 1957 Soviet surprise of Sputnik, engineers and scientists recognized that global orbits required all nations to reserve special radio channels for telecommunications with spacecraft. Implementation required the sanctity of a treaty, preparation of which demanded more than the talents of radio specialists; it engaged politicians, space lawyers, and foreign policy analysts. As science and technology advisor to Congress, I evaluated the treaty draft for technical validity and for consistency with U.S. foreign policy.

The treaty recognized that the airwaves were a common property resource, and that the virtuosity of communications engineering was limited without an administrative protocol to safeguard integrity of transmissions. This case demonstrated that all technological systems have three major components — hardware or communications equipment; software or operating instructions (in terms of frequency assignments); and peopleware, the organizations that write and implement the instructions.

National Policy for the Oceans

Another case concerned a national priority to explore the oceans and to identify U.S. rights and responsibilities in the exploitation and conservation of ocean resources. This issue, surfacing in 1966, was driven by new technological capabilities for fishing, offshore oil development, mining of mineral nodules on the ocean floor, and maritime shipment of oil in supertankers that if spilled could contaminate valuable inshore waters. Also at issue was the safety of those who sailed and fished.

This issue had a significant history. During the late 1950s, the US. Government was downsizing oceanographic research that initially had been sponsored during World War II. This was done without strong objection, partly because marine issues lacked coherent policy or high-level policy leadership and strong constituent advocacy.

Oceanographers, however, wanting to sustain levels of research funding, prompted a study by the National Academy of Sciences (NAS), Using the reports findings, which documented the importance of oceanographic research, NAS lobbied Congress with great success, triggering a flurry of bills dramatized by such titles as “National Oceanographic Program.”

But what was overlooked was the ultimate purpose of such research to serve human needs and wants, to synchronize independent activities of major agencies, to encourage public/private partnerships, and to provide political leadership. During the 1960s, in the role of Congressional advisor, I proposed a broad “strategy and coordination machinery” centered in the Office of the President, the nation’s systems manager. The result was the Marine Resources and Engineering Development Act, passed by Congress and signed into law by President Johnson in 1966.

The shift in bill title reveals the transformation from ocean sciences to socially relevant technology, with engineering playing a key role. The legislation thus embraced the potential of marine resources and the steps for both development and protection. By emphasizing policy, ocean activities were elevated to a higher national priority.

Exxon Valdez

Just after midnight on March 24, 1989, the tanker Exxon Valdez, loaded with 50 million gallons of Alaska crude oil, fetched up on Bligh Reef in Prince William Sound and spilled its guts. For five hours, oil surged from the torn bottom at an incredible rate of 1,000 gallons per second. Attention quickly focused on the enormity of environmental damage and on blunders of the ship operators. The captain had a history of alcohol abuse, but was in his cabin at impact. There was much finger- pointing as people questioned how the accident could happen during a routine run on a clear night. Answers were sought by the National Transportation Safety Board and by a state of Alaska commission to which I was appointed. That blame game still continues in the courts.

The commission was instructed to clarify what happened, why, and how to keep it from happening again. But even the commission was not immune to the political blame game. While I wanted to look beyond the ship’s bridge and search for other, perhaps more systemic problems, the commission chair blocked me from raising those issues. Despite my repeated requests for time at the regularly scheduled sessions, I was not allowed to speak. The chair, a former official having tanker safety responsibilities in Alaska, had a different agenda and would only let the commission focus largely on cleanup rather than prevention. Fortunately, I did get to have my say by signing up as a witness and using that forum to express my views and concerns.

The Exxon Valdez proved to be an archetype of avoidable risk. Whatever the weakness in the engineered hardware, the accident was largely due to internal cultures of large corporations obsessed with the bottom line and determined to get their way, a U.S. Coast Guard vulnerable to political tampering and unable to realize its own ethic, a shipping system infected with a virus of tradition, and a cast of characters lulled into complacency that defeated efforts at prevention.

Lessons

These examples of technological delivery systems have unexpected commonalities. Space telecommunications and sea preservation and exploitation were well beyond the purview of just those engineers and scientists working on the projects; they involved national policy and required interaction between engineers, scientists, users, and policymakers. The Exxon Valdez disaster showed what happens when these groups do not work together. No matter how conscientious a ship designer is about safety, it is necessary to anticipate the weaknesses of fallibility and
the darker side of self-centered, short-term ambition.

Recommendations

Many will argue that the engineering curriculum is so overloaded that the only source of socio- technical enrichment is a fifth year. Assuming that step is unrealistic, what can we do?

  • The hodge podge of nonengineering courses could be structured to provide an integrated foundation in liberal arts.
  • Teaching at the upper division could be problem- rather than discipline-oriented, with examples from practice that integrate nontechnical parameters.
  • Teaching could employ the case method often used in law, architecture, and business.
  • Students could be encouraged to learn about the world around them by reading good newspapers and nonengineering journals.
  • Engineering students could be encouraged to join such extracurricular activities as debating or political clubs that engage students from across the campus.

As we strengthen engineering’s potential to contribute to society, we can market this attribute to women and minority students who often seek socially minded careers and believe that engineering is exclusively a technical pursuit.

For practitioners of the future, something radically new needs to be offered in schools of engineering. Otherwise, engineers will continue to be left out.

Engineering’s Relationship To Science

One of the things that I hoped to get across in my post about perspectives on mature engineering was the subtle idea that engineering’s relationship to science is not straightforward.

My first caveat is that I am not a language expert, but I do respect it as a potential deadly weapon. I do hope that it’s not too controversial to state that doing science is not the same as doing engineering. I’d like to further state that the difference, in some part, lies in the discretionary space that engineers have in both the design and operation of their creations. Science alone doesn’t care about our intentions, while engineering cares very much about our intentions.

A fellow alumni of the master’s program I’m in, Martin Downey, did his thesis on a fascinating topic: “Is There More To Engineering Than Applied Science?” in which he asks the question:

“Does the belief that engineering is an applied science help engineers understand their profession and its practice?”

Martin graciously let me quote his chapter 6 of his thesis here, on the application of heuristics, which are essentially rules-of-thumb that are used to make decisions under some amount of uncertainty and ambiguity. Which you can hopefully agree is at the core of engineering as a discipline, yes?

My own research aims to look deep into this discretionary space as well. Closing the gap between how we think work gets done and actually how it gets done is in my crosshairs. At the moment, my own thesis looks to explore (my proposal is still not yet approved, so I do not want to speak too soon) how Internet engineers attempt (in many cases, using heuristics) to make sense of complex and sometimes disorienting scenarios (like, during an outage with cascading failures that can sometimes defy the imagination) and work as a team to untangle those scenarios. So Downey’s thesis is pretty relevant to me. :)

Martin’s chapter is below…

CHAPTER 6: THE APPLICATION OF HEURISTICS?

The Engineering Method

Billy Vaughn Koen* describes a heuristic based system of reasoning used by engineers which marries the theoretical and practical aspects of engineering (Koen, 1985, 2003). Koen’s view takes a radically skeptical standpoint towards engineering knowledge (be it ‘scientific’ or otherwise) by which all knowledge is fallible – and is better considered as heuristic, or rule of thumb. Koen (1985, p.6) defines the engineer not in terms of the artefacts he produces, but rather as someone who applies the engineering method, which he describes as ‘the strategy for causing the best change in a poorly understood or uncertain situation within the available resources.’ (Koen, 1985, p. 5). Koen argues engineering consists of the application of heuristics, rather than ‘science’ and ‘reason’. A heuristic, by Koen’s definition is ‘anything that provides a plausible aid or direction in the solution of a problem, but is in the final analysis unjustified, incapable of justification, and fallible.’ (Koen, 1985, p. 16). Koen (1985) provides four characteristics that aid in identifying heuristics (p.17):

  • ‘A heuristic does not guarantee a solution
  • It may contradict other heuristics
  • It reduces the search time in solving a problem
  • Its acceptance depends on the immediate context instead of an absolute standard.’

He contends that the epistemology of engineering is entirely based on heuristics, which contrasts starkly the idea that it is simply the application of ‘hard science’:

Engineering has no hint of the absolute, the deterministic, the guaranteed, the true. Instead it fairly reeks of the uncertain, the provisional and the doubtful. The engineer instinctively recognizes this and calls his ad hoc method “doing the best you can with what you’ve got,” “finding a seat-of-the-pants solution,” or just “muddling through”. (Koen, 1985, p. 23).

 

State of the Art

Koen (1985) uses the term ‘sota’ (‘state of the art’) to denote a specific set of heuristics that are considered to be best practice, at a given time (p.23). The sota will change and evolve due to changes to the technological or social context, and the sota will vary depending on the field of engineering and by geo-political context. What is considered as sota in a rapidly industrializing nation such as China will be different from that in a developed western democracy.

It is impossible for engineering in any sense to be considered as ‘value-free’** due to the overriding influence of context, which sets it apart from ‘science’. Koen (1985) emphasizes the primacy of context in determining the response to an engineering problem, and the role of the engineer is to determine the response appropriate to the context. To the engineer there is no absolute solution, at the core of practice is selecting adequate solutions given the time and resources available. Koen proposes his Rule of Engineering:

Do what you think represents best practice at the time you must decide, and only this rule must be present (Koen, 1985, p. 42).

Koen characterizes engineering as something altogether different from ‘applied science’. Indeed he provides the following heuristic:

Heuristic: Apply science when appropriate (Koen, 1985, p. 65).

He highlights the tendency for ‘some authors […] with limited technical training’ to become mesmerized by the ‘extensive and productive use made of science by engineers’, and elevate the use of science from its status as just one of the many heuristics used by engineers. He states that ‘the thesis that engineering is applied science fails because scientific knowledge has not always been available and is not always available now, and because, even if available, is not always appropriate for use’ (Koen, 1985, p. 63).

 

The Best Solution

Koen’s position points towards a practical, pragmatic experience based epistemology – flexible and adaptable. Koen’s definition of ‘best’ is highly contingent something can be the best outcome within available resources without necessarily being any good, in a universal, objective sense. Koen gives the example of judging whether a Mustang or a Mercedes is the better car. Although, objectively the Mercedes may be the better car, the Mustang could be considered as the best solution to the given problem statement and its constraints (Koen, 1985, p. 10). Koen’s viewpoint takes ‘scientific knowledge’ as provisional, and judges it in terms of its utility in arriving at an engineering solution in the context of other available heuristics.

Koen’s discussion of how the engineer arrives at a ‘best’ solution involves trading off the utility characteristics which are to a large extent incommensurable and negotiable – engineering judgement prevails, and it is the ability to achieve a solution under constraint that lies at the heart of the engineering approach to problem solving:

Theoretically […] best for an engineer is the result of manipulating a model of society’s perceived reality, including additional subjective considerations known only to the engineer constructing the model. In essence, the engineer creates what he thinks an informed society would want based on his knowledge of what an uninformed society thinks it wants (Koen, 1985, p. 12).

 

Trade-Offs Under Constraint?

On the face of it, Koen’s approach to arriving at the best solution under constraint sounds rather similar to Erik Hollnagel’s ETTO Principle (Hollnagel, 2009), however any similarity is superficial as Koen and Hollnagel appear to hold very different philosophical positions. Hollnagel takes an abstract view that human action balances two commensurate criteria: being efficient or being thorough. Hollnagel proposes a principle where trade-offs are made between efficiency and thoroughness under conditions of limited time and resources, which he terms as ETTO (Efficiency Thoroughness Trade-Off) (Hollnagel, 2009, p. 16). He suggests that people ‘routinely make a choice between being effective and being thorough, since it is rarely possible to be both at the same time’ (Hollnagel, 2009, p. 15). Using the analogy of a set of scales, Hollnagel proposes that successful performance requires that efficiency and thoroughness are balanced. Excessive thoroughness leads to failure as actions are performed too late, or exhaust available resources, excessive efficiency leads to failure through taking action that is either inappropriate, or at the expense of safety – an excess of either will tip the scales towards failure (Hollnagel, 2009, p. 14).

Hollnagel (2009) defines the ETTO fallacy in administrative decision making as the situation where there is the expectation that people will be ‘efficient and thorough at the same time – or rather to be thorough when in hindsight it was wrong to be efficient’ (p.68). He redefines safety as the ‘ability to succeed under varying conditions’ (p.100), and proposes that making an efficiency-thoroughness trade-off is never wrong in itself. Although Hollnagel does state that ETTOs are ‘normal and necessary’, there is an undercurrent of scientific positivism running through his book. In essence the approximations that are used in ETTOs are in his view driven by time and resource pressures – uncertainty is a result of insufficient time and information. Putting time and resource considerations to one side, there is the inference that greater thoroughness would be an effective barrier to failure – the right answer is out there if we care to be thorough enough in our actions. This, superficially, is not unlike Reason’s discussion of ‘skill based violations’ (Reason, 2008, pp. 51-52). Indeed Hollnagel suggests (Hollnagel, 2009, pp. 141-142) that for a system to be efficient and resilient ETTOs must be balanced by TETOs (Thoroughness-Efficiency Trade-Off) – having thoroughness in the present allows for efficiency in the future.

 

There Are No Right Answers, Only Best Answers

The engineering method as defined by Koen (recall: ‘The strategy for causing the best change in a poorly understood or uncertain situation within the available resources’(Koen, 1985, p. 5)) superficially bears the hallmarks of an ETTO, however, Koen would argue that there is ‘no one right answer out there’, and that in effect ‘all is heuristic’ – science is essentially a succession of approximations (Koen, 2003). Hollnagel’s ETTO Principle, understood on a superficial level, is unhelpful in understanding how safety is generated in an engineering context. It relies on hindsight and outcome knowledge, and simply asks at each critical decision point (which in itself is only defined with hindsight) ‘where could the engineer have been more thorough’, on the basis that being more thorough would have brought them closer to the ‘right answer’. If you accept, as Koen would assert, that there is no ‘right answer’, only the ‘best’ answer, then any assessment of engineering accountability reduces to a discussion as to whether the engineer used a set of heuristics that were considered at the time (and place) of the decision to be ‘state of the art’, in the context of the constraints of the engineering problem faced. This ethical discussion goes beyond the agency of the individual engineer or engineering team insofar as the constraints imposed (time, materials, budget, weight…) mean that the best is not good enough. The ‘wisdom’ to know when a problem is over-constrained, and the power to change the constraints need to go hand-in-hand. This decision is confounded by the tendency for the most successful systems to be optimised at the boundary of failure – too conservative and failure will come from being uncompetitive (too heavy, too expensive, too late…); too ambitious and you may discover where the boundary between successful operation and functional failure lies.

 

And Why is All This Important…?

The view that engineering is based on the application of heuristics in face of uncertainty provides a useful framework in which engineers can consider risk and the limitations of the methods used to assess system safety. The appearance (illusion?) of scientific rigour can blind engineers to the limitations in the ability of engineering models and abstractions to represent real systems. Over- confidence or blind acceptance of the approaches to risk management leave the engineer open to censure for presenting society with the impression that the models used are somehow precise and comprehensive. Koen’s way of defining the Engineering Method promotes a modest epistemology – an acceptance of the fallibility of the methods used by engineers, and a healthy scepticism about what constitutes ‘scientifically proven fact’ can paradoxically enhance safety. A modest approach encourages us to err on the side of caution and think more critically about the weaknesses in our models of risk.

* Emeritus Professor of Mechanical Engineering at University of Texas at Austin.
** ’Value free’ in this context refers to ideal of the Scientific Method; remaining purely objective and without ‘contaminating’ scientific inquiry with value judgements.

 

References

Hollnagel, E. (2009). The ETTO principle: efficiency-thoroughness trade-off : why things that go right sometimes go wrong. Farnham, UK: Ashgate.

Koen, B. V. (1985). Definition of the engineering method. Washington, DC: American Society for Engineering Education.
Koen, B. V. (2003).
Discussion of the method: conducting the engineer’s approach to problem solving. New York, NY: Oxford University Press.

Reason, J. T. (2008). The human contribution: unsafe acts, accidents and heroic recoveries. Farnham, UK: Ashgate.

Paradigm Check Point: Prefacing Debriefings

I’m a firm believer in restating values, goals, and perspectives at the beginning of every group debriefing (e.g. “postmortem meetings”) in order to bring new folks up to speed on how we view the process and what the purpose of the debriefing is.

When I came upon a similar baselining dialogue from another domain, I thought I’d share…

Screen Shot 2014-03-10 at 4.43.19 PM

  • Risk is in everything we do. Short of never doing anything, there is no way to avoid all risk or ever to be 100% safe.
  • How employees (at any level) perceive, anticipate, interpret, and react to risk is systematically connected to conditions associated with the design, systems, features, and culture of the workplace.
  • “Risk does not exist “out there,” independent of our minds and culture, waiting to be measured. Human beings have invented the concept of “risk” to help them understand and cope with the dangers and the uncertainties of life. Although these dangers are real, there is no such thing as a “real risk” or “objective risk.””*
  • The best definition of “safety” is: the reasonableness of risk. It is a feeling. It is not an absolute. It is personal and contextual and will vary between people even within identical situations.
  • While safety is an essential business practice, our agency does not exist to be safe or to protect our employees. We exist to accomplish a mission as efficiently as possible–knowing that many activities we choose to perform are inherently hazardous (for example, deployment, data migration, code commits, on-call response, editing configurations, and even powering on a device on the network).
  • Mistakes, errors, and lapses are normal and inevitable human behaviors. So are optimism and fatalism. So are taking shortcuts to save time and effort. So are under- and over-estimating risk. In spite of this, our work systems are generally designed for the optimal worker, not the normal one.
  • Essentially every risk mitigation (every safety precaution) carries some level of “cost” to production or compromise to efficiency. One of the most obvious is the cost of training. Employees at all levels (administrators, safety advisors, system designers, and front-line employees) are continuously–and often subconsciously– estimating, balancing, optimizing, managing, and accepting these subtle and nuanced tradeoffs between safety and production.
  • All successful systems, organizations, and individuals will trend toward efficiency over thoroughness (production over protection) over time until something happens (usually an accident or a close call) that changes their perception of risk. This creativity and drive for efficiency is what makes people, businesses and agencies successful.
  • Our natural intuition (our common sense) is to let outcomes draw the line between success and failure and to base safety programs on outcomes. This is shortsighted and eventually dangerous. Using the science of risk management is more potent and robust. Importantly, Risk Management is wholly concerned with managing risks, not outcomes. Risk management is counterintuitive.
  • Employees directly involved in the event did not expect that the accident was going to happen. They expected a positive outcome. If this is not the case, then you’re not dealing with an accident.

*Paul Slovic, as quoted in Daniel Kahneman, Thinking Fast and Slow (Farrar, Straus and Giroux, 2011), p141.
The above is excerpted from the Facilitated Learning Analysis Implementation Guide, US Forestry Service, Wildland Fire Operations.

High Tempo, High Consequence

A Time to Remember

I want you to think back to a time when you found yourself in an emergency situation at work.

Maybe it was diagnosing and trying to recover from a site outage.
Maybe it was when you were confronting the uncertain possibility of critical data loss.
Maybe it was when you and your team were responding to a targeted and malicious attack.

Maybe it was a time when you, maybe even milliseconds after you triggered some action (maybe even just hit “enter” after a command), realized that you just made a terrible mistake and inadvertently kicked off irreparable destruction that cannot be undone.

Maybe it was a shocking discovery that something bad (silent data corruption, for example) has been happening for a long time and no one knew it was happening.

Maybe it was a time when silence descended upon your team as they tried to understand what was happening to the site, and the business. The time when you didn’t even know what was going on, forget about hypothesizing how to go about fixing it.

Think back to the time when you had to actively hold back the fears of what the news headlines or management or your board of directors were going to say about it when it was over, because you have a job to do and worrying about those things wouldn’t bring the site back up.

Think back to a time when after you’ve resolved an outage and the dust has settled, your adrenaline turns its focus to amplifying the fear that you and your team will have no idea when that will happen again in the future because you’re still uncertain how it happened in the first place.

I’ve been working in web operations for over 15 years, and I can describe in excruciating detail examples of many of those situations. Many of my peers can tell stories like those, and often do. I’m willing to bet that you too, dear reader, will find these to be familiar feelings.

Those moments are real.
The cortisol coursing through your body during those times was real.
The effect of time pressure is real.

The problems that show up when you have to make critical decisions based on incredibly incomplete information is real.

The issues that show up when having to communicate effectively across multiple team members, sometimes separated by time (as people enter and exit the response to an outage) as well as distance (connected through chat or audio/video conferencing) are all real.

The issues when coordinating who is going to do what, and when they’re going to do it, and confirming that whatever they did went well enough for someone else to do their part next, etc. are all real.

And they all are real regardless of the outcomes of the scenarios.

Comparisons

Those moments do happen in other domains. Other domains like healthcare, where nurses work in neonatal intensive care units.

Like infantry in battle.
Like ground control in a mission control organization.
Like a regional railway control center.
Like a trauma surgeon in an operating room.
Like an air traffic controller.
Like a pilot, just flying.
Like a wildland firefighting hotshow crew.
Like a ship crew.

Like a software engineer working in a high-frequency trading company.

All of those domains (and many others) have these in common:

  • They need to make decisions and take action under time pressure and with incomplete information, and when the results have just as much potential to make things worse than they do to make things better.
  • They have to communicate a lot of information and coordinate actions between teams and team members in the shortest time possible, while also not missing critical details.
  • They all work in areas where small changes can bring out large results whose potential for surprising everyone is quite high.
  • They all work in organizations whose cultural, social, hierarchical, and decision-making norms are influenced by past successes and failures, many of which manifest in these high-tempo scenarios.

But: do the people in those domains experience those moments differently?

In other words: does a nurse or air traffic controller’s experience in those real moments differ from ours, because lives are at stake?

Do they experience more stress? Different stress?
Do they navigate alerts, alarms, and computers in more prudent or better ways than we do?
Do they have more problems with communications and coordinating response amongst multiple team members?
Are they measurably more careful in their work because of the stakes?

Are all of their decisions perfectly clear, or do they have to navigate ambiguity sometimes, just like we do?
Because there are lives to protect, is their decision-making in high-tempo scenarios different? Better?

My assertion is that high-tempo/high-consequence scenarios in the domain of Internet engineering and operations do indeed have similarities with those other domains, and that understanding all of those dynamics, pitfalls, learning opportunities, etc. is critical for the future.

All of the future.

Do these scenarios yield the same results, organizationally, in those domains as they do in web engineering and operations? Likely not. But I will add that unless we attempt to understand those similarities and differences, we’re not going to know what to learn from, and what to discard.

Hrm. Really?

Because how can we compare something like the Site Reliability Engineer team’s experience at Google.com to something like the air traffic control crew experience landing airplanes at Heathrow?

I have two responses to this question.

The first is that we’re assuming that the potential severity of the consequence influences the way people (and teams of people) think, act, and behave under those conditions. Research on how people behave under uncertain conditions and escalating scenarios do indeed have generalizable findings across many domains.

The second is that in trivializing the comparison to loss of life versus non-loss of life, we can underestimate the n-order effects that the Internet can have on geopolitical, economic, and other areas that are further away from servers and network cables. We would be too reductionist in our thinking. The Internet is not just about photos of cats. It bolsters elections in emerging democracies, revolutions, and a whole host of other things that prove to be life-critical.

A View From Not Too Far Away

At the Velocity Conference in 2012, Dr. Richard Cook (an anesthesiologist and one of the most forward-thinking men I know in these areas), was interviewed after his keynote by Mac Slocum, from O’Reilly.

Mac, hoping to contrast Cook’s normal audience to that of Velocity’s, asked about whether or not he saw crossover from the “safety-critical” domains to that of web operations:

Cook: “Anytime you find a world in which you have high consequences, high tempo, time pressure, and lots of complexity, semantic complexity, underlying deep complexity, and people are called upon to manage that you’re going to have these kinds of issues arise. And the general model that we have is one for systems, not for specific instances of systems. So I kind of expected that it would work…”

Mac: ”…obviously failure in the health care world is different than failure in the [web operations] world. What is the right way to address failure, the appropriate way to address failure? Because obviously you shouldn’t have people in this space who are assigning the same level of importance to failure as you would?”

Cook: “You really think so?”

Mac: “Well, if a computer goes down, that’s one thing.”

Cook: “If you lose $300 to $400 million dollars, don’t you think that would buy a lot of vaccines?”

Mac: “[laughs] well, that’s true.”

Cook: “Look, the fact that it appears to be dramatic because we’re in the operating room or the intensive care unit doesn’t change the importance of what people are doing. That’s a consequence of being close to and seeing things in a dramatic fashion. But what’s happening here? This is the lifeblood of commerce. This is the core of the economic engine that we’re now experiencing. You think that’s not important?”

Mac: “So it’s ok then, to assign deep importance to this work?”

Cook: “Yeah, I think the big question will be whether or not we are actually able to conclude the healthcare importance measures up to the importance web ops, not the other way around.”

Richard further mentioned in his keynote last year at New York’s Velocity that:

“…web applications have a tendency to become business critical applications, and business-critical applications have a tendency to become safety-critical systems.”

And yes, software bugs have killed people.

When I began my studies at Lund University, I was joined by practitioners in many of those domains: air traffic control, aviation, wildland fire, child welfare services, mining, oil and gas industry, submarine safety, and maritime accident investigation.

I will admit at the first learning lab of my course, I mentioned that I felt like a bit of an outsider (or at least a cheater in getting away with failures that don’t kill people) and one of my classmates responded:

“John, why do you think that understanding these scenarios and potentially improving upon them has anything to do with body count? Do you think that our organizations are influenced more by body count than commercial and economic influences? Complex failures don’t care about how many dollars or bodies you will lose – they are equal opportunists.”

I now understand this.

So don’t be fooled into thinking that those human moments at the beginning of this post are any different in other domains, or that our responsibility to understand complex system failures is less important in web engineering and operations than it is elsewhere.