Paradigm Check Point: Prefacing Debriefings

I’m a firm believer in restating values, goals, and perspectives at the beginning of every group debriefing (e.g. “postmortem meetings”) in order to bring new folks up to speed on how we view the process and what the purpose of the debriefing is.

When I came upon a similar baselining dialogue from another domain, I thought I’d share…

Screen Shot 2014-03-10 at 4.43.19 PM

  • Risk is in everything we do. Short of never doing anything, there is no way to avoid all risk or ever to be 100% safe.
  • How employees (at any level) perceive, anticipate, interpret, and react to risk is systematically connected to conditions associated with the design, systems, features, and culture of the workplace.
  • “Risk does not exist “out there,” independent of our minds and culture, waiting to be measured. Human beings have invented the concept of “risk” to help them understand and cope with the dangers and the uncertainties of life. Although these dangers are real, there is no such thing as a “real risk” or “objective risk.””*
  • The best definition of “safety” is: the reasonableness of risk. It is a feeling. It is not an absolute. It is personal and contextual and will vary between people even within identical situations.
  • While safety is an essential business practice, our agency does not exist to be safe or to protect our employees. We exist to accomplish a mission as efficiently as possible–knowing that many activities we choose to perform are inherently hazardous (for example, deployment, data migration, code commits, on-call response, editing configurations, and even powering on a device on the network).
  • Mistakes, errors, and lapses are normal and inevitable human behaviors. So are optimism and fatalism. So are taking shortcuts to save time and effort. So are under- and over-estimating risk. In spite of this, our work systems are generally designed for the optimal worker, not the normal one.
  • Essentially every risk mitigation (every safety precaution) carries some level of “cost” to production or compromise to efficiency. One of the most obvious is the cost of training. Employees at all levels (administrators, safety advisors, system designers, and front-line employees) are continuously–and often subconsciously– estimating, balancing, optimizing, managing, and accepting these subtle and nuanced tradeoffs between safety and production.
  • All successful systems, organizations, and individuals will trend toward efficiency over thoroughness (production over protection) over time until something happens (usually an accident or a close call) that changes their perception of risk. This creativity and drive for efficiency is what makes people, businesses and agencies successful.
  • Our natural intuition (our common sense) is to let outcomes draw the line between success and failure and to base safety programs on outcomes. This is shortsighted and eventually dangerous. Using the science of risk management is more potent and robust. Importantly, Risk Management is wholly concerned with managing risks, not outcomes. Risk management is counterintuitive.
  • Employees directly involved in the event did not expect that the accident was going to happen. They expected a positive outcome. If this is not the case, then you’re not dealing with an accident.

*Paul Slovic, as quoted in Daniel Kahneman, Thinking Fast and Slow (Farrar, Straus and Giroux, 2011), p141.
The above is excerpted from the Facilitated Learning Analysis Implementation Guide, US Forestry Service, Wildland Fire Operations.

High Tempo, High Consequence

A Time to Remember

I want you to think back to a time when you found yourself in an emergency situation at work.

Maybe it was diagnosing and trying to recover from a site outage.
Maybe it was when you were confronting the uncertain possibility of critical data loss.
Maybe it was when you and your team were responding to a targeted and malicious attack.

Maybe it was a time when you, maybe even milliseconds after you triggered some action (maybe even just hit “enter” after a command), realized that you just made a terrible mistake and inadvertently kicked off irreparable destruction that cannot be undone.

Maybe it was a shocking discovery that something bad (silent data corruption, for example) has been happening for a long time and no one knew it was happening.

Maybe it was a time when silence descended upon your team as they tried to understand what was happening to the site, and the business. The time when you didn’t even know what was going on, forget about hypothesizing how to go about fixing it.

Think back to the time when you had to actively hold back the fears of what the news headlines or management or your board of directors were going to say about it when it was over, because you have a job to do and worrying about those things wouldn’t bring the site back up.

Think back to a time when after you’ve resolved an outage and the dust has settled, your adrenaline turns its focus to amplifying the fear that you and your team will have no idea when that will happen again in the future because you’re still uncertain how it happened in the first place.

I’ve been working in web operations for over 15 years, and I can describe in excruciating detail examples of many of those situations. Many of my peers can tell stories like those, and often do. I’m willing to bet that you too, dear reader, will find these to be familiar feelings.

Those moments are real.
The cortisol coursing through your body during those times was real.
The effect of time pressure is real.

The problems that show up when you have to make critical decisions based on incredibly incomplete information is real.

The issues that show up when having to communicate effectively across multiple team members, sometimes separated by time (as people enter and exit the response to an outage) as well as distance (connected through chat or audio/video conferencing) are all real.

The issues when coordinating who is going to do what, and when they’re going to do it, and confirming that whatever they did went well enough for someone else to do their part next, etc. are all real.

And they all are real regardless of the outcomes of the scenarios.


Those moments do happen in other domains. Other domains like healthcare, where nurses work in neonatal intensive care units.

Like infantry in battle.
Like ground control in a mission control organization.
Like a regional railway control center.
Like a trauma surgeon in an operating room.
Like an air traffic controller.
Like a pilot, just flying.
Like a wildland firefighting hotshow crew.
Like a ship crew.

Like a software engineer working in a high-frequency trading company.

All of those domains (and many others) have these in common:

  • They need to make decisions and take action under time pressure and with incomplete information, and when the results have just as much potential to make things worse than they do to make things better.
  • They have to communicate a lot of information and coordinate actions between teams and team members in the shortest time possible, while also not missing critical details.
  • They all work in areas where small changes can bring out large results whose potential for surprising everyone is quite high.
  • They all work in organizations whose cultural, social, hierarchical, and decision-making norms are influenced by past successes and failures, many of which manifest in these high-tempo scenarios.

But: do the people in those domains experience those moments differently?

In other words: does a nurse or air traffic controller’s experience in those real moments differ from ours, because lives are at stake?

Do they experience more stress? Different stress?
Do they navigate alerts, alarms, and computers in more prudent or better ways than we do?
Do they have more problems with communications and coordinating response amongst multiple team members?
Are they measurably more careful in their work because of the stakes?

Are all of their decisions perfectly clear, or do they have to navigate ambiguity sometimes, just like we do?
Because there are lives to protect, is their decision-making in high-tempo scenarios different? Better?

My assertion is that high-tempo/high-consequence scenarios in the domain of Internet engineering and operations do indeed have similarities with those other domains, and that understanding all of those dynamics, pitfalls, learning opportunities, etc. is critical for the future.

All of the future.

Do these scenarios yield the same results, organizationally, in those domains as they do in web engineering and operations? Likely not. But I will add that unless we attempt to understand those similarities and differences, we’re not going to know what to learn from, and what to discard.

Hrm. Really?

Because how can we compare something like the Site Reliability Engineer team’s experience at to something like the air traffic control crew experience landing airplanes at Heathrow?

I have two responses to this question.

The first is that we’re assuming that the potential severity of the consequence influences the way people (and teams of people) think, act, and behave under those conditions. Research on how people behave under uncertain conditions and escalating scenarios do indeed have generalizable findings across many domains.

The second is that in trivializing the comparison to loss of life versus non-loss of life, we can underestimate the n-order effects that the Internet can have on geopolitical, economic, and other areas that are further away from servers and network cables. We would be too reductionist in our thinking. The Internet is not just about photos of cats. It bolsters elections in emerging democracies, revolutions, and a whole host of other things that prove to be life-critical.

A View From Not Too Far Away

At the Velocity Conference in 2012, Dr. Richard Cook (an anesthesiologist and one of the most forward-thinking men I know in these areas), was interviewed after his keynote by Mac Slocum, from O’Reilly.

Mac, hoping to contrast Cook’s normal audience to that of Velocity’s, asked about whether or not he saw crossover from the “safety-critical” domains to that of web operations:

Cook: “Anytime you find a world in which you have high consequences, high tempo, time pressure, and lots of complexity, semantic complexity, underlying deep complexity, and people are called upon to manage that you’re going to have these kinds of issues arise. And the general model that we have is one for systems, not for specific instances of systems. So I kind of expected that it would work…”

Mac: ”…obviously failure in the health care world is different than failure in the [web operations] world. What is the right way to address failure, the appropriate way to address failure? Because obviously you shouldn’t have people in this space who are assigning the same level of importance to failure as you would?”

Cook: “You really think so?”

Mac: “Well, if a computer goes down, that’s one thing.”

Cook: “If you lose $300 to $400 million dollars, don’t you think that would buy a lot of vaccines?”

Mac: “[laughs] well, that’s true.”

Cook: “Look, the fact that it appears to be dramatic because we’re in the operating room or the intensive care unit doesn’t change the importance of what people are doing. That’s a consequence of being close to and seeing things in a dramatic fashion. But what’s happening here? This is the lifeblood of commerce. This is the core of the economic engine that we’re now experiencing. You think that’s not important?”

Mac: “So it’s ok then, to assign deep importance to this work?”

Cook: “Yeah, I think the big question will be whether or not we are actually able to conclude the healthcare importance measures up to the importance web ops, not the other way around.”

Richard further mentioned in his keynote last year at New York’s Velocity that:

“…web applications have a tendency to become business critical applications, and business-critical applications have a tendency to become safety-critical systems.”

And yes, software bugs have killed people.

When I began my studies at Lund University, I was joined by practitioners in many of those domains: air traffic control, aviation, wildland fire, child welfare services, mining, oil and gas industry, submarine safety, and maritime accident investigation.

I will admit at the first learning lab of my course, I mentioned that I felt like a bit of an outsider (or at least a cheater in getting away with failures that don’t kill people) and one of my classmates responded:

“John, why do you think that understanding these scenarios and potentially improving upon them has anything to do with body count? Do you think that our organizations are influenced more by body count than commercial and economic influences? Complex failures don’t care about how many dollars or bodies you will lose – they are equal opportunists.”

I now understand this.

So don’t be fooled into thinking that those human moments at the beginning of this post are any different in other domains, or that our responsibility to understand complex system failures is less important in web engineering and operations than it is elsewhere.

Counterfactual Thinking, Rules, and The Knight Capital Accident

In between reading copious amounts of indignation surrounding whatever is suboptimal about, you may or may not have noticed the SEC statement regarding the Knight Capital accident that took place in 2012.

This Release No. 70694 is a document that contains many details about the accident, and you can read what looks like on the surface to be an in-depth analysis of what went wrong and how best to prevent such an accident from happening in the future.

You may believe this document can serve as a ‘post-mortem’ narrative. It cannot, and should not.

Any ‘after-action’ or ‘postmortem’ document (in my domain of web operations and engineering) has two main goals:

  1. To provide an explanation of how an event happened, as the organization (including those closest to the work) best understands it.
  2. To produce artifacts (recommendations, remediations, etc.) aimed at both prevention and the improvement of detection and response approaches to aid in handling similar events in the future.

You need #1 in order to work on #2. If you don’t understand how the event unfolded, you can’t make gains towards prevention in the future.

The purpose of this post is to outline how the release is not something that can or should be used for explanation or prevention.

The Release No. 70694 document does not address either of those concerns in any meaningful way.

What it does address, however, is exactly what a regulatory body is tasked to do in the wake of a known outcome: contrast how an organization was or was not in compliance with the rules that the body has put in place. Nothing more, nothing less. In this area, the document is concise and focused.

You can be forgiven for thinking that the document could serve as an explanation, because you can find some technical details in it. It looks a little bit like a timeline. What is interesting is not what details are covered, but what details are not covered, including the organizational sensemaking that is part of every complex systems failure.

If you are looking for a real postmortem of the Knight Capital accident in this post, you’re going to be disappointed. At the end of this post, I will certainly attempt to list some questions that I might pose if I was facilitating a debriefing of the event, but no real investigation can happen without the individuals closest to the work involved in the discussion.

However, I’d like to write up a bit about why it should not be viewed as what is traditionally known (at least in the web operations and engineering community) as a postmortem report. Because frankly I think that is more important than the specific event itself.

But before I do that, it’s necessary to unpack a few concepts related to learning in a retrospective way, as in a postmortem…


Learning from events in the past (both successful and unsuccessful) puts us into a funny position as humans. In a process that is genuinely interested in learning from events, we have to rectify our need to understand with the reality that we will never get a complete picture of what has happened in the past. Regulatory bodies such as the SEC (lucky for them) don’t have to get a complete picture in order to do their job. They have only to point out the gap between how “work is prescribed” versus “work is being done” (or what Richard Cook has said  “the system as imagined” versus “the system as found.”)

In many circumstances (as in the case of the SEC release), what this means is to point out the things that people and organizations didn’t do in the time preceding an event. This is usually done by using “counterfactuals”, which means literally “counter the facts.”

In the language of my domain, using counterfactuals in the process of explanation and prevention is an anti-pattern, and I’ll explain why.

One of the potential pitfalls of postmortem reports (and debriefings) is that the language we use can cloud our opportunities to learn what took place and the context people (and machines!) found themselves in. Sidney Dekker says this about using counterfactuals:

“They make you spend your time talking about a reality that did not happen (but if it had happened, the mishap would not have happened).” (Dekker, 2006, p. 39)

What are examples of counterfactuals? In ordinary language, they look like:

  • “they shouldn’t have…”
  • “they could have…”
  • “they failed to…”
  • “if only they had…!”

Why are these statements woefully inappropriate for aiding explanation of what happened? Because stating what you think should have happened doesn’t explain people’s (or an organization’s) behavior. Counterfactuals serve as a massive distraction, because it brings sharply into focus what didn’t happen, when what is required for explanation is to understand why people did what they did. 

People do what makes sense to them, given their focus, their goals, and what they perceive to be their environment. This is known as the local rationality principle, and it is required in order to tease out second stories, which in turn is required for learning from failure. People’s local rationality is influenced by many dynamics, and I can imagine some of these things might feel familiar to any engineers who operate in high-tempo organizations:

  • Multiple conflicting goals
    • E.g., “Deploy the new stuff, and do it quickly because our competitors may beat us! Also: take care of all of the details while you do it quickly, because one small mistake could make for a big deal!”
  • Multiple targets of attention
    • E.g., “When you deploy the new stuff, make sure you’re looking at the logs. And ignore the errors that are normally there, so you can focus on the right ones to pay attention to. Oh, and the dashboard graph of errors…pay attention to that. And the deployment process. And the system resources on each node as you deploy to them. And the network bandwidth. Also: remember, we have to get this done quickly.”

David Woods put counterfactual thinking in context with how people actually work:

“After-the-fact, based on knowledge of outcome, outsiders can identify “critical” decisions and actions that, if different, would have averted the negative outcome. Since these “critical” points are so clear to you with the benefit of hindsight, you could be tempted to think they should have been equally clear and obvious to the people involved in the incident. These people’s failure to see what is obvious now to you seems inexplicable and therefore irrational or even perverse. In fact, what seems to be irrational behavior in hindsight turns out to be quite reasonable from the point of view of the demands practitioners face and the resources they can bring bear.” (Woods, 2010)

Dekker concurs:

“You construct a referent world from outside the accident sequence, based on data you now have access to, based on facts you now know to be true. The problem is that these after-the-fact-worlds may have very little relevance to the circumstances of the accident sequence. They do not explain the observed behavior. You have substituted your own world for the one that surrounded the people in question.” (Dekker, 2004, p.33)

“Saying what people failed to do, or implying what they could or should have done to prevent the mishap, has no role in understanding human error.”  (Dekker, 2004, p.43)

The engineers and managers at Knight Capital did not set out that morning of August 1, 2012 to lose $460 million. If they did, we’d be talking about sabotage and not human error. They did, however, set out to perform some work successfully (in this case, roll out what they needed to participate in the Retail Liquidity Program.)

If you haven’t picked up on it already, the use of counterfactuals is a manifestation of one of the most studied cognitive bias in modern psychology: The Hindsight Bias. I will leave it as an exercise to the reader to dig into that.

Outcome Bias

Cognitive biases are the greatest pitfalls in explaining surprising outcomes. The weird cousin of The Hindsight Bias is Outcome Bias. In a nutshell, it says that we are biased to “judge a past decision by its ultimate outcome instead of based on the quality of the decision at the time it was made, given what was known at that time.” (Outcome Bias, 2013)

In other words, we can be tricked into thinking that if the result of an accident is truly awful (like people dying, something crashing, or, say, losing $460 million in 20 minutes) then the decisions that led up to that outcome must have been reeeeeealllllllyyyy bad. Right?

This is a myth debunked by a few decades of social science, but it remains persistent. No decision maker has omniscience about results, so the severity of the outcome cannot be seen to be proportional to the quality of thought that went into the decisions or actions that led up to the result. Why we have this bias to begin with is yet another topic that we can explore another time.

But a possible indication that you are susceptible to The Outcome Bias is a quick thought exercise on results: if Knight Capital lost only $1,000 (or less) would you think them to be more or less prudent in their preventative measures than in the case of $460 million?

If you’re into sports, maybe this can help shed light on The Outcome Bias.


Operators (within complex systems, at least) have procedures and rules to help them achieve their goals safely. They come in many forms: checklists, guidelines, playbooks, laws, etc. There is a distinction between procedures and rules, but they have similarities when it comes to gaining understanding of the past.

First let’s talk about procedures. In the aftermath of an accident, we can (and will, in the SEC release) see many calls for “they didn’t follow procedures!” or “they didn’t even have a checklist!” This sort of statement can nicely serve as a counterfactual.

What is important to recognize is that procedures are but only one resource people use to do work. If we only worked by following every rule and procedure we’ve written for ourselves, by the letter, then I suspect society would come to a halt. As an aside, “work-to-rule” is a tactic that labor organizations have used to demonstrate the issues that onerous rules and procedures can rob people of their adaptive capacities, and therefore bring business to an effective standstill.

Some more thought exercises to think with on procedures:

  • How easy might it be to go to your corporate wiki or intranet to find a procedure (or a step within a procedure) that was once relevant, but no longer is?
  • Do you think you can find a procedure somewhere in your group that isn’t specific enough to address every context you might use it in?
  • Can you find steps in existing procedures that feel safe to skip, especially in if you’re under time pressure to get something done?
  • Part of the legal terms of using Microsoft Office is that you read and understand the End User License Agreement. You did that before checking “I agree”, right? Or did you violate that legal agreement?! (don’t worry, I won’t tell anyone)

Procedures are important for a number of reasons. They serve as institutional knowledge and guidelines for safe work. But, like wikis, they make sense to the authors of the procedure the day they wrote it. They are written to take into account all of the scenarios and contexts that the author can imagine.

But since that imagination is limited, many procedures that are thought to ensure safety are context-sensitive and they require interpretation, modification, and adaptation.

There are multiple issues with procedures as they are navigated by people who do real work. Stealing from Dekker again:

  1. “First, a mismatch between procedures and practice is not unique to accident sequences. Not following procedures does not necessarily lead to trouble, and safe outcomes may be preceded by just as (relatively) many procedural deviations as those that precede accidents (Woods et al., 1994; Snook, 2000) This turns any “findings” about accidents being preceded by procedural violation into mere tautologies…” 
  2. “Second, real work takes place in a context of limited resources and multiple goals and pressures.” 
  3. “Third, some of the safest complex, dynamic work not only occurs despite the procedures—such as aircraft line maintenance—but without procedures altogether.” The long-studied High Reliability Organizations have examples (in domains such as naval aircraft carrier operations and nuclear power generation) where procedures are eschewed, and instead replaced by less static forms of learning from practice:

    ‘‘there were no books on the integration of this new hardware into existing routines and no other place to practice it but at sea. Moreover, little of the process was written down, so that the ship in operation is the only reliable manual’’. Work is ‘‘neither standardized across ships nor, in fact, written down systematically and formally anywhere’’. Yet naval air- craft carriers—with inherent high-risk operations—have a remarkable safety record, like other so-called high reliability organizations (Rochlin et al., 1987; Weick, 1990; Rochlin, 1999). “

  4. “Fourth, procedure-following can be antithetical to safety.”  – Consider the case of the 1949 US Mann Gulch disaster where firefighters who perished were the ones sticking to the organizational mandate to carry their tools everywhere. Or Swissair Flight 111, when captain and co-pilot of an aircraft disagreed on whether or not to follow the prescribed checklist for an emergency landing. While they argued, the plan crashed. (Dekker, 2003)

Anyone operating in high-tempo and high-consequence environments recognize both the utility and also the brittleness of a procedure, no matter how much thought went into it.

Let’s keep this idea in mind as we walk through the SEC release below.


Violation of Rules != Explanation


Now let’s talk about rules. The SEC’s job (in a nutshell) is to design, maintain, and enforce regulations of practice for various types of financially-driven organizations in the United States. Note that they are not charged with explaining or preventing events. Preventing may or may not result from their work in regulations, but prevention demands much more than abiding by rules.

Rules and regulations are similar to procedures in that they are written with deliberate but ultimately interpretable intention. Judges and juries help interpret different contexts as they relate to a given rule, law, or regulation. Rules are good for a number of reasons that are beyond the scope of this (now lengthy) post.

If we think about regulations in the context of causality, however, we can get into trouble.

Because we can find ourselves in uncertain contexts that have some of the dynamics that I listed above (multiple conflicting goals and targets of attention) regulations (even when we are acutely aware of them) pose some issues. In the Man-Made Disasters Model, Nick Pidgeon lays some of this out for us:

“Uncertainty may also arise about how to deal with formal violations of safety regulations. Violations might occur because regulations are ambiguous, in conflict with other goals such as the needs of production, or thought to be outdated because of technological advance. Alternatively safety waivers may be in operation, allowing relaxation of regulations under certain circumstances (as also occurred in the `Challenger’ case; see Vaughan, 1996).” (Pidgeon, 2000)

Rules and regulations need to allow for interpretation, otherwise they would be brittle in enforcement. So therefore, vagueness and flexibility in rules is desired. We’ll see how this vagueness can be exploited for enforcement, however, at the expense of learning.

Back to the statement

Once more: the SEC document cannot be viewed as a canonical description of what happened with Knight Capital on August 1, 2012.

It can, however, be viewed as a comprehensive account of the exchange and trading regulations the SEC deems were violated by the organization. This is its purpose. My goal here is not to critique the SEC release for its purpose, it is to reveal how it cannot be seen to aid either explanation or prevention of the event, and so should not be used for that.

Before we walk through (at least parts) of the document, it’s worth noting that there is no objective accident investigative body that exists for electronic trading systems. In aviation, there is a regulative body (the FAA) and an investigative body (the NTSB) and there is significant differences between the two, charter-wise and operations-wise. There exists no such independent investigative body analogous to the NTSB in Knight Capital’s industry. There is only the SEC.

The Release

I’ll have comments in italics, in blue and talk about the highlighted pieces. After getting feedback from many colleagues, I decided to keep the length here for people to dig into, because I think it’s important to understand. If you make it through this, you deserve cake.

If you want to skip the annotated and butchered SEC statement, you can just go to the summary.


The Securities and Exchange Commission (the “Commission”) deems it appropriate and in the public interest that public administrative and cease-and-desist proceedings be, and hereby are, instituted pursuant to Sections 15(b) and 21C of the Securities Exchange Act of 1934 (the “Exchange Act”) against Knight Capital Americas LLC (“Knight” or “Respondent”).


In anticipation of the institution of these proceedings, Respondent has submitted an Offer of Settlement (the “Offer”), which the Commission has determined to accept. Solely for the purpose of these proceedings and any other proceedings by or on behalf of the Commission, or to which the Commission is a party, and without admitting or denying the findings herein, except as to the Commission’s jurisdiction over it and the subject matter of these proceedings, which are admitted, Respondent consents to the entry of this Order Instituting Administrative and Cease-and-Desist Proceedings, Pursuant to Sections 15(b) and 21C of the Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a Cease-and-Desist Order (“Order”), as set forth below:

Note: This means that Knight doesn’t have to agree or disagree with any of the statements in the document. This is expected. If it was intended to be a postmortem doc, then there would be a lot more covered here in addition to listing violations of regulations.


On the basis of this Order and Respondent’s Offer, the Commission finds that:


1. On August 1, 2012, Knight Capital Americas LLC (“Knight”) experienced a significant error in the operation of its automated routing system for equity orders, known as SMARS. While processing 212 small retail orders that Knight had received from its customers, SMARS routed millions of orders into the market over a 45-minute period, and obtained over 4 million executions in 154 stocks for more than 397 million shares. By the time that Knight stopped sending the orders, Knight had assumed a net long position in 80 stocks of approximately $3.5 billion and a net short position in 74 stocks of approximately $3.15 billion. Ultimately, Knight lost over $460 million from these unwanted positions. The subject of these proceedings is Knight’s violation of a Commission rule that requires brokers or dealers to have controls and procedures in place reasonably designed to limit the risks associated with their access to the markets, including the risks associated with automated systems and the possibility of these types of errors.

Note: Again, the purpose of the doc is to point out where Knight violated rules. It is not:

  • a description of the multiple trade-offs that engineering at Knight made or considered when designing fault-tolerance in their systems, or
  • how Knight as an organization evolved over time to focus on evolving some procedures and not others, or
  • how engineers anticipated in preparation for deploying support for the new RLP effort on Aug 1, 2012.

To equate any of those things with violation of a rule is a cognitive leap that we should stay very far away from.

It’s worth mentioning here that the document only focuses on failures, and makes no mention of successes. How Knight succeeded during diagnosis and response is unknown to us, so a rich source of data isn’t available. Because of this, we cannot pretend the document to give explanation.

2. Automated trading is an increasingly important component of the national market system. Automated trading typically occurs through or by brokers or dealers that have direct access to the national securities exchanges and other trading centers. Retail and institutional investors alike rely on these brokers, and their technology and systems, to access the markets.

3. Although automated technology brings benefits to investors, including increased execution speed and some decreased costs, automated trading also amplifies certain risks. As market participants increasingly rely on computers to make order routing and execution decisions, it is essential that compliance and risk management functions at brokers or dealers keep pace. In the absence of appropriate controls, the speed with which automated trading systems enter orders into the marketplace can turn an otherwise manageable error into an extreme event with potentially wide-spread impact.

Note: The sharp contrast between our ability to create complex and valuable automation and our ability to reason about, influence, control, and understand it in even ‘normal’ operating conditions (forget about time-pressured emergency diagnosis of a problem) is something I (and many others over the decades) have written about. The key phrase here is “keep pace”, and it’s difficult for me to argue with. This may be the most valuable statement in the document with regards to safety and the use of automation.

4. Prudent technology risk management has, at its core, quality assurance, continuous improvement, controlled testing and user acceptance, process measuring, management and control, regular and rigorous review for compliance with applicable rules and regulations and a strong and independent audit process. To ensure these basic features are present and incorporated into day-to-day operations, brokers or dealers must invest appropriate resources in their technology, compliance, and supervisory infrastructures. Recent events and Commission enforcement actions have demonstrated that this investment must be supported by an equally strong commitment to prioritize technology governance with a view toward preventing, wherever possible, software malfunctions, system errors and failures, outages or other contingencies and, when such issues arise, ensuring a prompt, effective, and risk-mitigating response. The failure by, or unwillingness of, a firm to do so can have potentially catastrophic consequences for the firm, its customers, their counterparties, investors and the marketplace.

Note: Here we have the first value statement we see in the document. It states what is “prudent” in risk management. This is reasonable for the SEC to state in a generic high-level way, given its charge: to interpret regulations. This sets the stage for showing contrast between what happened, and what the rules are, which comes later.

If this was a postmortem doc, this word should be a red flag that immediately sets your face on fire. Stating what is “prudent” is essentially imposing standards onto history. It is a declaration of what a standard of good practice looks like. The SEC does not mention Knight Capital as not prudent specifically, but they don’t have to. This is the model on which the rest of the document rests. Stating what standards of good practice look like in a document that is looked to for explanation is an anti-pattern. In aviation, this might be analogous to saying that a pilot lacked “good airmanship” and pointing at it as a cause.The phrases “must invest appropriate resources” and “equally strong” above are both non-binary and context-sensitive. What is appropriate and equally strong gets to be defined by…whom?

  • What is “prudent”?
  • The description only says prudence demands prevention of errors, outages, and malfunctions “wherever possible.” How will you know where prevention is not possible? And following that – it would appear that you can be prudent and still not prevent errors and malfunctions.
  • Please ensure a “prompt, effective, and risk-mitigating response.” In other words: fix it correctly and fix it quickly. It’s so simple!


5. The Commission adopted Exchange Act Rule 15c3-52 in November 2010 to require that brokers or dealers, as gatekeepers to the financial markets, “appropriately control the risks associated with market access, so as not to jeopardize their own financial condition, that of other market participants, the integrity of trading on the securities markets, and the stability of the financial system.”

Note: It’s true, this is what the rule says. What is deemed  “appropriate”, it would seem, is dependent on the outcome. Had an accident? It was not appropriate control. Didn’t have an accident? It must be appropriate control. This would mean that Knight Capital did have appropriate controls the day before the accident. Outcome bias reigns supreme here.

6. Subsection (b) of Rule 15c3-5 requires brokers or dealers with market access to “establish, document, and maintain a system of risk management controls and supervisory procedures reasonably designed to manage the financial, regulatory, and other risks” of having market access. The rule addresses a range of market access arrangements, including customers directing their own trading while using a broker’s market participant identifications, brokers trading for their customers as agents, and a broker-dealer’s trading activities that place its own capital at risk. Subsection (b) also requires a broker or dealer to preserve a copy of its supervisory procedures and a written description of its risk management controls as part of its books and records.

Note: The rules says, basically:  “have a document about controls and risks”. It doesn’t say anything about an organization’s ability to adapt them as time and technology progresses, only that at some point they were written down and shared with the right parties. 

7. Subsection (c) of Rule 15c3-5 identifies specific required elements of a broker or dealer’s risk management controls and supervisory procedures. A broker or dealer must have systematic financial risk management controls and supervisory procedures that are reasonably designed to prevent the entry of erroneous orders and orders that exceed pre-set credit and capital thresholds in the aggregate for each customer and the broker or dealer. In addition, a broker or dealer must have regulatory risk management controls and supervisory procedures that are reasonably designed to ensure compliance with all regulatory requirements.

Note: This is the first of many instances of the phrase “reasonably designed” in the document. As with the word ‘appropriate’, how something is defined to be “reasonably designed” is dependent on the outcome of that design. This robs both the design and the engineer of the nuanced details that make for resilient systems. Modern technology doesn’t work or not-work. It breaks and fails in surprising (sometimes shocking) ways that were not imagined by its designers, which means that “reason” plays only a part of its quality.

Right now, all over the world, every (non-malicious) engineer around the world is designing and building systems that they believe are “reasonably designed.”  If they didn’t think they were reasonably designed, they wouldn’t be finished with it until they did think it was.

Some of those systems will fail. Most will not. Many of them will fail in ways that are safe and anticipated. Some will will not, and surprise everyone. 

Systems Safety researcher Erik Hollnagel has had related thoughts:

We must strive to understand that accidents don’t happen because people gamble and lose.

Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.

8. Subsection (e) of Rule 15c3-5 requires brokers or dealers with market access to establish, document, and maintain a system for regularly reviewing the effectiveness of their risk management controls and supervisory procedures. This sub-section also requires that the Chief Executive Officer (“CEO”) review and certify that the controls and procedures comply with subsections (b) and (c) of the rule. These requirements are intended to assure compliance on an ongoing basis, in part by charging senior management with responsibility to regularly review and certify the effectiveness of the controls.

Note: This takes into consideration that systems are not indeed static, and it implies that they need to evolve over time. This is important to remember for some notes later on.

9. Beginning no later than July 14, 2011, and continuing through at least August 1, 2012, Knight’s system of risk management controls and supervisory procedures was not reasonably designed to manage the risk of its market access. In addition, Knight’s internal reviews were inadequate, its annual CEO certification for 2012 was defective, and its written description of its risk management controls was insufficient. Accordingly, Knight violated Rule 15c3-5. In particular:

  1. Knight did not have controls reasonably designed to prevent the entry of erroneous orders at a point immediately prior to the submission of orders to the market by one of Knight’s equity order routers, as required under Rule 15c3-5(c)(1)(ii);
  2. Knight did not have controls reasonably designed to prevent it from entering orders for equity securities that exceeded pre-set capital thresholds for the firm, in the aggregate, as required under Rule 15c3-5(c)(1)(i). In particular, Knight failed to link accounts to firm-wide capital thresholds, and Knight relied on financial risk controls that were not capable of preventing the entry of orders;
  3. Knight did not have an adequate written description of its risk management controls as part of its books and records in a manner consistent with Rule 17a-4(e)(7) of the Exchange Act, as required by Rule 15c3-5(b);
  4. Knight also violated the requirements of Rule 15c3-5(b) because Knight did not have technology governance controls and supervisory procedures sufficient to ensure the orderly deployment of new code or to prevent the activation of code no longer intended for use in Knight’s current operations but left on its servers that were accessing the market; and Knight did not have controls and supervisory procedures reasonably designed to guide employees’ responses to significant technological and compliance incidents;
  5. Knight did not adequately review its business activity in connection with its market access to assure the overall effectiveness of its risk management controls and supervisory procedures, as required by Rule 15c3-5(e)(1); and
  6. Knight’s 2012 annual CEO certification was defective because it did not certify that Knight’s risk management controls and supervisory procedures complied with paragraphs (b) and (c) of Rule 15c3-5, as required by Rule 15c3-5(e)(2).

Note: It’s a counterfactual party! The question remains: are conditions sufficient, reasonably designed, or adequate if they don’t result in an accident like this one? Which comes first: these characterizations, or the accident? Knight Capital did believe these things were sufficient, reasonably designed, and adequate enough. Otherwise, they would have addressed them. One question necessary to answer for prevention is: “What were the sources of confidence that Knight Capital drew upon as they designed their systems?” Because improvement lies there.

10. As a result of these failures, Knight did not have a system of risk management controls and supervisory procedures reasonably designed to manage the financial, regulatory, and other risks of market access on August 1, 2012, when it experienced a significant operational failure that affected SMARS, one of the primary systems Knight uses to send orders to the market. While Knight’s technology staff worked to identify and resolve the issue, Knight remained connected to the markets and continued to send orders in certain listed securities. Knight’s failures resulted in it accumulating an unintended multi-billion dollar portfolio of securities in approximately forty-five minutes on August 1 and, ultimately, Knight lost more than $460 million, experienced net capital problems, and violated Rules 200(g) and 203(b) of Regulation SHO.

A. Respondent


11. Knight Capital Americas LLC (“Knight”) is a U.S.-based broker-dealer and a wholly-owned subsidiary of KCG Holdings, Inc. Knight was owned by Knight Capital Group, Inc. until July 1, 2013, when that entity and GETCO Holding Company, LLC combined to form KCG Holdings, Inc. Knight is registered with the Commission pursuant to Section 15 of the Exchange Act and is a Financial Industry Regulatory Authority (“FINRA”) member. Knight has its principal business operations in Jersey City, New Jersey. Throughout 2011 and 2012, Knight’s aggregate trading (both for itself and for its customers) generally represented approximately ten percent of all trading in listed U.S. equity securities. SMARS generally represented approximately one percent or more of all trading in listed U.S. equity securities.

B. August 1, 2012 and Related Events

Preparation for NYSE Retail Liquidity Program

12. To enable its customers’ participation in the Retail Liquidity Program (“RLP”) at the New York Stock Exchange, which was scheduled to commence on August 1, 2012, Knight made a number of changes to its systems and software code related to its order handling processes. These changes included developing and deploying new software code in SMARS. SMARS is an automated, high speed, algorithmic router that sends orders into the market for execution. A core function of SMARS is to receive orders passed from other components of Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity, send one or more representative (or “child”) orders to external venues for execution.

13. Upon deployment, the new RLP code in SMARS was intended to replace unused code in the relevant portion of the order router. This unused code previously had been used for functionality called “Power Peg,” which Knight had discontinued using many years earlier. Despite the lack of use, the Power Peg functionality remained present and callable at the time of the RLP deployment. The new RLP code also repurposed a flag that was formerly used to activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag was set to “yes,” the new RLP functionality—rather than Power Peg—would be engaged.

Note: Noting the intention is important in gaining understanding, because it shows effort to get into the mindset of the individual or groups involved in the work. If this introspection continued throughout the document, it would get a little closer to something like a postmortem.

Raise your hand if you can definitively state all of the active and inactive code execution paths in your application right now. Right.

14. When Knight used the Power Peg code previously, as child orders were executed, a cumulative quantity function counted the number of shares of the parent order that had been executed. This feature instructed the code to stop routing child orders after the parent order had been filled completely. In 2003, Knight ceased using the Power Peg functionality. In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.

Note: On the surface, this looks like some technical meat to bite into. There is a some detail surrounding a fault-tolerance guardrail here, something to fail “closed” in the presence of specific criteria. What’s missing? Any dialogue about why the move of the function from one place (in Power Peg) to another (earlier in SMARS) – this is important, because in my experience, engineers don’t make effort in that sort of thing without motivation. If that motivation was explored, then we’d get a better sense of where the organization drew its confidence from, previous to the accident. This helps us understand their local rationality. But: we don’t get that from this document.

15. Beginning on July 27, 2012, Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.

Note: Code and deployment review is a fine thing to have. But is it sufficient? Dr. Nancy Leveson explained when she was invited to speak at the SEC’s “Technology Roundtable” in October of last year that in 1992, she chaired a committee to review the code that was deployed on the Space Shuttle. She said that NASA was spending $100 million a year to maintain the code, was employing the smartest engineers in the world, and there were still found to be gaps of concern. She repeats that there is no such thing as perfect software, no matter how much effort an individual or organization makes to produce such a thing.

Do written procedures requiring a review of code or deployment guarantee safety? Of course not. But ensuring safety isn’t what the SEC is expected to do in this document. Again: they are only pointing out the differences between regulation and practice.

Events of August 1, 2012

16. On August 1, Knight received orders from broker-dealers whose customers were eligible to participate in the RLP. The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server. As a result, this server began sending child orders to certain trading centers for execution. Because the cumulative quantity function had been moved, this server continuously sent child orders, in rapid sequence, for each incoming parent order without regard to the number of share executions Knight had already received from trading centers. Although one part of Knight’s order handling system recognized that the parent orders had been filled, this information was not communicated to SMARS.

Note: So the guardrail/fail-closed mechanism wasn’t in the same place it was before, and the eighth server was allowed to continue on. As Leveson said in her testimony: ” It’s not necessarily just individual component failure. In a lot of these accidents each individual component worked exactly the way it was expected to work. It surprised everyone in the interactions among the components.”

17. The consequences of the failures were substantial. For the 212 incoming parent orders that were processed by the defective Power Peg code, SMARS sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes. Knight inadvertently assumed an approximately $3.5 billion net long position in 80 stocks and an approximately $3.15 billion net short position in 74 stocks. Ultimately, Knight realized a $460 million loss on these positions.

Note: Just in case you forgot, this accident was sooooo bad. These numbers are so big. Keep that in mind, dear reader, because I want to you remember that when you think about the engineer who thought he had deployed the code to the eighth server. 

18. The millions of erroneous executions influenced share prices during the 45 minute period. For example, for 75 of the stocks, Knight’s executions comprised more than 20 percent of the trading volume and contributed to price moves of greater than five percent. As to 37 of those stocks, the price moved by greater than ten percent, and Knight’s executions constituted more than 50 percent of the trading volume. These share price movements affected other market participants, with some participants receiving less favorable prices than they would have in the absence of these executions and others receiving more favorable prices.

BNET Reject E-mail Messages

19. On August 1, Knight also received orders eligible for the RLP but that were designated for pre-market trading. SMARS processed these orders and, beginning at approximately 8:01 a.m. ET, an internal system at Knight generated automated e-mail messages (called “BNET rejects”) that referenced SMARS and identified an error described as “Power Peg disabled.” Knight’s system sent 97 of these e-mail messages to a group of Knight personnel before the 9:30 a.m. market open. Knight did not design these types of messages to be system alerts, and Knight personnel generally did not review them when they were received. However, these messages were sent in real time, were caused by the code deployment failure, and provided Knight with a potential opportunity to identify and fix the coding issue prior to the market open. These notifications were not acted upon before the market opened and were not used to diagnose the problem after the open.

Note: Translated, this says that systems-generated warnings/alerts that were sent via email weren’t noticed. Signals sent by automated systems (synchronously – as in “alerts” or asynchronously – as in “email”) aimed at perfectly detecting or preventing anomalies is not a solved problem. Show me an outage, any outage, and I’ll show you warning signs that humans didn’t pick up on. The document doesn’t give any detail on why those type of messages were sent via email (as opposed to paging-style alerts), what the distribution list was for them, how those messages get generated, or any other details.

Is the number of the emails (97 of them) important? 97 sounds like a lot, doesn’t it? If it was one, and not 97, would the paragraph read differently? What if there were 10,000 messages sent? 

How many engineers right now are receiving alerts on their phone (forget about emails) that they will glance at and think that they are part of the normal levels of noise in the system, because thresholds and error handling are not always precisely tuned?

C. Controls and Supervisory Procedures


20. Knight had a number of controls in place prior to the point that orders reached SMARS. In particular, Knight’s customer interface, internal order management system, and system for internally executing customer orders all contained controls concerning the prevention of the entry of erroneous orders.

21. However, Knight did not have adequate controls in SMARS to prevent the entry of erroneous orders. For example, Knight did not have sufficient controls to monitor the output from SMARS, such as a control to compare orders leaving SMARS with those that entered it. Knight also did not have procedures in place to halt SMARS’s operations in response to its own aberrant activity. Knight had a control that capped the limit price on a parent order, and therefore related child orders, at 9.5 percent below the National Best Bid (for sell orders) or above the National Best Offer (for buy orders) for the stock at the time that SMARS had received the parent order. However, this control would not prevent the entry of erroneous orders in circumstances in which the National Best Bid or Offer moved by less than 9.5 percent. Further, it did not apply to orders—such as the 212 orders described above—that Knight received before the market open and intended to send to participate in the opening auction at the primary listing exchange for the stock.

Note: Anomaly detection and error-handling criteria have two origins: the imagination of their authors and the history of surprises that have been encountered already. A significant number of thresholds, guardrails, and alerts in any technical organization are put in place only after it’s realized that they are needed. Some of these realizations come from negative events like outages, data loss, etc. and some of them come from “near-misses” or explicit re-anticipation activated by feedback that comes from real-world operation.

Even then, real-world observations don’t always produce new safeguards. How many successful trades had Knight Capital seen in its lifetime while that control allowed “the entry of erroneous orders in circumstances in which the National Best Bid or Offer moved by less than 9.5 percent.” How many successful Shuttle launches saw degradation in O-ring integrity before the Challenger accident? This ‘normalization of deviance’ (Vaughn, 1997) phenomenon is to be expected in all socio-technical organizations. Financial trading systems are no exception. History matters.

Capital Thresholds

Note: Nothing in this section had much value in explanation or prevention.

Code Development and Deployment

26. Knight did not have written code development and deployment procedures for SMARS (although other groups at Knight had written procedures), and Knight did not require a second technician to review code deployment in SMARS. Knight also did not have a written protocol concerning the accessing of unused code on its production servers, such as a protocol requiring the testing of any such code after it had been accessed to ensure that the code still functioned properly.

Note: Again, does a review guarantee safety? Does testing prevent malfunction?

Incident Response

27. On August 1, Knight did not have supervisory procedures concerning incident response. More specifically, Knight did not have supervisory procedures to guide its relevant personnel when significant issues developed. On August 1, Knight relied primarily on its technology team to attempt to identify and address the SMARS problem in a live trading environment. Knight’s system continued to send millions of child orders while its personnel attempted to identify the source of the problem. In one of its attempts to address the problem, Knight uninstalled the new RLP code from the seven servers where it had been deployed correctly. This action worsened the problem, causing additional incoming parent orders to activate the Power Peg code that was present on those servers, similar to what had already occurred on the eighth server.

Note: I would like to think that most engineering organizations that are tasked with troubleshooting issues in production systems understand that diagnosis isn’t something you can prescribe. Successful incident response in escalating scenarios is something that comes from real-world  practice, not a document. Improvisation and intuition play a significant role in this, which obviously cannot be written down beforehand. 

Thought exercise: you just deployed new code to production. You become aware of an issue. Would it be surprising if one of the ways you attempt to rectify the scenario is to roll back to the last known working version? The SEC release implies that it would be.

D. Compliance Reviews and Written Description of Controls

Note: I’m skipping some sections here as it’s just more about compliance. 

Post-Compliance Date Reviews

32. Knight conducted periodic reviews pursuant to the WSPs. As explained above, the WSPs assigned various tasks to be performed by SCG staff in consultation with the pertinent business and technology units, with a senior member of the pertinent business unit reviewing and approving that work. These reviews did not consider whether Knight needed controls to limit the risk that SMARS could malfunction, nor did these reviews consider whether Knight needed controls concerning code deployment or unused code residing on servers. Before undertaking any evaluation of Knight’s controls, SCG, along with business and technology staff, had to spend significant time and effort identifying the missing content and correcting the inaccuracies in the written description.

33. Several previous events presented an opportunity for Knight to review the adequacy of its controls in their entirety. For example, in October 2011, Knight used test data to perform a weekend disaster recovery test. After the test concluded, Knight’s LMM desk mistakenly continued to use the test data to generate automated quotes when trading began that Monday morning. Knight experienced a nearly $7.5 million loss as a result of this event. Knight responded to the event by limiting the operation of the system to market hours, changing the control so that this system would stop providing quotes after receiving an execution, and adding an item to a disaster recovery checklist that required a check of the test data. Knight did not broadly consider whether it had sufficient controls to prevent the entry of erroneous orders, regardless of the specific system that sent the orders or the particular reason for that system’s error. Knight also did not have a mechanism to test whether their systems were relying on stale data.

Note: That we might be able to cherry-pick opportunities in the past where signs of doomsday could have (or should have) been seen and heeded is consistent with textbook definitions of The Hindsight Bias. How organizations learn is influenced by the social and cultural dynamics of its internal structures. Again, Diane Vaughn’s writings is a place we can look to for exploring how path dependency can get us into surprising places. But again: this is not the SEC’s job to speak to that.  

E. CEO Certification

34. In March 2012, Knight’s CEO signed a certification concerning Rule 15c3-5. The certification did not state that Knight’s controls and procedures complied with the rule. Instead, the certification stated that Knight had in place “processes” to comply with the rule. This drafting error was not intentional, the CEO did not notice the error, and the CEO believed at the time that he was certifying that Knight’s controls and procedures complied with the rule.

Note: This is possibly the only hint at local rationality in the document. 

F. Collateral Consequences

35. There were collateral consequences as a result of the August 1 event, including significant net capital problems. In addition, many of the millions of orders that SMARS sent on August 1 were short sale orders. Knight did not mark these orders as short sales, as required by Rule 200(g) of Regulation SHO. Similarly, Rule 203(b) of Regulation SHO prohibits a broker or dealer from accepting a short sale order in an equity security from another person, or effecting a short sale in an equity security for its own account, unless it has borrowed the security, entered into a bona-fide arrangement to borrow the security, or has reasonable grounds to believe that the security can be borrowed so that it can be delivered on the date delivery is due (known as the “locate” requirement), and has documented compliance with this requirement. Knight did not obtain a “locate” in connection with Knight’s unintended orders and did not document compliance with the requirement with respect to Knight’s unintended orders.

A. Market Access Rule: Section 15(c)(3) of the Exchange Act and Rule 15c3-5

Note: I’m going skip a bit because it’s not much more than a restating of rules that the SEC deemed were broken….

Accordingly, pursuant to Sections 15(b) and 21C of the Exchange Act, it is hereby ORDERED that:

A. Respondent Knight cease and desist from committing or causing any violations and any future violations of Section 15(c)(3) of the Exchange Act and Rule 15c3-5 thereunder, and Rules 200(g) and 203(b) of Regulation SHO.

Note: Translated – you must stop immediately all of the things that violate rules that say you must “reasonably design” things. So don’t unreasonably design things anymore. 

The SEC document does what it needs to do: walk through the regulations that they think were violated, and talk about the settlement agreement. Knight Capital doesn’t have to admit they did anything wrong or suboptimal, and the SEC gets to tell them what to do next. That is, roughly:

  1. Hire a consultant that helps them not unreasonably design things anymore, and document that.
  2. Pay $12M to the SEC.
Under no circumstances should you take this document to be an explanation of the event or how to prevent future ones like it.

What questions remain unanswered?

Like I mentioned before, this SEC release doesn’t help explain why how the event came to be, or make any effort towards prevention other than require Knight Capital to pay a settlement, hire a consultant, and write new procedures that can predict the future. I do not know anyone at Knight Capital (or at the SEC for that matter) so it’s very unlikely that I’ll gain any more awareness of accident details than you will, my dear reader.

But I can put down a few questions that I might ask if I was facilitating the debriefing of the accident, which could possibly help with gaining a systems-thinking perspective on explanation. Real prevention is left to an exercise to the readers who also work at Knight Capital.

  •  The engineer who deployed the new code to support the RLP integration had confidence that all servers (not just seven of the eight) received the new code. What gave him that confidence? Was it a dashboard? Reliance on an alert? Some other sort of feedback from the deployment process?
  • The BNET Reject E-mail Messages: Have they ever been sent before? Do the recipients of them trust their validity? What is the background on their delivery being via email, versus synchronous alerting? Do they provide enough context in their content to give an engineer sufficient criteria to act on?
  • What were the signals that the responding team used to indicate that a roll-back of the code on the seven servers was a potential repairing action?
  • Did the team that were responding to the issue have solid and clear communication channels? Was it textual chat, in-person, or over voice or video conference?
  • Did the team have to improvise any new tooling to be used in the diagnosis or response?
  • What metrics did the team use to guide their actions? Were they infrastructural (such as latency, network, or CPU graphs?) or market-related data (trades, positions, etc.) or a mixture?
  • What indications were there to raise awareness that the eighth server didn’t receive the latest code? Was it a checksum or versioning? Was it logs of a deployment tool? Was it differences in the server metrics of the eighth server?
  • As the new code was rolled out: what was the team focused on? What were they seeing?
  • As they recognized there was an issue: did the symptoms look like something they had seen before?
  • As the event unfolded: did the responding team discuss what to do, or did single actors take action?
  • Regarding non-technical teams: were they involved with directing the response?
  • Many many more questions remain, that presumably (hopefully) Knight Capital has asked and answered themselves.

The Second Victim

What about the engineer who deployed the code…the one who had his hands on the actual work being done? How is he doing? Is he receiving support from his peers and colleagues? Or was he fired? The financial trading world does not exactly have a reputation for empathy, and given that there is no voice given to the people closest to the work (such as this engineer) informing the story, I can imagine that symptoms consistent with traumatic stress are likely.

Some safety-critical domains have put together structured programs to offer support to individuals that are involved with high-tempo and high-consequence work. Aviation and air traffic control has seen good success with CISM (Critical Incident Stress Management) and it’s been embraced by organizations around the world.

As web operations and financial trading systems become more and more complex, we will continue to be surprised by outcomes of what looks like “normal” work. If we do not make effort to support those who navigate this complexity on a daily basis, we will not like the results.


  1. The SEC does not have responsibility for investigation with the goals of explanation or prevention of adverse events. Their focus is regulation.
  2. Absent a real investigation that eschews counterfactuals, puts procedures and rules into context, and encourages a narrative that holds paramount the voices of those closest to the work: we cannot draw any substantial conclusions. This means armchair accident investigation ripe with indignation.

So please don’t use the SEC Release No. 70694 as a post-mortem document, because it is not.


Dekker, S. (2003). Failure to adapt or adaptations that fail: contrasting models on procedures and safety. Applied Ergonomics, 34(3), 233–238. doi:10.1016/S0003-6870(03)00031-0
Dekker, S. (2006). The Field Guide to Understanding Human Error. Ashgate Publishing, Ltd.
Outcome Bias. (n.d.). In Wikipedia. Retrieved October 28, 2013, from
Pidgeon, N., & O’Leary, M. (2000). Man-made disasters: why technology and organizations (sometimes) fail. Safety Science, 34(1), 15–30.
Vaughan, D. (2009). The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. University of Chicago Press.
Woods, D. D., Dekker, S., Cook, R., Johannesen, L., & Sarter, N. (2010). Behind Human Error (2nd ed.). Farnham: Ashgate Pub Co.
Weick, K.E., 1993. The collapse of sensemaking in organizations. Administrative Sci. Quart. 38, 628–652.

Learning from Failure at Etsy

(This was originally posted on Code As Craft, Etsy’s engineering blog. I’m re-posting it here because it still resonates strongly as I prepare to teach a ‘postmortem facilitator’s course internally at Etsy.)

Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought I might give some detail on how that actually happens, and why.

Anyone who’s worked with technology at any scale is familiar with failure. Failure cares not about the architecture designs you slave over, the code you write and review, or the alerts and metrics you meticulously pore through.

So: failure happens. This is a foregone conclusion when working with complex systems. But what about those failures that have resulted due to the actions (or lack of action, in some cases) of individuals? What do you do with those careless humans who caused everyone to have a bad day?

Maybe they should be fired.

Or maybe they need to be prevented from touching the dangerous bits again.

Or maybe they need more training.

This is the traditional view of “human error”, which focuses on the characteristics of the individuals involved. It’s what Sidney Dekker calls the “Bad Apple Theory” – get rid of the bad apples, and you’ll get rid of the human error. Seems simple, right?

We don’t take this traditional view at Etsy. We instead want to view mistakes, errors, slips, lapses, etc. with a perspective of learning. Having blameless Post-Mortems on outages and accidents are part of that.

A Blameless Post-Mortem

What does it mean to have a ‘blameless’ Post-Mortem?
Does it mean everyone gets off the hook for making mistakes? No.

Well, maybe. It depends on what “gets off the hook” means. Let me explain.

Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

  • what actions they took at what time,
  • what effects they observed,
  • expectations they had,
  • assumptions they had made,
  • and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.

Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.

We believe that this detail is paramount to improving safety at Etsy.

If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?

This cycle of name/blame/shame can be looked at like this:

  1. Engineer takes action and contributes to a failure or incident.
  2. Engineer is punished, shamed, blamed, or retrained.
  3. Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
  4. Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
  5. Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
  6. Errors more likely, latent conditions can’t be identified due to #5, above
  7. Repeat from step 1

We need to avoid this cycle. We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.

The base fundamental here is something Erik Hollnagel has said:

We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.

A Second Story

This idea of digging deeper into the circumstance and environment that an engineer found themselves in is called looking for the “Second Story”. In Post-Mortem meetings, we want to find Second Stories to help understand what went wrong.

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

Allowing Engineers to Own Their Own Stories

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items.

So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.

So what do we do to enable a “Just Culture” at Etsy?

  • We encourage learning by having these blameless Post-Mortems on outages and accidents.
  • The goal is to understand how an accident could have happened, in order to better equip ourselves from it happening in the future
  • We seek out Second Stories, gather details from multiple perspectives on failures, and we don’t punish people for making mistakes.
  • Instead of punishing engineers, we instead give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures.
  • We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future.
  • We accept that there is always a discretionary space where humans can decide to make actions or not, and that the judgement of those decisions lie in hindsight.
  • We accept that the Hindsight Bias will continue to cloud our assessment of past events, and work hard to eliminate it.
  • We accept that the Fundamental Attribution Error is also difficult to escape, so we focus on the environment and circumstances people are working in when investigating accidents.
  • We strive to make sure that the blunt end of the organization understands how work is actually getting done (as opposed to how they imagine (or hope) it’s getting done, via Gantt charts and procedures) on the sharp end.
  • The sharp end is relied upon to inform the organization where the line is between appropriate and inappropriate behavior. This isn’t something that the blunt end can come up with on its own.

Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.

One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”

Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.

That’s why we have blameless Post-Mortems at Etsy, and why we’re looking to create a Just Culture here.

A Mature Role for Automation: Part II

(Courtney Nash’s excellent post on this topic inadvertently pushed me to finally finish this – give it a read)

In the last post on this topic, I hoped to lay the foundation for what a mature role for automation might look like in web operations, and bring considerations to the decision-making process involved with considering automation as part of a design. Like Richard mentioned in his excellent comment to that post, this is essentially a very high level overview about the past 30 years of research into the effects, benefits, and ironies of automation.

I also hoped in that post to challenge people to investigate their assumptions about automation.


  • when will automation be appropriate,
  • what problems could it help solve, and
  • how should it be designed in order to augment and compliment (not simply replace) human adaptive and processing capacities.

The last point is what I’d like to explore further here. Dr. Cook also pointed out that I had skipped over entirely the concept of task allocation as an approach that didn’t end up as intended. I’m planning on exploring that a bit in this post.

But first: what is responsible for the impulse to automate that can grab us so strongly as engineers?

Is it simply the disgust we feel when we find (often in hindsight) a human-driven process that made a mistake (maybe one that contributed to an outage) that is presumed impossible for a machine to make?

It turns out that there are a number of automation ‘philosophies’, some of which you might recognize as familiar.

Philosophies and Approaches

One: The Left-Over Principle

One common way to think of automation is to gather up all of the tasks, and sort them into things that can be automated, and things that can’t be. Even the godfather of Human Factors, Alphonse Chapanis said that it was reasonable to “mechanize everything that can be mechanized” (here). The main idea here is efficiency. Functions that cannot be assigned to machines are left for humans to carry out. This is known as the ‘Left-Over’ Principle.

David Woods and Erik Hollnagel has a response to this early incarnation of the “automate all the things!” approach, in Joint Cognitive Systems: Foundations of Cognitive Systems Engineering, which is (emphasis mine):

“The proviso of this argument is, however, that we should mechanise everything that can be mechanised, only in the sense that it can be guaranteed that the automation or mechanisation always will work correctly and not suddenly require operator intervention or support. Full automation should therefore be attempted only when it is possible to anticipate every possible condition and contingency. Such cases are unfortunately few and far between, and the available empirical evidence may lead to doubts whether they exist at all.


Without the proviso, the left-over principle implies a rather cavalier view of humans since it fails to include any explicit assumptions about their capabilities or limitations – other than the sanguine hope that the humans in the system are capable of doing what must be done. Implicitly this means that humans are treated as extremely flexible and powerful machines, which at any time far surpass what technological artefacts can do. Since the determination of what is left over reflects what technology cannot do rather than what people can do, the inevitable result is that humans are faced with two sets of tasks. One set comprises tasks that are either too infrequent or too expensive to automate. This will often include trivial tasks such as loading material onto a conveyor belt, sweeping the floor, or assembling products in small batches, i.e., tasks where the cost of automation is higher than the benefit. The other set comprises tasks that are too complex, too rare or too irregular to automate. This may include tasks that designers are unable to analyse or even imagine. Needless to say that may easily leave the human operator in an unenviable position.”

So to reiterate, the Left-Over Principle basically says that the things that are “left over” after automating as much as you can are either:

  1. Too “simple” to automate (economically, the benefit of automating isn’t worth the expense of automating it) because the operation is too infrequent, OR
  2. Too “difficult” to automate; the operation is too rare or irregular, and too complex to automate.

One critique of the Left-Over Principle is what Bainbridge points to in her second irony that I mentioned in the last post. The tasks that are “left over” after trying to automate all the things that can are the ones that you can’t figure out how to automate effectively (because they are too complicated or infrequent therefore not worth it) you then give back to the human to deal with.

So hold on: I thought we were trying to make humans lives easier, not more difficult?

Giving all of the easy bits to the machine and the difficult bits to the human also has a side affect of amplifying the workload on humans in terms of cognitive load and vigilance. (It turns out that it’s relatively trivial to write code that can do a boatload of complex things quite fast.) There’s usually little consideration given to whether or not the human could effectively perform these remaining non-automated tasks in a way that will benefit the overall system, including the automated tasks.

This approach also assumes that the tasks that are now automated can be done in isolation of the tasks that can’t be, which is almost never the case. When only humans are working on tasks, even with other humans, they can stride at their own rate individually or as a group. When humans and computers work together, the pace is set by the automated part, so the human needs to keep up with the computer. This underscores the importance automation in the context of humans and computers working jointly. Together. As a team, if you will.

We’ll revisit this idea later, but the idea that automation should place high priority and focus on the human-machine collaboration instead of their individual capacities is a main theme in the area of Joint Cognitive Systems, and one that I personally agree with.

The Left-Over Principle

Parasuraman, Sheridan, and Wickens (2000) had this to say about the Left-Over Principle (emphasis mine):

“This approach therefore defines the human operator’s roles and responsibilities in terms of the automation. Designers automate every subsystem that leads to an economic benefit for that subsystem and leave the operator to manage the rest. Technical capability or low cost are valid reasons for automation, given that there is no detrimental impact on human performance in the resulting whole system, but this is not always the case. The sum of subsystem optimizations does not typically lead to whole system optimization.”

Two: The “Compensatory” Principle

Another familiar approach (or justification) for automating processes rests on the idea that you should exploit the strengths of both humans and machines differently. The basic premise is: give the machines the tasks that they are good at, and the humans the things that they are good at.

This is called the Compensatory Principle, based on the idea that humans and machines can compensate for each others’ weaknesses. It’s also known as functional allocation, task allocation, comparison allocation, or the MABA-MABA (“Men Are Better At-Machines Are Better At”) approach.

Historically, functional allocation has been most embodied by “Fitts’ List”, which comes from a report in 1951, “Human Engineering For An Effective Air Navigation and Traffic-Control System” written by Paul Fitts and others.

Fitts’ List, which is essentially the original MABA-MABA list, juxtaposes human with machine capabilities to be used as a guide in automation design to help decided who (humans or machine) does what.

Here is Fitts’ List:

 Humans appear to surpass present-day machines with respect to the following:

  • Ability to detect small amounts of visual or acoustic energy.
  • Ability to perceive patterns of light or sound.
  • Ability to improvise and use flexible procedures.
  • Ability to store very large amounts of information for long periods and to recall relevant facts at the appropriate time.
  • Ability to reason inductively.
  • Ability to exercise judgment.

Modern-day machines (then, in the 1950s) appear to surpass humans with respect to the following:

  • Ability to respond quickly to control signals and to apply great forces smoothly and precisely.
  • Ability to perform repetitive, routine tasks
  • Ability to store information briefly and then to erase it completely
  • Ability to reason deductively, including computational ability
  • Ability to handle highly complex operations, i.e., to do many different things at once

This approach is intuitive for a number of reasons. It at least recognizes that when it comes to a certain category of tasks, humans are much superior to computers and software.

Erik Hollnagel summarized the Fitts’ List in Human Factors for Engineers:

Summary of the Fitts List

It does a good job of looking like a guide; it’s essentially an IF-THEN conditional on where to use automation.

So what’s not to like about this approach?

While this is a reasonable way to look at the situation, it does have some difficulties that have been explored which makes it basically impossible as a practical rationale.

Criticisms of the Compensatory Principle

There are a number of strong criticisms to this approach or argument for putting in place automation. One argument that I agree with most is that the work we do in engineering are never as decomposable as list would imply. You can’t simply say “I have a lot of data analysis to do over huge amounts of data, so I’ll let the computer do that part, because that’s what it’s good at. Then it can present me the results and I can make judgements over them.” for many (if not all) of the work we do.

The systems we build have enough complexity in them that we can’t simply put tasks into these boxes or categories, because then the cost of moving between them becomes extremely high. So high that the MABA-MABA approach, as it stands, is pretty useless as a design guide. The world we’ve built around ourselves simply doesn’t exist neatly into these buckets; we move dynamically between judging and processing and calculating and reasoning and filtering and improvising.

Hollnagel unpacks it more eloquently in Joint Cognitive Systems: Foundations of Cognitive Systems Engineering:

“The compensatory approach requires that the situation characteristics can be described adequately a priori, and that the variability of human (and technological) capabilities will be minimal and perfectly predictable.”


“Yet function allocation cannot be achieved simply by substituting human functions by technology, nor vice versa, because of fundamental differences between how humans and machines function and because functions depend on each other in ways that are more complex than a mechanical decomposition can account for. Since humans and machines do not merely interact but act together, automation design should be based on principles of coagency.”


David Woods refers to Norbert’s Contrast (from Norbert Weiner’s 1950 The Human Use of Human Beings)

Norbert’s Contrast

Artificial agents are literal minded and disconnected from the world, while human agents are context sensitive and have a stake in outcomes. 

With this perspective, we can see how computers and humans aren’t necessarily decomposable into the work simply based on what they do well.

Maybe, just maybe: there’s hope in a third approach? If we were to imagine humans and machines as partners? How might we view the relationship between humans and computers through a different lens of cooperation?

That’s for the next post. 🙂

Owning Attention (Considerations for Alert Design)

In the past month or two, I’ve spoken on the topic of alert design. There’s a video of my giving the talk (at Monitorama, as well), but I thought I’d try to post on the topic and material as well.

The topic of alerts and “alert design” as seen as a deliberate and purposeful thing to do has been on my mind.

In my experience and my asking many people in engineering and operations (at least in the web and financial trading domains) nothing spikes blood pressure like the topics of alerts. The caricature of the sysadmin waking up to a buzzing pager or phone is what comes to mind.

The costs of not paying attention to how your organization views or treats what comes of this behavior in operational teams (developers and systems folks included) I think are both largely invisible and much higher than most people think. It may be clear that what we’re talking about here is a signal:noise ratio, but it goes way beyond that. The cognitive cost of an engineer to attend to an alert (a fundamentally interrupting event by design) is akin to the cost of a software developer losing their “flow”; context switching is expensive. Expensive from a financial standpoint, a productivity perspective, and I’ll argue a career development view.

Here are some (likely melodramatic) assertions:

  • Alert numbness and fatigue is a blight on our industry. Because we can alert on basically anything, and we can argue that anything could be a harbinger of things that could drastically affect our business, we generally put an alert on everything we get our hands on.
  • Knowing something has happened almost always trumps not knowing something happened, with sometimes not much effort put into whether the “something” is important with respect to the context it’s happened in.
  • Computers deciding what is important to alert on is and will always be brittle. Meaning: alerts and their criteria originate in the author’s mind, which may or may not be in the same place as the receiver of the alert in the future. In other words: we all write documentation and procedures that make sense to us when we write them. They never survive too much of the future, because our worlds that refer to them change. Example: corporate wiki pages are commonly referred to as the place where “documentation goes to die”. Alerts are no different.

Therefore, I’d love to get a much deeper and broader conversation about alert design in our domain. Because I’ll say that it’s not the technology that sucks, it’s our use of it. Consider the possibility that you don’t have a Nagios problem, you have an alert design problem.

Down and In

As the years go by and we see the continued decline of storage prices, the explosion of accessible processing power, we have an ever-expanding ability to zoom in deeply to the ways servers and services talk to each other and process information.

We can zoom in on the relationships and behaviors of seemingly disparate pieces of data, and we can discover and detect disruptions or anomalies in sometimes surprising places. This is interesting, for sure.

But it is also woefully incomplete if we are to make any progress in technical operations.

Up and Out

It is incomplete because as we zoom out of those high-resolution metrics collection and analysis tooling, what we find is a much-ignored environment which includes one of the most powerful context-sensitive and incredibly adaptive anomaly detection and response agents in the world:  humans.

Do we have anomaly detection problems? Certainly. One can argue (I will) that we will always have them, for many reasons. (One of those reasons is the Law Of Stretched Systems, but that is for a different post.)

What I’m interested in is not how software can be used to detect anomalies automatically,
(well, I’m interested, but I don’t doubt that we all will continue to get better at it)

…it is how people navigate this boundary between themselves and the machines they work with. The boundary between humans and machines, as we observe our use of tools, is a focus in and of itself. If we have any hope of making progress in monitoring complex systems, we must take this boundary into account.

As an aside, some more bullet points:

  1. We don’t use a single tool to gain insight into the architectures we build. And we will not, much to the dismay of many monitoring-as-a-service business models. (“A single plane of glass?! Where do I sign?!”)
  2. Teams of people are the norm, which means that communication and coordination become as important (if not more important) than surfacing anomalies themselves.
  3. We bring our biases, expectations, trust, and perceptions to the table when it comes to monitoring and response. No tool or piece of automation will ever change that.
  4. Understanding the breakdowns at these boundaries between people and machines should be a part of how we approach the design of tools. Organizational behavior beats technology at every turn.

Less Code, More Social Science

When we look at Boyd’s OODA loop, we see “observe” and “orient” as critical pieces. Note that these are not Unix commands, they are human activities.

So writing code to tell computers what to look at is quite different than making sure that the code’s human supervisors are equipped or aided in what to look when an alert goes off. Figuring out how people make sense of what is actually going on at a given point (in diagnosis? in planning? in response to an outage? in control?) is just plain hard.

A step that Don Norman (and other folks known in the world of ergonomics and human factors) have been tugging at for a couple of decades is to first attempt to understand how people consume, adapt to, work around, and make use of tools under “normal” operating conditions. Once that’s done, it’s suggested, then we can try to understand how people make sense of their world under high-tempo or escalating scenarios (during an outage, for example) when the signals they receive can sometimes be disorienting as things escalate.


  • Who has ever gotten an alert and ignored it? (/me looks at alert, says “oh, it’ll probably recover, no need to look further”)
  • How many alerts were received in the past week that were not actionable? (no human action was required)
  • How many alerts were received in the past week as a result of known work being done (expected) but alerts were not silenced during that period?
  • How many alerts were received as a result of a previously silenced alert (because work was being done) that was mistakenly un-silenced?

Here are some quotes from engineers who have found themselves in interesting situations related to alerts:

“The whole place just lit up. I mean, all the lights came on. So instead of being able to tell you what went wrong, the lights were absolutely no help at all.”
– Comment by one space controller in mission control after the Apollo 12 spacecraft was struck by lightning (Murray and Cox 1990).


“I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information.”
– Comment by one operator at the Three Mile Island nuclear power plant to the official inquiry following the TMI accident (Kemeny 1979).


“When the alarm kept going off then we kept shutting it [the device] off [and on] and when the alarm would go off [again], we’d shut it off.”
“… so I just reset it [a device control] to a higher temperature. So I kinda fooled it [the alarm]…”
– Physicians explaining how they respond to a nuisance alarm on a computerized operating room device (Cook, Potter, Woods and McDonald 1991).


“A [computer] program alarm could be triggered by trivial problems that could be ignored altogether. Or it could be triggered by problems that called for an immediate abort [of the lunar landing]. How to decide which was which? It wasn’t enough to memorize what the program alarm numbers stood for, because even within a single number the alarm might signify many different things.


“We wrote ourselves little rules like: ‘If this alarm happens and it only happens once, don’t worry about it. If it happens repeatedly, but other indicators are okay, don’t worry about it.'” And of course, if some alarms happen even once, or if other alarms happen repeatedly and the other indicators are not okay, then they should get the LEM [lunar module] the hell out of there.
– Response to discovery of a set of computer alarms linked to the astronauts displays shortly before the Apollo 11 mission (Murray and Cox 1990).


“1202.” (Astronaut announcing that an alarm buzzer and light had gone off and the code 1202 was indicated on the computer display.)
“What’s a 1202?”
“1202, what’s that?”
“12…1202 alarm.”
– Mission control dialog as the LEM descended to the moon during Apollo 11 (Murray and Cox 1990).


“I know exactly what it [an alarm] is–it’s because the patient has been, hasn’t taken enough breaths or–I’m not sure exactly why.”
– Physician explaining one alarm on a computerized operating room device that commonly occurred at a particular stage of surgery (Cook et al. 1991).

These quotes are from the excellent paper The Alarm Problem and Directed Attention in Dynamic Fault Management (Woods, 1995).

David Woods writes at great length on the topic and gives great insight into what essentially alerts and alarms are: directed attention. As operators of systems that are beyond our full understanding at any given point and perspective, he shines light on the core of the alarm problem: that there is always context sensitivity to alerts, and in many ways the author/designer of the alert hasn’t (can’t!) imagine how the receiver of the alert will interpret it.

For example: he points to signal detection theory as a framework for thinking about alert/alarm criteria. That is to say, there is always a relationship between true “signal” and “noise” and the trade-offs inherent in choosing the alerting criteria (sometimes, but not always, viewed as a simple threshold) can be thought of like this:

Signal Detection Theory

In other words, there are four outcomes that are possible that reflect how sensitive the alerting criteria can be:


SDT outcomes

So this is a tough one, and points out that getting good (forget about perfect!) signal-to-noise ratio is hard. Too sensitive, you’ll get too many false alarms. Not sensitive enough, and you’ll miss something.

I’ll say that because of this, we generally err on the side of too many false alarms. For fear of missing something (or the embarrassment of it being known that you missed something going wrong with your systems!) we will crank up the sensitivity.

But in doing so, we essentially ignore the detrimental effect of the false alarms on our engineers and organizations. Underlying the false alarms are not just limitations in the alerting algorithms themselves, but the conditions and factors that the alert systems cannot detect or interpret.

An often-given example of this manifests at the Cincinnati Airport. A riverbank leading up to a particular runway there triggers a threshold in ground proximity warning systems (in-cockpit alerts) because the system can’t detect that it’s going to plateau at the runway. Pilots familiar with this particular runway at this particular airport ignore the alerts.

Once more, with feeling: the pilots, who are flying massive cylinders of metal containing many humans ignore a Ground Proximity Warning alert.

When we talk about how the receiver of an alert will behave, we begin to uncover the context sensitivity of an alert.

How can we take into account how someone might react when we they are woken up to an alert we’ve designed? Will they shake their head, wondering what it’s all about? Are we helping them understand what might be going on, or hindering them by including only the bare minimum of data?

What about the engineer who gets an alert in a sea of alerts, while an outage is ongoing? How much attention will they give one amongst a hundred?

Something that might affect our behavior when we get an alert is the amount of trust that we have in the alert: is it telling us something we should believe? Should we drop everything we’re doing in order to pay attention to it? If not, why not?

As an example of this, take the Ground Proximity Warning System I mentioned above. Turns out that in many studies across a number of years, a majority of pilots delay reacting to a GPWS alarm, not just in Cincinnati. Why? Because they take time to validate that the alarm is actually legitimate by looking out the window. This is enough of a problem that the FAA has coined this phenomenon “delayed GPWS response syndrome“.

Trust in automation: it’s a thing that might be worth thinking closely about.

Two Views

“The critical point is that the challenge of fault management lies in sorting through an avalanche of raw data — a data overload problem. This is in contrast to the view that the performance bottleneck is the difficulty of picking up subtle early indications of a fault against the background of a quiescent monitored process.” (Woods, 1995)

The next time you set up an alert in your system, consider how you’re thinking the receiver of that alert will take it. Do you believe that your alert will save the day, providing information for someone to head off catastrophe before it’s too late? Or will it be likely discarded as noise amongst a sea of alerts as someone struggles to understand an outage?

“Information is not a scarce resource, attention is.” – Herb Simon

Herb Simon has mentioned this in many pieces of his writing, as David Woods and Emily Patterson remarks in Can We Ever Escape From Data Overload, A Cognitive Systems DiagnosisThus far we’ve captured that designing alerts is hard, even if we only invest effort in capturing signal, forget about providing context. Woods talks a bit more about directed attention, about a paradox:

“Note the paradox at the heart of directed attention. Given that the supervisory agent is loaded by various other task related demands, how does one interpret information about the potential need to switch attentional focus without interrupting or interfering with the tasks or lines of reasoning already under attentional control. We can state this paradox in another way: how can one skillfully ignore a signal that should not shift attention within the current context, without first processing it — in which case it hasn’t been ignored.”

So Where Is “Design”?

“It is the expertise of the human operator that makes it possible to adapt the  performance of the joint system, in real time, to unexpected events and disturbances. Every working day, across the whole spectrum of human enterprise, a large number of near-misses are prevented from turning into accidents only because human operators intervene.

The system should therefore be designed so that human adaptation is enhanced.”

(emphasis mine) – Erik Hollnagel, Expertise and Technology: Cognition &  Human-Computer Cooperation, 1995

Instead of thinking about alerts and alert design as tasks that underscore the mental model of a subordinate or otherwise dumb messenger delivering news to us?

What if we viewed alerting systems as a partner? What does the world look like if we designed alerting systems to cooperate with us?
If trust in alerting systems is such a big deal, as it is with the GPWS and alert numbness,  what can we learn from how humans learn to trust each other, and let that influence our design decisions?

In other words: how can we design alerts that support our efforts to confirm their legitimacy, or our expectations when an alert will fire? Is context-sensitivity part of this?

This is the type of partnership and thinking that I’m interested in. 🙂

Prevention versus Governance versus Adaptive Capacities

The other day I posted about the intersections of Systems Safety and web operations and engineering.

One of the largest proponents of bringing a systems thinking perspective to safety (specifically ‘software safety’) is Dr. Nancy Leveson, who has been in that field (really a multidisciplinary field) for at least a couple of decades. She’s the author of a super book, Engineering a Safer World (free download) that discusses this very concept.

I also mentioned the firming up (still in the public comment timeframe) of REG-SCI which puts into place regulation (not just a recommendation or suggestion) the ARP (automation review policy) that public trading markets must comply with.

Without commenting too much on REG-SCI (I have opinions on that which I can post about at a later date) itself, I wanted to point to a Technology Roundable that the SEC had last October and invited Dr. Leveson to speak on the notion and concepts of “safe” software systems. This laid the groundwork that went into (presumably) Regulation SCI.

I clipped out her testimony, it’s about 20 minutes long, but very much worth a watch. She touches on a number of topics, but brings plain language to what organizations (both for-profit and regulatory groups, like the SEC) can expect with respect to introducing an increasing amount of technology to ‘solve’ stability issues in complex systems:

Nancy Leveson SEC Technology Rountable

Nancy Leveson, SEC Technology Rountable 10/2/2012 from jspaw on Vimeo.

Regulation SCI is aimed towards national securities and trading exchanges primarily. And the regulation itself is almost 400 pages long. Even if the intention is to prevent the sort of calamities such as the Flash Crash, the BATS IPO event, and the Knight Capital incident…is regulation the best (or only) way to make our systems safer?


Always a Student: Operations and Systems Safety

Anyone who has known me well knows that I’m generally not satisfied with skimming the surface of a topic that I feel excited about. So to them it wouldn’t be a surprise that I’m now working on (yes, while I’m still at Etsy!) a master’s degree.

Since January I’ve been working with an incredible group as part of the Master’s Degree Program in Human Factors and Systems Safety at Lund University. lund_logoThis program was initially started by Sidney Dekker, and now is directed by the wicked smart Johan Bergström, whose works I’ve tweeted about before. As a matter of fact, I was able to convince JB to keynote this year’s Velocity Conference in Santa Clara next month on the the topic of risk, and I can’t be more excited for it.

So what am I all gaga about in this program?

To begin with, I’ve been a huge proponent of learning from other fields of engineering. In particular, how other domains perceive failures; with foresight, in hindsight, how they aim to prevent them, detect them, recover from them, and learn from them.

The Velocity Conference (and Surge, for that matter) are always filled with narratives of success and failure, and for that I’m grateful.

But I think for me it goes deeper than that.

We’re now in a world where the US State Department calls Twitter to request that database maintenance be postponed because of political events in the Middle East that could benefit from it being available. It’s all but given at this point that Facebook has had an enormous effect on global discourse on wide-ranging topics, many people pointing to its effects on the Arab Spring.

As we speak, REG-SCI is open for public comment from the SEC. Inside that piece of work are attempts to shore up safeguards and preventative measures that exchanges may have to employ to make themselves less vulnerable to perturbations and disturbances that can result in the next large-scale trading surprises that came with the Flash Crash, the BATS IPO event, and the Knight Capital incident.

And yes, here at Etsy we have been re-imagining how commerce is being done on a global scale as well. 🙂

  • How do we design our systems to be resilient? Are the traditional approaches still working? How will we know when they stop working?
  • How can we view the “systems” in that sentence to include the socio-technical relationship that organizations have to their service? Their employees? Their investors? The public?
  • How does the political, regulatory, or commercial environment that our services expect to live in affect their operation? Do they influence the ‘safety’ of those systems?
  • How do we manage the inevitable trade-offs that we experience when we move from a startup with a “Minimum Viable Product” to a globally-relied-upon service that is expected to always be on?
  • What are the various ways we can anticipate, monitor, respond to, and learn from our failures and our successes? 

All of these questions could be seen as technical in nature, but I’d argue that’s too simplistic. I’m interested in that beautiful and insane boundary between humans and machines, and how that relationship is intertwined in the increasingly complex systems we build over and over again.

My classmates in the program are from the US Forestry Service, air traffic control training facilities and towers, Australian mining safety, maritime accident investigation firms, healthcare and some airline pilots as well. They all have worked in high-tempo, high-consequence environments, and I’m finding even more overlap in thinking with them than I ever thought I would.

The notion that the web and internet infrastructures of tomorrow are heavily influenced by the failures of yesterday riddle me with concern and curiosity. Given that I’m now addicted to the critical thinking that many smart folks have been giving the topic for a couple of decades now, I figured that I’m not going to half-ass it, and lean into it as hard as I can.

So expect more writing on the topics of complex systems, human factors, systems safety, Just Culture, and related meanderings, because next year I’ve got a thesis to write. 🙂

Availability: Nuance As A Service

Something that has struck me funny recently surrounds the traditional notion of availability of web applications. With respect to its relationship to revenue, to infrastructure and application behavior, and fault protection and tolerance, I’m thinking it may be time to get a broader upgrade adjustment to the industry’s perception on the topic.

These nuances in the definition and affects of availability aren’t groundbreaking. They’ve been spoken about before, but for some reason I’m not yet convinced that they’re widely known or understood.

Impact On Business

What is laid out here in this article is something that’s been parroted for decades: downtime costs companies money, and lost value. Generally speaking, this is obviously correct, and by all means you should strive to design and operate your site with high availability and fault tolerance in mind.

But underneath the binary idea that uptime = good and downtime = bad, the reality is that there’s a lot more detail that deserves exploring.

This irritatingly-designed site has a post about a common equation to help those that are arithmetically challenged:

GR = gross yearly revenue
TH = total yearly business hours
I = percentage impact
H = number of hours of outage

In my mind, this is an unnecessarily blunt measure. I see the intention behind this approach, because it’s not meant to be anywhere close to being accurate. But modern web operations is now a field where gathering metrics in the hundreds of thousands per second is becoming more common-place, fault-tolerance/protection is a thing we do increasingly well, and graceful degradation techniques are the norm.

In other words: there are a lot more considerations than outage minutes = lost revenue, even if you did have a decent way to calculate it (which, you don’t). Companies selling monitoring and provisioning services will want you to subscribe to this notion.

We can do better than this blunt measure, and I thought it’s worth digging in a bit deeper.


Thought experiment: if has a full and global outage for 30 minutes, how much revenue did it “lose”? Using the above rough equation, you can certainly come up with a number, let’s say N million dollars. But how accurate is N, really? Discussions that surround revenue loss are normally designed to motivate organizations to invest in availability efforts, so N only needs to be big and scary enough to provide that motivation. So let’s just say that goal has been achieved: you’re convinced! Availability is important, and you’re a firm believer that You Own Your Own Availability.

Outside of the “let this big number N convince you to invest in availability efforts” I have some questions that surround N:

  • How many potential customers did lose forever, during that outage? Meaning: they tried to get to, with some nonzero intent/probability of buying something, found it to be offline, and will never return there again, for reasons of impatience, loss of confidence, the fact that it was an impulse-to-buy click whose time has passed, etc.
  • How much revenue did Amazon lose during that 30 minute window, versus how the revenue that it simply postponed when it was down, only to be executed later? In other words: upon finding the site down, they’ll return sometime later to do what they originally intended, which may or may not include buying something or participate in some other valuable activity.
  • How much did that 30 minutes of downtime affect the strength of the Amazon brand, in a way that could be viewed as revenue-affecting? Meaning: are users and potential users now swayed to having less confidence in Amazon because they came to the site only to be disappointed that it’s down, enough to consider alternatives the next time they would attempt to go to the site in the future?

I don’t know the answers to these questions about Amazon, but I do know that at Etsy, those answers depend on some variables:

  • the type of outage or degradation (more on that in a minute),
  • the time of day/week/year
  • how we actually calculate/forecast how those metrics would have behaved during the outage

So, let’s crack those open a bit, and see what might be inside…

Temporal Concerns

Not all time periods can be considered equal when it comes to availability, and the idea of lost revenue. For commerce sites (or really any site whose usage varies with some seasonality) this is hopefully glaringly obvious. In other words:

X minutes of full downtime during the peak hour of the peak day of the year can be worlds apart from Y minutes of full downtime during the lowest hour of the lowest day of the year, traffic-wise.

Take for example a full outage that happens during a period of the peak day of the year, and contrast it with one that happens during a lower-period of the year. Let’s say that this graph of purchases is of those 24-hour periods, indicating when the outages happen:

A Tale of Two Outages

The impact time of the outage during the lower-traffic day is actually longer than the peak day, affecting the precious Nines math by a decent margin. But yet: which outage would you rather have, if you had to have one of those? 🙂

Another temporal concern is: across space and time, distribution and volume of any level degradation could be viewed as perfect uptime as the length of the outage approaches zero.

Dig, if you will, these two outage profiles, across a 24-hour period. The first one has many small outages across the day:

Screen Shot 2013-01-03 at 8.09.59 AM

and the other has the same amount of impact time, in a single go:

Screen Shot 2013-01-03 at 8.12.54 AM

So here we have the same amount of time, but spread out throughout the day. Hopefully, folks will think a bit more beyond the clear “they’re both bad! don’t have outages!” and could investigate how they could be different. Some considerations in this simplified example:

  • Hour of day. Note that the single large outage is “earlier” in the day. Maybe this will affect EU or other non-US users more broadly, depending on the timezone of the original graph. Do EU users have a different expectation or tolerance for outages in a US-based company’s website?
  • Which outage scenario has a greater affect on the user population: if the ‘normal’ behavior is “get in, buy your thing, and get out” quickly, I could see the many-small-outages more preferable to the single large one. If the status quo is some mix of searching, browsing, favoriting/sharing, and then purchase, I could see the singular constrained outage being preferable.

Regardless, this underscores the idea that not all outages are created equal with respect to impact timing.


Loss of “availability” can also be seen as an extreme loss of performance. At a particular threshold, given the type of feedback to the user (a fast-failed 404 or browser error, versus a hanging white page and spinning “loading…”) the severity of an event being slow can effectively be the same as a full outage.

Some concerns/thought exercises around this:

  • Where is this latency threshold for your site, for the functionality that is critical for the business?
  • Is this threshold a cliff, or is it a continuous/predictable relationship between performance and abandonment?

There’s been much more work on performance’s effects on revenue than availability. The Velocity Conference in 2009 brought the first real production-scale numbers (in the form of a Bing/Google joint presentation as well as Shopzilla and Mozilla talks) behind how performance affects businesses, and if you haven’t read about it, please do.

Graceful Degradation

Will Amazon (or Etsy) lose sales if all or a portion of its functionality is gone (or sufficiently slow) for a period of time? Almost certainly. But that question is somewhat boring without further detail.

In many cases, modern web sites don’t simply live in a “everything works perfectly” or “nothing works at all” boolean world. (To be sure, neither does the Internet as a whole.) Instead, fault-tolerance and resilience approaches allow for features and operations degrade under a spectrum of failure conditions. Many companies build their applications to have both in-flight fault tolerance to degrade the experience in the face of singular failures, as well as making use of “feature flags” (Martin and Jez call them “feature toggles“) which allow for specific features to be shut off if they’re causing problems.

I’m hoping that most organizations are familiar with this approach at this point. Just because user registration is broken at the moment, you don’t want to prevent  already logged-in users from using the otherwise healthy site, do you? 🙂

But these graceful degradation approaches further complicates the notion of availability, as well as its impact on the business as a whole.

For example: if Etsy’s favoriting feature is not working (because the site’s architecture allows it to gracefully fail without affecting other critical functionality), but checkout is working fine…what is the result? Certainly you might paused before marking down your blunt Nines record.

You might also think: “so what? as long as people can buy things, then favoriting listings on the site shouldn’t be considered in scope of availability.”

But consider these possibilities:

  • What if Favoriting listings was a significant driver of conversions?
  • If Favoriting was a behavior that led to conversions at a rate of X%, what value should X be before ‘availability’ ought to be influenced by such a degradation?
  • What if Favoriting was technically working, but was severely degraded (see above) in performance?

Availability can be a useful metric, but when abused as a silver bullet to inform or even dictate architectural, business priority, and product decisions, there’s a real danger of oversimplifying what are really nuanced concerns.

Bounce-Back and Postponement

As I mentioned above, what is more likely for sites that have an established community or brand, outages (even full ones) don’t mark an instantaneous amount of ‘lost’ revenue or activity. For a nonzero amount, they’re simply postponed. This is the area that I think could use a lot more data and research in the industry, much in the same way that latency/conversion relationship has been investigated.

The over-simplified scenario involves something that looks like this. Instead of the blunt math of “X minutes of downtime = Y dollars of lost revenue”, we can be a bit more accurate, if we tried just a bit harder. The red is the outage:



So we have some more detail, which is that if we can make a reasonable forecast about what purchases did during the time of the outage, then we could make a better-inform estimate of purchases “lost” during that time period.

But is that actually the case?

What we see at Etsy is something different, a bit more like this:

Screen Shot 2013-01-03 at 12.35.41 PM

Clearly this is an oversimplification, but I think the general behavior comes across. When a site comes back from a full outage, there is an increase in the amount of activity as users who were stalled/paused in their behavior by the outage resumes. My assumption is that many organizations see this behavior, but it’s just not being talked about publicly.
The phenomenon that needs more real-world data is to support (or deny) the hypothesis that depending on:
  • Position of the outage in the daily traffic profile (start-end)
  • Position of the outage in the yearly season

the bounce-back volume will vary in a reasonably predictable fashion. Namely, as the length of the outage grows, the amount of bounce-back volume shrinks:

Screen Shot 2013-01-03 at 12.55.14 PM

What this line of thinking doesn’t capture is how many of those users postponed their activity not for immediately after the outage, but maybe the next day because they needed to leave their computer for a meeting at work, or leaving work to commute home?

Intention isn’t entirely straightforward to figure out, but in the cases where you have a ‘fail-over’ page that many CDNs will provide when the origin servers aren’t available, you can get some more detail about what requests (add to cart? submit payment?) came in during that time.

Regardless, availability and its affect on business metrics isn’t as easy as service providers and monitoring-as-a-service companies will have you believe. To be sure, a good amount of this investigation will vary wildly from company to company, but I think it’s well worth taking a look into.