We are too much accustomed to attribute to a single cause that which is the product of several, and the majority of our controversies come from that.
In between reading copious amounts of indignation surrounding whatever is suboptimal about healthcare.gov, you may or may not have noticed the SEC statement regarding the Knight Capital accident that took place in 2012.
This Release No. 70694 is a document that contains many details about the accident, and you can read what looks like on the surface to be an in-depth analysis of what went wrong and how best to prevent such an accident from happening in the future.
You may believe this document can serve as a ‘post-mortem’ narrative. It cannot, and should not.
Any ‘after-action’ or ‘postmortem’ document (in my domain of web operations and engineering) has two main goals:
- To provide an explanation of how an event happened, as the organization (including those closest to the work) best understands it.
- To produce artifacts (recommendations, remediations, etc.) aimed at both prevention and the improvement of detection and response approaches to aid in handling similar events in the future.
You need #1 in order to work on #2. If you don’t understand how the event unfolded, you can’t make gains towards prevention in the future.
The purpose of this post is to outline how the release is not something that can or should be used for explanation or prevention.
The Release No. 70694 document does not address either of those concerns in any meaningful way.
What it does address, however, is exactly what a regulatory body is tasked to do in the wake of a known outcome: contrast how an organization was or was not in compliance with the rules that the body has put in place. Nothing more, nothing less. In this area, the document is concise and focused.
You can be forgiven for thinking that the document could serve as an explanation, because you can find some technical details in it. It looks a little bit like a timeline. What is interesting is not what details are covered, but what details are not covered, including the organizational sensemaking that is part of every complex systems failure.
If you are looking for a real postmortem of the Knight Capital accident in this post, you’re going to be disappointed. At the end of this post, I will certainly attempt to list some questions that I might pose if I was facilitating a debriefing of the event, but no real investigation can happen without the individuals closest to the work involved in the discussion.
However, I’d like to write up a bit about why it should not be viewed as what is traditionally known (at least in the web operations and engineering community) as a postmortem report. Because frankly I think that is more important than the specific event itself.
But before I do that, it’s necessary to unpack a few concepts related to learning in a retrospective way, as in a postmortem…
Learning from events in the past (both successful and unsuccessful) puts us into a funny position as humans. In a process that is genuinely interested in learning from events, we have to rectify our need to understand with the reality that we will never get a complete picture of what has happened in the past. Regulatory bodies such as the SEC (lucky for them) don’t have to get a complete picture in order to do their job. They have only to point out the gap between how “work is prescribed” versus “work is being done” (or what Richard Cook has said “the system as imagined” versus “the system as found.”)
In many circumstances (as in the case of the SEC release), what this means is to point out the things that people and organizations didn’t do in the time preceding an event. This is usually done by using “counterfactuals”, which means literally “counter the facts.”
In the language of my domain, using counterfactuals in the process of explanation and prevention is an anti-pattern, and I’ll explain why.
One of the potential pitfalls of postmortem reports (and debriefings) is that the language we use can cloud our opportunities to learn what took place and the context people (and machines!) found themselves in. Sidney Dekker says this about using counterfactuals:
“They make you spend your time talking about a reality that did not happen (but if it had happened, the mishap would not have happened).” (Dekker, 2006, p. 39)
What are examples of counterfactuals? In ordinary language, they look like:
- “they shouldn’t have…”
- “they could have…”
- “they failed to…”
- “if only they had…!”
Why are these statements woefully inappropriate for aiding explanation of what happened? Because stating what you think should have happened doesn’t explain people’s (or an organization’s) behavior. Counterfactuals serve as a massive distraction, because it brings sharply into focus what didn’t happen, when what is required for explanation is to understand why people did what they did.
People do what makes sense to them, given their focus, their goals, and what they perceive to be their environment. This is known as the local rationality principle, and it is required in order to tease out second stories, which in turn is required for learning from failure. People’s local rationality is influenced by many dynamics, and I can imagine some of these things might feel familiar to any engineers who operate in high-tempo organizations:
Multiple conflicting goals
E.g., “Deploy the new stuff, and do it quickly because our competitors may beat us! Also: take care of all of the details while you do it quickly, because one small mistake could make for a big deal!”
Multiple targets of attention
E.g., “When you deploy the new stuff, make sure you’re looking at the logs. And ignore the errors that are normally there, so you can focus on the right ones to pay attention to. Oh, and the dashboard graph of errors…pay attention to that. And the deployment process. And the system resources on each node as you deploy to them. And the network bandwidth. Also: remember, we have to get this done quickly.”
David Woods put counterfactual thinking in context with how people actually work:
“After-the-fact, based on knowledge of outcome, outsiders can identify “critical” decisions and actions that, if different, would have averted the negative outcome. Since these “critical” points are so clear to you with the benefit of hindsight, you could be tempted to think they should have been equally clear and obvious to the people involved in the incident. These people’s failure to see what is obvious now to you seems inexplicable and therefore irrational or even perverse. In fact, what seems to be irrational behavior in hindsight turns out to be quite reasonable from the point of view of the demands practitioners face and the resources they can bring bear.” (Woods, 2010)
“You construct a referent world from outside the accident sequence, based on data you now have access to, based on facts you now know to be true. The problem is that these after-the-fact-worlds may have very little relevance to the circumstances of the accident sequence. They do not explain the observed behavior. You have substituted your own world for the one that surrounded the people in question.” (Dekker, 2004, p.33)
“Saying what people failed to do, or implying what they could or should have done to prevent the mishap, has no role in understanding human error.” (Dekker, 2004, p.43)
The engineers and managers at Knight Capital did not set out that morning of August 1, 2012 to lose $460 million. If they did, we’d be talking about sabotage and not human error. They did, however, set out to perform some work successfully (in this case, roll out what they needed to participate in the Retail Liquidity Program.)
If you haven’t picked up on it already, the use of counterfactuals is a manifestation of one of the most studied cognitive bias in modern psychology: The Hindsight Bias. I will leave it as an exercise to the reader to dig into that.
Cognitive biases are the greatest pitfalls in explaining surprising outcomes. The weird cousin of The Hindsight Bias is Outcome Bias. In a nutshell, it says that we are biased to “judge a past decision by its ultimate outcome instead of based on the quality of the decision at the time it was made, given what was known at that time.” (Outcome Bias, 2013)
In other words, we can be tricked into thinking that if the result of an accident is truly awful (like people dying, something crashing, or, say, losing $460 million in 20 minutes) then the decisions that led up to that outcome must have been reeeeeealllllllyyyy bad. Right?
This is a myth debunked by a few decades of social science, but it remains persistent. No decision maker has omniscience about results, so the severity of the outcome cannot be seen to be proportional to the quality of thought that went into the decisions or actions that led up to the result. Why we have this bias to begin with is yet another topic that we can explore another time.
But a possible indication that you are susceptible to The Outcome Bias is a quick thought exercise on results: if Knight Capital lost only $1,000 (or less) would you think them to be more or less prudent in their preventative measures than in the case of $460 million?
If you’re into sports, maybe this can help shed light on The Outcome Bias.
Operators (within complex systems, at least) have procedures and rules to help them achieve their goals safely. They come in many forms: checklists, guidelines, playbooks, laws, etc. There is a distinction between procedures and rules, but they have similarities when it comes to gaining understanding of the past.
First let’s talk about procedures. In the aftermath of an accident, we can (and will, in the SEC release) see many calls for “they didn’t follow procedures!” or “they didn’t even have a checklist!” This sort of statement can nicely serve as a counterfactual.
What is important to recognize is that procedures are but only one resource people use to do work. If we only worked by following every rule and procedure we’ve written for ourselves, by the letter, then I suspect society would come to a halt. As an aside, “work-to-rule” is a tactic that labor organizations have used to demonstrate the issues that onerous rules and procedures can rob people of their adaptive capacities, and therefore bring business to an effective standstill.
Some more thought exercises to think with on procedures:
- How easy might it be to go to your corporate wiki or intranet to find a procedure (or a step within a procedure) that was once relevant, but no longer is?
- Do you think you can find a procedure somewhere in your group that isn’t specific enough to address every context you might use it in?
- Can you find steps in existing procedures that feel safe to skip, especially in if you’re under time pressure to get something done?
- Part of the legal terms of using Microsoft Office is that you read and understand the End User License Agreement. You did that before checking “I agree”, right? Or did you violate that legal agreement?! (don’t worry, I won’t tell anyone)
Procedures are important for a number of reasons. They serve as institutional knowledge and guidelines for safe work. But, like wikis, they make sense to the authors of the procedure the day they wrote it. They are written to take into account all of the scenarios and contexts that the author can imagine.
But since that imagination is limited, many procedures that are thought to ensure safety are context-sensitive and they require interpretation, modification, and adaptation.
There are multiple issues with procedures as they are navigated by people who do real work. Stealing from Dekker again:
- “First, a mismatch between procedures and practice is not unique to accident sequences. Not following procedures does not necessarily lead to trouble, and safe outcomes may be preceded by just as (relatively) many procedural deviations as those that precede accidents (Woods et al., 1994; Snook, 2000) This turns any “findings” about accidents being preceded by procedural violation into mere tautologies…”
- “Second, real work takes place in a context of limited resources and multiple goals and pressures.”
- “Third, some of the safest complex, dynamic work not only occurs despite the procedures—such as aircraft line maintenance—but without procedures altogether.” The long-studied High Reliability Organizations have examples (in domains such as naval aircraft carrier operations and nuclear power generation) where procedures are eschewed, and instead replaced by less static forms of learning from practice:
‘‘there were no books on the integration of this new hardware into existing routines and no other place to practice it but at sea. Moreover, little of the process was written down, so that the ship in operation is the only reliable manual’’. Work is ‘‘neither standardized across ships nor, in fact, written down systematically and formally anywhere’’. Yet naval air- craft carriers—with inherent high-risk operations—have a remarkable safety record, like other so-called high reliability organizations (Rochlin et al., 1987; Weick, 1990; Rochlin, 1999). “
- “Fourth, procedure-following can be antithetical to safety.” – Consider the case of the 1949 US Mann Gulch disaster where firefighters who perished were the ones sticking to the organizational mandate to carry their tools everywhere. Or Swissair Flight 111, when captain and co-pilot of an aircraft disagreed on whether or not to follow the prescribed checklist for an emergency landing. While they argued, the plan crashed. (Dekker, 2003)
Anyone operating in high-tempo and high-consequence environments recognize both the utility and also the brittleness of a procedure, no matter how much thought went into it.
Let’s keep this idea in mind as we walk through the SEC release below.
Violation of Rules != Explanation
Now let’s talk about rules. The SEC’s job (in a nutshell) is to design, maintain, and enforce regulations of practice for various types of financially-driven organizations in the United States. Note that they are not charged with explaining or preventing events. Preventing may or may not result from their work in regulations, but prevention demands much more than abiding by rules.
Rules and regulations are similar to procedures in that they are written with deliberate but ultimately interpretable intention. Judges and juries help interpret different contexts as they relate to a given rule, law, or regulation. Rules are good for a number of reasons that are beyond the scope of this (now lengthy) post.
If we think about regulations in the context of causality, however, we can get into trouble.
Because we can find ourselves in uncertain contexts that have some of the dynamics that I listed above (multiple conflicting goals and targets of attention) regulations (even when we are acutely aware of them) pose some issues. In the Man-Made Disasters Model, Nick Pidgeon lays some of this out for us:
“Uncertainty may also arise about how to deal with formal violations of safety regulations. Violations might occur because regulations are ambiguous, in conflict with other goals such as the needs of production, or thought to be outdated because of technological advance. Alternatively safety waivers may be in operation, allowing relaxation of regulations under certain circumstances (as also occurred in the `Challenger’ case; see Vaughan, 1996).” (Pidgeon, 2000)
Rules and regulations need to allow for interpretation, otherwise they would be brittle in enforcement. So therefore, vagueness and flexibility in rules is desired. We’ll see how this vagueness can be exploited for enforcement, however, at the expense of learning.
Back to the statement
Once more: the SEC document cannot be viewed as a canonical description of what happened with Knight Capital on August 1, 2012.
It can, however, be viewed as a comprehensive account of the exchange and trading regulations the SEC deems were violated by the organization. This is its purpose. My goal here is not to critique the SEC release for its purpose, it is to reveal how it cannot be seen to aid either explanation or prevention of the event, and so should not be used for that.
Before we walk through (at least parts) of the document, it’s worth noting that there is no objective accident investigative body that exists for electronic trading systems. In aviation, there is a regulative body (the FAA) and an investigative body (the NTSB) and there is significant differences between the two, charter-wise and operations-wise. There exists no such independent investigative body analogous to the NTSB in Knight Capital’s industry. There is only the SEC.
I’ll have comments in italics, in blue and talk about the highlighted pieces. After getting feedback from many colleagues, I decided to keep the length here for people to dig into, because I think it’s important to understand. If you make it through this, you deserve cake.
The Securities and Exchange Commission (the “Commission”) deems it appropriate and in the public interest that public administrative and cease-and-desist proceedings be, and hereby are, instituted pursuant to Sections 15(b) and 21C of the Securities Exchange Act of 1934 (the “Exchange Act”) against Knight Capital Americas LLC (“Knight” or “Respondent”).
In anticipation of the institution of these proceedings, Respondent has submitted an Offer of Settlement (the “Offer”), which the Commission has determined to accept. Solely for the purpose of these proceedings and any other proceedings by or on behalf of the Commission, or to which the Commission is a party, and without admitting or denying the findings herein, except as to the Commission’s jurisdiction over it and the subject matter of these proceedings, which are admitted, Respondent consents to the entry of this Order Instituting Administrative and Cease-and-Desist Proceedings, Pursuant to Sections 15(b) and 21C of the Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a Cease-and-Desist Order (“Order”), as set forth below:
Note: This means that Knight doesn’t have to agree or disagree with any of the statements in the document. This is expected. If it was intended to be a postmortem doc, then there would be a lot more covered here in addition to listing violations of regulations.
1. On August 1, 2012, Knight Capital Americas LLC (“Knight”) experienced a significant error in the operation of its automated routing system for equity orders, known as SMARS. While processing 212 small retail orders that Knight had received from its customers, SMARS routed millions of orders into the market over a 45-minute period, and obtained over 4 million executions in 154 stocks for more than 397 million shares. By the time that Knight stopped sending the orders, Knight had assumed a net long position in 80 stocks of approximately $3.5 billion and a net short position in 74 stocks of approximately $3.15 billion. Ultimately, Knight lost over $460 million from these unwanted positions. The subject of these proceedings is Knight’s violation of a Commission rule that requires brokers or dealers to have controls and procedures in place reasonably designed to limit the risks associated with their access to the markets, including the risks associated with automated systems and the possibility of these types of errors.
Note: Again, the purpose of the doc is to point out where Knight violated rules. It is not:
- a description of the multiple trade-offs that engineering at Knight made or considered when designing fault-tolerance in their systems, or
- how Knight as an organization evolved over time to focus on evolving some procedures and not others, or
- how engineers anticipated in preparation for deploying support for the new RLP effort on Aug 1, 2012.
To equate any of those things with violation of a rule is a cognitive leap that we should stay very far away from.
It’s worth mentioning here that the document only focuses on failures, and makes no mention of successes. How Knight succeeded during diagnosis and response is unknown to us, so a rich source of data isn’t available. Because of this, we cannot pretend the document to give explanation.
2. Automated trading is an increasingly important component of the national market system. Automated trading typically occurs through or by brokers or dealers that have direct access to the national securities exchanges and other trading centers. Retail and institutional investors alike rely on these brokers, and their technology and systems, to access the markets.
3. Although automated technology brings benefits to investors, including increased execution speed and some decreased costs, automated trading also amplifies certain risks. As market participants increasingly rely on computers to make order routing and execution decisions, it is essential that compliance and risk management functions at brokers or dealers keep pace. In the absence of appropriate controls, the speed with which automated trading systems enter orders into the marketplace can turn an otherwise manageable error into an extreme event with potentially wide-spread impact.
Note: The sharp contrast between our ability to create complex and valuable automation and our ability to reason about, influence, control, and understand it in even ‘normal’ operating conditions (forget about time-pressured emergency diagnosis of a problem) is something I (and many others over the decades) have written about. The key phrase here is “keep pace”, and it’s difficult for me to argue with. This may be the most valuable statement in the document with regards to safety and the use of automation.
4. Prudent technology risk management has, at its core, quality assurance, continuous improvement, controlled testing and user acceptance, process measuring, management and control, regular and rigorous review for compliance with applicable rules and regulations and a strong and independent audit process. To ensure these basic features are present and incorporated into day-to-day operations, brokers or dealers must invest appropriate resources in their technology, compliance, and supervisory infrastructures. Recent events and Commission enforcement actions have demonstrated that this investment must be supported by an equally strong commitment to prioritize technology governance with a view toward preventing, wherever possible, software malfunctions, system errors and failures, outages or other contingencies and, when such issues arise, ensuring a prompt, effective, and risk-mitigating response. The failure by, or unwillingness of, a firm to do so can have potentially catastrophic consequences for the firm, its customers, their counterparties, investors and the marketplace.
Note: Here we have the first value statement we see in the document. It states what is “prudent” in risk management. This is reasonable for the SEC to state in a generic high-level way, given its charge: to interpret regulations. This sets the stage for showing contrast between what happened, and what the rules are, which comes later.
If this was a postmortem doc, this word should be a red flag that immediately sets your face on fire. Stating what is “prudent” is essentially imposing standards onto history. It is a declaration of what a standard of good practice looks like. The SEC does not mention Knight Capital as not prudent specifically, but they don’t have to. This is the model on which the rest of the document rests. Stating what standards of good practice look like in a document that is looked to for explanation is an anti-pattern. In aviation, this might be analogous to saying that a pilot lacked “good airmanship” and pointing at it as a cause.The phrases “must invest appropriate resources” and “equally strong” above are both non-binary and context-sensitive. What is appropriate and equally strong gets to be defined by…whom?
- What is “prudent”?
- The description only says prudence demands prevention of errors, outages, and malfunctions “wherever possible.” How will you know where prevention is not possible? And following that – it would appear that you can be prudent and still not prevent errors and malfunctions.
- Please ensure a “prompt, effective, and risk-mitigating response.” In other words: fix it correctly and fix it quickly. It’s so simple!
5. The Commission adopted Exchange Act Rule 15c3-52 in November 2010 to require that brokers or dealers, as gatekeepers to the financial markets, “appropriately control the risks associated with market access, so as not to jeopardize their own financial condition, that of other market participants, the integrity of trading on the securities markets, and the stability of the financial system.”
Note: It’s true, this is what the rule says. What is deemed “appropriate”, it would seem, is dependent on the outcome. Had an accident? It was not appropriate control. Didn’t have an accident? It must be appropriate control. This would mean that Knight Capital did have appropriate controls the day before the accident. Outcome bias reigns supreme here.
6. Subsection (b) of Rule 15c3-5 requires brokers or dealers with market access to “establish, document, and maintain a system of risk management controls and supervisory procedures reasonably designed to manage the financial, regulatory, and other risks” of having market access. The rule addresses a range of market access arrangements, including customers directing their own trading while using a broker’s market participant identifications, brokers trading for their customers as agents, and a broker-dealer’s trading activities that place its own capital at risk. Subsection (b) also requires a broker or dealer to preserve a copy of its supervisory procedures and a written description of its risk management controls as part of its books and records.
Note: The rules says, basically: “have a document about controls and risks”. It doesn’t say anything about an organization’s ability to adapt them as time and technology progresses, only that at some point they were written down and shared with the right parties.
7. Subsection (c) of Rule 15c3-5 identifies specific required elements of a broker or dealer’s risk management controls and supervisory procedures. A broker or dealer must have systematic financial risk management controls and supervisory procedures that are reasonably designed to prevent the entry of erroneous orders and orders that exceed pre-set credit and capital thresholds in the aggregate for each customer and the broker or dealer. In addition, a broker or dealer must have regulatory risk management controls and supervisory procedures that are reasonably designed to ensure compliance with all regulatory requirements.
Note: This is the first of many instances of the phrase “reasonably designed” in the document. As with the word ‘appropriate’, how something is defined to be “reasonably designed” is dependent on the outcome of that design. This robs both the design and the engineer of the nuanced details that make for resilient systems. Modern technology doesn’t work or not-work. It breaks and fails in surprising (sometimes shocking) ways that were not imagined by its designers, which means that “reason” plays only a part of its quality.
Right now, all over the world, every (non-malicious) engineer around the world is designing and building systems that they believe are “reasonably designed.” If they didn’t think they were reasonably designed, they wouldn’t be finished with it until they did think it was.
Some of those systems will fail. Most will not. Many of them will fail in ways that are safe and anticipated. Some will will not, and surprise everyone.
Systems Safety researcher Erik Hollnagel has had related thoughts:
We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.
8. Subsection (e) of Rule 15c3-5 requires brokers or dealers with market access to establish, document, and maintain a system for regularly reviewing the effectiveness of their risk management controls and supervisory procedures. This sub-section also requires that the Chief Executive Officer (“CEO”) review and certify that the controls and procedures comply with subsections (b) and (c) of the rule. These requirements are intended to assure compliance on an ongoing basis, in part by charging senior management with responsibility to regularly review and certify the effectiveness of the controls.
Note: This takes into consideration that systems are not indeed static, and it implies that they need to evolve over time. This is important to remember for some notes later on.
9. Beginning no later than July 14, 2011, and continuing through at least August 1, 2012, Knight’s system of risk management controls and supervisory procedures was not reasonably designed to manage the risk of its market access. In addition, Knight’s internal reviews were inadequate, its annual CEO certification for 2012 was defective, and its written description of its risk management controls was insufficient. Accordingly, Knight violated Rule 15c3-5. In particular:
- Knight did not have controls reasonably designed to prevent the entry of erroneous orders at a point immediately prior to the submission of orders to the market by one of Knight’s equity order routers, as required under Rule 15c3-5(c)(1)(ii);
- Knight did not have controls reasonably designed to prevent it from entering orders for equity securities that exceeded pre-set capital thresholds for the firm, in the aggregate, as required under Rule 15c3-5(c)(1)(i). In particular, Knight failed to link accounts to firm-wide capital thresholds, and Knight relied on financial risk controls that were not capable of preventing the entry of orders;
- Knight did not have an adequate written description of its risk management controls as part of its books and records in a manner consistent with Rule 17a-4(e)(7) of the Exchange Act, as required by Rule 15c3-5(b);
- Knight also violated the requirements of Rule 15c3-5(b) because Knight did not have technology governance controls and supervisory procedures sufficient to ensure the orderly deployment of new code or to prevent the activation of code no longer intended for use in Knight’s current operations but left on its servers that were accessing the market; and Knight did not have controls and supervisory procedures reasonably designed to guide employees’ responses to significant technological and compliance incidents;
- Knight did not adequately review its business activity in connection with its market access to assure the overall effectiveness of its risk management controls and supervisory procedures, as required by Rule 15c3-5(e)(1); and
- Knight’s 2012 annual CEO certification was defective because it did not certify that Knight’s risk management controls and supervisory procedures complied with paragraphs (b) and (c) of Rule 15c3-5, as required by Rule 15c3-5(e)(2).
Note: It’s a counterfactual party! The question remains: are conditions sufficient, reasonably designed, or adequate if they don’t result in an accident like this one? Which comes first: these characterizations, or the accident? Knight Capital did believe these things were sufficient, reasonably designed, and adequate enough. Otherwise, they would have addressed them. One question necessary to answer for prevention is: “What were the sources of confidence that Knight Capital drew upon as they designed their systems?” Because improvement lies there.
10. As a result of these failures, Knight did not have a system of risk management controls and supervisory procedures reasonably designed to manage the financial, regulatory, and other risks of market access on August 1, 2012, when it experienced a significant operational failure that affected SMARS, one of the primary systems Knight uses to send orders to the market. While Knight’s technology staff worked to identify and resolve the issue, Knight remained connected to the markets and continued to send orders in certain listed securities. Knight’s failures resulted in it accumulating an unintended multi-billion dollar portfolio of securities in approximately forty-five minutes on August 1 and, ultimately, Knight lost more than $460 million, experienced net capital problems, and violated Rules 200(g) and 203(b) of Regulation SHO.
11. Knight Capital Americas LLC (“Knight”) is a U.S.-based broker-dealer and a wholly-owned subsidiary of KCG Holdings, Inc. Knight was owned by Knight Capital Group, Inc. until July 1, 2013, when that entity and GETCO Holding Company, LLC combined to form KCG Holdings, Inc. Knight is registered with the Commission pursuant to Section 15 of the Exchange Act and is a Financial Industry Regulatory Authority (“FINRA”) member. Knight has its principal business operations in Jersey City, New Jersey. Throughout 2011 and 2012, Knight’s aggregate trading (both for itself and for its customers) generally represented approximately ten percent of all trading in listed U.S. equity securities. SMARS generally represented approximately one percent or more of all trading in listed U.S. equity securities.
B. August 1, 2012 and Related Events
Preparation for NYSE Retail Liquidity Program
12. To enable its customers’ participation in the Retail Liquidity Program (“RLP”) at the New York Stock Exchange, which was scheduled to commence on August 1, 2012, Knight made a number of changes to its systems and software code related to its order handling processes. These changes included developing and deploying new software code in SMARS. SMARS is an automated, high speed, algorithmic router that sends orders into the market for execution. A core function of SMARS is to receive orders passed from other components of Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity, send one or more representative (or “child”) orders to external venues for execution.
13. Upon deployment, the new RLP code in SMARS was intended to replace unused code in the relevant portion of the order router. This unused code previously had been used for functionality called “Power Peg,” which Knight had discontinued using many years earlier. Despite the lack of use, the Power Peg functionality remained present and callable at the time of the RLP deployment. The new RLP code also repurposed a flag that was formerly used to activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag was set to “yes,” the new RLP functionality—rather than Power Peg—would be engaged.
Note: Noting the intention is important in gaining understanding, because it shows effort to get into the mindset of the individual or groups involved in the work. If this introspection continued throughout the document, it would get a little closer to something like a postmortem.
Raise your hand if you can definitively state all of the active and inactive code execution paths in your application right now. Right.
14. When Knight used the Power Peg code previously, as child orders were executed, a cumulative quantity function counted the number of shares of the parent order that had been executed. This feature instructed the code to stop routing child orders after the parent order had been filled completely. In 2003, Knight ceased using the Power Peg functionality. In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.
Note: On the surface, this looks like some technical meat to bite into. There is a some detail surrounding a fault-tolerance guardrail here, something to fail “closed” in the presence of specific criteria. What’s missing? Any dialogue about why the move of the function from one place (in Power Peg) to another (earlier in SMARS) – this is important, because in my experience, engineers don’t make effort in that sort of thing without motivation. If that motivation was explored, then we’d get a better sense of where the organization drew its confidence from, previous to the accident. This helps us understand their local rationality. But: we don’t get that from this document.
15. Beginning on July 27, 2012, Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.
Note: Code and deployment review is a fine thing to have. But is it sufficient? Dr. Nancy Leveson explained when she was invited to speak at the SEC’s “Technology Roundtable” in October of last year that in 1992, she chaired a committee to review the code that was deployed on the Space Shuttle. She said that NASA was spending $100 million a year to maintain the code, was employing the smartest engineers in the world, and there were still found to be gaps of concern. She repeats that there is no such thing as perfect software, no matter how much effort an individual or organization makes to produce such a thing.
Do written procedures requiring a review of code or deployment guarantee safety? Of course not. But ensuring safety isn’t what the SEC is expected to do in this document. Again: they are only pointing out the differences between regulation and practice.
Events of August 1, 2012
16. On August 1, Knight received orders from broker-dealers whose customers were eligible to participate in the RLP. The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server. As a result, this server began sending child orders to certain trading centers for execution. Because the cumulative quantity function had been moved, this server continuously sent child orders, in rapid sequence, for each incoming parent order without regard to the number of share executions Knight had already received from trading centers. Although one part of Knight’s order handling system recognized that the parent orders had been filled, this information was not communicated to SMARS.
Note: So the guardrail/fail-closed mechanism wasn’t in the same place it was before, and the eighth server was allowed to continue on. As Leveson said in her testimony: ” It’s not necessarily just individual component failure. In a lot of these accidents each individual component worked exactly the way it was expected to work. It surprised everyone in the interactions among the components.”
17. The consequences of the failures were substantial. For the 212 incoming parent orders that were processed by the defective Power Peg code, SMARS sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes. Knight inadvertently assumed an approximately $3.5 billion net long position in 80 stocks and an approximately $3.15 billion net short position in 74 stocks. Ultimately, Knight realized a $460 million loss on these positions.
Note: Just in case you forgot, this accident was sooooo bad. These numbers are so big. Keep that in mind, dear reader, because I want to you remember that when you think about the engineer who thought he had deployed the code to the eighth server.
18. The millions of erroneous executions influenced share prices during the 45 minute period. For example, for 75 of the stocks, Knight’s executions comprised more than 20 percent of the trading volume and contributed to price moves of greater than five percent. As to 37 of those stocks, the price moved by greater than ten percent, and Knight’s executions constituted more than 50 percent of the trading volume. These share price movements affected other market participants, with some participants receiving less favorable prices than they would have in the absence of these executions and others receiving more favorable prices.
BNET Reject E-mail Messages
19. On August 1, Knight also received orders eligible for the RLP but that were designated for pre-market trading. SMARS processed these orders and, beginning at approximately 8:01 a.m. ET, an internal system at Knight generated automated e-mail messages (called “BNET rejects”) that referenced SMARS and identified an error described as “Power Peg disabled.” Knight’s system sent 97 of these e-mail messages to a group of Knight personnel before the 9:30 a.m. market open. Knight did not design these types of messages to be system alerts, and Knight personnel generally did not review them when they were received. However, these messages were sent in real time, were caused by the code deployment failure, and provided Knight with a potential opportunity to identify and fix the coding issue prior to the market open. These notifications were not acted upon before the market opened and were not used to diagnose the problem after the open.
Note: Translated, this says that systems-generated warnings/alerts that were sent via email weren’t noticed. Signals sent by automated systems (synchronously – as in “alerts” or asynchronously – as in “email”) aimed at perfectly detecting or preventing anomalies is not a solved problem. Show me an outage, any outage, and I’ll show you warning signs that humans didn’t pick up on. The document doesn’t give any detail on why those type of messages were sent via email (as opposed to paging-style alerts), what the distribution list was for them, how those messages get generated, or any other details.
Is the number of the emails (97 of them) important? 97 sounds like a lot, doesn’t it? If it was one, and not 97, would the paragraph read differently? What if there were 10,000 messages sent?
How many engineers right now are receiving alerts on their phone (forget about emails) that they will glance at and think that they are part of the normal levels of noise in the system, because thresholds and error handling are not always precisely tuned?
C. Controls and Supervisory Procedures
20. Knight had a number of controls in place prior to the point that orders reached SMARS. In particular, Knight’s customer interface, internal order management system, and system for internally executing customer orders all contained controls concerning the prevention of the entry of erroneous orders.
21. However, Knight did not have adequate controls in SMARS to prevent the entry of erroneous orders. For example, Knight did not have sufficient controls to monitor the output from SMARS, such as a control to compare orders leaving SMARS with those that entered it. Knight also did not have procedures in place to halt SMARS’s operations in response to its own aberrant activity. Knight had a control that capped the limit price on a parent order, and therefore related child orders, at 9.5 percent below the National Best Bid (for sell orders) or above the National Best Offer (for buy orders) for the stock at the time that SMARS had received the parent order. However, this control would not prevent the entry of erroneous orders in circumstances in which the National Best Bid or Offer moved by less than 9.5 percent. Further, it did not apply to orders—such as the 212 orders described above—that Knight received before the market open and intended to send to participate in the opening auction at the primary listing exchange for the stock.
Note: Anomaly detection and error-handling criteria have two origins: the imagination of their authors and the history of surprises that have been encountered already. A significant number of thresholds, guardrails, and alerts in any technical organization are put in place only after it’s realized that they are needed. Some of these realizations come from negative events like outages, data loss, etc. and some of them come from “near-misses” or explicit re-anticipation activated by feedback that comes from real-world operation.
Even then, real-world observations don’t always produce new safeguards. How many successful trades had Knight Capital seen in its lifetime while that control allowed “the entry of erroneous orders in circumstances in which the National Best Bid or Offer moved by less than 9.5 percent.” How many successful Shuttle launches saw degradation in O-ring integrity before the Challenger accident? This ‘normalization of deviance’ (Vaughn, 1997) phenomenon is to be expected in all socio-technical organizations. Financial trading systems are no exception. History matters.
Note: Nothing in this section had much value in explanation or prevention.
Code Development and Deployment
26. Knight did not have written code development and deployment procedures for SMARS (although other groups at Knight had written procedures), and Knight did not require a second technician to review code deployment in SMARS. Knight also did not have a written protocol concerning the accessing of unused code on its production servers, such as a protocol requiring the testing of any such code after it had been accessed to ensure that the code still functioned properly.
Note: Again, does a review guarantee safety? Does testing prevent malfunction?
27. On August 1, Knight did not have supervisory procedures concerning incident response. More specifically, Knight did not have supervisory procedures to guide its relevant personnel when significant issues developed. On August 1, Knight relied primarily on its technology team to attempt to identify and address the SMARS problem in a live trading environment. Knight’s system continued to send millions of child orders while its personnel attempted to identify the source of the problem. In one of its attempts to address the problem, Knight uninstalled the new RLP code from the seven servers where it had been deployed correctly. This action worsened the problem, causing additional incoming parent orders to activate the Power Peg code that was present on those servers, similar to what had already occurred on the eighth server.
Note: I would like to think that most engineering organizations that are tasked with troubleshooting issues in production systems understand that diagnosis isn’t something you can prescribe. Successful incident response in escalating scenarios is something that comes from real-world practice, not a document. Improvisation and intuition play a significant role in this, which obviously cannot be written down beforehand.
Thought exercise: you just deployed new code to production. You become aware of an issue. Would it be surprising if one of the ways you attempt to rectify the scenario is to roll back to the last known working version? The SEC release implies that it would be.
D. Compliance Reviews and Written Description of Controls
Note: I’m skipping some sections here as it’s just more about compliance.
Post-Compliance Date Reviews
32. Knight conducted periodic reviews pursuant to the WSPs. As explained above, the WSPs assigned various tasks to be performed by SCG staff in consultation with the pertinent business and technology units, with a senior member of the pertinent business unit reviewing and approving that work. These reviews did not consider whether Knight needed controls to limit the risk that SMARS could malfunction, nor did these reviews consider whether Knight needed controls concerning code deployment or unused code residing on servers. Before undertaking any evaluation of Knight’s controls, SCG, along with business and technology staff, had to spend significant time and effort identifying the missing content and correcting the inaccuracies in the written description.
33. Several previous events presented an opportunity for Knight to review the adequacy of its controls in their entirety. For example, in October 2011, Knight used test data to perform a weekend disaster recovery test. After the test concluded, Knight’s LMM desk mistakenly continued to use the test data to generate automated quotes when trading began that Monday morning. Knight experienced a nearly $7.5 million loss as a result of this event. Knight responded to the event by limiting the operation of the system to market hours, changing the control so that this system would stop providing quotes after receiving an execution, and adding an item to a disaster recovery checklist that required a check of the test data. Knight did not broadly consider whether it had sufficient controls to prevent the entry of erroneous orders, regardless of the specific system that sent the orders or the particular reason for that system’s error. Knight also did not have a mechanism to test whether their systems were relying on stale data.
Note: That we might be able to cherry-pick opportunities in the past where signs of doomsday could have (or should have) been seen and heeded is consistent with textbook definitions of The Hindsight Bias. How organizations learn is influenced by the social and cultural dynamics of its internal structures. Again, Diane Vaughn’s writings is a place we can look to for exploring how path dependency can get us into surprising places. But again: this is not the SEC’s job to speak to that.
E. CEO Certification
34. In March 2012, Knight’s CEO signed a certification concerning Rule 15c3-5. The certification did not state that Knight’s controls and procedures complied with the rule. Instead, the certification stated that Knight had in place “processes” to comply with the rule. This drafting error was not intentional, the CEO did not notice the error, and the CEO believed at the time that he was certifying that Knight’s controls and procedures complied with the rule.
Note: This is possibly the only hint at local rationality in the document.
F. Collateral Consequences
35. There were collateral consequences as a result of the August 1 event, including significant net capital problems. In addition, many of the millions of orders that SMARS sent on August 1 were short sale orders. Knight did not mark these orders as short sales, as required by Rule 200(g) of Regulation SHO. Similarly, Rule 203(b) of Regulation SHO prohibits a broker or dealer from accepting a short sale order in an equity security from another person, or effecting a short sale in an equity security for its own account, unless it has borrowed the security, entered into a bona-fide arrangement to borrow the security, or has reasonable grounds to believe that the security can be borrowed so that it can be delivered on the date delivery is due (known as the “locate” requirement), and has documented compliance with this requirement. Knight did not obtain a “locate” in connection with Knight’s unintended orders and did not document compliance with the requirement with respect to Knight’s unintended orders.
A. Market Access Rule: Section 15(c)(3) of the Exchange Act and Rule 15c3-5
Note: I’m going skip a bit because it’s not much more than a restating of rules that the SEC deemed were broken….
Accordingly, pursuant to Sections 15(b) and 21C of the Exchange Act, it is hereby ORDERED that:
A. Respondent Knight cease and desist from committing or causing any violations and any future violations of Section 15(c)(3) of the Exchange Act and Rule 15c3-5 thereunder, and Rules 200(g) and 203(b) of Regulation SHO.
Note: Translated – you must stop immediately all of the things that violate rules that say you must “reasonably design” things. So don’t unreasonably design things anymore.
The SEC document does what it needs to do: walk through the regulations that they think were violated, and talk about the settlement agreement. Knight Capital doesn’t have to admit they did anything wrong or suboptimal, and the SEC gets to tell them what to do next. That is, roughly:
- Hire a consultant that helps them not unreasonably design things anymore, and document that.
- Pay $12M to the SEC.
Like I mentioned before, this SEC release doesn’t help explain
why how the event came to be, or make any effort towards prevention other than require Knight Capital to pay a settlement, hire a consultant, and write new procedures that can predict the future. I do not know anyone at Knight Capital (or at the SEC for that matter) so it’s very unlikely that I’ll gain any more awareness of accident details than you will, my dear reader.
But I can put down a few questions that I might ask if I was facilitating the debriefing of the accident, which could possibly help with gaining a systems-thinking perspective on explanation. Real prevention is left to an exercise to the readers who also work at Knight Capital.
- The engineer who deployed the new code to support the RLP integration had confidence that all servers (not just seven of the eight) received the new code. What gave him that confidence? Was it a dashboard? Reliance on an alert? Some other sort of feedback from the deployment process?
- The BNET Reject E-mail Messages: Have they ever been sent before? Do the recipients of them trust their validity? What is the background on their delivery being via email, versus synchronous alerting? Do they provide enough context in their content to give an engineer sufficient criteria to act on?
- What were the signals that the responding team used to indicate that a roll-back of the code on the seven servers was a potential repairing action?
- Did the team that were responding to the issue have solid and clear communication channels? Was it textual chat, in-person, or over voice or video conference?
- Did the team have to improvise any new tooling to be used in the diagnosis or response?
- What metrics did the team use to guide their actions? Were they infrastructural (such as latency, network, or CPU graphs?) or market-related data (trades, positions, etc.) or a mixture?
- What indications were there to raise awareness that the eighth server didn’t receive the latest code? Was it a checksum or versioning? Was it logs of a deployment tool? Was it differences in the server metrics of the eighth server?
- As the new code was rolled out: what was the team focused on? What were they seeing?
- As they recognized there was an issue: did the symptoms look like something they had seen before?
- As the event unfolded: did the responding team discuss what to do, or did single actors take action?
- Regarding non-technical teams: were they involved with directing the response?
- Many many more questions remain, that presumably (hopefully) Knight Capital has asked and answered themselves.
The Second Victim
What about the engineer who deployed the code…the one who had his hands on the actual work being done? How is he doing? Is he receiving support from his peers and colleagues? Or was he fired? The financial trading world does not exactly have a reputation for empathy, and given that there is no voice given to the people closest to the work (such as this engineer) informing the story, I can imagine that symptoms consistent with traumatic stress are likely.
Some safety-critical domains have put together structured programs to offer support to individuals that are involved with high-tempo and high-consequence work. Aviation and air traffic control has seen good success with CISM (Critical Incident Stress Management) and it’s been embraced by organizations around the world.
As web operations and financial trading systems become more and more complex, we will continue to be surprised by outcomes of what looks like “normal” work. If we do not make effort to support those who navigate this complexity on a daily basis, we will not like the results.
- The SEC does not have responsibility for investigation with the goals of explanation or prevention of adverse events. Their focus is regulation.
- Absent a real investigation that eschews counterfactuals, puts procedures and rules into context, and encourages a narrative that holds paramount the voices of those closest to the work: we cannot draw any substantial conclusions. This means armchair accident investigation ripe with indignation.
So please don’t use the SEC Release No. 70694 as a post-mortem document, because it is not.
Dekker, S. (2003). Failure to adapt or adaptations that fail: contrasting models on procedures and safety. Applied Ergonomics, 34(3), 233–238. doi:10.1016/S0003-6870(03)00031-0
Dekker, S. (2006). The Field Guide to Understanding Human Error. Ashgate Publishing, Ltd.
Outcome Bias. (n.d.). In Wikipedia. Retrieved October 28, 2013, from https://en.wikipedia.org/wiki/Outcome_bias
Pidgeon, N., & O’Leary, M. (2000). Man-made disasters: why technology and organizations (sometimes) fail. Safety Science, 34(1), 15–30.
Vaughan, D. (2009). The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. University of Chicago Press.
Woods, D. D., Dekker, S., Cook, R., Johannesen, L., & Sarter, N. (2010). Behind Human Error (2nd ed.). Farnham: Ashgate Pub Co.
Weick, K.E., 1993. The collapse of sensemaking in organizations. Administrative Sci. Quart. 38, 628–652.
(This was originally posted on Code As Craft, Etsy’s engineering blog. I’m re-posting it here because it still resonates strongly as I prepare to teach a ‘postmortem facilitator’s course internally at Etsy.)
Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought I might give some detail on how that actually happens, and why.
Anyone who’s worked with technology at any scale is familiar with failure. Failure cares not about the architecture designs you slave over, the code you write and review, or the alerts and metrics you meticulously pore through.
So: failure happens. This is a foregone conclusion when working with complex systems. But what about those failures that have resulted due to the actions (or lack of action, in some cases) of individuals? What do you do with those careless humans who caused everyone to have a bad day?
Maybe they should be fired.
Or maybe they need to be prevented from touching the dangerous bits again.
Or maybe they need more training.
This is the traditional view of “human error”, which focuses on the characteristics of the individuals involved. It’s what Sidney Dekker calls the “Bad Apple Theory” – get rid of the bad apples, and you’ll get rid of the human error. Seems simple, right?
We don’t take this traditional view at Etsy. We instead want to view mistakes, errors, slips, lapses, etc. with a perspective of learning. Having blameless Post-Mortems on outages and accidents are part of that.
A Blameless Post-Mortem
What does it mean to have a ‘blameless’ Post-Mortem?
Does it mean everyone gets off the hook for making mistakes? No.
Well, maybe. It depends on what “gets off the hook” means. Let me explain.
Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.
Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:
- what actions they took at what time,
- what effects they observed,
- expectations they had,
- assumptions they had made,
- and their understanding of timeline of events as they occurred.
…and that they can give this detailed account without fear of punishment or retribution.
Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.
We believe that this detail is paramount to improving safety at Etsy.
If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?
This cycle of name/blame/shame can be looked at like this:
- Engineer takes action and contributes to a failure or incident.
- Engineer is punished, shamed, blamed, or retrained.
- Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
- Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
- Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
- Errors more likely, latent conditions can’t be identified due to #5, above
- Repeat from step 1
We need to avoid this cycle. We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.
The base fundamental here is something Erik Hollnagel has said:
We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.
A Second Story
This idea of digging deeper into the circumstance and environment that an engineer found themselves in is called looking for the “Second Story”. In Post-Mortem meetings, we want to find Second Stories to help understand what went wrong.
From Behind Human Error here’s the difference between “first” and “second” stories of human error:
|First Stories||Second Stories|
|Human error is seen as cause of failure||Human error is seen as the effect of systemic vulnerabilities deeper inside the organization|
|Saying what people should have done is a satisfying way to describe failure||Saying what people should have done doesn’t explain why it made sense for them to do what they did|
|Telling people to be more careful will make the problem go away||Only by constantly seeking out its vulnerabilities can organizations enhance safety|
Allowing Engineers to Own Their Own Stories
A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items.
So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.
So what do we do to enable a “Just Culture” at Etsy?
- We encourage learning by having these blameless Post-Mortems on outages and accidents.
- The goal is to understand how an accident could have happened, in order to better equip ourselves from it happening in the future
- We seek out Second Stories, gather details from multiple perspectives on failures, and we don’t punish people for making mistakes.
- Instead of punishing engineers, we instead give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures.
- We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future.
- We accept that there is always a discretionary space where humans can decide to make actions or not, and that the judgement of those decisions lie in hindsight.
- We accept that the Hindsight Bias will continue to cloud our assessment of past events, and work hard to eliminate it.
- We accept that the Fundamental Attribution Error is also difficult to escape, so we focus on the environment and circumstances people are working in when investigating accidents.
- We strive to make sure that the blunt end of the organization understands how work is actually getting done (as opposed to how they imagine (or hope) it’s getting done, via Gantt charts and procedures) on the sharp end.
- The sharp end is relied upon to inform the organization where the line is between appropriate and inappropriate behavior. This isn’t something that the blunt end can come up with on its own.
Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.
One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”
Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.
That’s why we have blameless Post-Mortems at Etsy, and why we’re looking to create a Just Culture here.
(Courtney Nash’s excellent post on this topic inadvertently pushed me to finally finish this – give it a read)
In the last post on this topic, I hoped to lay the foundation for what a mature role for automation might look like in web operations, and bring considerations to the decision-making process involved with considering automation as part of a design. Like Richard mentioned in his excellent comment to that post, this is essentially a very high level overview about the past 30 years of research into the effects, benefits, and ironies of automation.
I also hoped in that post to challenge people to investigate their assumptions about automation.
- when will automation be appropriate,
- what problems could it help solve, and
- how should it be designed in order to augment and compliment (not simply replace) human adaptive and processing capacities.
The last point is what I’d like to explore further here. Dr. Cook also pointed out that I had skipped over entirely the concept of task allocation as an approach that didn’t end up as intended. I’m planning on exploring that a bit in this post.
But first: what is responsible for the impulse to automate that can grab us so strongly as engineers?
Is it simply the disgust we feel when we find (often in hindsight) a human-driven process that made a mistake (maybe one that contributed to an outage) that is presumed impossible for a machine to make?
It turns out that there are a number of automation ‘philosophies’, some of which you might recognize as familiar.
Philosophies and Approaches
One: The Left-Over Principle
One common way to think of automation is to gather up all of the tasks, and sort them into things that can be automated, and things that can’t be. Even the godfather of Human Factors, Alphonse Chapanis said that it was reasonable to “mechanize everything that can be mechanized” (here). The main idea here is efficiency. Functions that cannot be assigned to machines are left for humans to carry out. This is known as the ‘Left-Over’ Principle.
David Woods and Erik Hollnagel has a response to this early incarnation of the “automate all the things!” approach, in Joint Cognitive Systems: Foundations of Cognitive Systems Engineering, which is (emphasis mine):
“The proviso of this argument is, however, that we should mechanise everything that can be mechanised, only in the sense that it can be guaranteed that the automation or mechanisation always will work correctly and not suddenly require operator intervention or support. Full automation should therefore be attempted only when it is possible to anticipate every possible condition and contingency. Such cases are unfortunately few and far between, and the available empirical evidence may lead to doubts whether they exist at all.
Without the proviso, the left-over principle implies a rather cavalier view of humans since it fails to include any explicit assumptions about their capabilities or limitations – other than the sanguine hope that the humans in the system are capable of doing what must be done. Implicitly this means that humans are treated as extremely flexible and powerful machines, which at any time far surpass what technological artefacts can do. Since the determination of what is left over reflects what technology cannot do rather than what people can do, the inevitable result is that humans are faced with two sets of tasks. One set comprises tasks that are either too infrequent or too expensive to automate. This will often include trivial tasks such as loading material onto a conveyor belt, sweeping the floor, or assembling products in small batches, i.e., tasks where the cost of automation is higher than the benefit. The other set comprises tasks that are too complex, too rare or too irregular to automate. This may include tasks that designers are unable to analyse or even imagine. Needless to say that may easily leave the human operator in an unenviable position.”
So to reiterate, the Left-Over Principle basically says that the things that are “left over” after automating as much as you can are either:
- Too “simple” to automate (economically, the benefit of automating isn’t worth the expense of automating it) because the operation is too infrequent, OR
- Too “difficult” to automate; the operation is too rare or irregular, and too complex to automate.
One critique of the Left-Over Principle is what Bainbridge points to in her second irony that I mentioned in the last post. The tasks that are “left over” after trying to automate all the things that can are the ones that you can’t figure out how to automate effectively (because they are too complicated or infrequent therefore not worth it) you then give back to the human to deal with.
So hold on: I thought we were trying to make humans lives easier, not more difficult?
Giving all of the easy bits to the machine and the difficult bits to the human also has a side affect of amplifying the workload on humans in terms of cognitive load and vigilance. (It turns out that it’s relatively trivial to write code that can do a boatload of complex things quite fast.) There’s usually little consideration given to whether or not the human could effectively perform these remaining non-automated tasks in a way that will benefit the overall system, including the automated tasks.
This approach also assumes that the tasks that are now automated can be done in isolation of the tasks that can’t be, which is almost never the case. When only humans are working on tasks, even with other humans, they can stride at their own rate individually or as a group. When humans and computers work together, the pace is set by the automated part, so the human needs to keep up with the computer. This underscores the importance automation in the context of humans and computers working jointly. Together. As a team, if you will.
We’ll revisit this idea later, but the idea that automation should place high priority and focus on the human-machine collaboration instead of their individual capacities is a main theme in the area of Joint Cognitive Systems, and one that I personally agree with.
Parasuraman, Sheridan, and Wickens (2000) had this to say about the Left-Over Principle (emphasis mine):
“This approach therefore defines the human operator’s roles and responsibilities in terms of the automation. Designers automate every subsystem that leads to an economic benefit for that subsystem and leave the operator to manage the rest. Technical capability or low cost are valid reasons for automation, given that there is no detrimental impact on human performance in the resulting whole system, but this is not always the case. The sum of subsystem optimizations does not typically lead to whole system optimization.”
Two: The “Compensatory” Principle
Another familiar approach (or justification) for automating processes rests on the idea that you should exploit the strengths of both humans and machines differently. The basic premise is: give the machines the tasks that they are good at, and the humans the things that they are good at.
This is called the Compensatory Principle, based on the idea that humans and machines can compensate for each others’ weaknesses. It’s also known as functional allocation, task allocation, comparison allocation, or the MABA-MABA (“Men Are Better At-Machines Are Better At”) approach.
Historically, functional allocation has been most embodied by “Fitts’ List”, which comes from a report in 1951, “Human Engineering For An Effective Air Navigation and Traffic-Control System” written by Paul Fitts and others.
Fitts’ List, which is essentially the original MABA-MABA list, juxtaposes human with machine capabilities to be used as a guide in automation design to help decided who (humans or machine) does what.
Here is Fitts’ List:
Humans appear to surpass present-day machines with respect to the following:
- Ability to detect small amounts of visual or acoustic energy.
- Ability to perceive patterns of light or sound.
- Ability to improvise and use flexible procedures.
- Ability to store very large amounts of information for long periods and to recall relevant facts at the appropriate time.
- Ability to reason inductively.
- Ability to exercise judgment.
Modern-day machines (then, in the 1950s) appear to surpass humans with respect to the following:
- Ability to respond quickly to control signals and to apply great forces smoothly and precisely.
- Ability to perform repetitive, routine tasks
- Ability to store information briefly and then to erase it completely
- Ability to reason deductively, including computational ability
- Ability to handle highly complex operations, i.e., to do many different things at once
This approach is intuitive for a number of reasons. It at least recognizes that when it comes to a certain category of tasks, humans are much superior to computers and software.
Erik Hollnagel summarized the Fitts’ List in Human Factors for Engineers:
It does a good job of looking like a guide; it’s essentially an IF-THEN conditional on where to use automation.
So what’s not to like about this approach?
While this is a reasonable way to look at the situation, it does have some difficulties that have been explored which makes it basically impossible as a practical rationale.
Criticisms of the Compensatory Principle
There are a number of strong criticisms to this approach or argument for putting in place automation. One argument that I agree with most is that the work we do in engineering are never as decomposable as list would imply. You can’t simply say “I have a lot of data analysis to do over huge amounts of data, so I’ll let the computer do that part, because that’s what it’s good at. Then it can present me the results and I can make judgements over them.” for many (if not all) of the work we do.
The systems we build have enough complexity in them that we can’t simply put tasks into these boxes or categories, because then the cost of moving between them becomes extremely high. So high that the MABA-MABA approach, as it stands, is pretty useless as a design guide. The world we’ve built around ourselves simply doesn’t exist neatly into these buckets; we move dynamically between judging and processing and calculating and reasoning and filtering and improvising.
Hollnagel unpacks it more eloquently in Joint Cognitive Systems: Foundations of Cognitive Systems Engineering:
“The compensatory approach requires that the situation characteristics can be described adequately a priori, and that the variability of human (and technological) capabilities will be minimal and perfectly predictable.”
“Yet function allocation cannot be achieved simply by substituting human functions by technology, nor vice versa, because of fundamental differences between how humans and machines function and because functions depend on each other in ways that are more complex than a mechanical decomposition can account for. Since humans and machines do not merely interact but act together, automation design should be based on principles of coagency.”
David Woods refers to Norbert’s Contrast (from Norbert Weiner’s 1950 The Human Use of Human Beings)
Artificial agents are literal minded and disconnected from the world, while human agents are context sensitive and have a stake in outcomes.
With this perspective, we can see how computers and humans aren’t necessarily decomposable into the work simply based on what they do well.
Maybe, just maybe: there’s hope in a third approach? If we were to imagine humans and machines as partners? How might we view the relationship between humans and computers through a different lens of cooperation?
That’s for the next post.
The topic of alerts and “alert design” as seen as a deliberate and purposeful thing to do has been on my mind.
In my experience and my asking many people in engineering and operations (at least in the web and financial trading domains) nothing spikes blood pressure like the topics of alerts. The caricature of the sysadmin waking up to a buzzing pager or phone is what comes to mind.
The costs of not paying attention to how your organization views or treats what comes of this behavior in operational teams (developers and systems folks included) I think are both largely invisible and much higher than most people think. It may be clear that what we’re talking about here is a signal:noise ratio, but it goes way beyond that. The cognitive cost of an engineer to attend to an alert (a fundamentally interrupting event by design) is akin to the cost of a software developer losing their “flow”; context switching is expensive. Expensive from a financial standpoint, a productivity perspective, and I’ll argue a career development view.
Here are some (likely melodramatic) assertions:
- Alert numbness and fatigue is a blight on our industry. Because we can alert on basically anything, and we can argue that anything could be a harbinger of things that could drastically affect our business, we generally put an alert on everything we get our hands on.
- Knowing something has happened almost always trumps not knowing something happened, with sometimes not much effort put into whether the “something” is important with respect to the context it’s happened in.
- Computers deciding what is important to alert on is and will always be brittle. Meaning: alerts and their criteria originate in the author’s mind, which may or may not be in the same place as the receiver of the alert in the future. In other words: we all write documentation and procedures that make sense to us when we write them. They never survive too much of the future, because our worlds that refer to them change. Example: corporate wiki pages are commonly referred to as the place where “documentation goes to die”. Alerts are no different.
Therefore, I’d love to get a much deeper and broader conversation about alert design in our domain. Because I’ll say that it’s not the technology that sucks, it’s our use of it. Consider the possibility that you don’t have a Nagios problem, you have an alert design problem.
Down and In
As the years go by and we see the continued decline of storage prices, the explosion of accessible processing power, we have an ever-expanding ability to zoom in deeply to the ways servers and services talk to each other and process information.
We can zoom in on the relationships and behaviors of seemingly disparate pieces of data, and we can discover and detect disruptions or anomalies in sometimes surprising places. This is interesting, for sure.
But it is also woefully incomplete if we are to make any progress in technical operations.
Up and Out
It is incomplete because as we zoom out of those high-resolution metrics collection and analysis tooling, what we find is a much-ignored environment which includes one of the most powerful context-sensitive and incredibly adaptive anomaly detection and response agents in the world: humans.
Do we have anomaly detection problems? Certainly. One can argue (I will) that we will always have them, for many reasons. (One of those reasons is the Law Of Stretched Systems, but that is for a different post.)
What I’m interested in is not how software can be used to detect anomalies automatically,
(well, I’m interested, but I don’t doubt that we all will continue to get better at it)
…it is how people navigate this boundary between themselves and the machines they work with. The boundary between humans and machines, as we observe our use of tools, is a focus in and of itself. If we have any hope of making progress in monitoring complex systems, we must take this boundary into account.
As an aside, some more bullet points:
- We don’t use a single tool to gain insight into the architectures we build. And we will not, much to the dismay of many monitoring-as-a-service business models. (“A single plane of glass?! Where do I sign?!”)
- Teams of people are the norm, which means that communication and coordination become as important (if not more important) than surfacing anomalies themselves.
- We bring our biases, expectations, trust, and perceptions to the table when it comes to monitoring and response. No tool or piece of automation will ever change that.
- Understanding the breakdowns at these boundaries between people and machines should be a part of how we approach the design of tools. Organizational behavior beats technology at every turn.
Less Code, More Social Science
When we look at Boyd’s OODA loop, we see “observe” and “orient” as critical pieces. Note that these are not Unix commands, they are human activities.
So writing code to tell computers what to look at is quite different than making sure that the code’s human supervisors are equipped or aided in what to look when an alert goes off. Figuring out how people make sense of what is actually going on at a given point (in diagnosis? in planning? in response to an outage? in control?) is just plain hard.
A step that Don Norman (and other folks known in the world of ergonomics and human factors) have been tugging at for a couple of decades is to first attempt to understand how people consume, adapt to, work around, and make use of tools under “normal” operating conditions. Once that’s done, it’s suggested, then we can try to understand how people make sense of their world under high-tempo or escalating scenarios (during an outage, for example) when the signals they receive can sometimes be disorienting as things escalate.
- Who has ever gotten an alert and ignored it? (/me looks at alert, says “oh, it’ll probably recover, no need to look further”)
- How many alerts were received in the past week that were not actionable? (no human action was required)
- How many alerts were received in the past week as a result of known work being done (expected) but alerts were not silenced during that period?
- How many alerts were received as a result of a previously silenced alert (because work was being done) that was mistakenly un-silenced?
Here are some quotes from engineers who have found themselves in interesting situations related to alerts:
“The whole place just lit up. I mean, all the lights came on. So instead of being able to tell you what went wrong, the lights were absolutely no help at all.”
– Comment by one space controller in mission control after the Apollo 12 spacecraft was struck by lightning (Murray and Cox 1990).
“I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information.”
– Comment by one operator at the Three Mile Island nuclear power plant to the official inquiry following the TMI accident (Kemeny 1979).
“When the alarm kept going off then we kept shutting it [the device] off [and on] and when the alarm would go off [again], we’d shut it off.”
“… so I just reset it [a device control] to a higher temperature. So I kinda fooled it [the alarm]…”
– Physicians explaining how they respond to a nuisance alarm on a computerized operating room device (Cook, Potter, Woods and McDonald 1991).
“A [computer] program alarm could be triggered by trivial problems that could be ignored altogether. Or it could be triggered by problems that called for an immediate abort [of the lunar landing]. How to decide which was which? It wasn’t enough to memorize what the program alarm numbers stood for, because even within a single number the alarm might signify many different things.
“We wrote ourselves little rules like: ‘If this alarm happens and it only happens once, don’t worry about it. If it happens repeatedly, but other indicators are okay, don’t worry about it.'” And of course, if some alarms happen even once, or if other alarms happen repeatedly and the other indicators are not okay, then they should get the LEM [lunar module] the hell out of there.
– Response to discovery of a set of computer alarms linked to the astronauts displays shortly before the Apollo 11 mission (Murray and Cox 1990).
“1202.” (Astronaut announcing that an alarm buzzer and light had gone off and the code 1202 was indicated on the computer display.)
“What’s a 1202?”
“1202, what’s that?”
– Mission control dialog as the LEM descended to the moon during Apollo 11 (Murray and Cox 1990).
“I know exactly what it [an alarm] is–it’s because the patient has been, hasn’t taken enough breaths or–I’m not sure exactly why.”
– Physician explaining one alarm on a computerized operating room device that commonly occurred at a particular stage of surgery (Cook et al. 1991).
These quotes are from the excellent paper The Alarm Problem and Directed Attention in Dynamic Fault Management (Woods, 1995).
David Woods writes at great length on the topic and gives great insight into what essentially alerts and alarms are: directed attention. As operators of systems that are beyond our full understanding at any given point and perspective, he shines light on the core of the alarm problem: that there is always context sensitivity to alerts, and in many ways the author/designer of the alert hasn’t (can’t!) imagine how the receiver of the alert will interpret it.
For example: he points to signal detection theory as a framework for thinking about alert/alarm criteria. That is to say, there is always a relationship between true “signal” and “noise” and the trade-offs inherent in choosing the alerting criteria (sometimes, but not always, viewed as a simple threshold) can be thought of like this:
In other words, there are four outcomes that are possible that reflect how sensitive the alerting criteria can be:
So this is a tough one, and points out that getting good (forget about perfect!) signal-to-noise ratio is hard. Too sensitive, you’ll get too many false alarms. Not sensitive enough, and you’ll miss something.
I’ll say that because of this, we generally err on the side of too many false alarms. For fear of missing something (or the embarrassment of it being known that you missed something going wrong with your systems!) we will crank up the sensitivity.
But in doing so, we essentially ignore the detrimental effect of the false alarms on our engineers and organizations. Underlying the false alarms are not just limitations in the alerting algorithms themselves, but the conditions and factors that the alert systems cannot detect or interpret.
An often-given example of this manifests at the Cincinnati Airport. A riverbank leading up to a particular runway there triggers a threshold in ground proximity warning systems (in-cockpit alerts) because the system can’t detect that it’s going to plateau at the runway. Pilots familiar with this particular runway at this particular airport ignore the alerts.
Once more, with feeling: the pilots, who are flying massive cylinders of metal containing many humans ignore a Ground Proximity Warning alert.
When we talk about how the receiver of an alert will behave, we begin to uncover the context sensitivity of an alert.
How can we take into account how someone might react when we they are woken up to an alert we’ve designed? Will they shake their head, wondering what it’s all about? Are we helping them understand what might be going on, or hindering them by including only the bare minimum of data?
What about the engineer who gets an alert in a sea of alerts, while an outage is ongoing? How much attention will they give one amongst a hundred?
Something that might affect our behavior when we get an alert is the amount of trust that we have in the alert: is it telling us something we should believe? Should we drop everything we’re doing in order to pay attention to it? If not, why not?
As an example of this, take the Ground Proximity Warning System I mentioned above. Turns out that in many studies across a number of years, a majority of pilots delay reacting to a GPWS alarm, not just in Cincinnati. Why? Because they take time to validate that the alarm is actually legitimate by looking out the window. This is enough of a problem that the FAA has coined this phenomenon “delayed GPWS response syndrome“.
Trust in automation: it’s a thing that might be worth thinking closely about.
“The critical point is that the challenge of fault management lies in sorting through an avalanche of raw data — a data overload problem. This is in contrast to the view that the performance bottleneck is the difficulty of picking up subtle early indications of a fault against the background of a quiescent monitored process.” (Woods, 1995)
The next time you set up an alert in your system, consider how you’re thinking the receiver of that alert will take it. Do you believe that your alert will save the day, providing information for someone to head off catastrophe before it’s too late? Or will it be likely discarded as noise amongst a sea of alerts as someone struggles to understand an outage?
“Information is not a scarce resource, attention is.” – Herb Simon
Herb Simon has mentioned this in many pieces of his writing, as David Woods and Emily Patterson remarks in Can We Ever Escape From Data Overload, A Cognitive Systems Diagnosis. Thus far we’ve captured that designing alerts is hard, even if we only invest effort in capturing signal, forget about providing context. Woods talks a bit more about directed attention, about a paradox:
“Note the paradox at the heart of directed attention. Given that the supervisory agent is loaded by various other task related demands, how does one interpret information about the potential need to switch attentional focus without interrupting or interfering with the tasks or lines of reasoning already under attentional control. We can state this paradox in another way: how can one skillfully ignore a signal that should not shift attention within the current context, without first processing it — in which case it hasn’t been ignored.”
So Where Is “Design”?
“It is the expertise of the human operator that makes it possible to adapt the performance of the joint system, in real time, to unexpected events and disturbances. Every working day, across the whole spectrum of human enterprise, a large number of near-misses are prevented from turning into accidents only because human operators intervene.
The system should therefore be designed so that human adaptation is enhanced.”
(emphasis mine) – Erik Hollnagel, Expertise and Technology: Cognition & Human-Computer Cooperation, 1995
Instead of thinking about alerts and alert design as tasks that underscore the mental model of a subordinate or otherwise dumb messenger delivering news to us?
What if we viewed alerting systems as a partner? What does the world look like if we designed alerting systems to cooperate with us?
If trust in alerting systems is such a big deal, as it is with the GPWS and alert numbness, what can we learn from how humans learn to trust each other, and let that influence our design decisions?
In other words: how can we design alerts that support our efforts to confirm their legitimacy, or our expectations when an alert will fire? Is context-sensitivity part of this?
This is the type of partnership and thinking that I’m interested in.
The other day I posted about the intersections of Systems Safety and web operations and engineering.
One of the largest proponents of bringing a systems thinking perspective to safety (specifically ‘software safety’) is Dr. Nancy Leveson, who has been in that field (really a multidisciplinary field) for at least a couple of decades. She’s the author of a super book, Engineering a Safer World (free download) that discusses this very concept.
I also mentioned the firming up (still in the public comment timeframe) of REG-SCI which puts into place regulation (not just a recommendation or suggestion) the ARP (automation review policy) that public trading markets must comply with.
Without commenting too much on REG-SCI (I have opinions on that which I can post about at a later date) itself, I wanted to point to a Technology Roundable that the SEC had last October and invited Dr. Leveson to speak on the notion and concepts of “safe” software systems. This laid the groundwork that went into (presumably) Regulation SCI.
I clipped out her testimony, it’s about 20 minutes long, but very much worth a watch. She touches on a number of topics, but brings plain language to what organizations (both for-profit and regulatory groups, like the SEC) can expect with respect to introducing an increasing amount of technology to ‘solve’ stability issues in complex systems:
Regulation SCI is aimed towards national securities and trading exchanges primarily. And the regulation itself is almost 400 pages long. Even if the intention is to prevent the sort of calamities such as the Flash Crash, the BATS IPO event, and the Knight Capital incident…is regulation the best (or only) way to make our systems safer?
Anyone who has known me well knows that I’m generally not satisfied with skimming the surface of a topic that I feel excited about. So to them it wouldn’t be a surprise that I’m now working on (yes, while I’m still at Etsy!) a master’s degree.
Since January I’ve been working with an incredible group as part of the Master’s Degree Program in Human Factors and Systems Safety at Lund University. This program was initially started by Sidney Dekker, and now is directed by the wicked smart Johan Bergström, whose works I’ve tweeted about before. As a matter of fact, I was able to convince JB to keynote this year’s Velocity Conference in Santa Clara next month on the the topic of risk, and I can’t be more excited for it.
So what am I all gaga about in this program?
To begin with, I’ve been a huge proponent of learning from other fields of engineering. In particular, how other domains perceive failures; with foresight, in hindsight, how they aim to prevent them, detect them, recover from them, and learn from them.
But I think for me it goes deeper than that.
We’re now in a world where the US State Department calls Twitter to request that database maintenance be postponed because of political events in the Middle East that could benefit from it being available. It’s all but given at this point that Facebook has had an enormous effect on global discourse on wide-ranging topics, many people pointing to its effects on the Arab Spring.
As we speak, REG-SCI is open for public comment from the SEC. Inside that piece of work are attempts to shore up safeguards and preventative measures that exchanges may have to employ to make themselves less vulnerable to perturbations and disturbances that can result in the next large-scale trading surprises that came with the Flash Crash, the BATS IPO event, and the Knight Capital incident.
And yes, here at Etsy we have been re-imagining how commerce is being done on a global scale as well.
- How do we design our systems to be resilient? Are the traditional approaches still working? How will we know when they stop working?
- How can we view the “systems” in that sentence to include the socio-technical relationship that organizations have to their service? Their employees? Their investors? The public?
- How does the political, regulatory, or commercial environment that our services expect to live in affect their operation? Do they influence the ‘safety’ of those systems?
- How do we manage the inevitable trade-offs that we experience when we move from a startup with a “Minimum Viable Product” to a globally-relied-upon service that is expected to always be on?
- What are the various ways we can anticipate, monitor, respond to, and learn from our failures and our successes?
All of these questions could be seen as technical in nature, but I’d argue that’s too simplistic. I’m interested in that beautiful and insane boundary between humans and machines, and how that relationship is intertwined in the increasingly complex systems we build over and over again.
My classmates in the program are from the US Forestry Service, air traffic control training facilities and towers, Australian mining safety, maritime accident investigation firms, healthcare and some airline pilots as well. They all have worked in high-tempo, high-consequence environments, and I’m finding even more overlap in thinking with them than I ever thought I would.
The notion that the web and internet infrastructures of tomorrow are heavily influenced by the failures of yesterday riddle me with concern and curiosity. Given that I’m now addicted to the critical thinking that many smart folks have been giving the topic for a couple of decades now, I figured that I’m not going to half-ass it, and lean into it as hard as I can.
So expect more writing on the topics of complex systems, human factors, systems safety, Just Culture, and related meanderings, because next year I’ve got a thesis to write.
Something that has struck me funny recently surrounds the traditional notion of availability of web applications. With respect to its relationship to revenue, to infrastructure and application behavior, and fault protection and tolerance, I’m thinking it may be time to get a broader
upgrade adjustment to the industry’s perception on the topic.
These nuances in the definition and affects of availability aren’t groundbreaking. They’ve been spoken about before, but for some reason I’m not yet convinced that they’re widely known or understood.
Impact On Business
What is laid out here in this article is something that’s been parroted for decades: downtime costs companies money, and lost value. Generally speaking, this is obviously correct, and by all means you should strive to design and operate your site with high availability and fault tolerance in mind.
But underneath the binary idea that uptime = good and downtime = bad, the reality is that there’s a lot more detail that deserves exploring.
This irritatingly-designed site has a post about a common equation to help those that are arithmetically challenged:
LOST REVENUE = (GR/TH) x I x H GR = gross yearly revenue TH = total yearly business hours I = percentage impact H = number of hours of outage
In my mind, this is an unnecessarily blunt measure. I see the intention behind this approach, because it’s not meant to be anywhere close to being accurate. But modern web operations is now a field where gathering metrics in the hundreds of thousands per second is becoming more common-place, fault-tolerance/protection is a thing we do increasingly well, and graceful degradation techniques are the norm.
In other words: there are a lot more considerations than outage minutes = lost revenue, even if you did have a decent way to calculate it (which, you don’t). Companies selling monitoring and provisioning services will want you to subscribe to this notion.
We can do better than this blunt measure, and I thought it’s worth digging in a bit deeper.
Thought experiment: if Amazon.com has a full and global outage for 30 minutes, how much revenue did it “lose”? Using the above rough equation, you can certainly come up with a number, let’s say N million dollars. But how accurate is N, really? Discussions that surround revenue loss are normally designed to motivate organizations to invest in availability efforts, so N only needs to be big and scary enough to provide that motivation. So let’s just say that goal has been achieved: you’re convinced! Availability is important, and you’re a firm believer that You Own Your Own Availability.
Outside of the “let this big number N convince you to invest in availability efforts” I have some questions that surround N:
- How many potential customers did Amazon.com lose forever, during that outage? Meaning: they tried to get to Amazon.com, with some nonzero intent/probability of buying something, found it to be offline, and will never return there again, for reasons of impatience, loss of confidence, the fact that it was an impulse-to-buy click whose time has passed, etc.
- How much revenue did Amazon lose during that 30 minute window, versus how the revenue that it simply postponed when it was down, only to be executed later? In other words: upon finding the site down, they’ll return sometime later to do what they originally intended, which may or may not include buying something or participate in some other valuable activity.
- How much did that 30 minutes of downtime affect the strength of the Amazon brand, in a way that could be viewed as revenue-affecting? Meaning: are users and potential users now swayed to having less confidence in Amazon because they came to the site only to be disappointed that it’s down, enough to consider alternatives the next time they would attempt to go to the site in the future?
I don’t know the answers to these questions about Amazon, but I do know that at Etsy, those answers depend on some variables:
- the type of outage or degradation (more on that in a minute),
- the time of day/week/year
- how we actually calculate/forecast how those metrics would have behaved during the outage
So, let’s crack those open a bit, and see what might be inside…
Not all time periods can be considered equal when it comes to availability, and the idea of lost revenue. For commerce sites (or really any site whose usage varies with some seasonality) this is hopefully glaringly obvious. In other words:
X minutes of full downtime during the peak hour of the peak day of the year can be worlds apart from Y minutes of full downtime during the lowest hour of the lowest day of the year, traffic-wise.
Take for example a full outage that happens during a period of the peak day of the year, and contrast it with one that happens during a lower-period of the year. Let’s say that this graph of purchases is of those 24-hour periods, indicating when the outages happen:
The impact time of the outage during the lower-traffic day is actually longer than the peak day, affecting the precious Nines math by a decent margin. But yet: which outage would you rather have, if you had to have one of those?
Another temporal concern is: across space and time, distribution and volume of any level degradation could be viewed as perfect uptime as the length of the outage approaches zero.
Dig, if you will, these two outage profiles, across a 24-hour period. The first one has many small outages across the day:
and the other has the same amount of impact time, in a single go:
So here we have the same amount of time, but spread out throughout the day. Hopefully, folks will think a bit more beyond the clear “they’re both bad! don’t have outages!” and could investigate how they could be different. Some considerations in this simplified example:
- Hour of day. Note that the single large outage is “earlier” in the day. Maybe this will affect EU or other non-US users more broadly, depending on the timezone of the original graph. Do EU users have a different expectation or tolerance for outages in a US-based company’s website?
- Which outage scenario has a greater affect on the user population: if the ‘normal’ behavior is “get in, buy your thing, and get out” quickly, I could see the many-small-outages more preferable to the single large one. If the status quo is some mix of searching, browsing, favoriting/sharing, and then purchase, I could see the singular constrained outage being preferable.
Regardless, this underscores the idea that not all outages are created equal with respect to impact timing.
Loss of “availability” can also be seen as an extreme loss of performance. At a particular threshold, given the type of feedback to the user (a fast-failed 404 or browser error, versus a hanging white page and spinning “loading…”) the severity of an event being slow can effectively be the same as a full outage.
Some concerns/thought exercises around this:
- Where is this latency threshold for your site, for the functionality that is critical for the business?
- Is this threshold a cliff, or is it a continuous/predictable relationship between performance and abandonment?
There’s been much more work on performance’s effects on revenue than availability. The Velocity Conference in 2009 brought the first real production-scale numbers (in the form of a Bing/Google joint presentation as well as Shopzilla and Mozilla talks) behind how performance affects businesses, and if you haven’t read about it, please do.
Will Amazon (or Etsy) lose sales if all or a portion of its functionality is gone (or sufficiently slow) for a period of time? Almost certainly. But that question is somewhat boring without further detail.
In many cases, modern web sites don’t simply live in a “everything works perfectly” or “nothing works at all” boolean world. (To be sure, neither does the Internet as a whole.) Instead, fault-tolerance and resilience approaches allow for features and operations degrade under a spectrum of failure conditions. Many companies build their applications to have both in-flight fault tolerance to degrade the experience in the face of singular failures, as well as making use of “feature flags” (Martin and Jez call them “feature toggles“) which allow for specific features to be shut off if they’re causing problems.
I’m hoping that most organizations are familiar with this approach at this point. Just because user registration is broken at the moment, you don’t want to prevent already logged-in users from using the otherwise healthy site, do you?
But these graceful degradation approaches further complicates the notion of availability, as well as its impact on the business as a whole.
For example: if Etsy’s favoriting feature is not working (because the site’s architecture allows it to gracefully fail without affecting other critical functionality), but checkout is working fine…what is the result? Certainly you might paused before marking down your blunt Nines record.
You might also think: “so what? as long as people can buy things, then favoriting listings on the site shouldn’t be considered in scope of availability.”
But consider these possibilities:
- What if Favoriting listings was a significant driver of conversions?
- If Favoriting was a behavior that led to conversions at a rate of X%, what value should X be before ‘availability’ ought to be influenced by such a degradation?
- What if Favoriting was technically working, but was severely degraded (see above) in performance?
Availability can be a useful metric, but when abused as a silver bullet to inform or even dictate architectural, business priority, and product decisions, there’s a real danger of oversimplifying what are really nuanced concerns.
Bounce-Back and Postponement
As I mentioned above, what is more likely for sites that have an established community or brand, outages (even full ones) don’t mark an instantaneous amount of ‘lost’ revenue or activity. For a nonzero amount, they’re simply postponed. This is the area that I think could use a lot more data and research in the industry, much in the same way that latency/conversion relationship has been investigated.
The over-simplified scenario involves something that looks like this. Instead of the blunt math of “X minutes of downtime = Y dollars of lost revenue”, we can be a bit more accurate, if we tried just a bit harder. The red is the outage:
So we have some more detail, which is that if we can make a reasonable forecast about what purchases did during the time of the outage, then we could make a better-inform estimate of purchases “lost” during that time period.
But is that actually the case?
What we see at Etsy is something different, a bit more like this:
- Position of the outage in the daily traffic profile (start-end)
- Position of the outage in the yearly season
the bounce-back volume will vary in a reasonably predictable fashion. Namely, as the length of the outage grows, the amount of bounce-back volume shrinks:
What this line of thinking doesn’t capture is how many of those users postponed their activity not for immediately after the outage, but maybe the next day because they needed to leave their computer for a meeting at work, or leaving work to commute home?
Intention isn’t entirely straightforward to figure out, but in the cases where you have a ‘fail-over’ page that many CDNs will provide when the origin servers aren’t available, you can get some more detail about what requests (add to cart? submit payment?) came in during that time.
Regardless, availability and its affect on business metrics isn’t as easy as service providers and monitoring-as-a-service companies will have you believe. To be sure, a good amount of this investigation will vary wildly from company to company, but I think it’s well worth taking a look into.
I think that there’s a lot of institutional knowledge in our field, especially about what makes for a productive engineer. But while there are a good deal of books in the management field about “expert” roles and responsibilities of non-technical individual contributors, I don’t see too many modern books or posts that might shed light directly on what makes for a good senior engineer. One notable exception is of course Kate Matsudaira, who has been posting quite a good deal recently about the cultural sides of engineering.
Yet at the same time, a good lot of successful engineers whom I have known all remember the mentor who taught them what it meant to be “senior”.
I do, however, agree 100% with my friend Theo’s words about being “senior” in his chapter of the Web Operations book by O’Reilly:
“Generation X (and even more so generation Y) are cultures of immediate gratification. I’ve worked with a staggering number of engineers that expect the “career path” to take them to the highest ranks of the engineering group inside 5 years just because they are smart. This is simply impossible in the staggering numbers I’ve witnessed. Not everyone can be senior. If, after five years, you are senior, are you at the peak of your game? After five more years will you not have accrued more invaluable experience? What then? “Super engineer”? Five more years? “Super-duper engineer.” I blame the youth of our discipline for this affliction. The truth is that there are very few engineers that have been in the field of web operations for fifteen years. Given the dynamics of our industry many elected to move on to managerial positions or risk an entrepreneurial run at things.”
He’s right: this field of web operations is still quite young. So we can’t be surprised when people who have a title of ‘senior’ exhibit unsurprisingly immature behavior, both technical and non-technical. If you haven’t read Theo’s chapter, I suggest you do.
Having said that, what does it actually mean to be ‘senior’ in this discipline? I certainly have an opinion of what it means, given that I’m charged with hiring, supporting, and retaining engineers whom are deemed to be senior. This notion that there is a bar to be passed in terms of career development is a good one, but I’d also add that these criteria exist on a spectrum, as opposed to a simple list of check-boxes. You don’t wake up one day and you are “senior” just because your title reflects that upon a promotion. Senior engineers don’t know everything. They’re not perfect in their technical knowledge, and they’re OK with that.
In order not to confuse titles with expectations that are fuzzy, sometimes I’ll refer to engineering maturity.
Meaning: I expect a “senior” engineer to be a mature engineer.
I’m going to gloss over the part where one could simply list the technical areas in which a mature engineer should have some level of mastery or understanding (such as “Networking”, “Filesystems”, “Algorithms”, etc.) and instead highlight the personal characteristics that in my mind give me indication that someone can influence an organization or a business positively in the domain of engineering.
Over on Quora, someone once asked me “What are the attributes (other than technical ability/experience) that makes a great VP of Technical Operations?”. The list of attributes that I mentioned in the answer came with the understanding that they are perpetual aspirations of my own. This post is similar to that answer.
I might first argue that senior engineers in web development and operations have the same characteristics as senior engineers in other fields of engineering (mechanical, electrical, chemical, etc.) in which case The Unwritten Laws of Engineering are applicable. Again, if you haven’t read this, please go do so. It was originally written in 1944, published by the American Society of Mechanical Engineers. A good excerpt from the book is here.
While the book’s structure and prose still has a dated feel (“…refrain from using profanity in the workplace…” or “…men should pay particular attention to shaving habits and the trimming of beards and mustaches…”), it gives a good outline of the non-technical expectations, responsibilities, and inner workings of an engineering organization with respect to how both managers and mature engineers might behave.
Obligatory Pithy Characteristics of Mature Engineers
All posts that attempt to give insight to aspirational characteristics must have an over-abundance of bullet points, and the field of engineering has a fair share of them. Therefore, I’m going to give you some, some mine and some pulled from various sources, many from the Unwritten Laws mentioned above.
Mature engineers seek out constructive criticism of their designs.
Every successful engineer I’ve met, upon finishing up a design or getting ready for a project, will continually ask their peers questions along the lines of:
- “What could I be missing?”
- “How will this not work?”
- “Will you please shoot as many holes as possible into my thinking on this?”
- “Even if it’s technically sound, is it understandable enough for the rest of the organization to operate, troubleshoot, and extend it?”
This is because they know that nothing they make will ever only be in their hands, and that good peer review is what makes better design decisions. As it’s been said elsewhere, they “beg for the bad news.”
Mature engineers understand the non-technical areas of how they are perceived.
Being able to write a Bloom Filter in Erlang, or write multi-threaded C in your sleep is insufficient. None of that matters if no one wants to work with you. Mature engineers know that no matter how complete, elegant, or superior their designs are, it won’t matter if no one wants to work alongside them because they are assholes. Condescension, belittling, narcissism, and ego-boosting behavior send the message to other engineers (maybe tacitly) to stay away. Part of being happy in engineering comes from enjoying the company of the people you work with while designing and building things. An engineer who is quick to call someone a moron is someone destined to stunt his or her career.
This also means that mature engineers have self-awareness when it comes to their communication. This isn’t to say that every mature engineer communicates perfectly, only that they have some notion about where they could be better, and continually ask for a gut-check from peers and managers on how they’re doing. They aim to be assertive, not passive or aggressive in how they get their ideas across.
I’ve mentioned it elsewhere, but I must emphasize the point more: the degree to which other people want to work with you is a direct indication on how successful you’ll be in your career as an engineer. Be the engineer that everyone wants to work with.
Now this isn’t to say that you should shy away from giving (or getting) constructive criticism on the work produced by engineering (as opposed to the engineer personally), for fear of pissing someone off. There’s a difference between calling someone a moron and pointing out faults in their code or product. In a conversation with Theo, he pointed out another possible area where our field may grow up:
“We as an industry need to (of course) refrain from critiques of human character and condition, but not shy away from critiques of work product. We need to get tougher skin and be able to receive critique through a lens that attempts to eliminate personal focus.
There will be assholes, they should be shunned. But the attitude that someone’s code is their baby should come to an end. Code doesn’t have feelings, doesn’t develop complexes and certainly doesn’t exhibit the most important trait (the ability to reproduce) of that which carries for your genetic strains.”
See also below #2 and #10 in The Ten Commandments of Egoless Programming.
I think this has a corollary from the Unwritten Laws (emphasis mine):
Be careful about whom you mark for copies of letters, memos, etc., when the interests of other departments are involved.
A lot of mischief has been caused by young people broadcasting memorandum containing damaging or embarrassing statements. Of course it is sometimes difficult for a novice to recognize the “dynamite” in such a document but, in general, it is apt to cause trouble if it steps too heavily upon someone’s toes or reveals a serious shortcoming on anybody’s part. If it has wide distribution or if it concerns manufacturing or customer difficulties, you’d better get the boss to approve it before it goes out unless you’re very sure of your ground.
This of course underscores the dated feel of the book, but in the modern era, I still believe the main point to be true. Nothing indicates that you have a lack of perspective and awareness like a poorly thought out and nonconstructive tweet that slings venomous insults. It’s a junior engineer mistake to toss insults about a piece of complex technology in 140 characters.
I certainly (much like Christopher Brown mentioned in his keynote at Velocity London) pay attention to those sorts of public remarks when I come across them so that I can note who I would reconsider hiring if they ever applied to work at Etsy.
Mature engineers do not shy away from making estimates, and are always trying to get better at it.
From the Unwritten Laws:
Promises, schedules, and estimates are necessary and important instruments in a well-ordered business. Many engineers fail to realize this, or habitually try to dodge the irksome responsibility for making commitments. You must make promises based upon your own estimates for the part of the job for which you are responsible, together with estimates obtained from contributing departments for their parts. No one should be allowed to avoid the issue by the old formula, “I can’t give a promise because it depends upon so many uncertain factors.”
Avoiding responsibility for estimates is another way of saying, “I’m not ready to be relied upon for building critical pieces of infrastructure.” All businesses rely on estimates, and all engineers working on a project are involved in Joint Activity, which means that they have a responsibility to others to make themselves interpredictable. In general, mature engineers are comfortable with working within some nonzero amount of uncertainty and risk.
Mature engineers have an innate sense of anticipation, even if they don’t know they do.
This code looks good, I’m proud of myself. I’ve asked other people to review it, and I’ve taken their feedback. Now: how long will it last before it’s rewritten? Once it’s in production, how will its execution affect resource usage? How much so I expect CPU/memory/disk/network to increase or decrease? Will others be able to understand this code? Am I making it as easy as I can for others to extend or introspect this work?
Mature engineers understand that not all of their projects are filled with rockstar-on-stage work.
However menial and trivial your early assignments may appear, give them your best effort.
Getting things done means doing things you might not be interested in. No matter how sexy a project is, there are always boring tasks. Tedious tasks. Tasks that a less mature engineer may deem beneath their dignity or their job title. My good friend Kellan Elliot-McCrea (Etsy’s CTO) had this to say about it:
“Sometimes the saving grace of a tedious task is their simplicity and maturity manifests in finishing them quickly and moving on. Sometimes tasks are tedious because they require extreme discipline and malleable attention span. It’s an odd phenomena that the most tedious tasks, only to be carried out by the most senior engineers, can also be the most terrifying.”
Mature engineers lift the skills and expertise of those around them.
They recognize that at some point, their individual contribution and potential cannot be exercised singularly. They recognize that there is only so much that can be produced by a single person, and the world’s best engineering feats are executed by teams, not singularly brilliant and lone engineers. Tom Limoncelli makes this point quite well in his post.
At Etsy we call this a “generosity of spirit.” Generosity of spirit is one of our core engineering values, but also a primary responsibility of our Staff Engineer position, a career-level position. These engineers spend the time to make sure that more junior or new engineers unfamiliar with the tech or processes we have not only understand what they are doing, but also why they are doing it. “Teaching to fish” is a mandatory skill at this level, and that requires having both patience and a perspective of investment in the rest of the organization.
Therefore instead of: “OK, move over, lemme just do it for you”, it’s instead: “Ok, let’s work on this together. I can show you how I’m writing/troubleshooting/etc. Then, you do it so I can be sure you know why/how we’re doing it this way, etc.”
Related: see below about getting credit.
Mature engineers make their trade-offs explicit when making judgements and decisions.
They realize all engineering decisions, implementations, and designs exist within a spectrum; we do not live in a binary world. They can quickly point out contexts where one successful approach or solution could work and where it could not. They know that one cannot be both efficient and thorough at the same time (The ETTO Principle), that most projects engineers work on exist on an axis of optimality and brittleness, and that whether the problems they are solving are acute or chronic.
They know that they work within a spectrum of ideal and non-ideal, and are OK with that. They are comfortable with it because they strive to make the ideal and non-ideal in a design explicit. Later on in the lifecycle of a design, when the original design is not scaling anymore or needs to be replaced or rewritten, they can look back not with a perspective of how short-sighted those earlier decisions were, but instead say “yep, we made it this far with it and knew we’d have to extend or change it at some point. Looks like that time is now, let’s get to work!” instead of responding with a cranky-pants, passive-aggressive Hindsight Bias-filled remark with counterfactuals (e.g.. “those idiots didn’t do it right the first time!”, “they cut corners!”, “I TOLD them this wouldn’t work!”)
Many pithy quotes exist that shine light on this notion of trade-offs, and mature engineers know that there are limits to any philosophy-laden quotes (including the ones I’m writing here):
- “Premature optimization is the root of all evil.” – a very abused maxim, and I’ve written about it before. A corollary to that might be (taken from here) ‘Understanding what is and isn’t “premature” is what separates senior engineers from junior engineers.’
- “Right tool for the job” – another abused one. The intention here is reasonable: who wants to use a tool that isn’t appropriate? But a rare perspective is that this can be detrimental when taken to the extreme. A carpenter doesn’t arm himself with every variation and size of hammer that is available, even thought he may encounter hammering tasks that could be ideally handled by each one. Why? Because lugging around (and maintaining) a gazillion hammers incurs a cost. As such, decisions on this axis have trade-offs.
The tl;dr on trade-offs is that everyone cuts corners, in every project. Immature engineers discover them in hindsight, disgusted. Mature engineers spell them out at the onset of a project, accept them and recognize them as part of good engineering.
Mature engineers don’t practice CYAE (“Cover Your Ass Engineering”)
The scenario where someone will stand on ceremony as an excuse for not attempting to understand how his or her code (or infrastructure) could be touched by other parts of the system or business is a losing proposition. Covering your ass sends the implicit message that you are someone willing to throw others (on your team? in your company? in your community?) under the proverbial bus at the mere hint that your work had any flaw. Mature engineers stand up and accept the responsibility given to them. If they find they don’t have the requisite authority to be held accountable for their work, they seek out ways to rectify that.
An example of CYAE is “It’s not my fault. They broke it, they used it wrong. I built it to spec, I can’t be held responsible for their mistakes or improper specification.”
Mature engineers are empathetic.
In complex projects, there are usually a number of stakeholders. In any project, the designers, product managers, operations engineers, developers, and business development folks all have goals and perspectives, and mature engineers realize that those goals and views may be different. They understand this so that they can navigate effectively in the work that they do. Being empathetic in this sense means having the ability to view the project from another person’s perspective and to take that into consideration into your own work.
Goal conflicts are inherent in all engineering work, and complaining about them (instead of embracing them as requirements for success) is a sign of a less mature engineer.
They don’t make empty complaints.
Instead, they express judgements based on empirical evidence and bring with those judgements options for solving the problem which they’ve identified. A great manager of mine said to never go to your boss with a complaint about anything without at least one (ideally more than one) suggestion for a solution. Even demonstrating that you’ve tried working the problem on your own and came up empty-handed is better than an empty complaint.
Mature engineers are aware of cognitive biases
This isn’t to say that every mature engineer needs to have a degree in psychology, but cognitive biases are what can limit the growth of an engineer’s career at a certain point. Even if they’re not aware of the details of how they appear or how these biases can be guarded against, most mature engineers I know have a level of self-awareness to at least recognize they (like everyone) are susceptible to them.
Culturally, engineers work day-to-day in empirical evidence in research. Basically: show me the data. The issue with cognitive biases is that we can be blissfully unaware of when we are interpreting data with our own brains in ways that defy empirical data, and can have a surprising effect on how we get work done and work on teams.
A great list of them exists on Wikipedia, but some of the ones that I’ve seen engineers (including myself) fall prey to are:
- Self-Serving Bias – basically: if something is good, it’s probably because of something I did or thought of. If it’s bad, it’s probably the doing of someone else.
- Fundamental Attribution Error – basically: the bad results that someone else got from his work must have something to do with how he is, personally (stupid, clumsy, sloppy, etc.) whereas if I get bad results, it’s because of the context that I was in, the pressure I was under, the situation I was in, etc.
- Hindsight Bias – (it is said that this is the most-studied phenomenon in the history of modern psychology) basically: after an untoward or negative event (a severe bug, an outage, etc.) “I knew it all along!”. It is the very strong tendency to view the past more simply than it was in reality. You can tell there is Hindsight Bias going on when descriptions involve counterfactuals, or “…they should have…”, or “…how did they not see that, it’s so obvious!”.
- Outcome Bias – like above, this comes up after a surprising or negative event. If the event was very damaging, expensive to clean up, or severe, then the decisions or actions that contributed to that event are judged to be very stupid, reckless, or negligent. The judgement is proportional to how severe the event was.
- Planning Fallacy – (related to the point about making estimates under uncertainty, above) basically: being more optimistic about forecasting the time a particular project will take.
There are plenty of others, all of which I find personally fascinating and I can get lost in learning more about them. Highly suggested reading, if you’re at all interested in learning about how you might be limiting your own effectiveness.
The Ten Commandments of Egoless Programming
Appropriate, even if old…I’ve seen it referenced as coming from The Psychology of Computer Programming, written in 1971, but I don’t actually see it in the text. Regardless, here are The Ten Commandments of Egoless Programming, found on @wyattdanger‘s blog post on receiving advice from his dad:
- Understand and accept that you will make mistakes. The point is to find them early, before they make it into production. Fortunately, except for the few of us developing rocket guidance software at JPL, mistakes are rarely fatal in our industry. We can, and should, learn, laugh, and move on.
- You are not your code. Remember that the entire point of a review is to find problems, and problems will be found. Don’t take it personally when one is uncovered. (Allspaw note – related: see below, number #10, and the points Theo made above.)
- No matter how much “karate” you know, someone else will always know more. Such an individual can teach you some new moves if you ask. Seek and accept input from others, especially when you think it’s not needed.
- Don’t rewrite code without consultation. There’s a fine line between “fixing code” and “rewriting code.” Know the difference, and pursue stylistic changes within the framework of a code review, not as a lone enforcer.
- Treat people who know less than you with respect, deference, and patience. Non-technical people who deal with developers on a regular basis almost universally hold the opinion that we are prima donnas at best and crybabies at worst. Don’t reinforce this stereotype with anger and impatience.
- The only constant in the world is change. Be open to it and accept it with a smile. Look at each change to your requirements, platform, or tool as a new challenge, rather than some serious inconvenience to be fought.
- The only true authority stems from knowledge, not from position. Knowledge engenders authority, and authority engenders respect – so if you want respect in an egoless environment, cultivate knowledge.
- Fight for what you believe, but gracefully accept defeat. Understand that sometimes your ideas will be overruled. Even if you are right, don’t take revenge or say “I told you so.” Never make your dearly departed idea a martyr or rallying cry.
- Don’t be “the coder in the corner.” Don’t be the person in the dark office emerging only for soda. The coder in the corner is out of sight, out of touch, and out of control. This person has no voice in an open, collaborative environment. Get involved in conversations, and be a participant in your office community.
- Critique code instead of people – be kind to the coder, not to the code. As much as possible, make all of your comments positive and oriented to improving the code. Relate comments to local standards, program specs, increased performance, etc.
Novices versus Experts
Now I generally don’t follow too much on knowledge acquisition as a research topic, but I do believe it’s hard to get away from when talking about the evolving nature of a discipline. One bit of interesting breakdown comes from a paper from Dreyfus and Dreyfus called “A Five Stage Model of the Mental Activities Involved in Directed Skill Acquisition” which has laid out characteristics of various levels of expertise:
The paper goes on to state:
Novices operate from an explicit rules and knowledge-based perspective. They are deliberate and analytical, and therefore slower to take action, they decide or choose.
(which means that novices are deeply subject to local rationality)
Experts operate from a mature, holistic well-tried understanding, intuitively and without conscious deliberation. This is a function of experience. They do not see problems as one thing and solutions as another, they act.
(which means that experts are context driven)
I don’t necessarily subscribe to the notion of such dry lines being drawn between skill levels, because I think that there is a lot more granularity and facets of expertise than just those outlined above, but I think it’s helpful to be aware of the unfortunately over-simplified categories.
Dirty secret: mature engineers know the importance of (sometimes irrational) feelings people have. (gasp!)
How people feel about technologies, technical decisions, and technical directions is just as important (if not more) than the facts about the details. Mature engineers know this, and adjust accordingly. Again, being empathetic can help you understand how another person on your team feels about a technical decision, even if they themselves don’t have an easy time articulating why they feel that way.
People’s confidence in software, architectures, or patterns is heavily influenced by past experience, and can result in positive or negative reactions to using them. Used to work at a mod_perl shop that had a lot of mystifying outages? Then you can’t be surprised to feel reluctant to use it in a different company, even if the supporting expertise and use cases are entirely different. All you remember is that mod_perl = major headaches, so you’re going to be wary of using it in any context again.
Mature engineers understand this phenomenon when making a case to use technology that carries baggage, even if it’s irrational. Convincing a group to use tools and patterns that they aren’t comfortable with isn’t a straightforward task. The “right tool for the job” maxim also has (sometimes unquantifiable) comfortability as a parameter.
For an illustration of how people’s emotions drive technical decisions and opinions, read any flame war about anything, ever.
“It is amazing what you can accomplish if you do not care who gets credit.”
This quote is commonly attributed to Harry S. Truman, but it looks like it might have first been said by a Jesuit priest in a different form. In any case, this is another indication you’re working with a mature engineer: they hold the success of the project much higher than the potential praise they may get personally for working on it. The attribution of praise or credit can be the source of such dysfunction in an engineering-driven organization, and I believe it’s because it’s largely invisible.
The notion is liberating, and once understood and internalized, a world of progress and innovative thinking can flourish, because the engineer isn’t overly concerned with the personal liability of equating the work to their own career success.
Not The End
I’m at the moment blessed to work with a number of mature engineers here at Etsy, and it’s quite humbling. We are indeed a young field, and while I think we can learn a great deal from other fields of engineering on this topic, I also think we have an advantage. The web is inextricably tied to the notion of publishing and sharing information, globally. We need to continue pointing out what it means to be a “senior” and “mature” engineer if we have a hope of progressing the field into a true discipline.
Many thanks to members of the Etsy Operations team, Mike Brittain, Kellan Elliott-McCrea, Marc Hedlund, and Theo Schlossnagle for reviewing drafts of this post. They all make me a more mature engineer.
(Part 1 of 2 posts)
I’ve been percolating on this post for a long time. Thanks very much to Mark Burgess for reviewing early drafts of it.
One of the ideas that permeates our field of web operations is that we can’t have enough automation. You’ll see experience with “building automation” on almost every job description, and many post-mortem transcriptions around the world have remediation items that state that more automation needs to be in place to prevent similar incidents.
“AUTOMATE ALL THE THINGS!” the meme says.
But the where, when, and how to design, implement, and operate automation is not as straightforward as “AUTOMATE ALL THE THINGS!”
I’d like to explore this concept that everything that could be automated should be automated, and I’d like to take a stab at putting context around the reasons why we might think this is a good idea. I’d also like to give some background on the research of how automation is typically approached, the reasoning behind various levels of automation, and most importantly: the spectrum of downsides of automation done poorly or haphazardly.
(Since it’s related, I have to include an obligatory link to Github’s public postmortem on issues they found with their automated database failover, and some follow-up posts that are well worth reading.)
In a recent post by Mathias Meyer he gives some great pointers on this topic, and strongly hints at something I also agree with, which is that we should not let learnings from other safety-related fields (aviation, combat, surgery, etc.) go to waste, because there are some decades of thinking and exploration there. This is part of my plan for exploring automation.
Frankly, I think that we as a field could have a more mature relationship with automation. Not unlike the relationship humans have with fire: a cautious but extremely useful one, not without risks.
I’ve never done a true “series” of blog posts before, but I think this topic deserves one. There’s simply too much in this exploration to have in a single post.
What this means: There will not be, nor do I think should there ever be, a tl;dr for a mature role of automation, other than: its value is extremely context-specific, domain-specific, and implementation-specific.
If I’m successful with this series of posts, I will convince you to at least investigate your own intuition about automation, and get you to bring the same “constant sense of unease” that you have with making change in production systems to how you design, implement, and reason about it. In order to do this, I’m going to reference a good number of articles that will branch out into greater detail than any single blog post could shed light on.
Bluntly, I’m hoping to use some logic, research, science, and evidence to approach these sort of questions:
- What do we mean when we say “automation”? What do those other fields mean when they say it?
- What do we expect to gain from using automation? What problem(s) does it solve?
- Why do we reach for it so quickly sometimes, so blindly sometimes, as the tool to cure all evils?
- What are the (gasp!) possible downsides, concerns, or limitations of automation?
- And finally – given the potential benefits and concerns with automation, what does a mature role and perspective for automation look like in web engineering?
Given that I’m going to mention limitations of automation, I want to be absolutely clear, I am not against automation. On the contrary, I am for it.
Or rather, I am for: designing and implementing automation while keeping an eye on both its limitations and benefits.
So what limitations could there be? The story of automation (at least in web operations) is one of triumphant victory. The reason that we feel good and confident about reaching for automation is almost certainly due to the perceived benefits we’ve received when we’ve done it in the past.
Canonical example: engineer deploys to production by running a series of commands by hand, to each server one by one. Man that’s tedious and error-prone, right? Now we’ve automated the whole process, it can run on its own, and we can spend our time on more fun and challenging things.
This is a prevailing perspective, and a reasonable one.
Of course we can’t ditch the approach of automation, even if we wanted to. Strictly speaking, almost every use of a computer is to some extent using “automation”, even if we are doing things “by hand.” Which brings me to…
Definitions and Foundations
I’d like to point at the term itself, because I think it’s used in a number of different contexts to mean different things. If we’re to look at it closely, I’d like to at least clarify what I (and others who have researched the topic quite well) mean by the term “automation”. The word comes from the Greek: auto, meaning ‘self’, and matos, meaning ‘willing’, which implies something is acting on its own accord.
Some modern definitions:
“Automation is defined as the technology concerned with the application of complex mechanical, electronic, and computer based systems in the operations and control of production.” – Raouf (1988)
‘Automation’ as used in the ATA Human Factors Task Force report in 1989 refers to…”a system or method in which many of the processes of production are auotmatically controlled or performed by self-operating machines, electronic devices, etc.” – Billings (1991)
“We define automation as the execution by a machine agent (usually a computer) of a function that was previously carried out by a human.” – Parasuraman (1997)
I’ll add to that somewhat broad definition functions that have never been carried out by a human. Namely, processes and tasks that could never be performed by a human, by exploiting the resources available in a modern computer. The recording and display of computations per second, for example.
To help clarify my use of the term:
- Automation is not just about provisioning and configuration management. Although this is maybe the most popular context in which the term is used, it’s almost certainly not the only place for automation.
- It’s also not simply the result of programming what were previously performed as manual tasks.
- It can mean enforcing predefined or dynamic limits on operational tasks, automated or manual.
- It can mean surfacing, displaying, and analyzing metrics from tasks and actions.
- It can mean making decisions and possibly taking action on observed states in a system.
Some familiar examples of these facets of automation:
- MySQL max_connections and Apache’s MaxClients directives: these are upper bounds intended on preventing high workloads from causing damage.
- Nagios (or any monitoring system for that matter): these perform checks on values and states at rates and intervals only a computer could perform, and can also take action on those states in order to course-correct a process (as with Event Handlers)
- Deployment tools and configuration management tools (like Deployinator, as well as Chef/Puppet/CFEngine, etc.)
- Provisioning tools (golden-image or package-install based)
- Any collection or display of metrics (StatsD, Ganglia, Graphite, etc.)
Which is basically…well, everything, in some form or another in web operations.
Domains To Learn From
In many of the papers found in Human Factors and Resilience Engineering, and in blog posts that generally talk about limitations of automation, it’s done in the context of aviation. And what a great context that is! You have dramatic consequences (people die) and you have a plethora of articles and research to choose from. The volume of research done on automation in the cockpit is large due to the drama (people die, big explosions, etc.) so no surprise there.
Except the difference is, in the cockpit, human and machine elements have a different context. There are mechanical actions that the operator can and needs to do during takeoff and landing. They physically step on pedals, push levers and buttons, watch dials and gauges in various points during takeoff and landing. Automation in that context is, frankly, much more evolved there, and the contrast (and implicit contract) there between man and machine is much more stark than in the context of web infrastructures. Display layouts, power-assisted controls…we should be so lucky to have attention like that paid to our working environment in web operations! (but also, cheers to people not dying when the site goes down, amirite?)
My point is that while we discuss the pros, cons, and considerations for designing automation to help us in web operations, we have to be clear that we are not aviation, and that our discussion should reflect that while still trying to glean information from that field’s use of it.
We ought to understand also that when we are designing tasks, automation is but one (albeit a complex one) approach we can take, and that it can be implemented in a wide spectrum of ways. This also means that if we decide in some cases to not automate something (gasp!) or to step back from full automation for good reason, we shouldn’t feel bad or failed about it. Ray Kurzweil and the nutjobs that think the “singularity” is coming RealSoonNow™ won’t be impressed, but then again you’ve got work to do.
So Why Do We Want to Use Automation?
Historically, automation is used for:
Which sounds like a pretty good argument for it, right? Who wants to be less precise, less stable, or slower? Not I, says the Ops guy. So using automation at work seems like a no-brainer. But is it really just as simple as that?
Some common motivations for automation are:
- Reduce or eliminate human error
- Reduction of the human’s workload. Specifically, ridding humans of boring and tedious tasks so they can tackle the more difficult ones
- Bring stability to a system
- Reduce fatigue on humans
No article about automation would be complete without pointing first at Lisanne Bainbridge’s 1983 paper, “The Ironies of Automation”. I would put her work here as modern canonical on the topic. Any self-respecting engineer should read it. While its prose is somewhat dated, the value is still very real and pertinent.
What she says, in a nutshell, is that there are at least two ironies with automation, from the traditional view of it. The premise reflects a gut intuition that pervades many fields of engineering, and one that I think should be questioned:
The basic view is that the human operator is unreliable and inefficient, and therefore should be eliminated from the system.
Roger that. This supports the idea to take humans out of the loop (because they are unreliable and inefficient) and replace them with automated processes.
The first irony is:
Designer errors [in automation] can be a major source of operating problems.
This means that the designers of automation make decisions about how it will work based on how they envision the context it will be used. There is a very real possibility that the designer hasn’t imagined (or, can’t imagine) every scenario and situation the automation and human will find themselves in, and so therefore can’t account for it in the design.
Let’s re-read the statement: “This supports the idea to take humans out of the loop (because they are unreliable and inefficient) and replace them with automated processes.”…which are designed by humans, who are assumed to be unrelia…oh, wait.
The second irony is:
The designer [of the automation], who tries to eliminate the operator, still leaves the operator to do the tasks which the designer cannot think how to automate.
Which is to say that because the designers of automation can’t fully automate the human “out” of everything in a task, the human is left to cope with what’s left after the automated parts. Which by definition are the more complex bits. So the proposed benefit of relieving humans of cognitive workload isn’t exactly realized.
There are some more generalizations that Bainbridge makes, paraphrased by James Reason in Managing The Risks of Organizational Accidents:
- In highly automated systems, the task of the human operator is to monitor the systems to ensure that the ‘automatics’ are working as they should. But it’s well known that even the best motivated people have trouble maintaining vigilance for long periods of time. They are thus ill-suited to watch out for these very rare abnormal conditions.
- Skills need to be practiced continuously in order to preserve them. Yet an automatic system that fails only very occasionally denies the human operator the opportunity to practice the skills that will be called upon in an emergency. Thus, operators can become deskilled in just those abilities that justify their (supposedly) marginalized existence.
- And ‘Perhaps the final irony is that it is the most successful automated systems with rare need for manual intervention which may need the greatest investment in operator training.’
Bainbridge’s exploration of ironies and costs of automation bring a much more balanced view of the topic, IMHO. It also points to something that I don’t believe is apparent to our community, which is that automation isn’t an all-or-nothing proposition. It’s easy to bucket things that humans do, and things that machines do, and while the two do meet from time to time in different contexts, it’s simpler to think of their abilities apart from each other.
Viewing automation instead on a spectrum of contexts can break this oversimplification, which I think can help us gain a glimpse into what a more mature perspective towards automation could look like.
Levels Of Automation
It would seem automation design needs to be done with the context of its use in mind. Another fundamental work in the research of automation is the so-called “Levels Of Automation”. In their seminal 1999 paper “Human And Computer Control of Undersea Teleoperators”, Sheridan and Verplank lay out the landscape for where automation exists along the human-machine relationship (Table 8.2 in the original and most excellent vintage 1978 typewritten engineering paper)
|Automation Level||Automation Description|
|1||The computer offers no assistance: human must take all decision and actions.|
|2||The computer offers a complete set of decision/action alternatives, or|
|3||…narrows the selection down to a few, or|
|4||…suggests one alternative, and|
|5||…executes that suggestion if the human approves, or|
|6||…allows the human a restricted time to veto before automatic execution, or|
|7||…executes automatically, then necessarily informs humans, and|
|8||…informs the human only if asked, or|
|9||…informs him after execution if it, the computer, decides to.|
|10|| The computer decides everything and acts autonomously, ignoring the
This was extended later in Parasuraman, Sheridan, and Wickens (2000) “A Model for Types and Levels of Human Interaction with Automation” to include four stages of information processing within which each level of automation may exist:
- Information Acquisition. The first stage involves the acquisition, registration, and position of multiple information sources similar to that of humans’ initial sensory processing.
- Information Analysis. The second stage refers to conscious perception, selective attention, cognition, and the manipulation of processed information such as in the Baddeley model of information processing
- Decision and Action Selection. Next, automation can make decisions based on information acquisition, analysis and integration.
- Action Implementation. Finally, automation may execute forms of action.
Viewing the above 10 Levels of Automation (LOA) as a spectrum within each of those four stages allows for a way of discerning where and how much automation could (or should) be implemented, in the context of performance and cost of actions. This feels to me like a step towards making mature decisions about the role of automation in different contexts.
Here is an example of these stages and the LOA in each of them, suggested for Air Traffic Control activities:
Endsley (1999) also came up with a similar paradigm of stages of automation, in “Level of automation effects on performance, situation awareness and workload in a dynamic control task”
What are examples of viewing LOA in the context of web operations and engineering?
At Etsy, we’ve made decisions (sometimes implicitly) about the levels of automation in various tasks and tooling:
- Deployinator: assisted by automated processes, humans trigger application code deploys to production. The when and what is human-centered. The how is computer-centered.
- Chef: humans decide on some details in recipes (this configuration file in this place), computers decide on others (use 85% of total RAM for memcached, other logic in templates), and computer decides on automatic deployment (random 10 minute splay for Chef client runs). Mostly, humans provide the what, and computers decide the when and how.
- Database Schema changes: assisted by automated processes, humans trigger the what and when, computer handles the how.
- Event handling: some Nagios alerts trigger simple self-healing attempts upon some (not all) alertable events. Human decides what and how. Computer decides when.
I suspect that in many organizations, the four stages of automation (from Parasuraman, Sheridan, and Wickens) line up something like this, with regards to the breakdown in human or computer function allocation:
|Decision and Action Selection||
In many cases, what level of automation is appropriate and in which context is informed by the level of trust that operators have in the automation to be successful.
Do you trust an iPhone’s ability to auto-correct your spelling enough to blindly accept all suggestions? I suspect no one would, and the iPhone auto-correct designers know this because they’ve given the human the veto power of the suggestion by putting an “x” next to them. (following automation level 5, above)
Do you trust a GPS routing system enough to follow it without question? Let’s hope not. Given that there is context missing, such as stop signs, red lights, pedestrians, and other dynamic phenomena going on in traffic, GPS automobile routing may be a good example of keeping the LOA at level 4 and below, and even then only sticking to the “Information Acquisition” and “Information Analysis” states, and keeping the “Decision and Action” and “Action Implementation” stages to the human who can recognize the more complex context.
In “Trust in Automation: Designing for Appropriate Reliance“, James Lee and Katrina A. See investigate the concerns surrounding trusting automation, including organizational issues, cultural issues, and context that can influence how automation is designed and implemented. They outline a concern I think that should be familiar to anyone who has had experiences (good or bad) with automation (emphasis mine):
As automation becomes more prevalent, poor partnerships between people and automation will become increasingly costly and catastrophic. Such ﬂawed partnerships between automation and people can be described in terms of misuse and disuse of automation. (Parasuraman & Riley, 1997).
Misuse refers to the failures that occur when people inadvertently violate critical assumptions and rely on automation inappropriately, whereas disuse signifies failures that occur when people reject the capabilities of automation.
Misuse and disuse are two examples of inappropriate reliance on automation that can compromise safety and profitability.
They discuss methods on making automation trustable:
- Design for appropriate trust, not greater trust.
- Show the past performance of the automation.
- Show the process and algorithms of the automation by revealing intermediate results in a way that is comprehensible to the operators.
- Simplify the algorithms and operation of the automation to make it more understandable.
- Show the purpose of the automation, design basis, and range of applications in a way that relates to the users’ goals.
- Train operators regarding its expected reliability, the mechanisms governing its behavior, and its intended use.
- Carefully evaluate any anthropomorphizing of the automation, such as using speech to create a synthetic conversational partner, to ensure appropriate trust.
Adam Jacob, in a private email thread with myself and some others had some very insightful things to say on the topic:
The practical application of the ironies isn’t that you should/should not automate a given task, it’s answering the questions of “When is it safe to automate?”, perhaps followed by “How do I make it safe?”. We often jump directly to “automation is awesome”, which is an answer to a different question.
[if you were to ask]…”how do you draw the line between what is and isn’t appropriate?”, I come up with a couple of things:
- The purpose of automation is to serve a need – for most of us, it’s a business need. For others, it’s a human-critical one (do not crash planes full of people regularly due to foreseeable pilot error.)
- Recognize the need you are serving – it’s not good for its own sake, and different needs call for different levels of automation effort.
- The implementers of that automation have a moral imperative to create automation that is serviceable, instrumented, and documented.
- The users of automation have an imperative to ensure that the supervisors understand the system in detail, and can recover from
I think Adam is putting this eloquently, and I think it’s an indication that we as a field are moving towards a more mature perspective on the subject.
There is a growing notion amongst those who study the history, ironies, limitations, and advantages of automation that an evolved perspective on the human-machine relationship may look a lot like human-human relationships. Specifically, the characteristics that govern groups of humans that are engaged in ‘joint activity’ could also be seen as ways that automation could interact.
Collaboration, communication, and cooperation are some of the hallmarks of teamwork amongst people. In “Ten Challenges for Making Automation a ‘Team Player’ in Joint Human-Agent Activity” David Woods, Gary Klein, Jeffrey M. Bradshaw, Robert R. Hoffman, and Paul J. Feltovich make a case for how such a relationship might exist. I wrote briefly a little while ago about the ideas that this paper rests on, in this post here about how people work together.
Here are these ten challenges the authors say we face, where ‘agents’ = humans and machines/automated processes designed by humans:
- Basic Compact – Challenge 1: To be a team player, an intelligent agent must fulfill the requirements of a Basic Compact to engage in common-grounding activities.
- Adequate models – Challenge 2: To be an effective team player, intelligent agents must be able to adequately model the other participants’ intentions and actions vis-à-vis the joint activity’s state and evolution—for example, are they having trouble? Are they on a standard path proceeding smoothly? What impasses have arisen? How have others adapted to disruptions to the plan?
- Predictability – Challenge 3: Human-agent team members must be mutually predictable.
- Directability – Challenge 4: Agents must be directable.
- Revealing status and intentions – Challenge 5: Agents must be able to make pertinent aspects of their status and intentions obvious to their teammates.
- Interpreting signals – Challenge 6: Agents must be able to observe and interpret pertinent signals of status and intentions.
- Goal negotiation – Challenge 7: Agents must be able to engage in goal negotiation.
- Collaboration – Challenge 8: Support technologies for planning and autonomy must enable a collaborative approach.
- Attention management – Challenge 9: Agents must be able to participate in managing attention.
- Cost control – Challenge 10: All team members must help control the costs of coordinated activity.
I do recognize these to be traits and characteristics of high-performing human teams. Think of the best teams in many contexts (engineering, sports, political, etc.) and these certainly show up. Can humans and machines work together just as well? Maybe we’ll find out over the next ten years.
“The question is no longer whether one or another function can be automated, but, rather, whether it should be. – Wiener & Curry (1980)”
“…and in what ways it should be automated.” – John Allspaw (right now, in response to Wiener & Curry’s quote above)