In between reading copious amounts of indignation surrounding whatever is suboptimal about healthcare.gov, you may or may not have noticed the SEC statement regarding the Knight Capital accident that took place in 2012.
This Release No. 70694 is a document that contains many details about the accident, and you can read what looks like on the surface to be an in-depth analysis of what went wrong and how best to prevent such an accident from happening in the future.
You may believe this document can serve as a ‘post-mortem’ narrative. It cannot, and should not.
Any ‘after-action’ or ‘postmortem’ document (in my domain of web operations and engineering) has two main goals:
- To provide an explanation of how an event happened, as the organization (including those closest to the work) best understands it.
- To produce artifacts (recommendations, remediations, etc.) aimed at both prevention and the improvement of detection and response approaches to aid in handling similar events in the future.
You need #1 in order to work on #2. If you don’t understand how the event unfolded, you can’t make gains towards prevention in the future.
The purpose of this post is to outline how the release is not something that can or should be used for explanation or prevention.
The Release No. 70694 document does not address either of those concerns in any meaningful way.
What it does address, however, is exactly what a regulatory body is tasked to do in the wake of a known outcome: contrast how an organization was or was not in compliance with the rules that the body has put in place. Nothing more, nothing less. In this area, the document is concise and focused.
You can be forgiven for thinking that the document could serve as an explanation, because you can find some technical details in it. It looks a little bit like a timeline. What is interesting is not what details are covered, but what details are not covered, including the organizational sensemaking that is part of every complex systems failure.
If you are looking for a real postmortem of the Knight Capital accident in this post, you’re going to be disappointed. At the end of this post, I will certainly attempt to list some questions that I might pose if I was facilitating a debriefing of the event, but no real investigation can happen without the individuals closest to the work involved in the discussion.
However, I’d like to write up a bit about why it should not be viewed as what is traditionally known (at least in the web operations and engineering community) as a postmortem report. Because frankly I think that is more important than the specific event itself.
But before I do that, it’s necessary to unpack a few concepts related to learning in a retrospective way, as in a postmortem…
Learning from events in the past (both successful and unsuccessful) puts us into a funny position as humans. In a process that is genuinely interested in learning from events, we have to rectify our need to understand with the reality that we will never get a complete picture of what has happened in the past. Regulatory bodies such as the SEC (lucky for them) don’t have to get a complete picture in order to do their job. They have only to point out the gap between how “work is prescribed” versus “work is being done” (or what Richard Cook has said “the system as imagined” versus “the system as found.”)
In many circumstances (as in the case of the SEC release), what this means is to point out the things that people and organizations didn’t do in the time preceding an event. This is usually done by using “counterfactuals”, which means literally “counter the facts.”
In the language of my domain, using counterfactuals in the process of explanation and prevention is an anti-pattern, and I’ll explain why.
One of the potential pitfalls of postmortem reports (and debriefings) is that the language we use can cloud our opportunities to learn what took place and the context people (and machines!) found themselves in. Sidney Dekker says this about using counterfactuals:
“They make you spend your time talking about a reality that did not happen (but if it had happened, the mishap would not have happened).” (Dekker, 2006, p. 39)
What are examples of counterfactuals? In ordinary language, they look like:
- “they shouldn’t have…”
- “they could have…”
- “they failed to…”
- “if only they had…!”
Why are these statements woefully inappropriate for aiding explanation of what happened? Because stating what you think should have happened doesn’t explain people’s (or an organization’s) behavior. Counterfactuals serve as a massive distraction, because it brings sharply into focus what didn’t happen, when what is required for explanation is to understand why people did what they did.
People do what makes sense to them, given their focus, their goals, and what they perceive to be their environment. This is known as the local rationality principle, and it is required in order to tease out second stories, which in turn is required for learning from failure. People’s local rationality is influenced by many dynamics, and I can imagine some of these things might feel familiar to any engineers who operate in high-tempo organizations:
Multiple conflicting goals
E.g., “Deploy the new stuff, and do it quickly because our competitors may beat us! Also: take care of all of the details while you do it quickly, because one small mistake could make for a big deal!”
Multiple targets of attention
E.g., “When you deploy the new stuff, make sure you’re looking at the logs. And ignore the errors that are normally there, so you can focus on the right ones to pay attention to. Oh, and the dashboard graph of errors…pay attention to that. And the deployment process. And the system resources on each node as you deploy to them. And the network bandwidth. Also: remember, we have to get this done quickly.”
David Woods put counterfactual thinking in context with how people actually work:
“After-the-fact, based on knowledge of outcome, outsiders can identify “critical” decisions and actions that, if different, would have averted the negative outcome. Since these “critical” points are so clear to you with the benefit of hindsight, you could be tempted to think they should have been equally clear and obvious to the people involved in the incident. These people’s failure to see what is obvious now to you seems inexplicable and therefore irrational or even perverse. In fact, what seems to be irrational behavior in hindsight turns out to be quite reasonable from the point of view of the demands practitioners face and the resources they can bring bear.” (Woods, 2010)
“You construct a referent world from outside the accident sequence, based on data you now have access to, based on facts you now know to be true. The problem is that these after-the-fact-worlds may have very little relevance to the circumstances of the accident sequence. They do not explain the observed behavior. You have substituted your own world for the one that surrounded the people in question.” (Dekker, 2004, p.33)
“Saying what people failed to do, or implying what they could or should have done to prevent the mishap, has no role in understanding human error.” (Dekker, 2004, p.43)
The engineers and managers at Knight Capital did not set out that morning of August 1, 2012 to lose $460 million. If they did, we’d be talking about sabotage and not human error. They did, however, set out to perform some work successfully (in this case, roll out what they needed to participate in the Retail Liquidity Program.)
If you haven’t picked up on it already, the use of counterfactuals is a manifestation of one of the most studied cognitive bias in modern psychology: The Hindsight Bias. I will leave it as an exercise to the reader to dig into that.
Cognitive biases are the greatest pitfalls in explaining surprising outcomes. The weird cousin of The Hindsight Bias is Outcome Bias. In a nutshell, it says that we are biased to “judge a past decision by its ultimate outcome instead of based on the quality of the decision at the time it was made, given what was known at that time.” (Outcome Bias, 2013)
In other words, we can be tricked into thinking that if the result of an accident is truly awful (like people dying, something crashing, or, say, losing $460 million in 20 minutes) then the decisions that led up to that outcome must have been reeeeeealllllllyyyy bad. Right?
This is a myth debunked by a few decades of social science, but it remains persistent. No decision maker has omniscience about results, so the severity of the outcome cannot be seen to be proportional to the quality of thought that went into the decisions or actions that led up to the result. Why we have this bias to begin with is yet another topic that we can explore another time.
But a possible indication that you are susceptible to The Outcome Bias is a quick thought exercise on results: if Knight Capital lost only $1,000 (or less) would you think them to be more or less prudent in their preventative measures than in the case of $460 million?
If you’re into sports, maybe this can help shed light on The Outcome Bias.
Operators (within complex systems, at least) have procedures and rules to help them achieve their goals safely. They come in many forms: checklists, guidelines, playbooks, laws, etc. There is a distinction between procedures and rules, but they have similarities when it comes to gaining understanding of the past.
First let’s talk about procedures. In the aftermath of an accident, we can (and will, in the SEC release) see many calls for “they didn’t follow procedures!” or “they didn’t even have a checklist!” This sort of statement can nicely serve as a counterfactual.
What is important to recognize is that procedures are but only one resource people use to do work. If we only worked by following every rule and procedure we’ve written for ourselves, by the letter, then I suspect society would come to a halt. As an aside, “work-to-rule” is a tactic that labor organizations have used to demonstrate the issues that onerous rules and procedures can rob people of their adaptive capacities, and therefore bring business to an effective standstill.
Some more thought exercises to think with on procedures:
- How easy might it be to go to your corporate wiki or intranet to find a procedure (or a step within a procedure) that was once relevant, but no longer is?
- Do you think you can find a procedure somewhere in your group that isn’t specific enough to address every context you might use it in?
- Can you find steps in existing procedures that feel safe to skip, especially in if you’re under time pressure to get something done?
- Part of the legal terms of using Microsoft Office is that you read and understand the End User License Agreement. You did that before checking “I agree”, right? Or did you violate that legal agreement?! (don’t worry, I won’t tell anyone)
Procedures are important for a number of reasons. They serve as institutional knowledge and guidelines for safe work. But, like wikis, they make sense to the authors of the procedure the day they wrote it. They are written to take into account all of the scenarios and contexts that the author can imagine.
But since that imagination is limited, many procedures that are thought to ensure safety are context-sensitive and they require interpretation, modification, and adaptation.
There are multiple issues with procedures as they are navigated by people who do real work. Stealing from Dekker again:
- “First, a mismatch between procedures and practice is not unique to accident sequences. Not following procedures does not necessarily lead to trouble, and safe outcomes may be preceded by just as (relatively) many procedural deviations as those that precede accidents (Woods et al., 1994; Snook, 2000) This turns any “findings” about accidents being preceded by procedural violation into mere tautologies…”
- “Second, real work takes place in a context of limited resources and multiple goals and pressures.”
- “Third, some of the safest complex, dynamic work not only occurs despite the procedures–such as aircraft line maintenance–but without procedures altogether.” The long-studied High Reliability Organizations have examples (in domains such as naval aircraft carrier operations and nuclear power generation) where procedures are eschewed, and instead replaced by less static forms of learning from practice:
‘‘there were no books on the integration of this new hardware into existing routines and no other place to practice it but at sea. Moreover, little of the process was written down, so that the ship in operation is the only reliable manual’’. Work is ‘‘neither standardized across ships nor, in fact, written down systematically and formally anywhere’’. Yet naval air- craft carriers–with inherent high-risk operations–have a remarkable safety record, like other so-called high reliability organizations (Rochlin et al., 1987; Weick, 1990; Rochlin, 1999). “
- “Fourth, procedure-following can be antithetical to safety.” – Consider the case of the 1949 US Mann Gulch disaster where firefighters who perished were the ones sticking to the organizational mandate to carry their tools everywhere. Or Swissair Flight 111, when captain and co-pilot of an aircraft disagreed on whether or not to follow the prescribed checklist for an emergency landing. While they argued, the plan crashed. (Dekker, 2003)
Anyone operating in high-tempo and high-consequence environments recognize both the utility and also the brittleness of a procedure, no matter how much thought went into it.
Let’s keep this idea in mind as we walk through the SEC release below.
Violation of Rules != Explanation
Now let’s talk about rules. The SEC’s job (in a nutshell) is to design, maintain, and enforce regulations of practice for various types of financially-driven organizations in the United States. Note that they are not charged with explaining or preventing events. Preventing may or may not result from their work in regulations, but prevention demands much more than abiding by rules.
Rules and regulations are similar to procedures in that they are written with deliberate but ultimately interpretable intention. Judges and juries help interpret different contexts as they relate to a given rule, law, or regulation. Rules are good for a number of reasons that are beyond the scope of this (now lengthy) post.
If we think about regulations in the context of causality, however, we can get into trouble.
Because we can find ourselves in uncertain contexts that have some of the dynamics that I listed above (multiple conflicting goals and targets of attention) regulations (even when we are acutely aware of them) pose some issues. In the Man-Made Disasters Model, Nick Pidgeon lays some of this out for us:
“Uncertainty may also arise about how to deal with formal violations of safety regulations. Violations might occur because regulations are ambiguous, in conflict with other goals such as the needs of production, or thought to be outdated because of technological advance. Alternatively safety waivers may be in operation, allowing relaxation of regulations under certain circumstances (as also occurred in the `Challenger’ case; see Vaughan, 1996).” (Pidgeon, 2000)
Rules and regulations need to allow for interpretation, otherwise they would be brittle in enforcement. So therefore, vagueness and flexibility in rules is desired. We’ll see how this vagueness can be exploited for enforcement, however, at the expense of learning.
Back to the statement
Once more: the SEC document cannot be viewed as a canonical description of what happened with Knight Capital on August 1, 2012.
It can, however, be viewed as a comprehensive account of the exchange and trading regulations the SEC deems were violated by the organization. This is its purpose. My goal here is not to critique the SEC release for its purpose, it is to reveal how it cannot be seen to aid either explanation or prevention of the event, and so should not be used for that.
Before we walk through (at least parts) of the document, it’s worth noting that there is no objective accident investigative body that exists for electronic trading systems. In aviation, there is a regulative body (the FAA) and an investigative body (the NTSB) and there is significant differences between the two, charter-wise and operations-wise. There exists no such independent investigative body analogous to the NTSB in Knight Capital’s industry. There is only the SEC.
I’ll have comments in italics, in blue and talk about the highlighted pieces. After getting feedback from many colleagues, I decided to keep the length here for people to dig into, because I think it’s important to understand. If you make it through this, you deserve cake.
The Securities and Exchange Commission (the “Commission”) deems it appropriate and in the public interest that public administrative and cease-and-desist proceedings be, and hereby are, instituted pursuant to Sections 15(b) and 21C of the Securities Exchange Act of 1934 (the “Exchange Act”) against Knight Capital Americas LLC (“Knight” or “Respondent”).
In anticipation of the institution of these proceedings, Respondent has submitted an Offer of Settlement (the “Offer”), which the Commission has determined to accept. Solely for the purpose of these proceedings and any other proceedings by or on behalf of the Commission, or to which the Commission is a party, and without admitting or denying the findings herein, except as to the Commission’s jurisdiction over it and the subject matter of these proceedings, which are admitted, Respondent consents to the entry of this Order Instituting Administrative and Cease-and-Desist Proceedings, Pursuant to Sections 15(b) and 21C of the Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a Cease-and-Desist Order (“Order”), as set forth below:
Note: This means that Knight doesn’t have to agree or disagree with any of the statements in the document. This is expected. If it was intended to be a postmortem doc, then there would be a lot more covered here in addition to listing violations of regulations.
1. On August 1, 2012, Knight Capital Americas LLC (“Knight”) experienced a significant error in the operation of its automated routing system for equity orders, known as SMARS. While processing 212 small retail orders that Knight had received from its customers, SMARS routed millions of orders into the market over a 45-minute period, and obtained over 4 million executions in 154 stocks for more than 397 million shares. By the time that Knight stopped sending the orders, Knight had assumed a net long position in 80 stocks of approximately $3.5 billion and a net short position in 74 stocks of approximately $3.15 billion. Ultimately, Knight lost over $460 million from these unwanted positions. The subject of these proceedings is Knight’s violation of a Commission rule that requires brokers or dealers to have controls and procedures in place reasonably designed to limit the risks associated with their access to the markets, including the risks associated with automated systems and the possibility of these types of errors.
Note: Again, the purpose of the doc is to point out where Knight violated rules. It is not:
- a description of the multiple trade-offs that engineering at Knight made or considered when designing fault-tolerance in their systems, or
- how Knight as an organization evolved over time to focus on evolving some procedures and not others, or
- how engineers anticipated in preparation for deploying support for the new RLP effort on Aug 1, 2012.
To equate any of those things with violation of a rule is a cognitive leap that we should stay very far away from.
It’s worth mentioning here that the document only focuses on failures, and makes no mention of successes. How Knight succeeded during diagnosis and response is unknown to us, so a rich source of data isn’t available. Because of this, we cannot pretend the document to give explanation.
2. Automated trading is an increasingly important component of the national market system. Automated trading typically occurs through or by brokers or dealers that have direct access to the national securities exchanges and other trading centers. Retail and institutional investors alike rely on these brokers, and their technology and systems, to access the markets.
3. Although automated technology brings benefits to investors, including increased execution speed and some decreased costs, automated trading also amplifies certain risks. As market participants increasingly rely on computers to make order routing and execution decisions, it is essential that compliance and risk management functions at brokers or dealers keep pace. In the absence of appropriate controls, the speed with which automated trading systems enter orders into the marketplace can turn an otherwise manageable error into an extreme event with potentially wide-spread impact.
Note: The sharp contrast between our ability to create complex and valuable automation and our ability to reason about, influence, control, and understand it in even ‘normal’ operating conditions (forget about time-pressured emergency diagnosis of a problem) is something I (and many others over the decades) have written about. The key phrase here is “keep pace”, and it’s difficult for me to argue with. This may be the most valuable statement in the document with regards to safety and the use of automation.
4. Prudent technology risk management has, at its core, quality assurance, continuous improvement, controlled testing and user acceptance, process measuring, management and control, regular and rigorous review for compliance with applicable rules and regulations and a strong and independent audit process. To ensure these basic features are present and incorporated into day-to-day operations, brokers or dealers must invest appropriate resources in their technology, compliance, and supervisory infrastructures. Recent events and Commission enforcement actions have demonstrated that this investment must be supported by an equally strong commitment to prioritize technology governance with a view toward preventing, wherever possible, software malfunctions, system errors and failures, outages or other contingencies and, when such issues arise, ensuring a prompt, effective, and risk-mitigating response. The failure by, or unwillingness of, a firm to do so can have potentially catastrophic consequences for the firm, its customers, their counterparties, investors and the marketplace.
Note: Here we have the first value statement we see in the document. It states what is “prudent” in risk management. This is reasonable for the SEC to state in a generic high-level way, given its charge: to interpret regulations. This sets the stage for showing contrast between what happened, and what the rules are, which comes later.
If this was a postmortem doc, this word should be a red flag that immediately sets your face on fire. Stating what is “prudent” is essentially imposing standards onto history. It is a declaration of what a standard of good practice looks like. The SEC does not mention Knight Capital as not prudent specifically, but they don’t have to. This is the model on which the rest of the document rests. Stating what standards of good practice look like in a document that is looked to for explanation is an anti-pattern. In aviation, this might be analogous to saying that a pilot lacked “good airmanship” and pointing at it as a cause.The phrases “must invest appropriate resources” and “equally strong” above are both non-binary and context-sensitive. What is appropriate and equally strong gets to be defined by…whom?
- What is “prudent”?
- The description only says prudence demands prevention of errors, outages, and malfunctions “wherever possible.” How will you know where prevention is not possible? And following that – it would appear that you can be prudent and still not prevent errors and malfunctions.
- Please ensure a “prompt, effective, and risk-mitigating response.” In other words: fix it correctly and fix it quickly. It’s so simple!
5. The Commission adopted Exchange Act Rule 15c3-52 in November 2010 to require that brokers or dealers, as gatekeepers to the financial markets, “appropriately control the risks associated with market access, so as not to jeopardize their own financial condition, that of other market participants, the integrity of trading on the securities markets, and the stability of the financial system.”
Note: It’s true, this is what the rule says. What is deemed “appropriate”, it would seem, is dependent on the outcome. Had an accident? It was not appropriate control. Didn’t have an accident? It must be appropriate control. This would mean that Knight Capital did have appropriate controls the day before the accident. Outcome bias reigns supreme here.
6. Subsection (b) of Rule 15c3-5 requires brokers or dealers with market access to “establish, document, and maintain a system of risk management controls and supervisory procedures reasonably designed to manage the financial, regulatory, and other risks” of having market access. The rule addresses a range of market access arrangements, including customers directing their own trading while using a broker’s market participant identifications, brokers trading for their customers as agents, and a broker-dealer’s trading activities that place its own capital at risk. Subsection (b) also requires a broker or dealer to preserve a copy of its supervisory procedures and a written description of its risk management controls as part of its books and records.
Note: The rules says, basically: “have a document about controls and risks”. It doesn’t say anything about an organization’s ability to adapt them as time and technology progresses, only that at some point they were written down and shared with the right parties.
7. Subsection (c) of Rule 15c3-5 identifies specific required elements of a broker or dealer’s risk management controls and supervisory procedures. A broker or dealer must have systematic financial risk management controls and supervisory procedures that are reasonably designed to prevent the entry of erroneous orders and orders that exceed pre-set credit and capital thresholds in the aggregate for each customer and the broker or dealer. In addition, a broker or dealer must have regulatory risk management controls and supervisory procedures that are reasonably designed to ensure compliance with all regulatory requirements.
Note: This is the first of many instances of the phrase “reasonably designed” in the document. As with the word ‘appropriate’, how something is defined to be “reasonably designed” is dependent on the outcome of that design. This robs both the design and the engineer of the nuanced details that make for resilient systems. Modern technology doesn’t work or not-work. It breaks and fails in surprising (sometimes shocking) ways that were not imagined by its designers, which means that “reason” plays only a part of its quality.
Right now, all over the world, every (non-malicious) engineer around the world is designing and building systems that they believe are “reasonably designed.” If they didn’t think they were reasonably designed, they wouldn’t be finished with it until they did think it was.
Some of those systems will fail. Most will not. Many of them will fail in ways that are safe and anticipated. Some will will not, and surprise everyone.
Systems Safety researcher Erik Hollnagel has had related thoughts:
We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.
8. Subsection (e) of Rule 15c3-5 requires brokers or dealers with market access to establish, document, and maintain a system for regularly reviewing the effectiveness of their risk management controls and supervisory procedures. This sub-section also requires that the Chief Executive Officer (“CEO”) review and certify that the controls and procedures comply with subsections (b) and (c) of the rule. These requirements are intended to assure compliance on an ongoing basis, in part by charging senior management with responsibility to regularly review and certify the effectiveness of the controls.
Note: This takes into consideration that systems are not indeed static, and it implies that they need to evolve over time. This is important to remember for some notes later on.
9. Beginning no later than July 14, 2011, and continuing through at least August 1, 2012, Knight’s system of risk management controls and supervisory procedures was not reasonably designed to manage the risk of its market access. In addition, Knight’s internal reviews were inadequate, its annual CEO certification for 2012 was defective, and its written description of its risk management controls was insufficient. Accordingly, Knight violated Rule 15c3-5. In particular:
- Knight did not have controls reasonably designed to prevent the entry of erroneous orders at a point immediately prior to the submission of orders to the market by one of Knight’s equity order routers, as required under Rule 15c3-5(c)(1)(ii);
- Knight did not have controls reasonably designed to prevent it from entering orders for equity securities that exceeded pre-set capital thresholds for the firm, in the aggregate, as required under Rule 15c3-5(c)(1)(i). In particular, Knight failed to link accounts to firm-wide capital thresholds, and Knight relied on financial risk controls that were not capable of preventing the entry of orders;
- Knight did not have an adequate written description of its risk management controls as part of its books and records in a manner consistent with Rule 17a-4(e)(7) of the Exchange Act, as required by Rule 15c3-5(b);
- Knight also violated the requirements of Rule 15c3-5(b) because Knight did not have technology governance controls and supervisory procedures sufficient to ensure the orderly deployment of new code or to prevent the activation of code no longer intended for use in Knight’s current operations but left on its servers that were accessing the market; and Knight did not have controls and supervisory procedures reasonably designed to guide employees’ responses to significant technological and compliance incidents;
- Knight did not adequately review its business activity in connection with its market access to assure the overall effectiveness of its risk management controls and supervisory procedures, as required by Rule 15c3-5(e)(1); and
- Knight’s 2012 annual CEO certification was defective because it did not certify that Knight’s risk management controls and supervisory procedures complied with paragraphs (b) and (c) of Rule 15c3-5, as required by Rule 15c3-5(e)(2).
Note: It’s a counterfactual party! The question remains: are conditions sufficient, reasonably designed, or adequate if they don’t result in an accident like this one? Which comes first: these characterizations, or the accident? Knight Capital did believe these things were sufficient, reasonably designed, and adequate enough. Otherwise, they would have addressed them. One question necessary to answer for prevention is: “What were the sources of confidence that Knight Capital drew upon as they designed their systems?” Because improvement lies there.
10. As a result of these failures, Knight did not have a system of risk management controls and supervisory procedures reasonably designed to manage the financial, regulatory, and other risks of market access on August 1, 2012, when it experienced a significant operational failure that affected SMARS, one of the primary systems Knight uses to send orders to the market. While Knight’s technology staff worked to identify and resolve the issue, Knight remained connected to the markets and continued to send orders in certain listed securities. Knight’s failures resulted in it accumulating an unintended multi-billion dollar portfolio of securities in approximately forty-five minutes on August 1 and, ultimately, Knight lost more than $460 million, experienced net capital problems, and violated Rules 200(g) and 203(b) of Regulation SHO.
11. Knight Capital Americas LLC (“Knight”) is a U.S.-based broker-dealer and a wholly-owned subsidiary of KCG Holdings, Inc. Knight was owned by Knight Capital Group, Inc. until July 1, 2013, when that entity and GETCO Holding Company, LLC combined to form KCG Holdings, Inc. Knight is registered with the Commission pursuant to Section 15 of the Exchange Act and is a Financial Industry Regulatory Authority (“FINRA”) member. Knight has its principal business operations in Jersey City, New Jersey. Throughout 2011 and 2012, Knight’s aggregate trading (both for itself and for its customers) generally represented approximately ten percent of all trading in listed U.S. equity securities. SMARS generally represented approximately one percent or more of all trading in listed U.S. equity securities.
B. August 1, 2012 and Related Events
Preparation for NYSE Retail Liquidity Program
12. To enable its customers’ participation in the Retail Liquidity Program (“RLP”) at the New York Stock Exchange, which was scheduled to commence on August 1, 2012, Knight made a number of changes to its systems and software code related to its order handling processes. These changes included developing and deploying new software code in SMARS. SMARS is an automated, high speed, algorithmic router that sends orders into the market for execution. A core function of SMARS is to receive orders passed from other components of Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity, send one or more representative (or “child”) orders to external venues for execution.
13. Upon deployment, the new RLP code in SMARS was intended to replace unused code in the relevant portion of the order router. This unused code previously had been used for functionality called “Power Peg,” which Knight had discontinued using many years earlier. Despite the lack of use, the Power Peg functionality remained present and callable at the time of the RLP deployment. The new RLP code also repurposed a flag that was formerly used to activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag was set to “yes,” the new RLP functionality–rather than Power Peg–would be engaged.
Note: Noting the intention is important in gaining understanding, because it shows effort to get into the mindset of the individual or groups involved in the work. If this introspection continued throughout the document, it would get a little closer to something like a postmortem.
Raise your hand if you can definitively state all of the active and inactive code execution paths in your application right now. Right.
14. When Knight used the Power Peg code previously, as child orders were executed, a cumulative quantity function counted the number of shares of the parent order that had been executed. This feature instructed the code to stop routing child orders after the parent order had been filled completely. In 2003, Knight ceased using the Power Peg functionality. In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.
Note: On the surface, this looks like some technical meat to bite into. There is a some detail surrounding a fault-tolerance guardrail here, something to fail “closed” in the presence of specific criteria. What’s missing? Any dialogue about why the move of the function from one place (in Power Peg) to another (earlier in SMARS) – this is important, because in my experience, engineers don’t make effort in that sort of thing without motivation. If that motivation was explored, then we’d get a better sense of where the organization drew its confidence from, previous to the accident. This helps us understand their local rationality. But: we don’t get that from this document.
15. Beginning on July 27, 2012, Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.
Note: Code and deployment review is a fine thing to have. But is it sufficient? Dr. Nancy Leveson explained when she was invited to speak at the SEC’s “Technology Roundtable” in October of last year that in 1992, she chaired a committee to review the code that was deployed on the Space Shuttle. She said that NASA was spending $100 million a year to maintain the code, was employing the smartest engineers in the world, and there were still found to be gaps of concern. She repeats that there is no such thing as perfect software, no matter how much effort an individual or organization makes to produce such a thing.
Do written procedures requiring a review of code or deployment guarantee safety? Of course not. But ensuring safety isn’t what the SEC is expected to do in this document. Again: they are only pointing out the differences between regulation and practice.
Events of August 1, 2012
16. On August 1, Knight received orders from broker-dealers whose customers were eligible to participate in the RLP. The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server. As a result, this server began sending child orders to certain trading centers for execution. Because the cumulative quantity function had been moved, this server continuously sent child orders, in rapid sequence, for each incoming parent order without regard to the number of share executions Knight had already received from trading centers. Although one part of Knight’s order handling system recognized that the parent orders had been filled, this information was not communicated to SMARS.
Note: So the guardrail/fail-closed mechanism wasn’t in the same place it was before, and the eighth server was allowed to continue on. As Leveson said in her testimony: ” It’s not necessarily just individual component failure. In a lot of these accidents each individual component worked exactly the way it was expected to work. It surprised everyone in the interactions among the components.”
17. The consequences of the failures were substantial. For the 212 incoming parent orders that were processed by the defective Power Peg code, SMARS sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes. Knight inadvertently assumed an approximately $3.5 billion net long position in 80 stocks and an approximately $3.15 billion net short position in 74 stocks. Ultimately, Knight realized a $460 million loss on these positions.
Note: Just in case you forgot, this accident was sooooo bad. These numbers are so big. Keep that in mind, dear reader, because I want to you remember that when you think about the engineer who thought he had deployed the code to the eighth server.
18. The millions of erroneous executions influenced share prices during the 45 minute period. For example, for 75 of the stocks, Knight’s executions comprised more than 20 percent of the trading volume and contributed to price moves of greater than five percent. As to 37 of those stocks, the price moved by greater than ten percent, and Knight’s executions constituted more than 50 percent of the trading volume. These share price movements affected other market participants, with some participants receiving less favorable prices than they would have in the absence of these executions and others receiving more favorable prices.
BNET Reject E-mail Messages
19. On August 1, Knight also received orders eligible for the RLP but that were designated for pre-market trading. SMARS processed these orders and, beginning at approximately 8:01 a.m. ET, an internal system at Knight generated automated e-mail messages (called “BNET rejects”) that referenced SMARS and identified an error described as “Power Peg disabled.” Knight’s system sent 97 of these e-mail messages to a group of Knight personnel before the 9:30 a.m. market open. Knight did not design these types of messages to be system alerts, and Knight personnel generally did not review them when they were received. However, these messages were sent in real time, were caused by the code deployment failure, and provided Knight with a potential opportunity to identify and fix the coding issue prior to the market open. These notifications were not acted upon before the market opened and were not used to diagnose the problem after the open.
Note: Translated, this says that systems-generated warnings/alerts that were sent via email weren’t noticed. Signals sent by automated systems (synchronously – as in “alerts” or asynchronously – as in “email”) aimed at perfectly detecting or preventing anomalies is not a solved problem. Show me an outage, any outage, and I’ll show you warning signs that humans didn’t pick up on. The document doesn’t give any detail on why those type of messages were sent via email (as opposed to paging-style alerts), what the distribution list was for them, how those messages get generated, or any other details.
Is the number of the emails (97 of them) important? 97 sounds like a lot, doesn’t it? If it was one, and not 97, would the paragraph read differently? What if there were 10,000 messages sent?
How many engineers right now are receiving alerts on their phone (forget about emails) that they will glance at and think that they are part of the normal levels of noise in the system, because thresholds and error handling are not always precisely tuned?
C. Controls and Supervisory Procedures
20. Knight had a number of controls in place prior to the point that orders reached SMARS. In particular, Knight’s customer interface, internal order management system, and system for internally executing customer orders all contained controls concerning the prevention of the entry of erroneous orders.
21. However, Knight did not have adequate controls in SMARS to prevent the entry of erroneous orders. For example, Knight did not have sufficient controls to monitor the output from SMARS, such as a control to compare orders leaving SMARS with those that entered it. Knight also did not have procedures in place to halt SMARS’s operations in response to its own aberrant activity. Knight had a control that capped the limit price on a parent order, and therefore related child orders, at 9.5 percent below the National Best Bid (for sell orders) or above the National Best Offer (for buy orders) for the stock at the time that SMARS had received the parent order. However, this control would not prevent the entry of erroneous orders in circumstances in which the National Best Bid or Offer moved by less than 9.5 percent. Further, it did not apply to orders–such as the 212 orders described above–that Knight received before the market open and intended to send to participate in the opening auction at the primary listing exchange for the stock.
Note: Anomaly detection and error-handling criteria have two origins: the imagination of their authors and the history of surprises that have been encountered already. A significant number of thresholds, guardrails, and alerts in any technical organization are put in place only after it’s realized that they are needed. Some of these realizations come from negative events like outages, data loss, etc. and some of them come from “near-misses” or explicit re-anticipation activated by feedback that comes from real-world operation.
Even then, real-world observations don’t always produce new safeguards. How many successful trades had Knight Capital seen in its lifetime while that control allowed “the entry of erroneous orders in circumstances in which the National Best Bid or Offer moved by less than 9.5 percent.” How many successful Shuttle launches saw degradation in O-ring integrity before the Challenger accident? This ‘normalization of deviance’ (Vaughn, 1997) phenomenon is to be expected in all socio-technical organizations. Financial trading systems are no exception. History matters.
Note: Nothing in this section had much value in explanation or prevention.
Code Development and Deployment
26. Knight did not have written code development and deployment procedures for SMARS (although other groups at Knight had written procedures), and Knight did not require a second technician to review code deployment in SMARS. Knight also did not have a written protocol concerning the accessing of unused code on its production servers, such as a protocol requiring the testing of any such code after it had been accessed to ensure that the code still functioned properly.
Note: Again, does a review guarantee safety? Does testing prevent malfunction?
27. On August 1, Knight did not have supervisory procedures concerning incident response. More specifically, Knight did not have supervisory procedures to guide its relevant personnel when significant issues developed. On August 1, Knight relied primarily on its technology team to attempt to identify and address the SMARS problem in a live trading environment. Knight’s system continued to send millions of child orders while its personnel attempted to identify the source of the problem. In one of its attempts to address the problem, Knight uninstalled the new RLP code from the seven servers where it had been deployed correctly. This action worsened the problem, causing additional incoming parent orders to activate the Power Peg code that was present on those servers, similar to what had already occurred on the eighth server.
Note: I would like to think that most engineering organizations that are tasked with troubleshooting issues in production systems understand that diagnosis isn’t something you can prescribe. Successful incident response in escalating scenarios is something that comes from real-world practice, not a document. Improvisation and intuition play a significant role in this, which obviously cannot be written down beforehand.
Thought exercise: you just deployed new code to production. You become aware of an issue. Would it be surprising if one of the ways you attempt to rectify the scenario is to roll back to the last known working version? The SEC release implies that it would be.
D. Compliance Reviews and Written Description of Controls
Note: I’m skipping some sections here as it’s just more about compliance.
Post-Compliance Date Reviews
32. Knight conducted periodic reviews pursuant to the WSPs. As explained above, the WSPs assigned various tasks to be performed by SCG staff in consultation with the pertinent business and technology units, with a senior member of the pertinent business unit reviewing and approving that work. These reviews did not consider whether Knight needed controls to limit the risk that SMARS could malfunction, nor did these reviews consider whether Knight needed controls concerning code deployment or unused code residing on servers. Before undertaking any evaluation of Knight’s controls, SCG, along with business and technology staff, had to spend significant time and effort identifying the missing content and correcting the inaccuracies in the written description.
33. Several previous events presented an opportunity for Knight to review the adequacy of its controls in their entirety. For example, in October 2011, Knight used test data to perform a weekend disaster recovery test. After the test concluded, Knight’s LMM desk mistakenly continued to use the test data to generate automated quotes when trading began that Monday morning. Knight experienced a nearly $7.5 million loss as a result of this event. Knight responded to the event by limiting the operation of the system to market hours, changing the control so that this system would stop providing quotes after receiving an execution, and adding an item to a disaster recovery checklist that required a check of the test data. Knight did not broadly consider whether it had sufficient controls to prevent the entry of erroneous orders, regardless of the specific system that sent the orders or the particular reason for that system’s error. Knight also did not have a mechanism to test whether their systems were relying on stale data.
Note: That we might be able to cherry-pick opportunities in the past where signs of doomsday could have (or should have) been seen and heeded is consistent with textbook definitions of The Hindsight Bias. How organizations learn is influenced by the social and cultural dynamics of its internal structures. Again, Diane Vaughn’s writings is a place we can look to for exploring how path dependency can get us into surprising places. But again: this is not the SEC’s job to speak to that.
E. CEO Certification
34. In March 2012, Knight’s CEO signed a certification concerning Rule 15c3-5. The certification did not state that Knight’s controls and procedures complied with the rule. Instead, the certification stated that Knight had in place “processes” to comply with the rule. This drafting error was not intentional, the CEO did not notice the error, and the CEO believed at the time that he was certifying that Knight’s controls and procedures complied with the rule.
Note: This is possibly the only hint at local rationality in the document.
F. Collateral Consequences
35. There were collateral consequences as a result of the August 1 event, including significant net capital problems. In addition, many of the millions of orders that SMARS sent on August 1 were short sale orders. Knight did not mark these orders as short sales, as required by Rule 200(g) of Regulation SHO. Similarly, Rule 203(b) of Regulation SHO prohibits a broker or dealer from accepting a short sale order in an equity security from another person, or effecting a short sale in an equity security for its own account, unless it has borrowed the security, entered into a bona-fide arrangement to borrow the security, or has reasonable grounds to believe that the security can be borrowed so that it can be delivered on the date delivery is due (known as the “locate” requirement), and has documented compliance with this requirement. Knight did not obtain a “locate” in connection with Knight’s unintended orders and did not document compliance with the requirement with respect to Knight’s unintended orders.
A. Market Access Rule: Section 15(c)(3) of the Exchange Act and Rule 15c3-5
Note: I’m going skip a bit because it’s not much more than a restating of rules that the SEC deemed were broken….
Accordingly, pursuant to Sections 15(b) and 21C of the Exchange Act, it is hereby ORDERED that:
A. Respondent Knight cease and desist from committing or causing any violations and any future violations of Section 15(c)(3) of the Exchange Act and Rule 15c3-5 thereunder, and Rules 200(g) and 203(b) of Regulation SHO.
Note: Translated – you must stop immediately all of the things that violate rules that say you must “reasonably design” things. So don’t unreasonably design things anymore.
The SEC document does what it needs to do: walk through the regulations that they think were violated, and talk about the settlement agreement. Knight Capital doesn’t have to admit they did anything wrong or suboptimal, and the SEC gets to tell them what to do next. That is, roughly:
- Hire a consultant that helps them not unreasonably design things anymore, and document that.
- Pay $12M to the SEC.
Like I mentioned before, this SEC release doesn’t help explain
why how the event came to be, or make any effort towards prevention other than require Knight Capital to pay a settlement, hire a consultant, and write new procedures that can predict the future. I do not know anyone at Knight Capital (or at the SEC for that matter) so it’s very unlikely that I’ll gain any more awareness of accident details than you will, my dear reader.
But I can put down a few questions that I might ask if I was facilitating the debriefing of the accident, which could possibly help with gaining a systems-thinking perspective on explanation. Real prevention is left to an exercise to the readers who also work at Knight Capital.
- The engineer who deployed the new code to support the RLP integration had confidence that all servers (not just seven of the eight) received the new code. What gave him that confidence? Was it a dashboard? Reliance on an alert? Some other sort of feedback from the deployment process?
- The BNET Reject E-mail Messages: Have they ever been sent before? Do the recipients of them trust their validity? What is the background on their delivery being via email, versus synchronous alerting? Do they provide enough context in their content to give an engineer sufficient criteria to act on?
- What were the signals that the responding team used to indicate that a roll-back of the code on the seven servers was a potential repairing action?
- Did the team that were responding to the issue have solid and clear communication channels? Was it textual chat, in-person, or over voice or video conference?
- Did the team have to improvise any new tooling to be used in the diagnosis or response?
- What metrics did the team use to guide their actions? Were they infrastructural (such as latency, network, or CPU graphs?) or market-related data (trades, positions, etc.) or a mixture?
- What indications were there to raise awareness that the eighth server didn’t receive the latest code? Was it a checksum or versioning? Was it logs of a deployment tool? Was it differences in the server metrics of the eighth server?
- As the new code was rolled out: what was the team focused on? What were they seeing?
- As they recognized there was an issue: did the symptoms look like something they had seen before?
- As the event unfolded: did the responding team discuss what to do, or did single actors take action?
- Regarding non-technical teams: were they involved with directing the response?
- Many many more questions remain, that presumably (hopefully) Knight Capital has asked and answered themselves.
The Second Victim
What about the engineer who deployed the code…the one who had his hands on the actual work being done? How is he doing? Is he receiving support from his peers and colleagues? Or was he fired? The financial trading world does not exactly have a reputation for empathy, and given that there is no voice given to the people closest to the work (such as this engineer) informing the story, I can imagine that symptoms consistent with traumatic stress are likely.
Some safety-critical domains have put together structured programs to offer support to individuals that are involved with high-tempo and high-consequence work. Aviation and air traffic control has seen good success with CISM (Critical Incident Stress Management) and it’s been embraced by organizations around the world.
As web operations and financial trading systems become more and more complex, we will continue to be surprised by outcomes of what looks like “normal” work. If we do not make effort to support those who navigate this complexity on a daily basis, we will not like the results.
- The SEC does not have responsibility for investigation with the goals of explanation or prevention of adverse events. Their focus is regulation.
- Absent a real investigation that eschews counterfactuals, puts procedures and rules into context, and encourages a narrative that holds paramount the voices of those closest to the work: we cannot draw any substantial conclusions. This means armchair accident investigation ripe with indignation.
So please don’t use the SEC Release No. 70694 as a post-mortem document, because it is not.
Dekker, S. (2003). Failure to adapt or adaptations that fail: contrasting models on procedures and safety. Applied Ergonomics, 34(3), 233—238. doi:10.1016/S0003-6870(03)00031-0
Dekker, S. (2006). The Field Guide to Understanding Human Error. Ashgate Publishing, Ltd.
Outcome Bias. (n.d.). In Wikipedia. Retrieved October 28, 2013, from https://en.wikipedia.org/wiki/Outcome_bias
Pidgeon, N., & O’Leary, M. (2000). Man-made disasters: why technology and organizations (sometimes) fail. Safety Science, 34(1), 15—30.
Vaughan, D. (2009). The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. University of Chicago Press.
Woods, D. D., Dekker, S., Cook, R., Johannesen, L., & Sarter, N. (2010). Behind Human Error (2nd ed.). Farnham: Ashgate Pub Co.
Weick, K.E., 1993. The collapse of sensemaking in organizations. Administrative Sci. Quart. 38, 628—652.