One of the reasons I’ve continued to be more and more interested in Human Factors and Safety Science is that I found myself without many answers to the questions I have had in my career. Questions surrounding how organizations work, how people think and work with computers, how decisions get made under uncertainty, and how do people cope with increasing amounts of complexity.
As a result, my journey took me deep into a world where I immediately saw connections — between concepts found in other high-tempo, high-consequence domains and my own world of software engineering and operations. One of the first connections was in Richard Cook’s How Complex Systems Fail, and it struck me so deeply I insisted that it get reprinted (with additions by Richard) into O’Reilly’s Web Operations book.
I simply cannot un-see these connections now, and the field of study keeps me going deeper. So deep that I felt I needed to get a degree. My goal with getting a degree in the topic is not just to satisfy my own curiosity, but also to explore these topics in sufficient depth to feel credible in thinking about them critically.
In software, the concept and sometimes inadvertent practice of “cargo cult engineering” is well known. I’m hoping to avoid that in my own translation(s) of what’s been found in human factors, safety science, and cognitive systems engineering, as they looked into domains like aviation, patient safety, or power plant operations. Instead, I’m looking to truly understand that work in order to know what to focus on in my own research as well as to understand how my domain is either similar (and in what ways?) or different (and in what ways?)
For example, just a hint of what sorts of questions I have been mulling over:
- How does the concept of “normalization of deviance” manifest in web engineering? How does it relate to our concept of ‘technical debt’?
- What organizational dynamics might be in play when it comes to learning from “successes” and “failures”?
- What methods of inquiry can we use to better design interfaces that have functionality and safety and diagnosis support as their core? Or, are those goals in conflict? If so, how?
- How can we design alerts to reduce noise and increase signal in a way that takes into account the context of the intended receiver of the alert? In other words, how can we teach alerts to know about us, instead of the other way around?
- The Internet (include its technical, political, and cultural structures) has non-zero amounts of diversity, interdependence, connectedness, and adaptation, which by many measures constitutes a complex system.
- How do successful organizations navigate trade-offs when it comes to decisions that may have unexpected consequences?
I’ve done my best to point my domain at some of these connections as I understand them, and the Velocity Conference has been one of the ways I’ve hoped to bring people “over the bridge” from Safety Science, Human Factors, and Cognitive Systems Engineering into software engineering and operations as it exists as a practice on Internet-connected resources. If you haven’t seen Dr. Richard Cook’s 2012 and 2013 keynotes, or Dr. Johan Bergstrom’s keynote, stop what you’re doing right now and watch them.
I’m willing to bet you’ll see connections immediately…
David Woods is one of the pioneers in these fields, and continues to be a huge influence on the way that I think about our domain and my own research (my thesis project relies heavily on some of his previous work) and I can’t be happier that he’s speaking at Velocity in New York, which is coming up soon. (Pssst: if you register for it here, you can use the code “JOHN20” for 20% discount)
I have posted before (and likely will again) about a paper Woods contributed to, Common Ground and Coordination in Joint Activity (Klein, Feltovich, Bradshaw, & Woods, 2005) which in my mind might as well be considered the best explanation on what “devops” means to me, and what makes successful teams work. If you haven’t read it, do it now.
Dynamic Fault Management and Anomaly Response
I thought about listing all of Woods’ work that I’ve seen connections in thus far, but then I realized that if I wasn’t careful, I’d be writing a literature review and not a blog post. 🙂 Also, I have thesis work to do. So for now, I’d like to point only at two concepts that struck me as absolutely critical to the day-to-day of many readers of this blog, dynamic fault management and anomaly response.
Woods sheds some light on these topics in Joint Cognitive Systems: Patterns in Cognitive Systems Engineering. Pay particular attention to the characteristics of these phenomenons:
“In anomaly response, there is some underlying process, an engineered or physiological process which will be referred to as the monitored process, whose state changes over time. Faults disturb the functions that go on in the monitored process and generate the demand for practitioners to act to compensate for these disturbances in order to maintain process integrity—what is sometimes referred to as “safing” activities. In parallel, practitioners carry out diagnostic activities to determine the source of the disturbances in order to correct the underlying problem.
Anomaly response situations frequently involve time pressure, multiple interacting goals, high consequences of failure, and multiple interleaved tasks (Woods, 1988; 1994). Typical examples of fields of practice where dynamic fault management occurs include flight deck operations in commercial aviation (Abbott, 1990), control of space systems (Patterson et al., 1999; Mark, 2002), anesthetic management under surgery (Gaba et al., 1987), terrestrial process control (Roth, Woods & Pople, 1992), and response to natural disasters.” (Woods & Hollnagel, 2006, p.71)
Now look down at the distributed systems you’re designing and operating.
Look at the “runbooks” and postmortem notes that you have written in the hopes that they can help guide teams as they try to untangle the sometimes very confusing scenarios that outages can bring.
Does “safing” ring familiar to you?
Do you recognize managing “multiple interleaved tasks” under “time pressure” and “high consequences of failure”?
I think it’s safe to say that almost every Velocity Conference attendee would see connections here.
In How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands (Woods & Patterson, 1999), he introduces the concept of escalation, in terms of anomaly response:
The concept of escalation captures a dynamic relationship between the cascade of effects that follows from an event and the demands for cognitive and collaborative work that escalate in response (Woods, 1994). An event triggers the evolution of multiple interrelated dynamics.
- There is a cascade of effects in the monitored process. A fault produces a time series of disturbances along lines of functional and physical coupling in the process (e.g., Abbott, 1990). These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications (Reiersen, Marshall, & Baker, 1988).
- Demands for cognitive activity increase as the problem cascades. More knowledge potentially needs to be brought to bear. There is more to monitor. There is a changing set of data to integrate into a coherent assessment. Candidate hypotheses need to be generated and evaluated. Assessments may need to be revised as new data come in. Actions to protect the integrity and safety of systems need to be identified, carried out, and monitored for success. Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.
- Demands for coordination increase as the problem cascades. As the cognitive activities escalate, the demand for coordination across people and across people and machines rises. Knowledge may reside in different people or different parts of the operational system. Specialized knowledge and expertise from other parties may need to be brought into the problem-solving process. Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process. The trouble in the underlying process requires informing and updating others – those whose scope of responsibility may be affected by the anomaly, those who may be able to support recovery, or those who may be affected by the consequences the anomaly could or does produce.
- The cascade and escalation is a dynamic process. A variety of complicating factors can occur, which move situations beyond canonical, textbook forms. The concept of escalation captures this movement from canonical to nonroutine to exceptional. The tempo of operations increases following the recognition of a triggering event and is synchronized by temporal landmarks that represent irreversible decision points.
When I read…
“These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications”
I think of many large-scale outages and multi-day recovery activities, like this one that you all might remember (AWS EBS/RDS outage, 2011).
When I read…
“Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.”
I think of many outage response scenarios I have been in with multiple teams (network, storage, database, security, etc.) gathering data from the multiple places they are expert in, at the same time making sense of that data as normal or abnormal signals.
When I read…
“Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process.”
I think of these two particular outages, and how in the fog of ambiguous signals coming in during diagnosis of an issue, there is a “divide and conquer” effort distributed throughout differing domain expertise (database, network, various software layers, hardware, etc.) that aims to split the search space of diagnosis, while at the same time keeping each other up-to-date on what pathologies have been eliminated as possibilities, what new data can be used to form hypotheses about what’s going on, etc.
I will post more on the topic of anomaly response in detail (and more of Woods’ work) in another post.
In the meantime, I urge you to take a look at David Woods’ writings, and look for connections in your own work. Below is a talk David gave at IBM’s Almaden Research Center, called “Creating Safety By Engineering Resilience”:
Hollnagel, E., & Woods, D. D. (1983). Cognitive systems engineering: New wine in new bottles. International Journal of Man-Machine Studies, 18(6), 583–600.
Klein, G., Feltovich, P. J., Bradshaw, J. M., & Woods, D. D. (2005). Common ground and coordination in joint activity. Organizational Simulation, 139–184.
Woods, D. D. (1995). The alarm problem and directed attention in dynamic fault management. Ergonomics. doi:10.1080/00140139508925274
Woods, D. D., & Hollnagel, E. (2006). Joint cognitive systems : patterns in cognitive systems engineering. Boca Raton : CRC/Taylor & Francis.
Woods, D. D., & Patterson, E. S. (1999). How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands. Stress, 1–13.
Woods, D. D., Patterson, E. S., & Roth, E. M. (2002). Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis. Cognition, Technology & Work, 4(1), 22–36. doi:10.1007/s101110200002