This little ramble of thoughts are related to my talk at Velocity coming up, but I know I’ll never get to this part at the conference, so I figured I’d post about it here.
Building resilience from a systems point of view means (amongst other things) understanding how your organization deals with failure and unexpected situations. Generally this means having a development and operations teams that can work well together under pressure, with fluctuating amounts of uncertainty, bringing their own domain expertise to the table when it matters.
This is what drives some of my favorite Ops candidate interview questions. Knowing Unix commands, network architectures, database behaviors, and scripting languages are obviously required, but comprise only one facet of the gig. The real mettle comes from being able easily zoom in and out of the whole system under scrutiny, splitting up troubleshooting responsibilities amongst your team (and trusting their results) and differentiating red herring symptoms from truly related ones. It also comes from things like:
- Staying away from distracting conversation during the outage response. Nothing kills a TTR like unrelated talk in IRC or a conf call.
- Trusting your information. This is where the UI challenges of dashboard design can make or break an outage response. “Are those units milli, or mega?”
- Balancing too much communication and too little amongst team members. Troubleshooting outage verbosity is a fickle mistress.
- Stomping actions. OneThingAtATime™ methods aren’t easy to stick to, especially when things escalate.
- Keeping outage fatigue at bay, and recognizing when brains are melting and need to take a break.
To make matters worse, determining causality can be tenuous at best when you’re working with complex systems, so being able to recognize when a failure has a single root cause (hint: with the big outages – almost never) and when it has multiple contributing causes is a skill that isn’t easily gained without seeing a lot of action in the past.
So it’s not a surprise that working well within a team under stressful scenarios is something other fields try to train people for. Trauma surgeons, FBI agents, military teams, air traffic control, etc. all have drills, exercises, and simulations for teaching these skills, but they are all done within the context of what those escalating situations look like in their specific fields.
So this brings a question that has come up before in my circles:
Can this sort of organizational resilience be taught, within the context of web operations?
GameDay exercises could certainly be one avenue for testing and training team-based outage response, but most of the focus there (at least those discussed publicly by companies who hold GameDay exercises) is testing the infrastructure and application-level components, and even then under controlled conditions and relatively narrow failure modes.
So the confidence-building value of GameDay drills lie elsewhere, and don’t really exercise the cognitive load that real-world failures can produce on the humans (i.e. the troubleshooting dev and ops teams) like the spectacular Amazon AWS outage recently.
But! Some smart folks have been thinking about this question, at a higher-level:
Is it possible to construct non-contextual and generic drills that can train competencies for this sort of on-the-fly, making-sense-of-unfamiliar-failure-modes, and sometimes disorienting troubleshooting?
At the Lund University in Sweden, there’s an excellent article on building organizational resilience in escalating situations, which I believe resulted in a chapter in the Resilience Engineering in Practice book, and also references another excellent article by David Woods and Emily Patterson called How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands.
The parts I want to highlight here are best practices for designing scenarios meant to train these skills. If you’re looking to design a good drill meant to educate and/or train Ops and Devs on what cognitive muscles to develop for handling large-scale outages, this is a pretty damn good list (quoted from both of those sources above):
- Try to force people beyond their learned roles and routines. The scenario can contain problems that are not solvable within those roles or routines, and forces people to step out of those roles and routines.
- Contain a number of hidden goals, at various times during the scenario, that people could pursue (e.g. different ways of escaping the situation or de-escalating it), but that they have to vocalize and articulate in order to begin to achieve them (as they cannot do so by themselves).
- Include potential actions of which the consequences are both important and difficult to foresee (and that might significantly influence people’s ability to control the problem in the near future). This can force people into pro-active thinking and articulation of their expectations of what might happen.
- Be able to trap people in locking onto one solution that everybody is fixedly working towards. This can be done by garden-pathing; making the escalating problem look initially (with strong cues) like something the crew could already familiar with, but then letting it depart (with much weaker cues) to see whether the crew is caught on the garden path and lets the situation escalate.
- Or the scenario, by creating so much cognitive noise in terms of new warnings and events, should be able to trip people into thematic vagabonding—the tendency to redirect attention and change diagnosis with each incoming data piece, which results in a fragmentation of problem-solving.
Think that such a scenario could be constructed?
I want to think so, but of course nothing teaches like the hindsight of a real production outage, eh? 🙂
I see these scenarios all the time in customer emergencies (I’m a consultant). Cutting through chaos after the customer has already been working for three days and is sleep deprived and stressed beyond belief, etc etc. It is very difficult for me to keep things straight myself, much less teach others how to do that too and control the customer at the same time and get to the bottom of things 🙂 It is surely fun. I don’t know how anyone could recreate such a scenario in the classroom. But practicing on a cadaver is probably a reasonable way to prep for surgery.
Thank you. Well written and very real in my everyday activities. will certainly share this article with others in the my org and at least get their minds seeded. Seed, Time and Harvest.
Would something like this http://www.crisiscompany.com/Pages/trainingpackages.html work, tailored for training people to respond to workplace events?
Pingback: How resilience engineering applies to the web world « malcolmwoote616
Thanks for your contribution to the reflections happening around the world about Organizational Resilience. While you look at the IT side, we operate on the physical side of Organization Resilience. There is no doubt that our respective disciplines will link more and more in the future. The American Society for Industrial Security (ASIS) and ISO are researching the matter with their Organizational Resilience Maturity Model Standard Committee (http://www.asisonline.org/guidelines/committees/spc.4_std.htm) headed by Dr. Mark H. Siegel.
My colleague John Gargett (firstname.lastname@example.org) is a member of the committee and he has developped a method to implement Organization Resilience called R-SEC. John’s method considers what he calls the 4 Ts, Teams, Techniques, Technology and Training and connects them with a net centric approach that recognizes the role of technology, social and human networks in the response to an incident. You may find his paper of interest. It is on the ASIS site of interest at http://www.asisonline.org/guidelines/committees/docs/R-SEC_and_Organizational_Resilience.pdf
What is a TTR?
I am really happy to read that you like the ideas that we have developed within a research project (focusing on organizational resilience in escalating situations) at Lund University! We presented an additional paper focusing on more epistemological ideas around the issues of team training in complex environments at the last Resilience Symposium in June 2011. Please check our my blog: johanniklas.blogspot.com for the full reference, and for other related references in which the ideas are more elaborated. Don’t hesitate to contact me for further thoughts, questions, or ideas.
Pingback: How resilience engineering applies to the web world - O'Reilly Radar