This little ramble of thoughts are related to my talk at Velocity coming up, but I know I’ll never get to this part at the conference, so I figured I’d post about it here.
Building resilience from a systems point of view means (amongst other things) understanding how your organization deals with failure and unexpected situations. Generally this means having a development and operations teams that can work well together under pressure, with fluctuating amounts of uncertainty, bringing their own domain expertise to the table when it matters.
This is what drives some of my favorite Ops candidate interview questions. Knowing Unix commands, network architectures, database behaviors, and scripting languages are obviously required, but comprise only one facet of the gig. The real mettle comes from being able easily zoom in and out of the whole system under scrutiny, splitting up troubleshooting responsibilities amongst your team (and trusting their results) and differentiating red herring symptoms from truly related ones. It also comes from things like:
- Staying away from distracting conversation during the outage response. Nothing kills a TTR like unrelated talk in IRC or a conf call.
- Trusting your information. This is where the UI challenges of dashboard design can make or break an outage response. “Are those units milli, or mega?”
- Balancing too much communication and too little amongst team members. Troubleshooting outage verbosity is a fickle mistress.
- Stomping actions. OneThingAtATime™ methods aren’t easy to stick to, especially when things escalate.
- Keeping outage fatigue at bay, and recognizing when brains are melting and need to take a break.
To make matters worse, determining causality can be tenuous at best when you’re working with complex systems, so being able to recognize when a failure has a single root cause (hint: with the big outages – almost never) and when it has multiple contributing causes is a skill that isn’t easily gained without seeing a lot of action in the past.
So it’s not a surprise that working well within a team under stressful scenarios is something other fields try to train people for. Trauma surgeons, FBI agents, military teams, air traffic control, etc. all have drills, exercises, and simulations for teaching these skills, but they are all done within the context of what those escalating situations look like in their specific fields.
So this brings a question that has come up before in my circles:
Can this sort of organizational resilience be taught, within the context of web operations?
GameDay exercises could certainly be one avenue for testing and training team-based outage response, but most of the focus there (at least those discussed publicly by companies who hold GameDay exercises) is testing the infrastructure and application-level components, and even then under controlled conditions and relatively narrow failure modes.
So the confidence-building value of GameDay drills lie elsewhere, and don’t really exercise the cognitive load that real-world failures can produce on the humans (i.e. the troubleshooting dev and ops teams) like the spectacular Amazon AWS outage recently.
But! Some smart folks have been thinking about this question, at a higher-level:
Is it possible to construct non-contextual and generic drills that can train competencies for this sort of on-the-fly, making-sense-of-unfamiliar-failure-modes, and sometimes disorienting troubleshooting?
At the Lund University in Sweden, there’s an excellent article on building organizational resilience in escalating situations, which I believe resulted in a chapter in the Resilience Engineering in Practice book, and also references another excellent article by David Woods and Emily Patterson called How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands.
The parts I want to highlight here are best practices for designing scenarios meant to train these skills. If you’re looking to design a good drill meant to educate and/or train Ops and Devs on what cognitive muscles to develop for handling large-scale outages, this is a pretty damn good list (quoted from both of those sources above):
- Try to force people beyond their learned roles and routines. The scenario can contain problems that are not solvable within those roles or routines, and forces people to step out of those roles and routines.
- Contain a number of hidden goals, at various times during the scenario, that people could pursue (e.g. different ways of escaping the situation or de-escalating it), but that they have to vocalize and articulate in order to begin to achieve them (as they cannot do so by themselves).
- Include potential actions of which the consequences are both important and difficult to foresee (and that might significantly influence people’s ability to control the problem in the near future). This can force people into pro-active thinking and articulation of their expectations of what might happen.
- Be able to trap people in locking onto one solution that everybody is fixedly working towards. This can be done by garden-pathing; making the escalating problem look initially (with strong cues) like something the crew could already familiar with, but then letting it depart (with much weaker cues) to see whether the crew is caught on the garden path and lets the situation escalate.
- Or the scenario, by creating so much cognitive noise in terms of new warnings and events, should be able to trip people into thematic vagabonding—the tendency to redirect attention and change diagnosis with each incoming data piece, which results in a fragmentation of problem-solving.
Think that such a scenario could be constructed?
I want to think so, but of course nothing teaches like the hindsight of a real production outage, eh?