Resilience Engineering: Part I

allspaw

14 years ago

I’ve been drafting this post for a really long time. Like most posts, it’s largely for me to get some thoughts down. It’s also very related to the topic I’ll be talking about at Velocity later this year.

When I gave a keynote talk at the Surge Conference last year, I talked about how our field of web engineering is still young, and would do very well to pay attention to other fields of engineering, since I suspect that we have a lot to learn from them. Contrary to popular belief, concepts such as fault tolerance, redundancy of components, sacrificial parts, automatic safety mechanisms, and capacity planning weren’t invented with the web. As it turns out, some of those ideas have been studied and put into practice in other fields for decades, if not centuries.

Systems engineering, control theory, reliability engineering…the list goes on for where we should be looking for influences, and other folks have noticed this as well. As our field recognizes the value of taking a “systems” (the C. West Churchman definition, not the computer software definition) view on building and managing infrastructures with a “Full Stack Programmer” perspective, we should pull our heads out of our echo chamber every now and again, because we can gain so much from lessons learned elsewhere.

Last year, I was lucky to convince Dr. Richard Cook to let us include his article “How Complex Systems Fail” in Web Operations. Some months before, I had seen the article and began to poke around Dr. Cook’s research areas: human error, cognitive systems engineering, safety, and a relatively new multi-discipline area known as Resilience Engineering.

What I found was nothing less than exhilarating and inspirational, and it’s hard for me to not consider this research mandatory reading for anyone involved with building or designing socio-technical systems. (Hint: we all do, in web operations) Frankly, I haven’t been this excited since I saw Jimmy Page in a restaurant once in the mid-90s. Even though Dr. Cook (and others in his field, like Erik Hollnagel, David Woods, and Sidney Dekker) historically have written and researched resilience in the context of aviation, space transportation, healthcare and manufacturing, their findings strike me as incredibly appropriate to web operations and development.

Except, of course, accidents in our field don’t actually harm or kill people. But they almost always involve humans, machines, high stress, and high expectations.

Some of the concepts in resilience engineering run contrary to the typical (or stereotypical) perspectives that I’ve found in operations management, and that’s what I find so fascinating. I’m especially interested in organizational resilience, and the realization that safety in systems develops not in spite of us messy humans, but because of it.

For example:

Historical approaches taken towards improving “safety” in production might not be best

Conventional wisdom might have you believe that the systems we build are basically safe, and that all they need is protection from unreliable humans. This logically stems from the myth that all outages/degradations occur as the result of a change gone wrong, and I suspect this idea also comes from Root Cause Analysis write-ups ending with “human error” at the bottom of the page. But Dekker, Woods, and others in Behind Human Error suggest that listing human error as a root cause isn’t where you should end, it’s where you should start your investigation. Getting behind what led to a ‘human error’ is where the good stuff happens, but unless you’ve got a safe political climate (i.e., no one is going to get punished or fired for making mistakes) you’ll never get at how and why the error was made. Which means that you will ignore one of the largest opportunities to make your system (and organization) more efficient and resilient in the face of incidents. Mismatches, slips, lapses, and violations…each one of those types of error can lead to different ways of improving. And of course, working out the motivations and intentions of people who have made errors isn’t straightforward, especially engineers who might not have enough humility to admit to making an error in the first place.

Root Cause Analysis can be easily misinterpreted and abused

The idea that failures in complex systems can literally have a singular ‘root’ cause, as if failures are the result of linear steps in time, is just incorrect. Not only is it almost always incorrect, but in practice that perspective can be harmful to an organization because it allows management and others to feel better about improving safety, when they’re not, because the solution(s) can be viewed as simple and singular fixes (in reality, they’re not). James Reason’s pioneering book Human Error is enlightening on these points, to say the least. In reality (and I am guilty of this as anyone) there are motivations to reduce complex failures to singular/linear models, tipping the scales on what Hollnagel refers to as an ETTO, or Efficiency-Thoroughness Trade-Off, which I think will sound familiar to anyone working in a web startup. Because why spend extra time digging to find details of that human error-causing outage, when you have work to do? Plus, if you linger too long in that postmortem meeting, people are going to feel even worse about making a mistake, and that’s just cruel, right? 🙂

PostMortems or accident investigations is not the only way an organization can improve “safety”

Only looking at failures to guide your designs, tools, and processes drastically minimizes your ability to improve, Hollnagel says. Instead of looking at the things that go wrong, looking at the things that go right is a better strategy to improve resiliency. Personally, I think that engineering teams who practice continuous deployment intuitively understand this. Small and frequent changes made to production by a growing number of developers ascribe to a particular culture of safety, whether they know it or not. It requires what Hollnagel refers to as a “constant sense of unease”, and awareness of failure is what helps bridge that stereotypical development and operations divide.

Resilience should be a 4th management objective, alongside Better/Faster/Cheaper

The definition goes like this:

Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Since resilience is about being able to function, rather than being impervious to failure, there is no conflict between productivity and safety.

This sounds like one of those commonsense ideas, right? In an extremely self-serving way, I find some validation in that definition that optimizing for MTTR is better than optimizing for MTBF. My gut says that this shouldn’t be shocking or a revelation; it’s what mature engineering is all about.

Safety might not come from the sources you think it comes from

“…so safety isn’t about the absence of something…that you need to count errors or monitor violations, and tabulate incidents and try to make those things go away…..it’s about the presence of something. But the presence of what? When we find that things go right under difficult circumstances, it’s mostly because of people’s adaptive capacity; their ability to recognize, adapt to, and absorb changes and disruptions, some of which might fall outside of what the system is designed or trained to handle.”

– Sidney Dekker

My plan is to post more about these topics, because there are just too many ideas to explain in a single go. Apparently, Ashgate Publishing has owned this space, with a whole series of books. The newest one, Resilience Engineering in Practice, is in my bag, and I can’t put it down. Examples of these ideas in real-world scenarios (hospital and medical ops, power plants, air traffic control, financial services) are juicy with details, and the chapter “Lessons from the Hudson” goes into excellent detail about the trade-offs that go on in the mind of someone in high-stress failure scenarios, like Chesley Sullenberger.

I’ll end on this decent introduction to some of the ideas that includes the above quote, from Sidney Dekker. There’s some distracting camera work, but the ideas get across: