Resilience Engineering: Part I

I’ve been drafting this post for a really long time. Like most posts, it’s largely for me to get some thoughts down. It’s also very related to the topic I’ll be talking about at Velocity later this year.

When I gave a keynote talk at the Surge Conference last year, I talked about how our field of web engineering is still young, and would do very well to pay attention to other fields of engineering, since I suspect that we have a lot to learn from them. Contrary to popular belief, concepts such as fault tolerance, redundancy of components, sacrificial parts, automatic safety mechanisms, and capacity planning weren’t invented with the web. As it turns out, some of those ideas have been studied and put into practice in other fields for decades, if not centuries.

Systems engineering, control theory, reliability engineering…the list goes on for where we should be looking for influences, and other folks have noticed this as well. As our field recognizes the value of taking a “systems” (the C. West Churchman definition, not the computer software definition) view on building and managing infrastructures with a “Full Stack Programmer” perspective, we should pull our heads out of our echo chamber every now and again, because we can gain so much from lessons learned elsewhere.

Last year, I was lucky to convince Dr. Richard Cook to let us include his article “How Complex Systems Fail” in Web Operations. Some months before, I had seen the article and began to poke around Dr. Cook’s research areas: human error, cognitive systems engineering, safety, and a relatively new multi-discipline area known as Resilience Engineering.

What I found was nothing less than exhilarating and inspirational, and it’s hard for me to not consider this research mandatory reading for anyone involved with building or designing socio-technical systems. (Hint: we all do, in web operations) Frankly, I haven’t been this excited since I saw Jimmy Page in a restaurant once in the mid-90s. Even though Dr. Cook (and others in his field, like Erik Hollnagel, David Woods, and Sidney Dekker) historically have written and researched resilience in the context of aviation, space transportation, healthcare and manufacturing, their findings strike me as incredibly appropriate to web operations and development.

Except, of course, accidents in our field don’t actually harm or kill people. But they almost always involve humans, machines, high stress, and high expectations.

Some of the concepts in resilience engineering run contrary to the typical (or stereotypical) perspectives that I’ve found in operations management, and that’s what I find so fascinating. I’m especially interested in organizational resilience, and the realization that safety in systems develops not in spite of us messy humans, but because of it.

For example:

Historical approaches taken towards improving “safety” in production might not be best

Conventional wisdom might have you believe that the systems we build are basically safe, and that all they need is protection from unreliable humans. This logically stems from the myth that all outages/degradations occur as the result of a change gone wrong, and I suspect this idea also comes from Root Cause Analysis write-ups ending with “human error” at the bottom of the page. But Dekker, Woods, and others in Behind Human Error suggest that listing human error as a root cause isn’t where you should end, it’s where you should start your investigation. Getting behind what led to a ‘human error’ is where the good stuff happens, but unless you’ve got a safe political climate (i.e., no one is going to get punished or fired for making mistakes) you’ll never get at how and why the error was made. Which means that you will ignore one of the largest opportunities to make your system (and organization) more efficient and resilient in the face of incidents. Mismatches, slips, lapses, and violations…each one of those types of error can lead to different ways of improving. And of course, working out the motivations and intentions of people who have made errors isn’t straightforward, especially engineers who might not have enough humility to admit to making an error in the first place.

Root Cause Analysis can be easily misinterpreted and abused

The idea that failures in complex systems can literally have a singular ‘root’ cause, as if failures are the result of linear steps in time, is just incorrect. Not only is it almost always incorrect, but in practice that perspective can be harmful to an organization because it allows management and others to feel better about improving safety, when they’re not, because the solution(s) can be viewed as simple and singular fixes (in reality, they’re not). James Reason’s pioneering book Human Error is enlightening on these points, to say the least. In reality (and I am guilty of this as anyone) there are motivations to reduce complex failures to singular/linear models, tipping the scales on what Hollnagel refers to as an ETTO, or Efficiency-Thoroughness Trade-Off, which I think will sound familiar to anyone working in a web startup. Because why spend extra time digging to find details of that human error-causing outage, when you have work to do? Plus, if you linger too long in that postmortem meeting, people are going to feel even worse about making a mistake, and that’s just cruel, right? 🙂

PostMortems or accident investigations is not the only way an organization can improve “safety”

Only looking at failures to guide your designs, tools, and processes drastically minimizes your ability to improve, Hollnagel says. Instead of looking at the things that go wrong, looking at the things that go right is a better strategy to improve resiliency. Personally, I think that engineering teams who practice continuous deployment intuitively understand this. Small and frequent changes made to production by a growing number of developers ascribe to a particular culture of safety, whether they know it or not. It requires what Hollnagel refers to as a “constant sense of unease”, and awareness of failure is what helps bridge that stereotypical development and operations divide.

Resilience should be a 4th management objective, alongside Better/Faster/Cheaper

The definition goes like this:

Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Since resilience is about being able to function, rather than being impervious to failure, there is no conflict between productivity and safety.

This sounds like one of those commonsense ideas, right? In an extremely self-serving way, I find some validation in that definition that optimizing for MTTR is better than optimizing for MTBF. My gut says that this shouldn’t be shocking or a revelation; it’s what mature engineering is all about.

Safety might not come from the sources you think it comes from

“…so safety isn’t about the absence of something…that you need to count errors or monitor violations, and tabulate incidents and try to make those things go away…..it’s about the presence of something. But the presence of what? When we find that things go right under difficult circumstances, it’s mostly because of people’s adaptive capacity; their ability to recognize, adapt to, and absorb changes and disruptions, some of which might fall outside of what the system is designed or trained to handle.”

– Sidney Dekker

My plan is to post more about these topics, because there are just too many ideas to explain in a single go. Apparently, Ashgate Publishing has owned this space, with a whole series of books. The newest one, Resilience Engineering in Practice, is in my bag, and I can’t put it down. Examples of these ideas in real-world scenarios (hospital and medical ops, power plants, air traffic control, financial services) are juicy with details, and the chapter “Lessons from the Hudson” goes into excellent detail about the trade-offs that go on in the mind of someone in high-stress failure scenarios, like Chesley Sullenberger.

I’ll end on this decent introduction to some of the ideas that includes the above quote, from Sidney Dekker. There’s some distracting camera work, but the ideas get across:

25 Comments

  1. Chris Kelly   •  

    Great post! I’ve been meandering down this route for some time as well and have yet to spend the time to put the pieces together somewhere. “Human Error” is a great resource, and (if you haven’t already), I highly recommend adding Charles Perrow’s “Normal Accidents” to your list.

  2. Nick Gall   •  

    John, Great post indeed. I’ve been trying to weave in resilience thinking into enterprise architecture for the past year or so. And I’ve come across resilience engineering in my travels. One concept I find deeply useful in thinking about how to architect or engineer for resilience is to distinguish between “front loop” (aka normal) operations and “back loop” (aka disrupted, reorganizing) operations. This comes from a long line of research in ecological resilience called Panarchy, led in large part by Buzz Holling. If you’re interested in my initial attempt to hybridize enterprise architecture and panarchy, check out my research report: http://bit.ly/ex1fJy . A shorter blog post on the topic is here: http://bit.ly/gRDdNN .

  3. Ramin K   •  

    It was great to come in this morning and see your essay after discussing a lot of the same ideas last night. I’m definitely looking to more in this vein.

  4. William Louth   •  

    Ties in somewhat with our recent initiative for greater software self & cost awareness & learning (adaption) in which software is designed and developed to adjust to its environment and (changing) behavior. This is definitely an important element of “that thing” (see video) that needs to be added and continuously – codifying learning and safety control.

    http://williamlouth.wordpress.com/2011/03/12/cloudconnect-apm-is-dead-long-live-apm/

    Automated Performance Management starts with Software’s Self Observation
    http://opencore.jinspired.com/?p=2709

  5. Pingback: Cloud Link Roundup #4 – Management Worries, Storage, Resilience Engineering, and More

  6. Lance Carlson   •  

    Very interesting video at the end.. but the camera work was so distracting.. I had to watch it a second time to really absorb everything.

  7. Ernest Mueller   •  

    I recently heard a lot of parallel ideas brought up at a keynote at a security conference (http://vimeo.com/17822489). They talked about the new “Rugged Manifesto,” which is trying to be for resilience, reliability, security, and the other “ilities” – http://www.ruggedsoftware.org/.

    The example I liked the most from the keynote was the “slump test,” a common test for concrete strength, and the fact that there is just no really good test or standard for software reliability.

  8. George Chiesa   •  

    Hear, hear!

    I mean, listen, listen, very carefully.

  9. Pingback: links for 2011-04-13 « Dan Creswell’s Linkblog

  10. Pingback: Systems Engineering: A great definition.

  11. Pingback: Quora

  12. Pingback: Confluence: System Administration Univ

  13. Hailay   •  

    I am really impressed by this groundbreaking view of ‘safety’. But Still more example to fully catch the idea of Resiliency. i want to publish a paper about human and organization factors in oil and gas industries. i want to do it from the point of view of Resilience Engineering.

  14. Pingback: Systems Engineering: A great definition. | Revolusionline

  15. Pingback: Each necessary, but only jointly sufficient

  16. Pingback: Sobre análisis post mórtem | bofh.es

  17. Pingback: How resilience engineering applies to the web world - O'Reilly Radar

  18. Pingback: VMware #CloudOps Friday Reading List – Standardization in the Cloud Era | VMware CloudOps - VMware Blogs

  19. Pingback: 障害の事後分析を読んで得た教訓 ― 「何がシステムを停止させるのか?」 | 開発手法・プロジェクト管理 | POSTD

  20. Pingback: SRE Weekly Issue #14 – SRE WEEKLY

  21. Pingback: 【译】从微软与Google的故障报告中学到的经验教训 | 神刀安全网

  22. Pingback: Cloud Link Roundup #4 - Management Worries, Storage, Resilience Engineering, and More - Zenoss IT Monitor - Our Blog for IT Monitoring

  23. Pingback: Our Network: Issue #9 (Part 2) - GlobalDeFi

  24. Pingback: Resilience Engineering: Part I | ProgClub

  25. Pingback: Awesome Services Engineering – Massive Collection of Resources – Learn Practice & Share

Comments are closed.