Resilience Engineering: Part I

I’ve been drafting this post for a really long time. Like most posts, it’s largely for me to get some thoughts down. It’s also very related to the topic I’ll be talking about at Velocity later this year.

When I gave a keynote talk at the Surge Conference last year, I talked about how our field of web engineering is still young, and would do very well to pay attention to other fields of engineering, since I suspect that we have a lot to learn from them. Contrary to popular belief, concepts such as fault tolerance, redundancy of components, sacrificial parts, automatic safety mechanisms, and capacity planning weren’t invented with the web. As it turns out, some of those ideas have been studied and put into practice in other fields for decades, if not centuries.

Systems engineering, control theory, reliability engineering…the list goes on for where we should be looking for influences, and other folks have noticed this as well. As our field recognizes the value of taking a “systems” (the C. West Churchman definition, not the computer software definition) view on building and managing infrastructures with a “Full Stack Programmer” perspective, we should pull our heads out of our echo chamber every now and again, because we can gain so much from lessons learned elsewhere.

Last year, I was lucky to convince Dr. Richard Cook to let us include his article “How Complex Systems Fail” in Web Operations. Some months before, I had seen the article and began to poke around Dr. Cook’s research areas: human error, cognitive systems engineering, safety, and a relatively new multi-discipline area known as Resilience Engineering.

What I found was nothing less than exhilarating and inspirational, and it’s hard for me to not consider this research mandatory reading for anyone involved with building or designing socio-technical systems. (Hint: we all do, in web operations) Frankly, I haven’t been this excited since I saw Jimmy Page in a restaurant once in the mid-90s. Even though Dr. Cook (and others in his field, like Erik Hollnagel, David Woods, and Sidney Dekker) historically have written and researched resilience in the context of aviation, space transportation, healthcare and manufacturing, their findings strike me as incredibly appropriate to web operations and development.

Except, of course, accidents in our field don’t actually harm or kill people. But they almost always involve humans, machines, high stress, and high expectations.

Some of the concepts in resilience engineering run contrary to the typical (or stereotypical) perspectives that I’ve found in operations management, and that’s what I find so fascinating. I’m especially interested in organizational resilience, and the realization that safety in systems develops not in spite of us messy humans, but because of it.

For example:

Historical approaches taken towards improving “safety” in production might not be best

Conventional wisdom might have you believe that the systems we build are basically safe, and that all they need is protection from unreliable humans. This logically stems from the myth that all outages/degradations occur as the result of a change gone wrong, and I suspect this idea also comes from Root Cause Analysis write-ups ending with “human error” at the bottom of the page. But Dekker, Woods, and others in Behind Human Error suggest that listing human error as a root cause isn’t where you should end, it’s where you should start your investigation. Getting behind what led to a ‘human error’ is where the good stuff happens, but unless you’ve got a safe political climate (i.e., no one is going to get punished or fired for making mistakes) you’ll never get at how and why the error was made. Which means that you will ignore one of the largest opportunities to make your system (and organization) more efficient and resilient in the face of incidents. Mismatches, slips, lapses, and violations…each one of those types of error can lead to different ways of improving. And of course, working out the motivations and intentions of people who have made errors isn’t straightforward, especially engineers who might not have enough humility to admit to making an error in the first place.

Root Cause Analysis can be easily misinterpreted and abused

The idea that failures in complex systems can literally have a singular ‘root’ cause, as if failures are the result of linear steps in time, is just incorrect. Not only is it almost always incorrect, but in practice that perspective can be harmful to an organization because it allows management and others to feel better about improving safety, when they’re not, because the solution(s) can be viewed as simple and singular fixes (in reality, they’re not). James Reason’s pioneering book Human Error is enlightening on these points, to say the least. In reality (and I am guilty of this as anyone) there are motivations to reduce complex failures to singular/linear models, tipping the scales on what Hollnagel refers to as an ETTO, or Efficiency-Thoroughness Trade-Off, which I think will sound familiar to anyone working in a web startup. Because why spend extra time digging to find details of that human error-causing outage, when you have work to do? Plus, if you linger too long in that postmortem meeting, people are going to feel even worse about making a mistake, and that’s just cruel, right? 🙂

PostMortems or accident investigations is not the only way an organization can improve “safety”

Only looking at failures to guide your designs, tools, and processes drastically minimizes your ability to improve, Hollnagel says. Instead of looking at the things that go wrong, looking at the things that go right is a better strategy to improve resiliency. Personally, I think that engineering teams who practice continuous deployment intuitively understand this. Small and frequent changes made to production by a growing number of developers ascribe to a particular culture of safety, whether they know it or not. It requires what Hollnagel refers to as a “constant sense of unease”, and awareness of failure is what helps bridge that stereotypical development and operations divide.

Resilience should be a 4th management objective, alongside Better/Faster/Cheaper

The definition goes like this:

Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Since resilience is about being able to function, rather than being impervious to failure, there is no conflict between productivity and safety.

This sounds like one of those commonsense ideas, right? In an extremely self-serving way, I find some validation in that definition that optimizing for MTTR is better than optimizing for MTBF. My gut says that this shouldn’t be shocking or a revelation; it’s what mature engineering is all about.

Safety might not come from the sources you think it comes from

“…so safety isn’t about the absence of something…that you need to count errors or monitor violations, and tabulate incidents and try to make those things go away…’s about the presence of something. But the presence of what? When we find that things go right under difficult circumstances, it’s mostly because of people’s adaptive capacity; their ability to recognize, adapt to, and absorb changes and disruptions, some of which might fall outside of what the system is designed or trained to handle.”

– Sidney Dekker

My plan is to post more about these topics, because there are just too many ideas to explain in a single go. Apparently, Ashgate Publishing has owned this space, with a whole series of books. The newest one, Resilience Engineering in Practice, is in my bag, and I can’t put it down. Examples of these ideas in real-world scenarios (hospital and medical ops, power plants, air traffic control, financial services) are juicy with details, and the chapter “Lessons from the Hudson” goes into excellent detail about the trade-offs that go on in the mind of someone in high-stress failure scenarios, like Chesley Sullenberger.

I’ll end on this decent introduction to some of the ideas that includes the above quote, from Sidney Dekker. There’s some distracting camera work, but the ideas get across:

The new book: Web Operations

Web Operations
At the Velocity Conference last year, I was talking to Mike Loukides from O’Reilly about the topics being presented and how it was so great to see such successful veterans of the field come out from behind the curtain and share their experiences. Mike said that there was interest in doing a book on the (obviously) broad subject of web operations, in a format similar to the Beautiful books that O’Reilly has in their Theory in Practice series.

Needless to say, I jumped at the chance to help out. Over the following months, Jesse Robbins and I wrangled a bunch of topics we thought were integral to the field and authors who could cover them. It’s in the final stages of being published as we speak, but I think it came out pretty damn good.

These folks cranked out great chapters while still doing their day jobs, and it shows. It’s a great collection of war stories, advice, and hard-earned lessons.

Here is a list of the chapters:

“The Web Ops Career Path” Theo Schlossnagle
“Cloud Computing At Picnik: Lessons Learned” Justin Huff
“Infrastructure and Application Metrics” Matt Massie and myself
“Continuous Deployment” Eric Ries
“Infrastructure as Code” Adam Jacob
“Monitoring” Patrick Debois
“How Complex Systems Fail” Dr. Richard Cook
“Community Management and Web Operations” Heather Champ
“Dealing With Unexpected Traffic Spikes” Brian Moon
“Dev and Ops Cooperation and Collaboration” Paul Hammond
“How Your Visitors Feel: User-Facing Metrics” Alistair Croll and Sean Power
“Relational Database Strategy and Tactics For The Web” Baron Schwartz
“The Art and Science of Postmortems” Jacob Loomis
“Managing Web Storage” Anoop Nagwani
“Nonrelational Datastores” Eric Florenzano
“Agile Infrastructure” Andrew Clay Shafer
“Things That Go Bump In The Night (And How To Sleep Through Them)” Michael Christian

Royalties from the sales will go to the national 826 Valencia organization, which is dedicated to supporting students ages 6 to 18 with their writing skills. They do this by offering free drop-in tutoring at eight different locations around the country, as well as special events, student publishing, and scholarships.

I wrote a book about common sense.

The Art of Capacity Planning: Scaling Web Resources

Whew. That took longer than I thought.

Todd Hoff over at the High Scalability blog has an email interview with me about a book that I wrote, called “The Art of Capacity Planning: Scaling Web Resources“.  I’m still just happy that I got it done at all, seeing how it was due the same week that my son was born.

This book happened because of a LOT of people. You know who you are, because you’re in the Acknowledgements. 🙂

Four chapters of the new book on RoughCuts…

So now there’s chapters 1-4 on Safari RoughCuts. Which means if you don’t mind shelling out the dough, you can take a look at what I’ve been getting up early for every day for the past few months. The working title is “The Art of Capacity Planning” and it’s meant to be a no-nonsense description of the capacity planning process and considerations for web operations.

I still have two chapters to go before it’s all finished, but if you’re nice enough to take a look at what I’ve got thus far, I’d appreciate any feedback. I’m sure there could be typos and some graphs misaligned, but such is life with “drafts”. 🙂