I guess I’m late on getting to this, but How Complex Systems Fail by Richard Cook is excellent.
Let me start with this: I don’t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure. I found this via @benjaminblack, and I agree with him 100%: this should be considered required reading for anyone in our industry. I’m not sure if Cook ever thought that his paper would apply to web infrastructure, but I think it can and does. Please take 30 minutes right now and read it. 🙂
There are a number of salient points in the paper that I’d like to comment on. Again, this is through the lens of failures of complex systems as it pertains to web operations:
7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
I’m going to guess that this portion may be viewed as controversial in the prevailing webops wisdom, where post-mortems are for sure necessary, but whose content may or may not be effective in preventing similar types of failure. I do value the process of a post-mortem, because I think the human element of understanding complex failures is important and doing whatever you can to put in place safety is good, modulo what is said in section #16 of the paper.
I believe that even a rudimentary process of “5 Whys” has value.(Update: I did when I first wrote this. Now, I do not.) But at the same time, I also think that there is something in the spirit of this paragraph, which is that there is a danger in standing behind a single underlying cause when there are systemic failures involved. Doing this can lead to the false belief that you’ve got this mode covered, you’ve found the silver bullet that made the whole mountain crumble, and jeez what a relief because that will never bite us again.
14) Change introduces new forms of failure.
I totally agree with this point. However, I often see this as a rallying point for operations teams to say “No!” to change, when instead they should be working alongside development (and product owners) with a goal of reducing the risk of failure associated with each change. I do not believe that ‘release early, release often’ in and of itself can reduce that risk. I believe that the real (and only) way to do this is both technical and cultural. But I’ve spoken about this before.
16) Safety is a characteristic of systems and not of their components
Emphasis on “Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system.” Real safety comes from smart people doing smart things to the entire shebang, not the individual guts.
and I think the point I love the most, with all of my heart:
18) Failure free operations require experience with failure.
Fear is a strong emotion. I believe it can be used as a strong motivator for ensuring safety in the face of constant change, instead of a reason to push back on the very idea of change. Embrace fear of outages and degradation. Use it to guide your architecture, your code, your infrastructure. So lean into it.
There are a lot of great points in the paper, and I could go on, but you get the idea.