
At the Velocity Conference last year, I was talking to Mike Loukides from O’Reilly about the topics being presented and how it was so great to see such successful veterans of the field come out from behind the curtain and share their experiences. Mike said that there was interest in doing a book on the (obviously) broad subject of web operations, in a format similar to the Beautiful books that O’Reilly has in their Theory in Practice series.
Needless to say, I jumped at the chance to help out. Over the following months, Jesse Robbins and I wrangled a bunch of topics we thought were integral to the field and authors who could cover them. It’s in the final stages of being published as we speak, but I think it came out pretty damn good.
These folks cranked out great chapters while still doing their day jobs, and it shows. It’s a great collection of war stories, advice, and hard-earned lessons.
Here is a list of the chapters:
“The Web Ops Career Path” Theo Schlossnagle
“Cloud Computing At Picnik: Lessons Learned” Justin Huff
“Infrastructure and Application Metrics” Matt Massie and myself
“Continuous Deployment” Eric Ries
“Infrastructure as Code” Adam Jacob
“Monitoring” Patrick Debois
“How Complex Systems Fail” Dr. Richard Cook
“Community Management and Web Operations” Heather Champ
“Dealing With Unexpected Traffic Spikes” Brian Moon
“Dev and Ops Cooperation and Collaboration” Paul Hammond
“How Your Visitors Feel: User-Facing Metrics” Alistair Croll and Sean Power
“Relational Database Strategy and Tactics For The Web” Baron Schwartz
“The Art and Science of Postmortems” Jacob Loomis
“Managing Web Storage” Anoop Nagwani
“Nonrelational Datastores” Eric Florenzano
“Agile Infrastructure” Andrew Clay Shafer
“Things That Go Bump In The Night (And How To Sleep Through Them)” Michael Christian
Royalties from the sales will go to the national 826 Valencia organization, which is dedicated to supporting students ages 6 to 18 with their writing skills. They do this by offering free drop-in tutoring at eight different locations around the country, as well as special events, student publishing, and scholarships.
I guess I’m late on getting to this, but How Complex Systems Fail by Richard Cook is excellent.
Let me start with this: I don’t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure. I found this via @benjaminblack, and I agree with him 100%: this should be considered required reading for anyone in our industry. I’m not sure if Cook ever thought that his paper would apply to web infrastructure, but I think it can and does. Please take 30 minutes right now and read it.
There are a number of salient points in the paper that I’d like to comment on. Again, this is through the lens of failures of complex systems as it pertains to web operations:
7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
I’m going to guess that this portion may be viewed as controversial in the prevailing webops wisdom, where post-mortems are for sure necessary, but whose content may or may not be effective in preventing similar types of failure. I do value the process of a post-mortem, because I think the human element of understanding complex failures is important and doing whatever you can to put in place safety is good, modulo what is said in section #16 of the paper. I believe that even a rudimentary process of “5 Whys” has value. But at the same time, I also think that there is something in the spirit of this paragraph, which is that there is a danger in standing behind a single underlying cause when there are systemic failures involved. Doing this can lead to the false belief that you’ve got this mode covered, you’ve found the silver bullet that made the whole mountain crumble, and jeez what a relief because that will never bite us again.
14) Change introduces new forms of failure.
I totally agree with this point. However, I often see this as a rallying point for operations teams to say “No!” to change, when instead they should be working alongside development (and product owners) with a goal of reducing the risk of failure associated with each change. I do not believe that ‘release early, release often’ in and of itself can reduce that risk. I believe that the real (and only) way to do this is both technical and cultural. But I’ve spoken about this before.
16) Safety is a characteristic of systems and not of their components
Emphasis on “Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system.” Real safety comes from smart people doing smart things to the entire shebang, not the individual guts.
and I think the point I love the most, with all of my heart:
18) Failure free operations require experience with failure.
Fear is a strong emotion. I believe it can be used as a strong motivator for ensuring safety in the face of constant change, instead of a reason to push back on the very idea of change. Embrace fear of outages and degradation. Use it to guide your architecture, your code, your infrastructure. So lean into it.
There are a lot of great points in the paper, and I could go on, but you get the idea.