UPDATE, 10/17/2017: This post hasn’t aged well, and needs some patching. The title should be “TTR is more important than TBF (for most types of F)” Why? Because taking the statistical mean of TTR or TBF makes absolutely no sense, whatsoever. Incidents and events simply are not comparable in that way, and even if they were, the time an event starts is infinitely negotiable.
Present Allspaw tells past Allspaw: you were wrong, buddy.
This week I gave a talk at QCon SF about development and operations cooperation at Etsy and Flickr. It’s a refresh of talks I’ve given in the past, with more detail about how it’s going at Etsy. (It’s going excellently 🙂 )
There’s a bunch of topics in the presentation slides, all centered around roles, responsibilities, and intersection points of domain expertise commonly found in development and operations teams. One of the not-groundbreaking ideas that I’m finally getting down is something that should be evident for anyone practicing or interested in ‘continuous deployment’:
Being able to recover quickly from failure is more important than having failures less often.
This has what should be an obvious caveat: some types of failures shouldn’t ever happen, and not all failures/degradations/outages are the same. (like failures resulting in accidental data loss, for example)
Put another way:
MTTR is more important than MTBF
(for most types of F)
(Edited: I did say originally “MTTR > MTBF”)
What I’m definitely not saying is that failure should be an acceptable condition. I’m positing that since failure will happen, it’s just as important (or in some cases more important) to spend time and energy on your response to failure than trying to prevent it. I agree with Hammond, when he said:
If you think you can prevent failure, then you aren’t developing your ability to respond.
In a complete steal of Artur Bergman‘s material, an example in the slides of the talk is of the Jeep versus Rolls Royce:
Artur has a Jeep, and he’s right when he says that for the most part, Jeeps are built with optimizing Mean-Time-To-Repair, not the classical approach to automotive engineering, which is to optimize Mean-Time-Between-Failures. This is likely because Jeep owners have been beating the shit out of their vehicles for decades, and every now and again, they expect that abuse to break something. Jeep designers know this, which is why it’s so damn easy to repair. Nuts and bolts are easy to reach, tools are included when you buy the thing, and if you haven’t seen the video of Army personnel disassembling and reassembling a Jeep in under 4 minutes, you’re missing out.
The Rolls Royce, on the other hand, likely don’t have such adventurous owners, and when it does break down, it’s a fine and acceptable thing for the car to be out of service for a long and expensive fixing by the manufacturer.
We as web operations folks want our architectures to be built optimized for MTTR, not for MTBF. I think that the reasons should be obvious, and the fact that practices like:
- Dark launching
- Percentage-based production A/B rollouts
- Feature flags
are becoming commonplace should verify this approach as having legs.
The slides from QConSF are here: