I just spent the last week in Lisbon, Portugal at the Resilience Engineering Symposium. Zoran Perkov and I were invited to speak on the topic of software operations and resilience in the financial trading and Internet services worlds, to an audience of practitioners and researchers from all around the globe, in a myriad of industries.
My hope was to start a dialogue about the connections we’ve seen (and to hopefully explore more) between practices and industries, and to catch theories about resilience up to what’s actually happening in these “pressurized and consequential”1 worlds.
I thought I’d put down some of my notes, highlights and takeaways here.
- In order to look at how resilience gets “engineered” (if that is actually a thing) we have to look at adaptations that people make in the work that they do, to fill in the gaps that show up as a result of the incompleteness of designs, tools, and prescribed practices. We have to do this with a “low commitment to concepts”2 because otherwise we run the risk of starting with a model (OODA? four cornerstones of resilience? swiss cheese? situation awareness? etc.) and then finding data to fill in those buckets. Which can happen unfortunately quite easily, and also: is not actually science.
- While I had understood this before the symposium, I’m now even clearer on it: resilience is not the same as fault-tolerance or “graceful degradation.” Instead, it’s something more, akin to what Woods calls “graceful extensibility.”
- The other researchers and practitioners in ‘safety-critical’ industries were very interested in approaches such as continuous deployment/delivery might look like in their fields. They saw it as a set of evolutions from waterfall that Internet software has made that allows it to be flexible and adaptive in the face of uncertainty of how the high-level system of users, providers, customers, operations, performance, etc. will behave in production. This was their reflection, not my words in their mouths, and I really couldn’t agree more. Validating!
- While financial trading systems and Internet software have some striking similarities, the differences are stark. Zoran and I are both jealous of each other’s worlds in different ways. Also: Zoran can quickly scare the shit out of an audience filled with pension and retirement plans. 🙂
- The lines between words (phases?) such as: design-implementation-operations are blurred in worlds where adaptive cycles take place, largely because feedback loops are the focus (or source?) of the cycles.
- We still have a lot to do in “software operations”3 in that we may be quite good at focusing and discussing software development and practices, alongside the computer science concepts that influence those things, but we’re not yet good at exploring what we can find about our field through the lenses of social science and cognitive psychology. I would like to change that, because I think we haven’t gone far enough in being introspective on those fronts. I think we might only currently flirting with those areas. By dropping a Conway’s Law here and a cognitive bias there, it’s a good start. But we need to consider that we might not actually know what the hell we’re talking about (yet!). However, I’m optimistic on this front, because our community has both curiosity and a seemingly boundless ability to debate esoteric topics with each other. Now if we can only stop doing it in 140 characters at a time… 🙂
- The term “devops” definitely has analogues in other industries. At the very least, the term brought vigorous nodding as I explained it. Woods used the phrase “throw it over the wall” and it resonated quite strongly with many folks from diverse fields. People from aviation, maritime, patient safety…they all could easily give a story that was analogous to “worked fine in dev, ops problem now” in their worlds. Again, validating.
- There is no Resilience Engineering (or Cognitive Systems Engineering or Systems Safety for that matter) without real dialogue about real practice in the world. In other words, there is no such thing as purely academic here. Every “academic” here viewed their “laboratories” as cockpits, operating rooms and ERs, control rooms in mission control and nuclear plants, on the bridges of massive ships. I’m left thinking that for the most part, this community abhors the fluorescent-lighted environments of universities. They run toward potential explosions, not away from them. Frankly, I think our field of software has a much larger population of the stereotype of the “out-of-touch” computer scientist whose ideas in papers never see the light of production traffic. (hat tip to Kyle for doing the work to do real-world research on what were previously known as academic theories!)
1 Richard Cook’s words.
2 David Woods’ words. I now know how important this is when connecting theory to practice. More on this topic in a different post!
3 This is what I’m now calling what used to be known as “WebOps” or what some refer to as ‘devops’ to reflect that there is more to software services that are delivered via the Internet than just the web, and I’d like to update my language a bit.
Very interesting notes! Are there any publications (books, papers) you’d recommend on the topic of resiliency engineering?
Pingback: Resilience | Virtuous Code
Pingback: Resilience - InfoLogs
You reflected, in the very interesting posts about a “mature role for
automation” on the possibility of machines and humans to work as team
members or partners. One thoughtful commentary noted the use of
autonomic systems, and I think there is an interesting link to be made
through that word to other large scale systems.
Specifically, Taiichi Ohno’s work “Beyond Large Scale Production” sets
out a thesis for autonomic systems. This approach, together with a
concept that individuals and teams could come up with better solutions
than the present methodology are formative principles of the Toyota
Production System (whose 5 Whys I think you may have misunderstood).
There is a nice story in Jeffrey Liker’s book about Toyota about how, a
senior Japanese Toyota executive was visiting the first Toyota joint
venture at NUMA in the States, he was shocked to hear that the
production line had not been stopped in several days. The point was, the
production line, if it had not been stopped to get to the bottom of
issues, was hiding problems and waste which would have been better
explored by the production teams so that improvements could be made. In
other words, Woods’ “graceful extensibility” had to be seen to be part
of the every day work of the plant.
The “Machine That Changed The World” is, in my view, a story that still
hasn’t been understood well outside of automotive engineering. Most
particularly the role of vital ongoing human creativity seems to have
Graceful degradation was a big topic at Lisbon. I added the Lisbon conference and others, including sorting by CFP dates (since you like to do talks), to my list of devops conferences at devopsconferences.org.
Pingback: Resilience – avdi.codes