Reflections on the 6th Resilience Engineering Symposium

I just spent the last week in Lisbon, Portugal at the Resilience Engineering Symposium. Zoran Perkov and I were invited to speak on the topic of software operations and resilience in the financial trading and Internet services worlds, to an audience of practitioners and researchers from all around the globe, in a myriad of industries....
Continue reading...

Counterfactual Thinking, Rules, and The Knight Capital Accident

In between reading copious amounts of indignation surrounding whatever is suboptimal about healthcare.gov, you may or may not have noticed the SEC statement regarding the Knight Capital accident that took place in 2012. This Release No. 70694 is a document that contains many details about the accident, and you can read what looks like on the surface...
Continue reading...

Learning from Failure at Etsy

(This was originally posted on Code As Craft, Etsy’s engineering blog. I’m re-posting it here because it still resonates strongly as I prepare to teach a ‘postmortem facilitator’s course internally at Etsy.) Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought...
Continue reading...

Owning Attention (Considerations for Alert Design)

In the past month or two, I’ve spoken on the topic of alert design. There’s a video of my giving the talk (at Monitorama, as well), but I thought I’d try to post on the topic and material as well. The topic of alerts and “alert design” as seen as a deliberate and purposeful thing...
Continue reading...