Counterfactual Thinking, Rules, and The Knight Capital Accident

In between reading copious amounts of indignation surrounding whatever is suboptimal about healthcare.gov, you may or may not have noticed the SEC statement regarding the Knight Capital accident that took place in 2012. This Release No. 70694 is a document that contains many details about the accident, and you can read what looks like on the surface
Continue reading…

Learning from Failure at Etsy

(This was originally posted on Code As Craft, Etsy’s engineering blog. I’m re-posting it here because it still resonates strongly as I prepare to teach a ‘postmortem facilitator’s course internally at Etsy.) Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought
Continue reading…

Owning Attention (Considerations for Alert Design)

In the past month or two, I’ve spoken on the topic of alert design. There’s a video of my giving the talk (at Monitorama, as well), but I thought I’d try to post on the topic and material as well. The topic of alerts and “alert design” as seen as a deliberate and purposeful thing
Continue reading…

Prevention versus Governance versus Adaptive Capacities

The other day I posted about the intersections of Systems Safety and web operations and engineering. One of the largest proponents of bringing a systems thinking perspective to safety (specifically ‘software safety’) is Dr. Nancy Leveson, who has been in that field (really a multidisciplinary field) for at least a couple of decades. She’s the
Continue reading…

A Mature Role for Automation: Part I

I’ve been percolating on this post for a long time. Thanks very much to Mark Burgess for reviewing early drafts of it. One of the ideas that permeates our field of web operations is that we can’t have enough automation. You’ll see experience with “building automation” on almost every job description, and many post-mortem transcriptions
Continue reading…