Learning from Failure at Etsy

(This was originally posted on Code As Craft, Etsy’s engineering blog. I’m re-posting it here because it still resonates strongly as I prepare to teach a ‘postmortem facilitator’s course internally at Etsy.)

Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought I might give some detail on how that actually happens, and why.

Anyone who’s worked with technology at any scale is familiar with failure. Failure cares not about the architecture designs you slave over, the code you write and review, or the alerts and metrics you meticulously pore through.

So: failure happens. This is a foregone conclusion when working with complex systems. But what about those failures that have resulted due to the actions (or lack of action, in some cases) of individuals? What do you do with those careless humans who caused everyone to have a bad day?

Maybe they should be fired.

Or maybe they need to be prevented from touching the dangerous bits again.

Or maybe they need more training.

This is the traditional view of “human error”, which focuses on the characteristics of the individuals involved. It’s what Sidney Dekker calls the “Bad Apple Theory” – get rid of the bad apples, and you’ll get rid of the human error. Seems simple, right?

We don’t take this traditional view at Etsy. We instead want to view mistakes, errors, slips, lapses, etc. with a perspective of learning. Having blameless Post-Mortems on outages and accidents are part of that.

A Blameless Post-Mortem

What does it mean to have a ‘blameless’ Post-Mortem?
Does it mean everyone gets off the hook for making mistakes? No.

Well, maybe. It depends on what “gets off the hook” means. Let me explain.

Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

  • what actions they took at what time,
  • what effects they observed,
  • expectations they had,
  • assumptions they had made,
  • and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.

Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.

We believe that this detail is paramount to improving safety at Etsy.

If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?

This cycle of name/blame/shame can be looked at like this:

  1. Engineer takes action and contributes to a failure or incident.
  2. Engineer is punished, shamed, blamed, or retrained.
  3. Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
  4. Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
  5. Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
  6. Errors more likely, latent conditions can’t be identified due to #5, above
  7. Repeat from step 1

We need to avoid this cycle. We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.

The base fundamental here is something Erik Hollnagel has said:

We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.

A Second Story

This idea of digging deeper into the circumstance and environment that an engineer found themselves in is called looking for the “Second Story”. In Post-Mortem meetings, we want to find Second Stories to help understand what went wrong.

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

Allowing Engineers to Own Their Own Stories

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items.

So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.

So what do we do to enable a “Just Culture” at Etsy?

  • We encourage learning by having these blameless Post-Mortems on outages and accidents.
  • The goal is to understand how an accident could have happened, in order to better equip ourselves from it happening in the future
  • We seek out Second Stories, gather details from multiple perspectives on failures, and we don’t punish people for making mistakes.
  • Instead of punishing engineers, we instead give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures.
  • We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future.
  • We accept that there is always a discretionary space where humans can decide to make actions or not, and that the judgement of those decisions lie in hindsight.
  • We accept that the Hindsight Bias will continue to cloud our assessment of past events, and work hard to eliminate it.
  • We accept that the Fundamental Attribution Error is also difficult to escape, so we focus on the environment and circumstances people are working in when investigating accidents.
  • We strive to make sure that the blunt end of the organization understands how work is actually getting done (as opposed to how they imagine (or hope) it’s getting done, via Gantt charts and procedures) on the sharp end.
  • The sharp end is relied upon to inform the organization where the line is between appropriate and inappropriate behavior. This isn’t something that the blunt end can come up with on its own.

Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.

One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”

Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.

That’s why we have blameless Post-Mortems at Etsy, and why we’re looking to create a Just Culture here.

20 Comments

  1. Pingback: M-A-O-L » Learning from Failure at Etsy

  2. kris   •  

    since i am retired now and can not work there.. i want my son to grow up and work there.

  3. Fred   •  

    I’ve worked in software/hardware engineering for over three decades, and participated in countless post-mortems. While I’ve often tried to support my engineers and techs by repeating things like, “Folks who never make a mistake probably aren’t producing anything”, after reading this article, I am now horribly embarrassed and humbled to admit I’ve sometimes been guilty of subscribing to the Bad Apple Theory. This article is *brilliant*. You’ve taught an old dog a very valuable new trick, and I fully intend to put it into practice. I suspect you’ve just made me a better engineer. Thank you.

  4. Pingback: Lessons on blame, most companies need to heed. « Simon Kenyon Shepard :: justLikeThat.

  5. Alberto   •  

    Great tactics this Just Culture, from my point of view. But given that failure is norm, particularly when you are building sth new; that people tend tu build up their experience just because we don’t really recall what really happened after a few hours of taking the wrong decision; that people usually tend to repeat at least twice the same mistake before acknowledging they maybe wrong; and that the situation is not really repeating itself in the future, I’m wondering:
    1) how do you balance failure in executing a successful working model with innovation (a new working model).
    2) under which consideration is a deed or a decision given the failure status.
    3) who is scanning for failures?
    4) how do you make sure that a flawless process is not leading Etsy to a certain, planned, unforseen cliff?
    Thanks in advance.

  6. Pingback: Technology Short Take #36 - blog.scottlowe.org - The weblog of an IT pro specializing in virtualization, storage, and servers

  7. Matthew Tippett   •  

    In the software organizations that I manage, I have a standard “Root Cause Analysis” (RCA) template that we follow. It has standard sections – Narrative, Timeline, Analysis (Modified Ishakawa, etc), Corrective Actions. It is intended to be as objective and impersonal as possible. Interestingly most of the RCA templates that I have seen tend to make it an onerous and painful experience.

    When recent regrettable mistake occurred, we went through the RCA over the course of a few hours. Once the feeling of it being a witch hunt and blame game had been put to rest, the engineers began to really explore what went wrong, how it went wrong, and what led to it occurring.

    In the end we walked away with a couple of key fixes for the issue, but a whole collection of other corrective actions that really helped the engineers understand how they can directly contribute to improving their deliver and reducing cringe moments when something goes wrong.

  8. Todd Troxell   •  

    Excellent perspective – many could learn from this.

  9. Pingback: Counterfactual Thinking, Rules, and The Knight Capital Accident | Kitchen Soap

  10. Pingback: Top Reads for October - Jaco Pretorius

  11. Pingback: One year in information security | major.io

  12. Pingback: Октябрьская лента: лучшее за месяц

  13. Ajinkya   •  

    Learning from failures are valuable lesson, it its not different here as well.

  14. Pingback: Октябрьская лента: лучшее за месяц (2013)

  15. Pingback: Paradigm Check Point: Prefacing Debriefings | Kitchen Soap

  16. Pingback: DevOps keeps it cool with ICE - O'Reilly Radar

  17. Pingback: Fazendo Devops funcionar fora de Webops -

  18. Pingback: SRE Weekly Issue #31 – SRE WEEKLY

  19. Pingback: Autism and its discontents (from Boren) – Parrhesia Parousia

  20. Pingback: Аварии помогают учиться | Компьюлента

Comments are closed.