Fault Tolerance and Protection

In yet another post where I point to a paper written from the perspective of another field of engineering about a topic that I think is inherently mappable to the web engineering world, I’ll at least give a summary. :)

Every time someone on-call gets an alert, they should always be thinking along these lines:

  • Does this really require me to wake up from sleeping or pause this movie I’m watching, to fix?
  • Can this really not wait until the morning, during office hours?

If the answer is yes to those, then excellent: the machines alerted a human to something that only a human could ever diagnose or fix. There was nothing that the software could have done to rectify the situation. Paging a human was justified.

But for those situations where the answer was “no” to those questions, one might (or should, anyway) think of bolstering your system’s “fault tolerance” or “fault protection.” But how many folks grok the full details of what that means?  Does it mean self-healing? Does it mean isolation of errors or unexpected behaviors that fall outside the bounds of normal operating circumstances? Or does it mean both and if so how should we approach building this tolerance and protection? The Wikipedia definitions for “fault tolerant systems” and “fault tolerant design” are a very good start on educating yourself on the concepts, but they’re reasonably general in scope.

The fact is, designing web systems to be truly fault-tolerant and protective is hard. These are questions that can’t be answered solely within infrastructural bounds; fault-tolerance isn’t selective in its tiering, it has to be thought of from layer 1 of the network all the way to the browser.

Now, not every web startup is lucky enough to hire someone from NASA’s Jet Propulsion Lab, who has written software for space vehicles, but we managed to convince Greg Horvath to leave there and join Etsy. He pointed me to an excellent paper, by Robert D. Rasmussen, called “GN&C Fault Protection Fundamentals” and thankfully, it’s a lot less about Guidance, Navigation, and Control and more about fault tolerance and protection strategies, concerns, and implementations.

Some of those concerns, from the paper:

  • Do not separate fault protection from normal operation of the same functions.
  • Strive for function preservation, not just fault protection.
  • Test systems, not fault protection; test behavior, not reflexes.
  • Cleanly establish a delineation of mainline control functions from transcendent issues.
  • Solve problems locally, if possible; explicitly manage broader impacts, if not.
  • Respond to the situation as it is, not as it is hoped to be.
  • Distinguish fault diagnosis from fault response initiation.
  • Follow the path of least regret.
  • Take the analysis of all contingencies to their logical conclusion.
  • Never underestimate the value of operational flexibility.
  • Allow for all reasonable possibilities — even the implausible ones.

The last idea there points to having “requisite imagination” to explore as fully as possible, the question “What could possibly go wrong?”, which is really just another manifestation of one of the four cornerstones of Resilience Engineering, which is: “Anticipation”. But that’s a topic for another post.

Here’s Rasmussen’s paper, please go and read it. If you don’t, you’re totally missing out and not keeping up!

I like making things go! At the moment, I'm SVP of Infrastructure and Operations at Etsy, and I'm currently pursuing a Master's degree in Human Factors and Systems Safety at Lund University.

4 comments

  1. John Wilis   •  

    As Aways.. Great post…

    One minor point and I know you already know this.. In a perfect world that call should never be a human’s decision. The notification system should always strive to make that judgement. I used to call the “Do you want to be woken up” flag in the event.

    Thanks
    John Willis
    DTO Solutions
    a.k.a. @botchagalupe

  2. Dinesh K   •  

    Important point and really a helpful post…….specially the concluded points from the paper.

    By following those points proactively really increase the level of fault tolerance and unnecessary alerts or notifications (or call) can be reduced.

    I am also agree with the John Wills, as at some instances taking decision whether the calls or notifications are critical or not, depends on till which extent it can have impact on the system.

    Waiting for the next post…..” What could possibly go wrong?”

    Thanks for sharing,
    Dinesh

  3. Kyle Kirby   •  

    write a summary supporting protection and email it to me, becaus the article is extremely one sided in favor of tolerance. Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>