An Open Letter To Monitoring/Metrics/Alerting Companies

I’d like to open up a dialogue with companies who are selling X-As-A-Service products that are focused on assisting operations and development teams in tracking the health and performance of their software systems.

Note: It’s likely my suggestions below are understood and embraced by many companies already. I know a number of them who are paying attention to all areas I would want them to, and/or make sure they’re not making claims about their product that aren’t genuine. 

Anomaly detection is important. It can’t be overlooked. We as a discipline need to pay attention to it, and continually get better at it.

But for the companies who rely on your value-add selling point(s) as:

  • “our product will tell you when things are going wrong” and/or
  • “our product will automatically fix things when it finds something is wrong”

the implication is these things will somehow relieve the engineer from thinking or doing anything about those activities, so they can focus on more ‘important’ things. “Well-designed automation will keep people from having to do tedious work”, the cartoon-like salesman says.

Please stop doing this. It’s a lie in the form of marketing material and it’s a huge boondoggle that distracts us away from focusing on what we should work on, which is to augment and assist people in solving problems.

Anomaly detection in software is, and always will be, an unsolved problem. Your company will not solve it. Your software will not solve it. Our people will improvise around it and adapt their work to cope with the fact that we will not always know what and how something is wrong at the exact time we need to know.

My suggestion is to first acknowledge this (that your attempts to detect anomalies perfectly, at the right time, is not possible) when you talk to potential customers. Want my business? Say this up front, so we can then move on to talking about how your software will assist my team of expert humans who will always be smarter than your code.

In other words, your monitoring software should take the Tony Stark approach, not the WOPR/HAL9000 approach.

These are things I’d like to know about how you thought about your product:

  • Tell me about how you used qualitative research in developing your product.
  • Tell me about how you observed actual engineers in their natural habitat, in the real world, as they detected and responded to anomalies that arose.
  • Show me your findings from when you had actual UX/UI professionals consider carefully how the interfaces of your product should be designed.
  • Demonstrate to me the people designing your product have actually been on-call and have experience with the scenario where they needed to understand what the hell was going on, had no idea where to start looking, all under time and consequence pressure.
  • Show me the people who are building your product take as a first design principle that outages and other “untoward” events are handled not by a lone engineer, but more often then not by a team of engineers all with their different expertise and focus of attention. Successful response depends on not just on anomaly detection, but how the team shares the observations they are making amongst each other in order to come up with actions to take.


Stop thinking you’re trying to solve a troubleshooting problem; you’re not.


The world you’re trying to sell to is in the business of dynamic fault managementThis means that quite often you can’t just take a component out of service and investigate what’s wrong with it. It means diagnosis involves testing hypotheses that could actually make things a lot worse than they already are. This means that phases of responding to issues have overlapping concerns all at the same time. Things like:

  • I don’t know what is going on.
  • I have a guess about what is going on, but I’m not sure, and I don’t know how to confirm it.
  • Because of what Sue and Alice said, and what I see, I think what is going on is X.
  • Since we think X is happening, I think we should do Y.
  • Is there a chance that Y will make things worse?
  • If we don’t know what’s happening with N, can we do M so things don’t get worse, or we can buy time to figure out what to do about N?
  • Do we think this thing (that we have no clue about) is changing for the better or the worse?
  • etc.

Instead of telling me about how your software will solve problems, show me you’re trying to build a product that is going to join my team as an awesome team member, because I’m going to think about using/buying your service in the same way I think about hiring.


John Allspaw


Availability: Nuance As A Service

Something that has struck me funny recently surrounds the traditional notion of availability of web applications. With respect to its relationship to revenue, to infrastructure and application behavior, and fault protection and tolerance, I’m thinking it may be time to get a broader upgrade adjustment to the industry’s perception on the topic.

These nuances in the definition and affects of availability aren’t groundbreaking. They’ve been spoken about before, but for some reason I’m not yet convinced that they’re widely known or understood.

Impact On Business

What is laid out here in this article is something that’s been parroted for decades: downtime costs companies money, and lost value. Generally speaking, this is obviously correct, and by all means you should strive to design and operate your site with high availability and fault tolerance in mind.

But underneath the binary idea that uptime = good and downtime = bad, the reality is that there’s a lot more detail that deserves exploring.

This irritatingly-designed site has a post about a common equation to help those that are arithmetically challenged:

GR = gross yearly revenue
TH = total yearly business hours
I = percentage impact
H = number of hours of outage

In my mind, this is an unnecessarily blunt measure. I see the intention behind this approach, because it’s not meant to be anywhere close to being accurate. But modern web operations is now a field where gathering metrics in the hundreds of thousands per second is becoming more common-place, fault-tolerance/protection is a thing we do increasingly well, and graceful degradation techniques are the norm.

In other words: there are a lot more considerations than outage minutes = lost revenue, even if you did have a decent way to calculate it (which, you don’t). Companies selling monitoring and provisioning services will want you to subscribe to this notion.

We can do better than this blunt measure, and I thought it’s worth digging in a bit deeper.


Thought experiment: if has a full and global outage for 30 minutes, how much revenue did it “lose”? Using the above rough equation, you can certainly come up with a number, let’s say N million dollars. But how accurate is N, really? Discussions that surround revenue loss are normally designed to motivate organizations to invest in availability efforts, so N only needs to be big and scary enough to provide that motivation. So let’s just say that goal has been achieved: you’re convinced! Availability is important, and you’re a firm believer that You Own Your Own Availability.

Outside of the “let this big number N convince you to invest in availability efforts” I have some questions that surround N:

  • How many potential customers did lose forever, during that outage? Meaning: they tried to get to, with some nonzero intent/probability of buying something, found it to be offline, and will never return there again, for reasons of impatience, loss of confidence, the fact that it was an impulse-to-buy click whose time has passed, etc.
  • How much revenue did Amazon lose during that 30 minute window, versus how the revenue that it simply postponed when it was down, only to be executed later? In other words: upon finding the site down, they’ll return sometime later to do what they originally intended, which may or may not include buying something or participate in some other valuable activity.
  • How much did that 30 minutes of downtime affect the strength of the Amazon brand, in a way that could be viewed as revenue-affecting? Meaning: are users and potential users now swayed to having less confidence in Amazon because they came to the site only to be disappointed that it’s down, enough to consider alternatives the next time they would attempt to go to the site in the future?

I don’t know the answers to these questions about Amazon, but I do know that at Etsy, those answers depend on some variables:

  • the type of outage or degradation (more on that in a minute),
  • the time of day/week/year
  • how we actually calculate/forecast how those metrics would have behaved during the outage

So, let’s crack those open a bit, and see what might be inside…

Temporal Concerns

Not all time periods can be considered equal when it comes to availability, and the idea of lost revenue. For commerce sites (or really any site whose usage varies with some seasonality) this is hopefully glaringly obvious. In other words:

X minutes of full downtime during the peak hour of the peak day of the year can be worlds apart from Y minutes of full downtime during the lowest hour of the lowest day of the year, traffic-wise.

Take for example a full outage that happens during a period of the peak day of the year, and contrast it with one that happens during a lower-period of the year. Let’s say that this graph of purchases is of those 24-hour periods, indicating when the outages happen:

A Tale of Two Outages

The impact time of the outage during the lower-traffic day is actually longer than the peak day, affecting the precious Nines math by a decent margin. But yet: which outage would you rather have, if you had to have one of those? 🙂

Another temporal concern is: across space and time, distribution and volume of any level degradation could be viewed as perfect uptime as the length of the outage approaches zero.

Dig, if you will, these two outage profiles, across a 24-hour period. The first one has many small outages across the day:

Screen Shot 2013-01-03 at 8.09.59 AM

and the other has the same amount of impact time, in a single go:

Screen Shot 2013-01-03 at 8.12.54 AM

So here we have the same amount of time, but spread out throughout the day. Hopefully, folks will think a bit more beyond the clear “they’re both bad! don’t have outages!” and could investigate how they could be different. Some considerations in this simplified example:

  • Hour of day. Note that the single large outage is “earlier” in the day. Maybe this will affect EU or other non-US users more broadly, depending on the timezone of the original graph. Do EU users have a different expectation or tolerance for outages in a US-based company’s website?
  • Which outage scenario has a greater affect on the user population: if the ‘normal’ behavior is “get in, buy your thing, and get out” quickly, I could see the many-small-outages more preferable to the single large one. If the status quo is some mix of searching, browsing, favoriting/sharing, and then purchase, I could see the singular constrained outage being preferable.

Regardless, this underscores the idea that not all outages are created equal with respect to impact timing.


Loss of “availability” can also be seen as an extreme loss of performance. At a particular threshold, given the type of feedback to the user (a fast-failed 404 or browser error, versus a hanging white page and spinning “loading…”) the severity of an event being slow can effectively be the same as a full outage.

Some concerns/thought exercises around this:

  • Where is this latency threshold for your site, for the functionality that is critical for the business?
  • Is this threshold a cliff, or is it a continuous/predictable relationship between performance and abandonment?

There’s been much more work on performance’s effects on revenue than availability. The Velocity Conference in 2009 brought the first real production-scale numbers (in the form of a Bing/Google joint presentation as well as Shopzilla and Mozilla talks) behind how performance affects businesses, and if you haven’t read about it, please do.

Graceful Degradation

Will Amazon (or Etsy) lose sales if all or a portion of its functionality is gone (or sufficiently slow) for a period of time? Almost certainly. But that question is somewhat boring without further detail.

In many cases, modern web sites don’t simply live in a “everything works perfectly” or “nothing works at all” boolean world. (To be sure, neither does the Internet as a whole.) Instead, fault-tolerance and resilience approaches allow for features and operations degrade under a spectrum of failure conditions. Many companies build their applications to have both in-flight fault tolerance to degrade the experience in the face of singular failures, as well as making use of “feature flags” (Martin and Jez call them “feature toggles“) which allow for specific features to be shut off if they’re causing problems.

I’m hoping that most organizations are familiar with this approach at this point. Just because user registration is broken at the moment, you don’t want to prevent  already logged-in users from using the otherwise healthy site, do you? 🙂

But these graceful degradation approaches further complicates the notion of availability, as well as its impact on the business as a whole.

For example: if Etsy’s favoriting feature is not working (because the site’s architecture allows it to gracefully fail without affecting other critical functionality), but checkout is working fine…what is the result? Certainly you might paused before marking down your blunt Nines record.

You might also think: “so what? as long as people can buy things, then favoriting listings on the site shouldn’t be considered in scope of availability.”

But consider these possibilities:

  • What if Favoriting listings was a significant driver of conversions?
  • If Favoriting was a behavior that led to conversions at a rate of X%, what value should X be before ‘availability’ ought to be influenced by such a degradation?
  • What if Favoriting was technically working, but was severely degraded (see above) in performance?

Availability can be a useful metric, but when abused as a silver bullet to inform or even dictate architectural, business priority, and product decisions, there’s a real danger of oversimplifying what are really nuanced concerns.

Bounce-Back and Postponement

As I mentioned above, what is more likely for sites that have an established community or brand, outages (even full ones) don’t mark an instantaneous amount of ‘lost’ revenue or activity. For a nonzero amount, they’re simply postponed. This is the area that I think could use a lot more data and research in the industry, much in the same way that latency/conversion relationship has been investigated.

The over-simplified scenario involves something that looks like this. Instead of the blunt math of “X minutes of downtime = Y dollars of lost revenue”, we can be a bit more accurate, if we tried just a bit harder. The red is the outage:



So we have some more detail, which is that if we can make a reasonable forecast about what purchases did during the time of the outage, then we could make a better-inform estimate of purchases “lost” during that time period.

But is that actually the case?

What we see at Etsy is something different, a bit more like this:

Screen Shot 2013-01-03 at 12.35.41 PM

Clearly this is an oversimplification, but I think the general behavior comes across. When a site comes back from a full outage, there is an increase in the amount of activity as users who were stalled/paused in their behavior by the outage resumes. My assumption is that many organizations see this behavior, but it’s just not being talked about publicly.
The phenomenon that needs more real-world data is to support (or deny) the hypothesis that depending on:
  • Position of the outage in the daily traffic profile (start-end)
  • Position of the outage in the yearly season

the bounce-back volume will vary in a reasonably predictable fashion. Namely, as the length of the outage grows, the amount of bounce-back volume shrinks:

Screen Shot 2013-01-03 at 12.55.14 PM

What this line of thinking doesn’t capture is how many of those users postponed their activity not for immediately after the outage, but maybe the next day because they needed to leave their computer for a meeting at work, or leaving work to commute home?

Intention isn’t entirely straightforward to figure out, but in the cases where you have a ‘fail-over’ page that many CDNs will provide when the origin servers aren’t available, you can get some more detail about what requests (add to cart? submit payment?) came in during that time.

Regardless, availability and its affect on business metrics isn’t as easy as service providers and monitoring-as-a-service companies will have you believe. To be sure, a good amount of this investigation will vary wildly from company to company, but I think it’s well worth taking a look into.


On Being A Senior Engineer

I think that there’s a lot of institutional knowledge in our field, especially about what makes for a productive engineer. But while there are a good deal of books in the management field about “expert” roles and responsibilities of non-technical individual contributors, I don’t see too many modern books or posts that might shed light directly on what makes for a good senior engineer. One notable exception is of course Kate Matsudaira, who has been posting quite a good deal recently about the cultural sides of engineering.

Yet at the same time, a good lot of successful engineers whom I have known all remember the mentor who taught them what it meant to be “senior”.

I do, however, agree 100% with my friend Theo’s words about being “senior” in his chapter of the Web Operations book by O’Reilly:

“Generation X (and even more so generation Y) are cultures of immediate gratification. I’ve worked with a staggering number of engineers that expect the “career path” to take them to the highest ranks of the engineering group inside 5 years just because they are smart. This is simply impossible in the staggering numbers I’ve witnessed. Not everyone can be senior. If, after five years, you are senior, are you at the peak of your game? After five more years will you not have accrued more invaluable experience? What then? “Super engineer”? Five more years? “Super-duper engineer.” I blame the youth of our discipline for this affliction. The truth is that there are very few engineers that have been in the field of web operations for fifteen years. Given the dynamics of our industry many elected to move on to managerial positions or risk an entrepreneurial run at things.”

He’s right: this field of web operations is still quite young. So we can’t be surprised when people who have a title of ‘senior’ exhibit unsurprisingly immature behavior, both technical and non-technical. If you haven’t read Theo’s chapter, I suggest you do.

Having said that, what does it actually mean to be ‘senior’ in this discipline? I certainly have an opinion of what it means, given that I’m charged with hiring, supporting, and retaining engineers whom are deemed to be senior. This notion that there is a bar to be passed in terms of career development is a good one, but I’d also add that these criteria exist on a spectrum, as opposed to a simple list of check-boxes. You don’t wake up one day and you are “senior” just because your title reflects that upon a promotion. Senior engineers don’t know everything. They’re not perfect in their technical knowledge, and they’re OK with that.

In order not to confuse titles with expectations that are fuzzy, sometimes I’ll refer to engineering maturity.

Meaning: I expect a “senior” engineer to be a mature engineer.

I’m going to gloss over the part where one could simply list the technical areas in which a mature engineer should have some level of mastery or understanding (such as “Networking”, “Filesystems”, “Algorithms”, etc.) and instead highlight the personal characteristics that in my mind give me indication that someone can influence an organization or a business positively in the domain of engineering.

Over on Quora, someone once asked me “What are the attributes (other than technical ability/experience) that makes a great VP of Technical Operations?”. The list of attributes that I mentioned in the answer came with the understanding that they are perpetual aspirations of my own. This post is similar to that answer.

I might first argue that senior engineers in web development and operations have the same characteristics as senior engineers in other fields of engineering (mechanical, electrical, chemical, etc.) in which case The Unwritten Laws of Engineering are applicable. Again, if you haven’t read this, please go do so. It was originally written in 1944, published by the American Society of Mechanical Engineers. A good excerpt from the book is here.

While the book’s structure and prose still has a dated feel (“…refrain from using profanity in the workplace…” or “…men should pay particular attention to shaving habits and the trimming of beards and mustaches…”), it gives a good outline of the non-technical expectations, responsibilities, and inner workings of an engineering organization with respect to how both managers and mature engineers might behave.

Obligatory Pithy Characteristics of Mature Engineers

All posts that attempt to give insight to aspirational characteristics must have an over-abundance of bullet points, and the field of engineering has a fair share of them. Therefore, I’m going to give you some, some mine and some pulled from various sources, many from the Unwritten Laws mentioned above.

Mature engineers seek out constructive criticism of their designs.

Every successful engineer I’ve met, upon finishing up a design or getting ready for a project, will continually ask their peers questions along the lines of:

  • “What could I be missing?”
  • “How will this not work?”
  • “Will you please shoot as many holes as possible into my thinking on this?”
  • “Even if it’s technically sound, is it understandable enough for the rest of the organization to operate, troubleshoot, and extend it?”

This is because they know that nothing they make will ever only be in their hands, and that good peer review is what makes better design decisions. As it’s been said elsewhere, they “beg for the bad news.”

Mature engineers understand the non-technical areas of how they are perceived.

Being able to write a Bloom Filter in Erlang, or write multi-threaded C in your sleep is insufficient. None of that matters if no one wants to work with you. Mature engineers know that no matter how complete, elegant, or superior their designs are, it won’t matter if no one wants to work alongside them because they are assholes. Condescension, belittling, narcissism, and ego-boosting behavior send the message to other engineers (maybe tacitly) to stay away. Part of being happy in engineering comes from enjoying the company of the people you work with while designing and building things. An engineer who is quick to call someone a moron is someone destined to stunt his or her career.

This also means that mature engineers have self-awareness when it comes to their communication. This isn’t to say that every mature engineer communicates perfectly, only that they have some notion about where they could be better, and continually ask for a gut-check from peers and managers on how they’re doing. They aim to be assertive, not passive or aggressive in how they get their ideas across.

I’ve mentioned it elsewhere, but I must emphasize the point more: the degree to which other people want to work with you is a direct indication on how successful you’ll be in your career as an engineer. Be the engineer that everyone wants to work with.

Now this isn’t to say that you should shy away from giving (or getting) constructive criticism on the work produced by engineering (as opposed to the engineer personally), for fear of pissing someone off. There’s a difference between calling someone a moron and pointing out faults in their code or product. In a conversation with Theo, he pointed out another possible area where our field may grow up:

“We as an industry need to (of course) refrain from critiques of human character and condition, but not shy away from critiques of work product. We need to get tougher skin and be able to receive critique through a lens that attempts to eliminate personal focus.

There will be assholes, they should be shunned. But the attitude that someone’s code is their baby should come to an end. Code doesn’t have feelings, doesn’t develop complexes and certainly doesn’t exhibit the most important trait (the ability to reproduce) of that which carries for your genetic strains.”

See also below #2 and #10 in The Ten Commandments of Egoless Programming.

I think this has a corollary from the Unwritten Laws (emphasis mine):

Be careful about whom you mark for copies of letters, memos, etc., when the interests of other departments are involved.

A lot of mischief has been caused by young people broadcasting memorandum containing damaging or embarrassing statements. Of course it is sometimes difficult for a novice to recognize the “dynamite” in such a document but, in general, it is apt to cause trouble if it steps too heavily upon someone’s toes or reveals a serious shortcoming on anybody’s part. If it has wide distribution or if it concerns manufacturing or customer difficulties, you’d better get the boss to approve it before it goes out unless you’re very sure of your ground.

This of course underscores the dated feel of the book, but in the modern era, I still believe the main point to be true. Nothing indicates that you have a lack of perspective and awareness like a poorly thought out and nonconstructive tweet that slings venomous insults. It’s a junior engineer mistake to toss insults about a piece of complex technology in 140 characters.

I certainly (much like Christopher Brown mentioned in his keynote at Velocity London) pay attention to those sorts of public remarks when I come across them so that I can note who I would reconsider hiring if they ever applied to work at Etsy.

Mature engineers do not shy away from making estimates, and are always trying to get better at it.

From the Unwritten Laws:

Promises, schedules, and estimates are necessary and important instruments in a well-ordered business. Many engineers fail to realize this, or habitually try to dodge the irksome responsibility for making commitments. You must make promises based upon your own estimates for the part of the job for which you are responsible, together with estimates obtained from contributing departments for their parts. No one should be allowed to avoid the issue by the old formula, “I can’t give a promise because it depends upon so many uncertain factors.”

Avoiding responsibility for estimates is another way of saying, “I’m not ready to be relied upon for building critical pieces of infrastructure.” All businesses rely on estimates, and all engineers working on a project are involved in Joint Activity, which means that they have a responsibility to others to make themselves interpredictable. In general, mature engineers are comfortable with working within some nonzero amount of uncertainty and risk.

Mature engineers have an innate sense of anticipation, even if they don’t know they do.

This code looks good, I’m proud of myself. I’ve asked other people to review it, and I’ve taken their feedback. Now: how long will it last before it’s rewritten? Once it’s in production, how will its execution affect resource usage? How much so I expect CPU/memory/disk/network to increase or decrease? Will others be able to understand this code? Am I making it as easy as I can for others to extend or introspect this work?

Mature engineers understand that not all of their projects are filled with rockstar-on-stage work.

However menial and trivial your early assignments may appear, give them your best effort.

Getting things done means doing things you might not be interested in. No matter how sexy a project is, there are always boring tasks. Tedious tasks. Tasks that a less mature engineer may deem beneath their dignity or their job title. My good friend Kellan Elliot-McCrea (Etsy’s CTO) had this to say about it:

“Sometimes the saving grace of a tedious task is their simplicity and maturity manifests in finishing them quickly and moving on. Sometimes tasks are tedious because they require extreme discipline and malleable attention span. It’s an odd phenomena that the most tedious tasks, only to be carried out by the most senior engineers, can also be the most terrifying.”

Mature engineers lift the skills and expertise of those around them.

They recognize that at some point, their individual contribution and potential cannot be exercised singularly. They recognize that there is only so much that can be produced by a single person, and the world’s best engineering feats are executed by teams, not singularly brilliant and lone engineers. Tom Limoncelli makes this point quite well in his post.

At Etsy we call this a “generosity of spirit.” Generosity of spirit is one of our core engineering values, but also a primary responsibility of our Staff Engineer position, a career-level position. These engineers spend the time to make sure that more junior or new engineers unfamiliar with the tech or processes we have not only understand what they are doing, but also why they are doing it. “Teaching to fish” is a mandatory skill at this level, and that requires having both patience and a perspective of investment in the rest of the organization.

Therefore instead of: “OK, move over, lemme just do it for you”, it’s instead: “Ok, let’s work on this together. I can show you how I’m writing/troubleshooting/etc. Then, you do it so I can be sure you know why/how we’re doing it this way, etc.”

Related: see below about getting credit.

Mature engineers make their trade-offs explicit when making judgements and decisions.

They realize all engineering decisions, implementations, and designs exist within a spectrum; we do not live in a binary world. They can quickly point out contexts where one successful approach or solution could work and where it could not. They know that one cannot be both efficient and thorough at the same time (The ETTO Principle), that most projects engineers work on exist on an axis of optimality and brittleness, and that whether the problems they are solving are acute or chronic.

They know that they work within a spectrum of ideal and non-ideal, and are OK with that. They are comfortable with it because they strive to make the ideal and non-ideal in a design explicit. Later on in the lifecycle of a design, when the original design is not scaling anymore or needs to be replaced or rewritten, they can look back not with a perspective of how short-sighted those earlier decisions were, but instead say “yep, we made it this far with it and knew we’d have to extend or change it at some point. Looks like that time is now, let’s get to work!” instead of responding with a cranky-pants, passive-aggressive Hindsight Bias-filled remark with counterfactuals (e.g.. “those idiots didn’t do it right the first time!”, “they cut corners!”, “I TOLD them this wouldn’t work!”)

Many pithy quotes exist that shine light on this notion of trade-offs, and mature engineers know that there are limits to any philosophy-laden quotes (including the ones I’m writing here):

  • “Premature optimization is the root of all evil.” – a very abused maxim, and I’ve written about it before. A corollary to that might be (taken from here) ‘Understanding what is and isn’t “premature” is what separates senior engineers from junior engineers.’
  • “Right tool for the job” – another abused one. The intention here is reasonable: who wants to use a tool that isn’t appropriate? But a rare perspective is that this can be detrimental when taken to the extreme. A carpenter doesn’t arm himself with every variation and size of hammer that is available, even thought he may encounter hammering tasks that could be ideally handled by each one. Why? Because lugging around (and maintaining) a gazillion hammers incurs a cost. As such, decisions on this axis have trade-offs.

The tl;dr on trade-offs is that everyone cuts corners, in every project. Immature engineers discover them in hindsight, disgusted. Mature engineers spell them out at the onset of a project, accept them and recognize them as part of good engineering.

(Related: Your Code May Be Elegant, But Mine Fucking Works)

Mature engineers don’t practice CYAE (“Cover Your Ass Engineering”)

The scenario where someone will stand on ceremony as an excuse for not attempting to understand how his or her code (or infrastructure) could be touched by other parts of the system or business is a losing proposition. Covering your ass sends the implicit message that you are someone willing to throw others (on your team? in your company? in your community?) under the proverbial bus at the mere hint that your work had any flaw. Mature engineers stand up and accept the responsibility given to them. If they find they don’t have the requisite authority to be held accountable for their work, they seek out ways to rectify that.

An example of CYAE is “It’s not my fault. They broke it, they used it wrong. I built it to spec, I can’t be held responsible for their mistakes or improper specification.”

Mature engineers are empathetic.

In complex projects, there are usually a number of stakeholders. In any project, the designers, product managers, operations engineers, developers, and business development folks all have goals and perspectives, and mature engineers realize that those goals and views may be different. They understand this so that they can navigate effectively in the work that they do. Being empathetic in this sense means having the ability to view the project from another person’s perspective and to take that into consideration into your own work.

Goal conflicts are inherent in all engineering work, and complaining about them (instead of embracing them as requirements for success) is a sign of a less mature engineer.

They don’t make empty complaints.

Instead, they express judgements based on empirical evidence and bring with those judgements options for solving the problem which they’ve identified. A great manager of mine said to never go to your boss with a complaint about anything without at least one (ideally more than one) suggestion for a solution. Even demonstrating that you’ve tried working the problem on your own and came up empty-handed is better than an empty complaint.

Mature engineers are aware of cognitive biases

This isn’t to say that every mature engineer needs to have a degree in psychology, but cognitive biases are what can limit the growth of an engineer’s career at a certain point. Even if they’re not aware of the details of how they appear or how these biases can be guarded against, most mature engineers I know have a level of self-awareness to at least recognize they (like everyone) are susceptible to them.

Culturally, engineers work day-to-day in empirical evidence in research. Basically: show me the data. The issue with cognitive biases is that we can be blissfully unaware of when we are interpreting data with our own brains in ways that defy empirical data, and can have a surprising effect on how we get work done and work on teams.

A great list of them exists on Wikipedia, but some of the ones that I’ve seen engineers (including myself) fall prey to are:

  • Self-Serving Bias – basically: if something is good, it’s probably because of something I did or thought of. If it’s bad, it’s probably the doing of someone else.
  • Fundamental Attribution Error – basically: the bad results that someone else got from his work must have something to do with how he is, personally (stupid, clumsy, sloppy, etc.) whereas if I get bad results, it’s because of the context that I was in, the pressure I was under, the situation I was in, etc.
  • Hindsight Bias – (it is said that this is the most-studied phenomenon in the history of modern psychology) basically: after an untoward or negative event (a severe bug, an outage, etc.) “I knew it all along!”. It is the very strong tendency to view the past more simply than it was in reality. You can tell there is Hindsight Bias going on when descriptions involve counterfactuals, or “…they should have…”, or “…how did they not see that, it’s so obvious!”.
  • Outcome Bias – like above, this comes up after a surprising or negative event. If the event was very damaging, expensive to clean up, or severe, then the decisions or actions that contributed to that event are judged to be very stupid, reckless, or negligent. The judgement is proportional to how severe the event was.
  • Planning Fallacy – (related to the point about making estimates under uncertainty, above) basically: being more optimistic about forecasting the time a particular project will take.

There are plenty of others, all of which I find personally fascinating and I can get lost in learning more about them. Highly suggested reading, if you’re at all interested in learning about how you might be limiting your own effectiveness.

The Ten Commandments of Egoless Programming

Appropriate, even if old…I’ve seen it referenced as coming from The Psychology of Computer Programming, written in 1971, but I don’t actually see it in the text. Regardless, here are The Ten Commandments of Egoless Programming, found on @wyattdanger‘s blog post on receiving advice from his dad:

  1. Understand and accept that you will make mistakes. The point is to find them early, before they make it into production. Fortunately, except for the few of us developing rocket guidance software at JPL, mistakes are rarely fatal in our industry. We can, and should, learn, laugh, and move on.
  2. You are not your code. Remember that the entire point of a review is to find problems, and problems will be found. Don’t take it personally when one is uncovered. (Allspaw note – related: see below, number #10, and the points Theo made above.)
  3. No matter how much “karate” you know, someone else will always know more. Such an individual can teach you some new moves if you ask. Seek and accept input from others, especially when you think it’s not needed.
  4. Don’t rewrite code without consultation. There’s a fine line between “fixing code” and “rewriting code.” Know the difference, and pursue stylistic changes within the framework of a code review, not as a lone enforcer.
  5. Treat people who know less than you with respect, deference, and patience. Non-technical people who deal with developers on a regular basis almost universally hold the opinion that we are prima donnas at best and crybabies at worst. Don’t reinforce this stereotype with anger and impatience.
  6. The only constant in the world is change. Be open to it and accept it with a smile. Look at each change to your requirements, platform, or tool as a new challenge, rather than some serious inconvenience to be fought.
  7. The only true authority stems from knowledge, not from position. Knowledge engenders authority, and authority engenders respect – so if you want respect in an egoless environment, cultivate knowledge.
  8. Fight for what you believe, but gracefully accept defeat. Understand that sometimes your ideas will be overruled. Even if you are right, don’t take revenge or say “I told you so.” Never make your dearly departed idea a martyr or rallying cry.
  9. Don’t be “the coder in the corner.” Don’t be the person in the dark office emerging only for soda. The coder in the corner is out of sight, out of touch, and out of control. This person has no voice in an open, collaborative environment. Get involved in conversations, and be a participant in your office community.
  10. Critique code instead of people – be kind to the coder, not to the code. As much as possible, make all of your comments positive and oriented to improving the code. Relate comments to local standards, program specs, increased performance, etc.

Novices versus Experts

Now I generally don’t follow too much on knowledge acquisition as a research topic, but I do believe it’s hard to get away from when talking about the evolving nature of a discipline. One bit of interesting breakdown comes from a paper from Dreyfus and Dreyfus called “A Five Stage Model of the Mental Activities Involved in Directed Skill Acquisition” which has laid out characteristics of various levels of expertise:

  • Rigid adherence to rules or plans
  • Little situational perception
  • No (or limited) discretionary judgment
Advanced Beginner
  • Guidelines for action based on attributes and aspects, which are all equal and separate
  • Limited situational perception
  • Conscious deliberate planning
  • Standardized and routine procedures
  • Sees situations holistically rather than as aspects
  • Perceives deviations from normal patterns
  • Uses maxims for guidance, whose meanings are contextual
  • No longer relies on rules, guidelines or maxims
  • Intuitive grasp of situations
  • Analytic approach used only in novel situations

The paper goes on to state:

Novices operate from an explicit rules and knowledge-based perspective. They are deliberate and analytical, and therefore slower to take action, they decide or choose.

(which means that novices are deeply subject to local rationality)

Experts operate from a mature, holistic well-tried understanding, intuitively and without conscious deliberation. This is a function of experience. They do not see problems as one thing and solutions as another, they act.

(which means that experts are context driven)

I don’t necessarily subscribe to the notion of such dry lines being drawn between skill levels, because I think that there is a lot more granularity and facets of expertise than just those outlined above, but I think it’s helpful to be aware of the unfortunately over-simplified categories.

Dirty secret: mature engineers know the importance of (sometimes irrational) feelings people have. (gasp!)

How people feel about technologies, technical decisions, and technical directions is just as important (if not more) than the facts about the details. Mature engineers know this, and adjust accordingly. Again, being empathetic can help you understand how another person on your team feels about a technical decision, even if they themselves don’t have an easy time articulating why they feel that way.

People’s confidence in software, architectures, or patterns is heavily influenced by past experience, and can result in positive or negative reactions to using them. Used to work at a mod_perl shop that had a lot of mystifying outages? Then you can’t be surprised to feel reluctant to use it in a different company, even if the supporting expertise and use cases are entirely different. All you remember is that mod_perl = major headaches, so you’re going to be wary of using it in any context again.

Mature engineers understand this phenomenon when making a case to use technology that carries baggage, even if it’s irrational. Convincing a group to use tools and patterns that they aren’t comfortable with isn’t a straightforward task. The “right tool for the job” maxim also has (sometimes unquantifiable) comfortability as a parameter.

For an illustration of how people’s emotions drive technical decisions and opinions, read any flame war about anything, ever.

“It is amazing what you can accomplish if you do not care who gets credit.”

This quote is commonly attributed to Harry S. Truman, but it looks like it might have first been said by a Jesuit priest in a different form. In any case, this is another indication you’re working with a mature engineer: they hold the success of the project much higher than the potential praise they may get personally for working on it. The attribution of praise or credit can be the source of such dysfunction in an engineering-driven organization, and I believe it’s because it’s largely invisible.

The notion is liberating, and once understood and internalized, a world of progress and innovative thinking can flourish, because the engineer isn’t overly concerned with the personal liability of equating the work to their own career success.

Not The End

I’m at the moment blessed to work with a number of mature engineers here at Etsy, and it’s quite humbling. We are indeed a young field, and while I think we can learn a great deal from other fields of engineering on this topic, I also think we have an advantage. The web is inextricably tied to the notion of publishing and sharing information, globally. We need to continue pointing out what it means to be a “senior” and “mature” engineer if we have a hope of progressing the field into a true discipline.

Many thanks to members of the Etsy Operations team, Mike Brittain, Kellan Elliott-McCrea, Marc Hedlund, and Theo Schlossnagle for reviewing drafts of this post. They all make me a more mature engineer.

A Mature Role for Automation: Part I

(Part 1 of 2 posts)
I’ve been percolating on this post for a long time. Thanks very much to Mark Burgess for reviewing early drafts of it.

One of the ideas that permeates our field of web operations is that we can’t have enough automation. You’ll see experience with “building automation” on almost every job description, and many post-mortem transcriptions around the world have remediation items that state that more automation needs to be in place to prevent similar incidents.

“AUTOMATE ALL THE THINGS!” the meme says.

But the where, when, and how to design, implement, and operate automation is not as straightforward as “AUTOMATE ALL THE THINGS!”

I’d like to explore this concept that everything that could be automated should be automated, and I’d like to take a stab at putting context around the reasons why we might think this is a good idea. I’d also like to give some background on the research of how automation is typically approached, the reasoning behind various levels of automation, and most importantly: the spectrum of downsides of automation done poorly or haphazardly.

(Since it’s related, I have to include an obligatory link to Github’s public postmortem on issues they found with their automated database failover, and some follow-up posts that are well worth reading.)

In a recent post by Mathias Meyer he gives some great pointers on this topic, and strongly hints at something I also agree with, which is that we should not let learnings from other safety-related fields (aviation, combat, surgery, etc.) go to waste, because there are some decades of thinking and exploration there. This is part of my plan for exploring automation.

Frankly, I think that we as a field could have a more mature relationship with automation. Not unlike the relationship humans have with fire: a cautious but extremely useful one, not without risks.

I’ve never done a true “series” of blog posts before, but I think this topic deserves one. There’s simply too much in this exploration to have in a single post.

What this means: There will not be, nor do I think should there ever be, a tl;dr for a mature role of automation, other than: its value is extremely context-specific, domain-specific, and implementation-specific.

If I’m successful with this series of posts, I will convince you to at least investigate your own intuition about automation, and get you to bring the same “constant sense of unease” that you have with making change in production systems to how you design, implement, and reason about it. In order to do this, I’m going to reference a good number of articles that will branch out into greater detail than any single blog post could shed light on.

Bluntly, I’m hoping to use some logic, research, science, and evidence to approach these sort of questions:

  1. What do we mean when we say “automation”? What do those other fields mean when they say it?
  2. What do we expect to gain from using automation? What problem(s) does it solve?
  3. Why do we reach for it so quickly sometimes, so blindly sometimes, as the tool to cure all evils?
  4. What are the (gasp!) possible downsides, concerns, or limitations of automation?
  5. And finally – given the potential benefits and concerns with automation, what does a mature role and perspective for automation look like in web engineering?

Given that I’m going to mention limitations of automation, I want to be absolutely clear, I am not against automation. On the contrary, I am for it.

Or rather, I am for: designing and implementing automation while keeping an eye on both its limitations and benefits.

So what limitations could there be? The story of automation (at least in web operations) is one of triumphant victory. The reason that we feel good and confident about reaching for automation is almost certainly due to the perceived benefits we’ve received when we’ve done it in the past.

Canonical example: engineer deploys to production by running a series of commands by hand, to each server one by one. Man that’s tedious and error-prone, right? Now we’ve automated the whole process, it can run on its own, and we can spend our time on more fun and challenging things.

This is a prevailing perspective, and a reasonable one.

Of course we can’t ditch the approach of automation, even if we wanted to.  Strictly speaking, almost every use of a computer is to some extent using “automation”, even if we are doing things “by hand.” Which brings me to…

Definitions and Foundations

I’d like to point at the term itself, because I think it’s used in a number of different contexts to mean different things. If we’re to look at it closely, I’d like to at least clarify what I (and others who have researched the topic quite well) mean by the term “automation”. The word comes from the Greek: auto, meaning ‘self’, and matos, meaning ‘willing’, which implies something is acting on its own accord.

Some modern definitions:

“Automation is defined as the technology concerned with the application of complex mechanical, electronic, and computer based systems in the operations and control of production.” – Raouf (1988)

‘Automation’ as used in the ATA Human Factors Task Force report in 1989 refers to…”a system or method in which many of the processes of production are auotmatically controlled or performed by self-operating machines, electronic devices, etc.” – Billings (1991)

“We define automation as the execution by a machine agent (usually a computer) of a function that was previously carried out by a human.” – Parasuraman (1997)

I’ll add to that somewhat broad definition functions that have never been carried out by a human. Namely, processes and tasks that could never be performed by a human, by exploiting the resources available in a modern computer. The recording and display of computations per second, for example.

To help clarify my use of the term:

  • Automation is not just about provisioning and configuration management. Although this is maybe the most popular context in which the term is used, it’s almost certainly not the only place for automation.
  • It’s also not simply the result of programming what were previously performed as manual tasks.
  • It can mean enforcing predefined or dynamic limits on operational tasks, automated or manual.
  • It can mean surfacing, displaying, and analyzing metrics from tasks and actions.
  • It can mean making decisions and possibly taking action on observed states in a system.

Some familiar examples of these facets of automation:

  • MySQL max_connections and Apache’s MaxClients directives: these are upper bounds intended on preventing high workloads from causing damage.
  • Nagios (or any monitoring system for that matter): these perform checks on values and states at rates and intervals only a computer could perform, and can also take action on those states in order to course-correct a process (as with Event Handlers)
  • Deployment tools and configuration management tools (like Deployinator, as well as Chef/Puppet/CFEngine, etc.)
  • Provisioning tools (golden-image or package-install based)
  • Any collection or display of metrics (StatsD, Ganglia, Graphite, etc.)

Which is basically…well, everything, in some form or another in web operations. 🙂

Domains To Learn From

In many of the papers found in Human Factors and Resilience Engineering, and in blog posts that generally talk about limitations of automation, it’s done in the context of aviation. And what a great context that is! You have dramatic consequences (people die) and you have a plethora of articles and research to choose from. The volume of research done on automation in the cockpit is large due to the drama (people die, big explosions, etc.) so no surprise there.

Except the difference is, in the cockpit, human and machine elements have a different context. There are mechanical actions that the operator can and needs to do during takeoff and landing. They physically step on pedals, push levers and buttons, watch dials and gauges in various points during takeoff and landing. Automation in that context is, frankly, much more evolved there, and the contrast (and implicit contract) there between man and machine is much more stark than in the context of web infrastructures. Display layouts, power-assisted controls…we should be so lucky to have attention like that paid to our working environment in web operations! (but also, cheers to people not dying when the site goes down, amirite?)

My point is that while we discuss the pros, cons, and considerations for designing automation to help us in web operations, we have to be clear that we are not aviation, and that our discussion should reflect that while still trying to glean information from that field’s use of it.

We ought to understand also that when we are designing tasks, automation is but one (albeit a complex one) approach we can take, and that it can be implemented in a wide spectrum of ways. This also means that if we decide in some cases to not automate something (gasp!) or to step back from full automation for good reason, we shouldn’t feel bad or failed about it. Ray Kurzweil and the nutjobs that think the “singularity” is coming RealSoonNow™ won’t be impressed, but then again you’ve got work to do.

So Why Do We Want to Use Automation?

Historically, automation is used for:

  • Precision
  • Stability
  • Speed

Which sounds like a pretty good argument for it, right? Who wants to be less precise, less stable, or slower? Not I, says the Ops guy. So using automation at work seems like a no-brainer.  But is it really just as simple as that?

Some common motivations for automation are:

  • Reduce or eliminate human error
  • Reduction of the human’s workload. Specifically, ridding humans of boring and tedious tasks so they can tackle the more difficult ones
  • Bring stability to a system
  • Reduce fatigue on humans

No article about automation would be complete without pointing first at Lisanne Bainbridge’s 1983 paper, “The Ironies of Automation”. I would put her work here as modern canonical on the topic. Any self-respecting engineer should read it. While its prose is somewhat dated, the value is still very real and pertinent.

What she says, in a nutshell, is that there are at least two ironies with automation, from the traditional view of it. The premise reflects a gut intuition that pervades many fields of engineering, and one that I think should be questioned:

The basic view is that the human operator is unreliable and inefficient, and therefore should be eliminated from the system.

Roger that. This supports the idea to take humans out of the loop (because they are unreliable and inefficient) and replace them with automated processes.

The first irony is:

Designer errors [in automation] can be a major source of operating problems.

This means that the designers of automation make decisions about how it will work based on how they envision the context it will be used. There is a very real possibility that the designer hasn’t imagined (or, can’t imagine) every scenario and situation the automation and human will find themselves in, and so therefore can’t account for it in the design.

Let’s re-read the statement: “This supports the idea to take humans out of the loop (because they are unreliable and inefficient) and replace them with automated processes.”…which are designed by humans, who are assumed to be unrelia…oh, wait.

The second irony is:

The designer [of the automation], who tries to eliminate the operator, still leaves the operator to do the tasks which the designer cannot think how to automate.

Which is to say that because the designers of automation can’t fully automate the human “out” of everything in a task, the human is left to cope with what’s left after the automated parts. Which by definition are the more complex bits. So the proposed benefit of relieving humans of cognitive workload isn’t exactly realized.

There are some more generalizations that Bainbridge makes, paraphrased by James Reason in Managing The Risks of Organizational Accidents:

  • In highly automated systems, the task of the human operator is to monitor the systems to ensure that the ‘automatics’ are working as they should. But it’s well known that even the best motivated people have trouble maintaining vigilance for long periods of time. They are thus ill-suited to watch out for these very rare abnormal conditions.
  • Skills need to be practiced continuously in order to preserve them. Yet an automatic system that fails only very occasionally denies the human operator the opportunity to practice the skills that will be called upon in an emergency. Thus, operators can become deskilled in just those abilities that justify their (supposedly) marginalized existence.
  • And ‘Perhaps the final irony is that it is the most successful automated systems with rare need for manual intervention which may need the greatest investment in operator training.’

Bainbridge’s exploration of ironies and costs of automation bring a much more balanced view of the topic, IMHO. It also points to something that I don’t believe is apparent to our community, which is that automation isn’t an all-or-nothing proposition. It’s easy to bucket things that humans do, and things that machines do, and while the two do meet from time to time in different contexts, it’s simpler to think of their abilities apart from each other.

Viewing automation instead on a spectrum of contexts can break this oversimplification, which I think can help us gain a glimpse into what a more mature perspective towards automation could look like.

Levels Of Automation

It would seem automation design needs to be done with the context of its use in mind. Another fundamental work in the research of automation is the so-called “Levels Of Automation”. In their seminal 1999 paper “Human And Computer Control of Undersea Teleoperators”, Sheridan and Verplank lay out the landscape for where automation exists along the human-machine relationship (Table 8.2 in the original and most excellent vintage 1978 typewritten engineering paper)

Automation Level Automation Description
1 The computer offers no assistance: human must take all decision and actions.
 2  The computer offers a complete set of decision/action alternatives, or
 3  …narrows the selection down to a few, or
 4  …suggests one alternative, and
 5  …executes that suggestion if the human approves, or
 6  …allows the human a restricted time to veto before automatic execution, or
 7  …executes automatically, then necessarily informs humans, and
 8  …informs the human only if asked, or
 9  …informs him after execution if it, the computer, decides to.
 10  The computer decides everything and acts autonomously, ignoring the


This was extended later in Parasuraman, Sheridan, and Wickens (2000) “A Model for Types and Levels of Human Interaction with Automation” to include four stages of information processing within which each level of automation may exist:

  1. Information Acquisition. The first stage involves the acquisition, registration, and position of multiple information sources similar to that of humans’ initial sensory processing.
  2. Information Analysis.  The second stage refers to conscious perception, selective attention, cognition, and the manipulation of processed information such as in the Baddeley model of information processing
  3. Decision and Action Selection. Next, automation can make decisions based on information acquisition, analysis and integration.
  4. Action Implementation. Finally, automation may execute forms of action.

Viewing the above 10 Levels of Automation (LOA) as a spectrum within each of those four stages allows for a way of discerning where and how much automation could (or should) be implemented, in the context of performance and cost of actions. This feels to me like a step towards making mature decisions about the role of automation in different contexts.

Here is an example of these stages and the LOA in each of them, suggested for Air Traffic Control activities:

Endsley (1999) also came up with a similar paradigm of stages of automation, in “Level of automation effects on performance, situation awareness and workload in a dynamic control task”

What are examples of viewing LOA in the context of web operations and engineering?

At Etsy, we’ve made decisions (sometimes implicitly) about the levels of automation in various tasks and tooling:

  • Deployinator: assisted by automated processes, humans trigger application code deploys to production. The when and what is human-centered. The how is computer-centered.
  • Chef: humans decide on some details in recipes (this configuration file in this place), computers decide on others (use 85% of total RAM for memcached, other logic in templates), and computer decides on automatic deployment (random 10 minute splay for Chef client runs). Mostly, humans provide the what, and computers decide the when and how.
  • Database Schema changes: assisted by automated processes, humans trigger the what and when, computer handles the how.
  • Event handling: some Nagios alerts trigger simple self-healing attempts upon some (not all) alertable events. Human decides what and how. Computer decides when.

I suspect that in many organizations, the four stages of automation (from Parasuraman, Sheridan, and Wickens) line up something like this, with regards to the breakdown in human or computer function allocation:

Information Acquisition
  • Largely computer-driven for application and infra metrics (think Graphite/Ganglia/NewRelic/Boundary/etc.)
  • Some higher-level human-driven data acquisition (think UX testing and observation/focus groups/etc.)
Information Analysis
  • Some computer-driven for application and infra (think Holt-Winters, CEP, A/B testing results, deductive reasoning about metrics, etc.)
  • Some human-driven analysis (think BI/behavioral/funnel correlations, inductive reasoning about metrics, etc.)
Decision and Action Selection
  • Some computer-driven for application and infra (think event handlers, fault tolerance and protection methods, CI, etc.)
  • Some human-driven (think some deployments, core network or storage changes deemed risky, etc.)
Action Implementation
  • Some computer-driven for application and infra (think event handlers, some config mgmt implementations, scheduled jobs with feed-back and feed-forward loops, etc.)
  • Some human-driven (think some deployments, feature ramp-ups, coordinated multi-team actions, etc.)



In many cases, what level of automation is appropriate and in which context is informed by the level of trust that operators have in the automation to be successful.

Do you trust an iPhone’s ability to auto-correct your spelling enough to blindly accept all suggestions? I suspect no one would, and the iPhone auto-correct designers know this because they’ve given the human the veto power of the suggestion by putting an “x” next to them. (following automation level 5, above)

Do you trust a GPS routing system enough to follow it without question? Let’s hope not. Given that there is context missing, such as stop signs, red lights, pedestrians, and other dynamic phenomena going on in traffic, GPS automobile routing may be a good example of keeping the LOA at level 4 and below, and even then only sticking to the “Information Acquisition” and “Information Analysis” states, and keeping the “Decision and Action” and “Action Implementation” stages to the human who can recognize the more complex context.

In “Trust in Automation: Designing for Appropriate Reliance“, James Lee and Katrina A. See investigate the concerns surrounding trusting automation, including organizational issues, cultural issues, and context that can influence how automation is designed and implemented. They outline a concern I think that should be familiar to anyone who has had experiences (good or bad) with automation (emphasis mine):

As automation becomes more prevalent, poor partnerships between people and automation will become increasingly costly and catastrophic. Such flawed partnerships between automation and people can be described in terms of misuse and disuse of automation. (Parasuraman & Riley, 1997).

Misuse refers to the failures that occur when people inadvertently violate critical assumptions and rely on automation inappropriately, whereas disuse signifies failures that occur when people reject the capabilities of automation.

Misuse and disuse are two examples of inappropriate reliance on automation that can compromise safety and profitability.

They discuss methods on making automation trustable:

  • Design for appropriate trust, not greater trust.
  • Show the past performance of the automation.
  • Show the process and algorithms of the automation by revealing intermediate results in a way that is comprehensible to the operators.
  • Simplify the algorithms and operation of the automation to make it more understandable.
  • Show the purpose of the automation, design basis, and range of applications in a way that relates to the users’ goals.
  • Train operators regarding its expected reliability, the mechanisms governing its behavior, and its intended use.
  • Carefully evaluate any anthropomorphizing of the automation, such as using speech to create a synthetic conversational partner, to ensure appropriate trust.

Adam Jacob, in a private email thread with myself and some others had some very insightful things to say on the topic:

The practical application of the ironies isn’t that you should/should not automate a given task, it’s answering the questions of “When is it safe to automate?”, perhaps followed by “How do I make it safe?”. We often jump directly to “automation is awesome”, which is an answer to a different question.

[if you were to ask]…”how do you draw the line between what is and isn’t appropriate?”, I come up with a couple of things:

  • The purpose of automation is to serve a need – for most of us, it’s a business need. For others, it’s a human-critical one (do not crash planes full of people regularly due to foreseeable pilot error.)
  • Recognize the need you are serving – it’s not good for its own sake, and different needs call for different levels of automation effort.
  • The implementers of that automation have a moral imperative to create automation that is serviceable, instrumented, and documented.
  • The users of automation have an imperative to ensure that the supervisors understand the system in detail, and can recover from

I think Adam is putting this eloquently, and I think it’s an indication that we as a field are moving towards a more mature perspective on the subject.

There is a growing notion amongst those who study the history, ironies, limitations, and advantages of automation that an evolved perspective on the human-machine relationship may look a lot like human-human relationships. Specifically, the characteristics that govern groups of humans that are engaged in ‘joint activity’ could also be seen as ways that automation could interact.

Collaboration, communication, and cooperation are some of the hallmarks of teamwork amongst people. In “Ten Challenges for Making Automation a ‘Team Player’ in Joint Human-Agent Activity” David Woods, Gary Klein, Jeffrey M. Bradshaw, Robert R. Hoffman, and Paul J. Feltovich make a case for how such a relationship might exist. I wrote briefly a little while ago about the ideas that this paper rests on, in this post here about how people work together.

Here are these ten challenges the authors say we face, where ‘agents’ = humans and machines/automated processes designed by humans:

  • Basic Compact – Challenge 1: To be a team player, an intelligent agent must fulfill the requirements of a Basic Compact to engage in common-grounding activities.
  • Adequate models – Challenge 2: To be an effective team player, intelligent agents must be able to adequately model the other participants’ intentions and actions vis-à-vis the joint activity’s state and evolution—for example, are they having trouble? Are they on a standard path proceeding smoothly? What impasses have arisen? How have others adapted to disruptions to the plan?
  • Predictability – Challenge 3: Human-agent team members must be mutually predictable.
  • Directability – Challenge 4: Agents must be directable.
  • Revealing status and intentions – Challenge 5: Agents must be able to make pertinent aspects of their status and intentions obvious to their teammates.
  • Interpreting signals – Challenge 6: Agents must be able to observe and interpret pertinent signals of status and intentions.
  • Goal negotiation – Challenge 7: Agents must be able to engage in goal negotiation.
  • Collaboration – Challenge 8: Support technologies for planning and autonomy must enable a collaborative approach.
  • Attention management – Challenge 9: Agents must be able to participate in managing attention.
  • Cost control – Challenge 10: All team members must help control the costs of coordinated activity.

I do recognize these to be traits and characteristics of high-performing human teams. Think of the best teams in many contexts (engineering, sports, political, etc.) and these certainly show up. Can humans and machines work together just as well? Maybe we’ll find out over the next ten years. 🙂

“The question is no longer whether one or another function can be automated, but, rather, whether it should be. – Wiener & Curry (1980)”
“…and in what ways it should be automated.” – John Allspaw (right now, in response to Wiener & Curry’s quote above)

Fundamental: Stress-Strain Curves In Web Engineering

I make it no secret that my background is in mechanical engineering. I still miss those days of explicit and dynamic finite element analysis, when I worked for the VNTSC, working on vehicle crashworthiness studies for the NHTSA.

What was there not to like? Things like cars and airbags and seatbelts and dummies and that get crushed, sheared, cracked, busted in every way, all made of different materials: steel, glass, rubber, even flesh (cadaver studies)…it was awesome.

I’ve made some analogies from the world of statics and dynamics to the world of web operations before (Part I and Part II), and it still sticks in my mind as a fundamental mental model in my every day work: resources that have adaptive capacities have a fundamental relationship between stress and strain. Which is to say, in most systems we encounter, as demand for a given resource increases, the strain on the system (and therefore the adaptive capacity) under load also changes, and in most cases increases.

What do I mean by “resource”? Well, from the materials science world, this is generally a component characterized by its material properties. The textbook example is a bar of metal, being stretched.

À la:

In this traditional case, the “system” is simply a beam or a linkage or a load-bearing something.

But in my extension/abuse of the analogy, simple resources in the first order could be imagined as:

  •    CPU
  •    Memory
  •    Disk I/O
  •    Disk consumption
  •    Network bandwidth

To extend it further (and more realistically, because these resources almost never experience work in isolation of each other) you could think of the resource under load to be any combination of these things. And the system under load may be a webserver. Or a database. Or a caching server.

Captain Obvious says: welcome to the underlying facts-on-the-ground of capacity planning and monitoring. 🙂

To me, this leads to some more questions:

    • What does this relationship look like, between stress and strain?
      • Does it fail immediately, as if it was brittle?
      • Or does it “bend”, proportionally, (as in: request rate versus latency) for some period before failure?
      • If the latter, is the relationship linear, or exponential, or something else entirely?
    • Was this relationship known before the design of the system, and therefore taken into account?
      • Which is to say: what approaches are we using most in predicting this relationship between stress and strain:
        • Extrapolated experimental data from synthetic load testing?
        • Previous real-world production data from similarly-designed systems?
        • Percentage rampups of production load?
        • A few cherry-picked reports on HackerNews combined with hope and caffeine?
    • Will we be able to detect when the adaptive capacity of this system is nearing damage or failure?
    • If we can, what are we planning on doing when we reach those inflections?

The more confidence we have about this relationship between stress and strain, the more prepared we are for the system’s failures and successes.

Now, the analogy of this fundamental concept doesn’t end here. What if the “system” under varying load is an organization? What if it’s your development and operations team? Viewed on a longer scale than a web request, this can be seen as a defining characteristic of a team’s adaptive capacities.

David Woods and John Wreathall discuss this analogy they’ve made in Stress-Strain Plots as a Basis for Assessing System Resilience”. They describe how they are mapping the state space of a stress-strain plot to an organization’s adaptive capacities and resilience:

Following the conventions of stress- strain plots in material sciences, the y-axis is the stress axis. We will here label the y-axis as the demand axis (D) and the basic unit of analysis is how the organization responds to an increase in D relative to a base level of D (Figure 1). The x-axis captures how the material stretches when placed under a given load or a change in load. In the extension to organizations, the x-axis captures how the organization stretches to handle an increase in demands (S relative to some base).

In the first region – which we will term the uniform response region – the organization has developed plans, procedures, training, personnel and related operational resources that can stretch uniformly as demand varies in this region. This is the on-plan performance area or what Woods (2006) referred to as the competence envelope.

As you can imagine, the fun begins in the part of the relationship above the uniform region. In materials science, this is where plastic deformation begins; it’s the point on the curve at which a resource/component’s structure deforms under the increased stress and can no longer rebound back to its original position. It’s essentially damaged, or its shape is permanently changed in the given context.

They go on to say that in the organizational stress-strain analogy:

In the second region non-uniform stretching begins; in other words, ‘gaps’ begin to appear in the ability to maintain safe and effective production (as defined within the competence envelope) as the change in demands exceeds the ability of the organization to adapt within the competence envelope. At this point, the demands exceed the limit of the first order adaptations built into the plan-ful operation of the system in question. To avoid an accumulation of gaps that would lead to a system failure, active steps are needed to compensate for the gaps or to extend the ability of the system to stretch in response to increasing demands. These local adaptations are provided by people and groups as they actively adjust strategies and recruit resources so that the system can continue to stretch. We term this the ‘extra’ region (or more compactly, the x-region) as compensation requires extra work, extra resources, and new (extra) strategies.

So this is a good general description in Human Factors Researcher language, but what is an example of this non-uniform or plastic deformation in our world of web engineering? I see a few examples.

  • In distributed systems, at the point at which the volume of data and request (or change) rate of the data is beyond the ability for individual nodes to cope, and a wholesale rehash or fundamental redistribution is necessary. For example, in a typical OneMasterManySlaves approach to database architecture, when the rate of change on the master passes the point where no matter how many slaves you add (to relieve read load on the master) the data will continually be stale. Common solutions to this inflection point are functional partitioning of the data into smaller clusters, or federating the data amongst shards. In another example, it could be that in a Dynamo-influenced datastore, the N, W, and R knobs need adjusting to adapt to the rate or the individual nodes’ resources need to be changed.
  • In Ops teams, when individuals start to discover and compensate for brittleness in the architecture. A common sign of this happening is when alerting thresholds or approaches (active versus passive, aggregate versus individual, etc.) no longer provide the detection needed within an acceptable signal:noise envelope. This compensation can be largely invisible, growing until it’s too late and burnout has settled in.
  • The limits of an underlying technology (or the particular use case for it) is starting to show. An example of this is a single-process server. Low traffic rates pose no problem for software that can only run on a single CPU core; it can adapt to small bursts to a certain extent, and there’s a simple solution to this non-multicore situation: add more servers. However, at some point, the work needed to replace the single-core software with multicore-ready software drops below the amount of work needed to maintain and grow an army of single-process servers. This is especially true in terms of computing efficiency, as in dollars per calculation.

In other words, the ways a design or team once adapted are no longer valid in this new region of the stress-strain relationship. Successful organizations re-group and increase their ability to adapt to this new present case of demands, and invest in new capacities.

For the general case, the exhaustion of capacity to adapt as demands grow is represented by the movement to a failure point. This second phase is represented by the slope and distance to the failure point (the downswing portion of the x-region curve). Rapid collapse is one kind of brittleness; more resilient systems can anticipate the eventual decline or recognize that capacity is becoming exhausted and recruit additional resources and methods for adaptation or switch to a re-structured mode of operations (Figures 2 and 3). Gracefully degrading systems can defer movement toward a failure point by continuing to act to add extra adaptive capacity.

In effect, resilient organizations recognize the need for these new strategies early on in the non-uniform phase, before failure becomes imminent. This, in my view, is the difference between a team who has ingrained into their perspective what it means to be operationally ready, and those who have not. At an individual level, this is what I would consider to be one of the many characteristics that define a “senior” (or, rather a mature) engineer.

This is the money quote, emphasis is mine:

Recognizing that this has occurred (or is about to occur) leads people in these various roles to actively adapt to make up for the non- uniform stretching (or to signal the need for active adaptation to others). They inject new resources, tactics, and strategies to stretch adaptive capacity beyond the base built into on-plan behavior. People are the usual source of these gap-filling adaptations and these people are often inventive in finding ways to adapt when they have experience with particular gaps (Cook et al., 2000). Experienced people generally anticipate the need for these gap-filling adaptations to forestall or to be prepared for upcoming events (Klein et al., 2005; Woods and Hollnagel, 2006), though they may have to adapt reactively on some occasions after the consequences of gaps have begun to appear. (The critical role of anticipation was missed in some early work that noticed the importance of resilient performance, e.g., Wildavsky, 1988.)

This behavior leads to the extension of the non-uniform space into new uniform spaces, as the team injects new adaptive capacities:

There is a lot more in this particular paper that Woods and Wreathall cover, including:

  • Calibration – How engineering leaders and teams view themselves and their situation, along the demand-strain curve. Do they underestimate or overestimate how close they are to failure points or active adaptations that are indicative of “drift” towards failure?
  • Costs of Continual Adaption in the X-Region – As the compensations for cracks and gaps in the system’s armor increase, so does the cost. At some point, the cost of restructuring the technology or the teams becomes lower than the continual making-up-for-the-gaps that are happening.
  • The Law of Stretched Systems – “As an organization is successful, more may be demanded of it (‘faster, better, cheaper’ pressures) pushing the organization to handle demands that will exceed its uniform range. In part this relationship is captured in the Law of Stretched Systems (Woods and Hollnagel, 2006) – with new capabilities, effective leaders will adapt to exploit the new margins by demanding higher tempos, greater efficiency, new levels of performance, and more complex ways of work.”

Overall, I think Woods and Wreathall hit the nail on the head for me.  Of course, as with all analogies, this mapping of resilience and adaptive capacity to stress-strain curves has limits and they are clear on pointing those out as well.

My suggestion of course is for you to read the whole chapter. It may or may not be useful for you, but it sure is to me. I mean, I embrace the concept so much that I got a it printed on a coffee mug, and I’m thinking of making an Etsy Engineering t-shirt as well. 🙂

The Devil’s In The Details

I’m a firm believer that context is everything, and that it’s needed in every constructive conversation we want to have as engineers.

As a nascent (but adorable) engineering field, we discuss (in blogs, books, meetups, conferences, etc.) success and failure in a number of areas, including the ways in which we work. We don’t just build complex systems, we are a complex system. In order to get our work done, we have to successfully bring together people and skills from diverse backgrounds. When we reach large-scale, we have to enlist deep and diverse domain expertise across our staff.

But sometimes, we can get frustrated or bogged-down in the details of these interactions between groups and individuals. We can feel like the blunt end (management) doesn’t understand the sharp end (practitioners), or we can feel as though one group doesn’t understand the goals, concerns, or tradeoffs of another, or we simply aren’t doing a good enough job of enabling people to have constructive conversations.

As usual, we’re not the only people who might be interested in how people work together, especially in combination with machines. There’s a great chapter in the 2005 issue of Organizational Simulation (link to journal) that outlines the concept of a “Basic Compact” that people have with each other when engaged in joint activity. From Common Ground and Coordination In Joint Activity:

People engage in joint activity for many reasons: because of necessity (neither party, alone, has the required skills or resources), enrichment (while each party could accomplish the task, they believe that adding complementary points of view will create a richer product), coercion (the boss assigns a group to carry out an assignment), efficiency (the parties working together can do the job faster or with fewer resources), resilience (the different perspectives and knowledge broaden the exploration of possibilities and cross check to detect and recover from errors) or even collegiality (the team members enjoy working together).

We propose that joint activity requires a “Basic Compact” that constitutes a level of commitment for all parties to support the process of coordination. The Basic Compact is an agreement (usually tacit) to participate in the joint activity and to carry out the required coordination responsibilities. Members of a relay team enter into a Basic Compact by virtue of their being on the team; people who are angrily arguing with each other are committed to a Basic Compact as long as they want the argument to continue.

That first reason why people engage in joint activity: because of necessity (neither party, alone, has the required skills or resources) points to some of the reasons why at the same time successful companies develop and employ generalists, there is an advantage to having people differing domain expertise come together. An example of this might be “development” and “operations”, or “design” and “product”, or “finance” and “business development”, or “public relations” and “community”, etc.

Regardless, two of the authors are responsible for some of the best writing on Cognitive Systems Engineering and Naturalistic Decision Making: Dr. David Woods and Gary Klein, respectively. There’s a lot of great bits in here. Costs of coordination, spoken and assumed behaviors, as well as touching on everyone’s favorite topic in engineering: automation and it’s affects on engineering behavior. Come on, what’s not to like in this paper?

The chapter is here as a PDF, so you best get it for your weekend reading! 🙂


Fault Tolerance and Protection

In yet another post where I point to a paper written from the perspective of another field of engineering about a topic that I think is inherently mappable to the web engineering world, I’ll at least give a summary. 🙂

Every time someone on-call gets an alert, they should always be thinking along these lines:

  • Does this really require me to wake up from sleeping or pause this movie I’m watching, to fix?
  • Can this really not wait until the morning, during office hours?

If the answer is yes to those, then excellent: the machines alerted a human to something that only a human could ever diagnose or fix. There was nothing that the software could have done to rectify the situation. Paging a human was justified.

But for those situations where the answer was “no” to those questions, one might (or should, anyway) think of bolstering your system’s “fault tolerance” or “fault protection.” But how many folks grok the full details of what that means?  Does it mean self-healing? Does it mean isolation of errors or unexpected behaviors that fall outside the bounds of normal operating circumstances? Or does it mean both and if so how should we approach building this tolerance and protection? The Wikipedia definitions for “fault tolerant systems” and “fault tolerant design” are a very good start on educating yourself on the concepts, but they’re reasonably general in scope.

The fact is, designing web systems to be truly fault-tolerant and protective is hard. These are questions that can’t be answered solely within infrastructural bounds; fault-tolerance isn’t selective in its tiering, it has to be thought of from layer 1 of the network all the way to the browser.

Now, not every web startup is lucky enough to hire someone from NASA’s Jet Propulsion Lab, who has written software for space vehicles, but we managed to convince Greg Horvath to leave there and join Etsy. He pointed me to an excellent paper, by Robert D. Rasmussen, called “GN&C Fault Protection Fundamentals” and thankfully, it’s a lot less about Guidance, Navigation, and Control and more about fault tolerance and protection strategies, concerns, and implementations.

Some of those concerns, from the paper:

  • Do not separate fault protection from normal operation of the same functions.
  • Strive for function preservation, not just fault protection.
  • Test systems, not fault protection; test behavior, not reflexes.
  • Cleanly establish a delineation of mainline control functions from transcendent issues.
  • Solve problems locally, if possible; explicitly manage broader impacts, if not.
  • Respond to the situation as it is, not as it is hoped to be.
  • Distinguish fault diagnosis from fault response initiation.
  • Follow the path of least regret.
  • Take the analysis of all contingencies to their logical conclusion.
  • Never underestimate the value of operational flexibility.
  • Allow for all reasonable possibilities — even the implausible ones.

The last idea there points to having “requisite imagination” to explore as fully as possible, the question “What could possibly go wrong?”, which is really just another manifestation of one of the four cornerstones of Resilience Engineering, which is: “Anticipation”. But that’s a topic for another post.

Here’s Rasmussen’s paper, please go and read it. If you don’t, you’re totally missing out and not keeping up!

Systems Engineering: A great definition.

Ben Rockwood said something last December about the re-emergence of the Systems Engineer and I agree with him, 100%.

NASA Systems Engineering Handbook

NASA Systems Engineering Handbook, 2007

To add to that, I’d like to quote the excellent NASA Systems Engineering handbook’s introduction. The emphasis is mine:

Systems engineering is a methodical, disciplined approach for the design, realization, technical management, operations, and retirement of a system. A “system” is a construct or collection of different elements that together produce results not obtainable by the elements alone. The elements, or parts, can include people, hardware, software, facilities, policies, and documents; that is, all things required to produce system-level results. The results include system-level qualities, properties, characteristics, functions, behavior, and performance. The value added by the system as a whole, beyond that contributed independently by the parts, is primarily created by the relationship among the parts; that is, how they are interconnected. It is a way of looking at the “big picture” when making technical decisions. It is a way of achieving stakeholder functional, physical, and operational performance requirements in the intended use environment over the planned life of the systems. In other words, systems engineering is a logical way of thinking.

Systems engineering is the art and science of developing an operable system capable of meeting requirements within often opposed constraints. Systems engineering is a holistic, integrative discipline, wherein the contributions of structural engineers, electrical engineers, mechanism designers, power engineers, human factors engineers, and many more disciplines are evaluated and balanced, one against another, to produce a coherent whole that is not dominated by the perspective of a single discipline.

Systems engineering seeks a safe and balanced design in the face of opposing interests and multiple, sometimes conflicting constraints. The systems engineer must develop the skill and instinct for identifying and focusing efforts on assessments to optimize the overall design and not favor one system/subsystem at the expense of another. The art is in knowing when and where to probe. Personnel with these skills are usually tagged as “systems engineers.” They may have other titles—lead systems engineer, technical manager, chief engineer— but for this document, we will use the term systems engineer.

The exact role and responsibility of the systems engineer may change from project to project depending on the size and complexity of the project and from phase to phase of the life cycle. For large projects, there may be one or more systems engineers. For small projects, sometimes the project manager may perform these practices. But, whoever assumes those responsibilities, the systems engineering functions must be performed. The actual assignment of the roles and responsibilities of the named systems engineer may also therefore vary. The lead systems engineer ensures that the system technically fulfills the defined needs and requirements and that a proper systems engineering approach is being followed. The systems engineer oversees the project’s systems engineering activities as performed by the technical team and directs, communicates, monitors, and coordinates tasks. The systems engineer reviews and evaluates the technical aspects of the project to ensure that the systems/subsystems engineering processes are functioning properly and evolves the system from concept to product. The entire technical team is involved in the systems engineering process.

I would imagine that successful organization understands this concept of systems engineering, but I don’t think I’ve ever seen it put so well.

NASA’s engineers have both common and conflicting goals, just like we do in web operations. They weigh trade-offs in efficiency and thoroughness, and wade into the constraints of better, cheaper, faster, and hopefully: more resilient.

This re-emergence of the systems engineering (or “full-stack” engineering) notion is excellent and exciting to me, and I’m hoping that everyone in our field, when they hear “DevOps” (and/or how Theo says *Ops) what they mean is taking a systems engineering view.


Training Organizational Resilience in Escalating Situations

This little ramble of thoughts are related to my talk at Velocity coming up, but I know I’ll never get to this part at the conference, so I figured I’d post about it here.

Building resilience from a systems point of view means (amongst other things) understanding how your organization deals with failure and unexpected situations. Generally this means having a development and operations teams that can work well together under pressure, with fluctuating amounts of uncertainty, bringing their own domain expertise to the table when it matters.

This is what drives some of my favorite Ops candidate interview questions. Knowing Unix commands, network architectures, database behaviors, and scripting languages are obviously required, but comprise only one facet of the gig.  The real mettle comes from being able easily zoom in and out of the whole system under scrutiny, splitting up troubleshooting responsibilities amongst your team (and trusting their results) and differentiating red herring symptoms from truly related ones. It also comes from things like:

  • Staying away from distracting conversation during the outage response. Nothing kills a TTR like unrelated talk in IRC or a conf call.
  • Trusting your information. This is where the UI challenges of dashboard design can make or break an outage response. “Are those units milli, or mega?”
  • Balancing too much communication and too little amongst team members. Troubleshooting outage verbosity is a fickle mistress.
  • Stomping actions. OneThingAtATime™ methods aren’t easy to stick to, especially when things escalate.
  • Keeping outage fatigue at bay, and recognizing when brains are melting and need to take a break.

To make matters worse, determining causality can be tenuous at best when you’re working with complex systems, so being able to recognize when a failure has a single root cause (hint: with the big outages – almost never) and when it has multiple contributing causes is a skill that isn’t easily gained without seeing a lot of action in the past.

So it’s not a surprise that working well within a team under stressful scenarios is something other fields try to train people for.  Trauma surgeons, FBI agents, military teams, air traffic control, etc. all have drills, exercises, and simulations for teaching these skills, but they are all done within the context of what those escalating situations look like in their specific fields.

So this brings a question that has come up before in my circles:

Can this sort of organizational resilience be taught, within the context of web operations?

GameDay exercises could certainly be one avenue for testing and training team-based outage response, but most of the focus there (at least those discussed publicly by companies who hold GameDay exercises) is testing the infrastructure and application-level components, and even then under controlled conditions and relatively narrow failure modes.

So the confidence-building value of GameDay drills lie elsewhere, and don’t really exercise the cognitive load that real-world failures can produce on the humans (i.e. the troubleshooting dev and ops teams) like the spectacular Amazon AWS outage recently.

But! Some smart folks have been thinking about this question, at a higher-level:

Is it possible to construct non-contextual and generic drills that can train competencies for this sort of on-the-fly, making-sense-of-unfamiliar-failure-modes, and sometimes disorienting troubleshooting?

At the Lund University in Sweden, there’s an excellent article on building organizational resilience in escalating situations, which I believe resulted in a chapter in the Resilience Engineering in Practice book, and also references another excellent article by David Woods and Emily Patterson called How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands.

The parts I want to highlight here are best practices for designing scenarios meant to train these skills. If you’re looking to design a good drill meant to educate and/or train Ops and Devs on what cognitive muscles to develop for handling large-scale outages, this is a pretty damn good list (quoted from both of those sources above):

  • Try to force people beyond their learned roles and routines. The scenario can contain problems that are not solvable within those roles or routines, and forces people to step out of those roles and routines.
  • Contain a number of hidden goals, at various times during the scenario, that people could pursue (e.g. different ways of escaping the situation or de-escalating it), but that they have to vocalize and articulate in order to begin to achieve them (as they cannot do so by themselves).
  • Include potential actions of which the consequences are both important and difficult to foresee (and that might significantly influence people’s ability to control the problem in the near future). This can force people into pro-active thinking and articulation of their expectations of what might happen.
  • Be able to trap people in locking onto one solution that everybody is fixedly working towards. This can be done by garden-pathing; making the escalating problem look initially (with strong cues) like something the crew could already familiar with, but then letting it depart (with much weaker cues) to see whether the crew is caught on the garden path and lets the situation escalate.
  • Or the scenario, by creating so much cognitive noise in terms of new warnings and events, should be able to trip people into thematic vagabonding—the tendency to redirect attention and change diagnosis with each incoming data piece, which results in a fragmentation of problem-solving.

Think that such a scenario could be constructed?

I want to think so, but of course nothing teaches like the hindsight of a real production outage, eh? 🙂

Resilience Engineering: Part I

I’ve been drafting this post for a really long time. Like most posts, it’s largely for me to get some thoughts down. It’s also very related to the topic I’ll be talking about at Velocity later this year.

When I gave a keynote talk at the Surge Conference last year, I talked about how our field of web engineering is still young, and would do very well to pay attention to other fields of engineering, since I suspect that we have a lot to learn from them. Contrary to popular belief, concepts such as fault tolerance, redundancy of components, sacrificial parts, automatic safety mechanisms, and capacity planning weren’t invented with the web. As it turns out, some of those ideas have been studied and put into practice in other fields for decades, if not centuries.

Systems engineering, control theory, reliability engineering…the list goes on for where we should be looking for influences, and other folks have noticed this as well. As our field recognizes the value of taking a “systems” (the C. West Churchman definition, not the computer software definition) view on building and managing infrastructures with a “Full Stack Programmer” perspective, we should pull our heads out of our echo chamber every now and again, because we can gain so much from lessons learned elsewhere.

Last year, I was lucky to convince Dr. Richard Cook to let us include his article “How Complex Systems Fail” in Web Operations. Some months before, I had seen the article and began to poke around Dr. Cook’s research areas: human error, cognitive systems engineering, safety, and a relatively new multi-discipline area known as Resilience Engineering.

What I found was nothing less than exhilarating and inspirational, and it’s hard for me to not consider this research mandatory reading for anyone involved with building or designing socio-technical systems. (Hint: we all do, in web operations) Frankly, I haven’t been this excited since I saw Jimmy Page in a restaurant once in the mid-90s. Even though Dr. Cook (and others in his field, like Erik Hollnagel, David Woods, and Sidney Dekker) historically have written and researched resilience in the context of aviation, space transportation, healthcare and manufacturing, their findings strike me as incredibly appropriate to web operations and development.

Except, of course, accidents in our field don’t actually harm or kill people. But they almost always involve humans, machines, high stress, and high expectations.

Some of the concepts in resilience engineering run contrary to the typical (or stereotypical) perspectives that I’ve found in operations management, and that’s what I find so fascinating. I’m especially interested in organizational resilience, and the realization that safety in systems develops not in spite of us messy humans, but because of it.

For example:

Historical approaches taken towards improving “safety” in production might not be best

Conventional wisdom might have you believe that the systems we build are basically safe, and that all they need is protection from unreliable humans. This logically stems from the myth that all outages/degradations occur as the result of a change gone wrong, and I suspect this idea also comes from Root Cause Analysis write-ups ending with “human error” at the bottom of the page. But Dekker, Woods, and others in Behind Human Error suggest that listing human error as a root cause isn’t where you should end, it’s where you should start your investigation. Getting behind what led to a ‘human error’ is where the good stuff happens, but unless you’ve got a safe political climate (i.e., no one is going to get punished or fired for making mistakes) you’ll never get at how and why the error was made. Which means that you will ignore one of the largest opportunities to make your system (and organization) more efficient and resilient in the face of incidents. Mismatches, slips, lapses, and violations…each one of those types of error can lead to different ways of improving. And of course, working out the motivations and intentions of people who have made errors isn’t straightforward, especially engineers who might not have enough humility to admit to making an error in the first place.

Root Cause Analysis can be easily misinterpreted and abused

The idea that failures in complex systems can literally have a singular ‘root’ cause, as if failures are the result of linear steps in time, is just incorrect. Not only is it almost always incorrect, but in practice that perspective can be harmful to an organization because it allows management and others to feel better about improving safety, when they’re not, because the solution(s) can be viewed as simple and singular fixes (in reality, they’re not). James Reason’s pioneering book Human Error is enlightening on these points, to say the least. In reality (and I am guilty of this as anyone) there are motivations to reduce complex failures to singular/linear models, tipping the scales on what Hollnagel refers to as an ETTO, or Efficiency-Thoroughness Trade-Off, which I think will sound familiar to anyone working in a web startup. Because why spend extra time digging to find details of that human error-causing outage, when you have work to do? Plus, if you linger too long in that postmortem meeting, people are going to feel even worse about making a mistake, and that’s just cruel, right? 🙂

PostMortems or accident investigations is not the only way an organization can improve “safety”

Only looking at failures to guide your designs, tools, and processes drastically minimizes your ability to improve, Hollnagel says. Instead of looking at the things that go wrong, looking at the things that go right is a better strategy to improve resiliency. Personally, I think that engineering teams who practice continuous deployment intuitively understand this. Small and frequent changes made to production by a growing number of developers ascribe to a particular culture of safety, whether they know it or not. It requires what Hollnagel refers to as a “constant sense of unease”, and awareness of failure is what helps bridge that stereotypical development and operations divide.

Resilience should be a 4th management objective, alongside Better/Faster/Cheaper

The definition goes like this:

Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Since resilience is about being able to function, rather than being impervious to failure, there is no conflict between productivity and safety.

This sounds like one of those commonsense ideas, right? In an extremely self-serving way, I find some validation in that definition that optimizing for MTTR is better than optimizing for MTBF. My gut says that this shouldn’t be shocking or a revelation; it’s what mature engineering is all about.

Safety might not come from the sources you think it comes from

“…so safety isn’t about the absence of something…that you need to count errors or monitor violations, and tabulate incidents and try to make those things go away…’s about the presence of something. But the presence of what? When we find that things go right under difficult circumstances, it’s mostly because of people’s adaptive capacity; their ability to recognize, adapt to, and absorb changes and disruptions, some of which might fall outside of what the system is designed or trained to handle.”

– Sidney Dekker

My plan is to post more about these topics, because there are just too many ideas to explain in a single go. Apparently, Ashgate Publishing has owned this space, with a whole series of books. The newest one, Resilience Engineering in Practice, is in my bag, and I can’t put it down. Examples of these ideas in real-world scenarios (hospital and medical ops, power plants, air traffic control, financial services) are juicy with details, and the chapter “Lessons from the Hudson” goes into excellent detail about the trade-offs that go on in the mind of someone in high-stress failure scenarios, like Chesley Sullenberger.

I’ll end on this decent introduction to some of the ideas that includes the above quote, from Sidney Dekker. There’s some distracting camera work, but the ideas get across: