Owning Attention (Considerations for Alert Design)

Posted 16 CommentsPosted in Cognitive Systems Engineering, Human Factors, Systems Safety, Talks

In the past month or two, I’ve spoken on the topic of alert design. There’s a video of my giving the talk (at Monitorama, as well), but I thought I’d try to post on the topic and material as well.

The topic of alerts and “alert design” as seen as a deliberate and purposeful thing to do has been on my mind.

In my experience and my asking many people in engineering and operations (at least in the web and financial trading domains) nothing spikes blood pressure like the topics of alerts. The caricature of the sysadmin waking up to a buzzing pager or phone is what comes to mind.

The costs of not paying attention to how your organization views or treats what comes of this behavior in operational teams (developers and systems folks included) I think are both largely invisible and much higher than most people think. It may be clear that what we’re talking about here is a signal:noise ratio, but it goes way beyond that. The cognitive cost of an engineer to attend to an alert (a fundamentally interrupting event by design) is akin to the cost of a software developer losing their “flow”; context switching is expensive. Expensive from a financial standpoint, a productivity perspective, and I’ll argue a career development view.

Here are some (likely melodramatic) assertions:

  • Alert numbness and fatigue is a blight on our industry. Because we can alert on basically anything, and we can argue that anything could be a harbinger of things that could drastically affect our business, we generally put an alert on everything we get our hands on.
  • Knowing something has happened almost always trumps not knowing something happened, with sometimes not much effort put into whether the “something” is important with respect to the context it’s happened in.
  • Computers deciding what is important to alert on is and will always be brittle. Meaning: alerts and their criteria originate in the author’s mind, which may or may not be in the same place as the receiver of the alert in the future. In other words: we all write documentation and procedures that make sense to us when we write them. They never survive too much of the future, because our worlds that refer to them change. Example: corporate wiki pages are commonly referred to as the place where “documentation goes to die”. Alerts are no different.

Therefore, I’d love to get a much deeper and broader conversation about alert design in our domain. Because I’ll say that it’s not the technology that sucks, it’s our use of it. Consider the possibility that you don’t have a Nagios problem, you have an alert design problem.

Down and In

As the years go by and we see the continued decline of storage prices, the explosion of accessible processing power, we have an ever-expanding ability to zoom in deeply to the ways servers and services talk to each other and process information.

We can zoom in on the relationships and behaviors of seemingly disparate pieces of data, and we can discover and detect disruptions or anomalies in sometimes surprising places. This is interesting, for sure.

But it is also woefully incomplete if we are to make any progress in technical operations.

Up and Out

It is incomplete because as we zoom out of those high-resolution metrics collection and analysis tooling, what we find is a much-ignored environment which includes one of the most powerful context-sensitive and incredibly adaptive anomaly detection and response agents in the world:  humans.

Do we have anomaly detection problems? Certainly. One can argue (I will) that we will always have them, for many reasons. (One of those reasons is the Law Of Stretched Systems, but that is for a different post.)

What I’m interested in is not how software can be used to detect anomalies automatically,
(well, I’m interested, but I don’t doubt that we all will continue to get better at it)

…it is how people navigate this boundary between themselves and the machines they work with. The boundary between humans and machines, as we observe our use of tools, is a focus in and of itself. If we have any hope of making progress in monitoring complex systems, we must take this boundary into account.

As an aside, some more bullet points:

  1. We don’t use a single tool to gain insight into the architectures we build. And we will not, much to the dismay of many monitoring-as-a-service business models. (“A single plane of glass?! Where do I sign?!”)
  2. Teams of people are the norm, which means that communication and coordination become as important (if not more important) than surfacing anomalies themselves.
  3. We bring our biases, expectations, trust, and perceptions to the table when it comes to monitoring and response. No tool or piece of automation will ever change that.
  4. Understanding the breakdowns at these boundaries between people and machines should be a part of how we approach the design of tools. Organizational behavior beats technology at every turn.

Less Code, More Social Science

When we look at Boyd’s OODA loop, we see “observe” and “orient” as critical pieces. Note that these are not Unix commands, they are human activities.

So writing code to tell computers what to look at is quite different than making sure that the code’s human supervisors are equipped or aided in what to look when an alert goes off. Figuring out how people make sense of what is actually going on at a given point (in diagnosis? in planning? in response to an outage? in control?) is just plain hard.

A step that Don Norman (and other folks known in the world of ergonomics and human factors) have been tugging at for a couple of decades is to first attempt to understand how people consume, adapt to, work around, and make use of tools under “normal” operating conditions. Once that’s done, it’s suggested, then we can try to understand how people make sense of their world under high-tempo or escalating scenarios (during an outage, for example) when the signals they receive can sometimes be disorienting as things escalate.

Questions

  • Who has ever gotten an alert and ignored it? (/me looks at alert, says “oh, it’ll probably recover, no need to look further”)
  • How many alerts were received in the past week that were not actionable? (no human action was required)
  • How many alerts were received in the past week as a result of known work being done (expected) but alerts were not silenced during that period?
  • How many alerts were received as a result of a previously silenced alert (because work was being done) that was mistakenly un-silenced?

Here are some quotes from engineers who have found themselves in interesting situations related to alerts:

“The whole place just lit up. I mean, all the lights came on. So instead of being able to tell you what went wrong, the lights were absolutely no help at all.”
– Comment by one space controller in mission control after the Apollo 12 spacecraft was struck by lightning (Murray and Cox 1990).

 

“I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information.”
– Comment by one operator at the Three Mile Island nuclear power plant to the official inquiry following the TMI accident (Kemeny 1979).

 

“When the alarm kept going off then we kept shutting it [the device] off [and on] and when the alarm would go off [again], we’d shut it off.”
“… so I just reset it [a device control] to a higher temperature. So I kinda fooled it [the alarm]…”
– Physicians explaining how they respond to a nuisance alarm on a computerized operating room device (Cook, Potter, Woods and McDonald 1991).

 

“A [computer] program alarm could be triggered by trivial problems that could be ignored altogether. Or it could be triggered by problems that called for an immediate abort [of the lunar landing]. How to decide which was which? It wasn’t enough to memorize what the program alarm numbers stood for, because even within a single number the alarm might signify many different things.

 

“We wrote ourselves little rules like: ‘If this alarm happens and it only happens once, don’t worry about it. If it happens repeatedly, but other indicators are okay, don’t worry about it.'” And of course, if some alarms happen even once, or if other alarms happen repeatedly and the other indicators are not okay, then they should get the LEM [lunar module] the hell out of there.
– Response to discovery of a set of computer alarms linked to the astronauts displays shortly before the Apollo 11 mission (Murray and Cox 1990).

 

“1202.” (Astronaut announcing that an alarm buzzer and light had gone off and the code 1202 was indicated on the computer display.)
“What’s a 1202?”
“1202, what’s that?”
“12…1202 alarm.”
– Mission control dialog as the LEM descended to the moon during Apollo 11 (Murray and Cox 1990).

 

“I know exactly what it [an alarm] is–it’s because the patient has been, hasn’t taken enough breaths or–I’m not sure exactly why.”
– Physician explaining one alarm on a computerized operating room device that commonly occurred at a particular stage of surgery (Cook et al. 1991).

These quotes are from the excellent paper The Alarm Problem and Directed Attention in Dynamic Fault Management (Woods, 1995).

David Woods writes at great length on the topic and gives great insight into what essentially alerts and alarms are: directed attention. As operators of systems that are beyond our full understanding at any given point and perspective, he shines light on the core of the alarm problem: that there is always context sensitivity to alerts, and in many ways the author/designer of the alert hasn’t (can’t!) imagine how the receiver of the alert will interpret it.

For example: he points to signal detection theory as a framework for thinking about alert/alarm criteria. That is to say, there is always a relationship between true “signal” and “noise” and the trade-offs inherent in choosing the alerting criteria (sometimes, but not always, viewed as a simple threshold) can be thought of like this:

Signal Detection Theory

In other words, there are four outcomes that are possible that reflect how sensitive the alerting criteria can be:

 

SDT outcomes

So this is a tough one, and points out that getting good (forget about perfect!) signal-to-noise ratio is hard. Too sensitive, you’ll get too many false alarms. Not sensitive enough, and you’ll miss something.

I’ll say that because of this, we generally err on the side of too many false alarms. For fear of missing something (or the embarrassment of it being known that you missed something going wrong with your systems!) we will crank up the sensitivity.

But in doing so, we essentially ignore the detrimental effect of the false alarms on our engineers and organizations. Underlying the false alarms are not just limitations in the alerting algorithms themselves, but the conditions and factors that the alert systems cannot detect or interpret.

An often-given example of this manifests at the Cincinnati Airport. A riverbank leading up to a particular runway there triggers a threshold in ground proximity warning systems (in-cockpit alerts) because the system can’t detect that it’s going to plateau at the runway. Pilots familiar with this particular runway at this particular airport ignore the alerts.

Once more, with feeling: the pilots, who are flying massive cylinders of metal containing many humans ignore a Ground Proximity Warning alert.

When we talk about how the receiver of an alert will behave, we begin to uncover the context sensitivity of an alert.

How can we take into account how someone might react when we they are woken up to an alert we’ve designed? Will they shake their head, wondering what it’s all about? Are we helping them understand what might be going on, or hindering them by including only the bare minimum of data?

What about the engineer who gets an alert in a sea of alerts, while an outage is ongoing? How much attention will they give one amongst a hundred?

Something that might affect our behavior when we get an alert is the amount of trust that we have in the alert: is it telling us something we should believe? Should we drop everything we’re doing in order to pay attention to it? If not, why not?

As an example of this, take the Ground Proximity Warning System I mentioned above. Turns out that in many studies across a number of years, a majority of pilots delay reacting to a GPWS alarm, not just in Cincinnati. Why? Because they take time to validate that the alarm is actually legitimate by looking out the window. This is enough of a problem that the FAA has coined this phenomenon “delayed GPWS response syndrome“.

Trust in automation: it’s a thing that might be worth thinking closely about.

Two Views

“The critical point is that the challenge of fault management lies in sorting through an avalanche of raw data — a data overload problem. This is in contrast to the view that the performance bottleneck is the difficulty of picking up subtle early indications of a fault against the background of a quiescent monitored process.” (Woods, 1995)

The next time you set up an alert in your system, consider how you’re thinking the receiver of that alert will take it. Do you believe that your alert will save the day, providing information for someone to head off catastrophe before it’s too late? Or will it be likely discarded as noise amongst a sea of alerts as someone struggles to understand an outage?

“Information is not a scarce resource, attention is.” – Herb Simon

Herb Simon has mentioned this in many pieces of his writing, as David Woods and Emily Patterson remarks in Can We Ever Escape From Data Overload, A Cognitive Systems DiagnosisThus far we’ve captured that designing alerts is hard, even if we only invest effort in capturing signal, forget about providing context. Woods talks a bit more about directed attention, about a paradox:

“Note the paradox at the heart of directed attention. Given that the supervisory agent is loaded by various other task related demands, how does one interpret information about the potential need to switch attentional focus without interrupting or interfering with the tasks or lines of reasoning already under attentional control. We can state this paradox in another way: how can one skillfully ignore a signal that should not shift attention within the current context, without first processing it — in which case it hasn’t been ignored.”

So Where Is “Design”?

“It is the expertise of the human operator that makes it possible to adapt the  performance of the joint system, in real time, to unexpected events and disturbances. Every working day, across the whole spectrum of human enterprise, a large number of near-misses are prevented from turning into accidents only because human operators intervene.

The system should therefore be designed so that human adaptation is enhanced.”

(emphasis mine) – Erik Hollnagel, Expertise and Technology: Cognition &  Human-Computer Cooperation, 1995

Instead of thinking about alerts and alert design as tasks that underscore the mental model of a subordinate or otherwise dumb messenger delivering news to us?

What if we viewed alerting systems as a partner? What does the world look like if we designed alerting systems to cooperate with us?
If trust in alerting systems is such a big deal, as it is with the GPWS and alert numbness,  what can we learn from how humans learn to trust each other, and let that influence our design decisions?

In other words: how can we design alerts that support our efforts to confirm their legitimacy, or our expectations when an alert will fire? Is context-sensitivity part of this?

This is the type of partnership and thinking that I’m interested in. 🙂

Prevention versus Governance versus Adaptive Capacities

Posted 1 CommentPosted in Cognitive Systems Engineering, Complex Systems, Human Factors, Systems Safety

The other day I posted about the intersections of Systems Safety and web operations and engineering.

One of the largest proponents of bringing a systems thinking perspective to safety (specifically ‘software safety’) is Dr. Nancy Leveson, who has been in that field (really a multidisciplinary field) for at least a couple of decades. She’s the author of a super book, Engineering a Safer World (free download) that discusses this very concept.

I also mentioned the firming up (still in the public comment timeframe) of REG-SCI which puts into place regulation (not just a recommendation or suggestion) the ARP (automation review policy) that public trading markets must comply with.

Without commenting too much on REG-SCI (I have opinions on that which I can post about at a later date) itself, I wanted to point to a Technology Roundable that the SEC had last October and invited Dr. Leveson to speak on the notion and concepts of “safe” software systems. This laid the groundwork that went into (presumably) Regulation SCI.

I clipped out her testimony, it’s about 20 minutes long, but very much worth a watch. She touches on a number of topics, but brings plain language to what organizations (both for-profit and regulatory groups, like the SEC) can expect with respect to introducing an increasing amount of technology to ‘solve’ stability issues in complex systems:

Nancy Leveson SEC Technology Rountable

Nancy Leveson, SEC Technology Rountable 10/2/2012 from jspaw on Vimeo.

Regulation SCI is aimed towards national securities and trading exchanges primarily. And the regulation itself is almost 400 pages long. Even if the intention is to prevent the sort of calamities such as the Flash Crash, the BATS IPO event, and the Knight Capital incident…is regulation the best (or only) way to make our systems safer?

 

Always a Student: Operations and Systems Safety

Posted 5 CommentsPosted in Complex Systems, Human Factors, Systems Safety

Anyone who has known me well knows that I’m generally not satisfied with skimming the surface of a topic that I feel excited about. So to them, it wouldn’t be a surprise that I’m now working on (yes, while I’m still at Etsy!) a master’s degree.

Since January, I’ve been working with an incredible group as part of the master’s degree program in Human Factors and Systems Safety at Lund University.

lund_logoThis program was initially started by Sidney Dekker, and now is directed by the wicked smart Johan Bergström, whose works I’ve tweeted about before. As a matter of fact, I was able to convince JB to keynote this year’s Velocity Conference in Santa Clara next month on the topic of risk, and I can’t be more excited for it.

So why am I all gaga about this program?

To begin with, I’ve been a huge proponent of learning from other fields of engineering. In particular, how other domains perceive failures; with foresight, in hindsight, how they aim to prevent them, detect them, recover from them, and learn from them.

The Velocity Conference (and Surge, for that matter) are always filled with narratives of success and failure, and for that I’m grateful.

But I think for me it goes deeper than that.

We’re now in a world where the US State Department calls Twitter to request that their database maintenance be postponed because of political events in the Middle East that could benefit from it being available. It’s also all but given at this point that Facebook has had an enormous effect on global discourse on wide-ranging topics, many people pointing to its effects on the Arab Spring.

As we speak, REG-SCI is open for public comment from the SEC. Inside that piece of work is an attempt to shore up safeguards and preventative measures that exchanges may have to employ to make themselves less vulnerable to perturbations and disturbances that can result in the next large-scale trading surprises that came with the Flash Crash, the BATS IPO event, and the Knight Capital incident.

And yes, here at Etsy we have been re-imagining how commerce is being done on a global scale as well. 🙂

  • How do we design our systems to be resilient? Are the traditional approaches still working? How will we know when they stop working?
  • How can we view the “systems” in that sentence to include the socio-technical relationship that organizations have to their service? Their employees? Their investors? The public?
  • How does the political, regulatory, or commercial environment that our services expect to live in affect their operation? Do they influence the ‘safety’ of those systems?
  • How do we manage the inevitable trade-offs that we experience when we move from a startup with a “Minimum Viable Product” to a globally-relied-upon service that is expected to always be on?
  • What are the various ways we can anticipate, monitor, respond to, and learn from our failures and our successes? 

All of these questions could be seen as technical in nature, but I’d argue that’s too simplistic. I’m interested in that beautiful and insane boundary between humans and machines, and how that relationship is intertwined in the increasingly complex systems we build over and over again.

My classmates in the program are from the US Forestry Service, air traffic control training facilities and towers, Australian mining safety, maritime accident investigation firms, healthcare and some airline pilots as well. They all have worked in high-tempo, high-consequence environments, and I’m finding even more overlap in thinking with them than I ever thought I would.

The notion that the web and internet infrastructures of tomorrow are heavily influenced by the failures of yesterday riddle me with concern and curiosity. Given that I’m now addicted to the critical thinking that many smart folks have been giving the topic for a couple of decades now, I figured that I’m not going to half-ass it, and lean into it as hard as I can.

So expect more writing on the topics of complex systems, human factors, systems safety, Just Culture, and related meanderings, because next year I’ve got a thesis to write. 🙂

Availability: Nuance As A Service

Posted 18 CommentsPosted in Complex Systems, Random, Resilience, WebOps

Something that has struck me funny recently surrounds the traditional notion of availability of web applications. With respect to its relationship to revenue, to infrastructure and application behavior, and fault protection and tolerance, I’m thinking it may be time to get a broader upgrade adjustment to the industry’s perception on the topic.

These nuances in the definition and affects of availability aren’t groundbreaking. They’ve been spoken about before, but for some reason I’m not yet convinced that they’re widely known or understood.

Impact On Business

What is laid out here in this article is something that’s been parroted for decades: downtime costs companies money, and lost value. Generally speaking, this is obviously correct, and by all means you should strive to design and operate your site with high availability and fault tolerance in mind.

But underneath the binary idea that uptime = good and downtime = bad, the reality is that there’s a lot more detail that deserves exploring.

This irritatingly-designed site has a post about a common equation to help those that are arithmetically challenged:

LOST REVENUE = (GR/TH) x I x H
GR = gross yearly revenue
TH = total yearly business hours
I = percentage impact
H = number of hours of outage

In my mind, this is an unnecessarily blunt measure. I see the intention behind this approach, because it’s not meant to be anywhere close to being accurate. But modern web operations is now a field where gathering metrics in the hundreds of thousands per second is becoming more common-place, fault-tolerance/protection is a thing we do increasingly well, and graceful degradation techniques are the norm.

In other words: there are a lot more considerations than outage minutes = lost revenue, even if you did have a decent way to calculate it (which, you don’t). Companies selling monitoring and provisioning services will want you to subscribe to this notion.

We can do better than this blunt measure, and I thought it’s worth digging in a bit deeper.

“Loss”

Thought experiment: if Amazon.com has a full and global outage for 30 minutes, how much revenue did it “lose”? Using the above rough equation, you can certainly come up with a number, let’s say N million dollars. But how accurate is N, really? Discussions that surround revenue loss are normally designed to motivate organizations to invest in availability efforts, so N only needs to be big and scary enough to provide that motivation. So let’s just say that goal has been achieved: you’re convinced! Availability is important, and you’re a firm believer that You Own Your Own Availability.

Outside of the “let this big number N convince you to invest in availability efforts” I have some questions that surround N:

  • How many potential customers did Amazon.com lose forever, during that outage? Meaning: they tried to get to Amazon.com, with some nonzero intent/probability of buying something, found it to be offline, and will never return there again, for reasons of impatience, loss of confidence, the fact that it was an impulse-to-buy click whose time has passed, etc.
  • How much revenue did Amazon lose during that 30 minute window, versus how the revenue that it simply postponed when it was down, only to be executed later? In other words: upon finding the site down, they’ll return sometime later to do what they originally intended, which may or may not include buying something or participate in some other valuable activity.
  • How much did that 30 minutes of downtime affect the strength of the Amazon brand, in a way that could be viewed as revenue-affecting? Meaning: are users and potential users now swayed to having less confidence in Amazon because they came to the site only to be disappointed that it’s down, enough to consider alternatives the next time they would attempt to go to the site in the future?

I don’t know the answers to these questions about Amazon, but I do know that at Etsy, those answers depend on some variables:

  • the type of outage or degradation (more on that in a minute),
  • the time of day/week/year
  • how we actually calculate/forecast how those metrics would have behaved during the outage

So, let’s crack those open a bit, and see what might be inside…

Temporal Concerns

Not all time periods can be considered equal when it comes to availability, and the idea of lost revenue. For commerce sites (or really any site whose usage varies with some seasonality) this is hopefully glaringly obvious. In other words:

X minutes of full downtime during the peak hour of the peak day of the year can be worlds apart from Y minutes of full downtime during the lowest hour of the lowest day of the year, traffic-wise.

Take for example a full outage that happens during a period of the peak day of the year, and contrast it with one that happens during a lower-period of the year. Let’s say that this graph of purchases is of those 24-hour periods, indicating when the outages happen:

A Tale of Two Outages

The impact time of the outage during the lower-traffic day is actually longer than the peak day, affecting the precious Nines math by a decent margin. But yet: which outage would you rather have, if you had to have one of those? 🙂

Another temporal concern is: across space and time, distribution and volume of any level degradation could be viewed as perfect uptime as the length of the outage approaches zero.

Dig, if you will, these two outage profiles, across a 24-hour period. The first one has many small outages across the day:

Screen Shot 2013-01-03 at 8.09.59 AM

and the other has the same amount of impact time, in a single go:

Screen Shot 2013-01-03 at 8.12.54 AM

So here we have the same amount of time, but spread out throughout the day. Hopefully, folks will think a bit more beyond the clear “they’re both bad! don’t have outages!” and could investigate how they could be different. Some considerations in this simplified example:

  • Hour of day. Note that the single large outage is “earlier” in the day. Maybe this will affect EU or other non-US users more broadly, depending on the timezone of the original graph. Do EU users have a different expectation or tolerance for outages in a US-based company’s website?
  • Which outage scenario has a greater affect on the user population: if the ‘normal’ behavior is “get in, buy your thing, and get out” quickly, I could see the many-small-outages more preferable to the single large one. If the status quo is some mix of searching, browsing, favoriting/sharing, and then purchase, I could see the singular constrained outage being preferable.

Regardless, this underscores the idea that not all outages are created equal with respect to impact timing.

Performance

Loss of “availability” can also be seen as an extreme loss of performance. At a particular threshold, given the type of feedback to the user (a fast-failed 404 or browser error, versus a hanging white page and spinning “loading…”) the severity of an event being slow can effectively be the same as a full outage.

Some concerns/thought exercises around this:

  • Where is this latency threshold for your site, for the functionality that is critical for the business?
  • Is this threshold a cliff, or is it a continuous/predictable relationship between performance and abandonment?

There’s been much more work on performance’s effects on revenue than availability. The Velocity Conference in 2009 brought the first real production-scale numbers (in the form of a Bing/Google joint presentation as well as Shopzilla and Mozilla talks) behind how performance affects businesses, and if you haven’t read about it, please do.

Graceful Degradation

Will Amazon (or Etsy) lose sales if all or a portion of its functionality is gone (or sufficiently slow) for a period of time? Almost certainly. But that question is somewhat boring without further detail.

In many cases, modern web sites don’t simply live in a “everything works perfectly” or “nothing works at all” boolean world. (To be sure, neither does the Internet as a whole.) Instead, fault-tolerance and resilience approaches allow for features and operations degrade under a spectrum of failure conditions. Many companies build their applications to have both in-flight fault tolerance to degrade the experience in the face of singular failures, as well as making use of “feature flags” (Martin and Jez call them “feature toggles“) which allow for specific features to be shut off if they’re causing problems.

I’m hoping that most organizations are familiar with this approach at this point. Just because user registration is broken at the moment, you don’t want to prevent  already logged-in users from using the otherwise healthy site, do you? 🙂

But these graceful degradation approaches further complicates the notion of availability, as well as its impact on the business as a whole.

For example: if Etsy’s favoriting feature is not working (because the site’s architecture allows it to gracefully fail without affecting other critical functionality), but checkout is working fine…what is the result? Certainly you might paused before marking down your blunt Nines record.

You might also think: “so what? as long as people can buy things, then favoriting listings on the site shouldn’t be considered in scope of availability.”

But consider these possibilities:

  • What if Favoriting listings was a significant driver of conversions?
  • If Favoriting was a behavior that led to conversions at a rate of X%, what value should X be before ‘availability’ ought to be influenced by such a degradation?
  • What if Favoriting was technically working, but was severely degraded (see above) in performance?

Availability can be a useful metric, but when abused as a silver bullet to inform or even dictate architectural, business priority, and product decisions, there’s a real danger of oversimplifying what are really nuanced concerns.

Bounce-Back and Postponement

As I mentioned above, what is more likely for sites that have an established community or brand, outages (even full ones) don’t mark an instantaneous amount of ‘lost’ revenue or activity. For a nonzero amount, they’re simply postponed. This is the area that I think could use a lot more data and research in the industry, much in the same way that latency/conversion relationship has been investigated.

The over-simplified scenario involves something that looks like this. Instead of the blunt math of “X minutes of downtime = Y dollars of lost revenue”, we can be a bit more accurate, if we tried just a bit harder. The red is the outage:

OutageGraph2

 

So we have some more detail, which is that if we can make a reasonable forecast about what purchases did during the time of the outage, then we could make a better-inform estimate of purchases “lost” during that time period.

But is that actually the case?

What we see at Etsy is something different, a bit more like this:

Screen Shot 2013-01-03 at 12.35.41 PM

Clearly this is an oversimplification, but I think the general behavior comes across. When a site comes back from a full outage, there is an increase in the amount of activity as users who were stalled/paused in their behavior by the outage resumes. My assumption is that many organizations see this behavior, but it’s just not being talked about publicly.
The phenomenon that needs more real-world data is to support (or deny) the hypothesis that depending on:
  • Position of the outage in the daily traffic profile (start-end)
  • Position of the outage in the yearly season

the bounce-back volume will vary in a reasonably predictable fashion. Namely, as the length of the outage grows, the amount of bounce-back volume shrinks:

Screen Shot 2013-01-03 at 12.55.14 PM

What this line of thinking doesn’t capture is how many of those users postponed their activity not for immediately after the outage, but maybe the next day because they needed to leave their computer for a meeting at work, or leaving work to commute home?

Intention isn’t entirely straightforward to figure out, but in the cases where you have a ‘fail-over’ page that many CDNs will provide when the origin servers aren’t available, you can get some more detail about what requests (add to cart? submit payment?) came in during that time.

Regardless, availability and its affect on business metrics isn’t as easy as service providers and monitoring-as-a-service companies will have you believe. To be sure, a good amount of this investigation will vary wildly from company to company, but I think it’s well worth taking a look into.

 

On Being A Senior Engineer

Posted 184 CommentsPosted in Culture, Etsy, Human Factors, Random, WebOps

UPDATE: I’ve added a short section on the topic of sponsorship.

I think that there’s a lot of institutional knowledge in our field, especially about what makes for a productive engineer. But while there are a good deal of books in the management field about “expert” roles and responsibilities of non-technical individual contributors, I don’t see too many modern books or posts that might shed light directly on what makes for a good senior engineer. One notable exception is of course Kate Matsudaira, who has been posting quite a good deal recently about the cultural sides of engineering.

Yet at the same time, a good lot of successful engineers whom I have known all remember the mentor who taught them what it meant to be “senior”.

I do, however, agree 100% with my friend Theo’s words about being “senior” in his chapter of the Web Operations book by O’Reilly:

Generation X (and even more so generation Y) are cultures of immediate gratification. I’ve worked with a staggering number of engineers that expect the “career path” to take them to the highest ranks of the engineering group inside 5 years just because they are smart. This is simply impossible in the staggering numbers I’ve witnessed. Not everyone can be senior. If, after five years, you are senior, are you at the peak of your game? After five more years will you not have accrued more invaluable experience? What then? “Super engineer”? Five more years? “Super-duper engineer.” I blame the youth of our discipline for this affliction. The truth is that there are very few engineers that have been in the field of web operations for fifteen years. Given the dynamics of our industry many elected to move on to managerial positions or risk an entrepreneurial run at things.

He’s right: this field of web operations is still quite young. So we can’t be surprised when people who have a title of ‘senior’ exhibit unsurprisingly immature behavior, both technical and non-technical. If you haven’t read Theo’s chapter, I suggest you do.

Having said that, what does it actually mean to be ‘senior’ in this discipline? I certainly have an opinion of what it means, given that I’m charged with hiring, supporting, and retaining engineers whom are deemed to be senior. This notion that there is a bar to be passed in terms of career development is a good one, but I’d also add that these criteria exist on a spectrum, as opposed to a simple list of check-boxes. You don’t wake up one day and you are “senior” just because your title reflects that upon a promotion. Senior engineers don’t know everything. They’re not perfect in their technical knowledge, and they’re OK with that.

In order not to confuse titles with expectations that are fuzzy, sometimes I’ll refer to engineering maturity.

Meaning: I expect a “senior” engineer to be a mature engineer.

I’m going to gloss over the part where one could simply list the technical areas in which a mature engineer should have some level of mastery or understanding (such as “Networking”, “Filesystems”, “Algorithms”, etc.) and instead highlight the personal characteristics that in my mind give me indication that someone can influence an organization or a business positively in the domain of engineering.

Over on Quora, someone once asked me “What are the attributes (other than technical ability/experience) that makes a great VP of Technical Operations?”. The list of attributes that I mentioned in the answer came with the understanding that they are perpetual aspirations of my own. This post is similar to that answer.

I might first argue that senior engineers in web development and operations have the same characteristics as senior engineers in other fields of engineering (mechanical, electrical, chemical, etc.) in which case The Unwritten Laws of Engineering are applicable. Again, if you haven’t read this, please go do so. It was originally written in 1944, published by the American Society of Mechanical Engineers. A good excerpt from the book is here.

While the book’s structure and prose still has a dated feel (“…refrain from using profanity in the workplace…” or “…men should pay particular attention to shaving habits and the trimming of beards and mustaches…”), it gives a good outline of the non-technical expectations, responsibilities, and inner workings of an engineering organization with respect to how both managers and mature engineers might behave.

Obligatory Pithy Characteristics of Mature Engineers

All posts that attempt to give insight to aspirational characteristics must have an over-abundance of bullet points, and the field of engineering has a fair share of them. Therefore, I’m going to give you some, some mine and some pulled from various sources, many from the Unwritten Laws mentioned above.

Mature engineers seek out constructive criticism of their designs.

Every successful engineer I’ve met, upon finishing up a design or getting ready for a project, will continually ask their peers questions along the lines of:

  • “What could I be missing?”
  • “How will this not work?”
  • “Will you please shoot as many holes as possible into my thinking on this?”
  • “Even if it’s technically sound, is it understandable enough for the rest of the organization to operate, troubleshoot, and extend it?”

This is because they know that nothing they make will ever only be in their hands, and that good peer review is what makes better design decisions. As it’s been said elsewhere, they “beg for the bad news.”

Mature engineers understand the non-technical areas of how they are perceived.

Being able to write a Bloom Filter in Erlang, or write multi-threaded C in your sleep is insufficient. None of that matters if no one wants to work with you. Mature engineers know that no matter how complete, elegant, or superior their designs are, it won’t matter if no one wants to work alongside them because they are assholes. Condescension, belittling, narcissism, and ego-boosting behavior send the message to other engineers (maybe tacitly) to stay away. Part of being happy in engineering comes from enjoying the company of the people you work with while designing and building things. An engineer who is quick to call someone a moron is someone destined to stunt his or her career.

This also means that mature engineers have self-awareness when it comes to their communication. This isn’t to say that every mature engineer communicates perfectly, only that they have some notion about where they could be better, and continually ask for a gut-check from peers and managers on how they’re doing. They aim to be assertive, not passive or aggressive in how they get their ideas across.

I’ve mentioned it elsewhere, but I must emphasize the point more: the degree to which other people want to work with you is a direct indication on how successful you’ll be in your career as an engineer. Be the engineer that everyone wants to work with.

Now this isn’t to say that you should shy away from giving (or getting) constructive criticism on the work produced by engineering (as opposed to the engineer personally), for fear of pissing someone off. There’s a difference between calling someone a moron and pointing out faults in their code or product. In a conversation with Theo, he pointed out another possible area where our field may grow up:

We as an industry need to (of course) refrain from critiques of human character and condition, but not shy away from critiques of work product. We need to get tougher skin and be able to receive critique through a lens that attempts to eliminate personal focus. There will be assholes, they should be shunned. But the attitude that someone’s code is their baby should come to an end. Code doesn’t have feelings, doesn’t develop complexes and certainly doesn’t exhibit the most important trait (the ability to reproduce) of that which carries for your genetic strains.

See also below #2 and #10 in The Ten Commandments of Egoless Programming.

I think this has a corollary from the Unwritten Laws (emphasis mine):

Be careful about whom you mark for copies of letters, memos, etc., when the interests of other departments are involved. A lot of mischief has been caused by young people broadcasting memorandum containing damaging or embarrassing statements.

Of course it is sometimes difficult for a novice to recognize the “dynamite” in such a document but, in general, it is apt to cause trouble if it steps too heavily upon someone’s toes or reveals a serious shortcoming on anybody’s part. If it has wide distribution or if it concerns manufacturing or customer difficulties, you’d better get the boss to approve it before it goes out unless you’re very sure of your ground.

This, of course, underscores the dated feel of the book, but in the modern era, I still believe the main point to be true. Nothing indicates that you have a lack of perspective and awareness like a poorly thought out and nonconstructive tweet that slings venomous insults. It’s a junior engineer mistake to toss insults about a piece of complex technology in 140 characters.

I certainly (much like Christopher Brown mentioned in his keynote at Velocity London) pay attention to those sorts of public remarks when I come across them so that I can note who I would reconsider hiring if they ever applied to work at Etsy.

Mature engineers do not shy away from making estimates and are always trying to get better at it.

From the Unwritten Laws:

Promises, schedules, and estimates are necessary and important instruments in a well-ordered business. Many engineers fail to realize this, or habitually try to dodge the irksome responsibility for making commitments. You must make promises based upon your own estimates for the part of the job for which you are responsible, together with estimates obtained from contributing departments for their parts. No one should be allowed to avoid the issue by the old formula, “I can’t give a promise because it depends upon so many uncertain factors.”

Avoiding responsibility for estimates is another way of saying, “I’m not ready to be relied upon for building critical pieces of infrastructure.” All businesses rely on estimates, and all engineers working on a project are involved in Joint Activity, which means that they have a responsibility to others to make themselves interpredictable. In general, mature engineers are comfortable with working within some nonzero amount of uncertainty and risk.

Mature engineers have an innate sense of anticipation, even if they don’t know they do.

This code looks good, I’m proud of myself. I’ve asked other people to review it, and I’ve taken their feedback. Now: how long will it last before it’s rewritten? Once it’s in production, how will its execution affect resource usage? How much so I expect CPU/memory/disk/network to increase or decrease? Will others be able to understand this code? Am I making it as easy as I can for others to extend or introspect this work?

Mature engineers understand that not all of their projects are filled with rockstar-on-stage work.

However menial and trivial your early assignments may appear, give them your best effort.

Getting things done means doing things you might not be interested in. No matter how exciting or appealing a project is, there are always boring tasks. Tedious tasks. Tasks that a less mature engineer may deem beneath their dignity or their job title. My good friend Kellan Elliot-McCrea (Etsy’s CTO) had this to say about it:

Sometimes the saving grace of a tedious task is their simplicity and maturity manifests in finishing them quickly and moving on. Sometimes tasks are tedious because they require extreme discipline and malleable attention span. It’s an odd phenomena that the most tedious tasks, only to be carried out by the most senior engineers, can also be the most terrifying.

Mature engineers lift the skills and expertise of those around them.

They recognize that at some point, their individual contribution and potential cannot be exercised singularly. They recognize that there is only so much that can be produced by a single person, and the world’s best engineering feats are executed by teams, not singularly brilliant and lone engineers. Tom Limoncelli makes this point quite well in his post.

At Etsy we call this a “generosity of spirit.” Generosity of spirit is one of our core engineering values, but also a primary responsibility of our Staff Engineer position, a career-level position. These engineers spend the time to make sure that more junior or new engineers unfamiliar with the tech or processes we have not only understand what they are doing, but also why they are doing it. “Teaching to fish” is a mandatory skill at this level, and that requires having both patience and a perspective of investment in the rest of the organization.

Therefore instead of: “OK, move over, lemme just do it for you”, it’s instead: “Ok, let’s work on this together. I can show you how I’m writing/troubleshooting/etc. Then, you do it so I can be sure you know why/how we’re doing it this way, etc.”

Related: see below about getting credit.

Mature engineers understand the difference between mentorship and sponsorship, and develop a habit of the latter.

Engineers who find that the visibility of their own work is increasing acknowledge that a fundamental of wielding influence in your local community (both inside and outside your organization) is developing and maintaining awareness of opportunities to sponsor those around them that would benefit. It is not a secret that the tech industry is seriously challenged when it comes to supporting underrepresented and/or marginalized groups. 

Developing this as a habit takes effort, but the benefits are multi-fold; the engineer sharpens their critical-thinking skills (“oh, what we’re talking about in this meeting would be a great opportunity for $NAME to work on…”) and the sponsored engineer has opportunities that they otherwise might not.

Lara Hogan’s article on this is a must-read on this topic:

When privileged people begin to see the systems of bias and privilege, their first instinct typically is to mentor those who haven’t benefited from the same privilege. This is understandable—they want to help those who are marginalized grow, get promoted, or become better engineers, to help balance out the inequity that pervades our industry.

But at its core, this instinct to mentor plays into the idea that those who are marginalized aren’t already skilled enough, smart enough, or ready for more responsibility or leadership.

What members of underrepresented groups in tech often need most is opportunity and visibilitynot advice. They have to work extremely hard and be extremely good at what they do to combat the systemic privilege and unconscious bias at play in our work environments. They are consistently under-promoted and under-compensated for this work, even though it’s excellent work.

Lara goes on to give examples to demonstrate what this can look like:

These are real-life examples of sponsorship that I’ve seen work:

  • suggesting someone who could be a good lead on a new project based on their experience in this codebase, solving these kinds of problems, or past demonstration of effectiveness getting work out the door on time
  • suggesting someone be a postmortem facilitator, or another type of visible leader in a meeting where others are learning
  • suggesting someone who could write a new blog post for the engineering blog about their recent project, approach to a tricky problem, or solution that other companies could learn from
  • suggesting someone to give a talk at a company or team meeting in which they demonstrate their work
  • forwarding their email summary of a project to a different group of people than the original audience, underscoring why it was interesting or what you learned from it
  • asking someone’s manager if you can share feedback about some of their excellent work you witnessed
  • mentioning or sharing someone’s work in Slack that you thought was helpful, interesting, etc.
  • citing an interesting thing you learned from someone recently to a large group of influential folks

Mature engineers make their trade-offs explicit when making judgments and decisions.

They realize all engineering decisions, implementations, and designs exist within a spectrum; we do not live in a binary world. They can quickly point out contexts where one successful approach or solution could work and where it could not. They know that one cannot be both efficient and thorough at the same time (The ETTO Principle), that most projects engineers work on exist on an axis of optimality and brittleness, and that whether the problems they are solving are acute or chronic.

They know that they work within a spectrum of ideal and non-ideal, and are OK with that. They are comfortable with it because they strive to make the ideal and non-ideal in a design explicit. Later on in the lifecycle of a design, when the original design is not scaling anymore or needs to be replaced or rewritten, they can look back not with a perspective of how short-sighted those earlier decisions were, but instead say “yep, we made it this far with it and knew we’d have to extend or change it at some point. Looks like that time is now, let’s get to work!” instead of responding with a cranky-pants, passive-aggressive Hindsight Bias-filled remark with counterfactuals (e.g.. “those idiots didn’t do it right the first time!”, “they cut corners!”, “I TOLD them this wouldn’t work!”)

Many pithy quotes exist that shine light on this notion of trade-offs, and mature engineers know that there are limits to any philosophy-laden quotes (including the ones I’m writing here):

  • “Premature optimization is the root of all evil.” – a very abused maxim, and I’ve written about it before. A corollary to that might be (taken from here) ‘Understanding what is and isn’t “premature” is what separates senior engineers from junior engineers.’
  • “Right tool for the job” – another abused one. The intention here is reasonable: who wants to use a tool that isn’t appropriate? But a rare perspective is that this can be detrimental when taken to the extreme. A carpenter doesn’t arm himself with every variation and size of hammer that is available, even thought he may encounter hammering tasks that could be ideally handled by each one. Why? Because lugging around (and maintaining) a gazillion hammer incurs a cost. As such, decisions on this axis have trade-offs.

The tl;dr on trade-offs is that everyone cuts corners, in every project. Immature engineers discover them in hindsight, disgusted. Mature engineers spell them out at the onset of a project, accept them and recognize them as part of good engineering.

(Related: Your Code May Be Elegant, But Mine Fucking Works)

Mature engineers don’t practice CYAE (“Cover Your Ass Engineering”)

The scenario where someone will stand on ceremony as an excuse for not attempting to understand how his or her code (or infrastructure) could be touched by other parts of the system or business is a losing proposition. Covering your ass sends the implicit message that you are someone willing to throw others (on your team? in your company? in your community?) under the proverbial bus at the mere hint that your work had any flaw. Mature engineers stand up and accept the responsibility given to them. If they find they don’t have the requisite authority to be held accountable for their work, they seek out ways to rectify that.

An example of CYAE is “It’s not my fault. They broke it, they used it wrong. I built it to spec, I can’t be held responsible for their mistakes or improper specification.”

Mature engineers are empathetic.

In complex projects, there are usually a number of stakeholders. In any project, the designers, product managers, operations engineers, developers, and business development folks all have goals and perspectives, and mature engineers realize that those goals and views may be different. They understand this so that they can navigate effectively in the work that they do. Being empathetic in this sense means having the ability to view the project from another person’s perspective and to take that into consideration into your own work.

Goal conflicts are inherent in all engineering work, and complaining about them (instead of embracing them as requirements for success) is a sign of a less mature engineer.

They don’t make empty complaints.

Instead, they express judgments based on empirical evidence and bring with those judgments options for solving the problem which they’ve identified. A great manager of mine said to never go to your boss with a complaint about anything without at least one (ideally more than one) suggestion for a solution. Even demonstrating that you’ve tried working the problem on your own and came up empty-handed is better than an empty complaint.

Mature engineers are aware of cognitive biases

This isn’t to say that every mature engineer needs to have a degree in psychology, but cognitive biases are what can limit the growth of an engineer’s career at a certain point. Even if they’re not aware of the details of how they appear or how these biases can be guarded against, most mature engineers I know have a level of self-awareness to at least recognize they (like everyone) are susceptible to them.

Culturally, engineers work day-to-day in empirical evidence in research. Basically: show me the data. The issue with cognitive biases is that we can be blissfully unaware of when we are interpreting data with our own brains in ways that defy empirical data and can have a surprising effect on how we get work done and work on teams.

A great list of them exists on Wikipedia, but some of the ones that I’ve seen engineers (including myself) fall prey to are:

  • Self-Serving Bias – basically: if something is good, it’s probably because of something I did or thought of. If it’s bad, it’s probably the doing of someone else.
  • Fundamental Attribution Error – basically: the bad results that someone else got from his work must have something to do with how he is, personally (stupid, clumsy, sloppy, etc.) whereas if I get bad results, it’s because of the context that I was in, the pressure I was under, the situation I was in, etc.
  • Hindsight Bias – (it is said that this is the most-studied phenomenon in the history of modern psychology) basically: after an untoward or negative event (a severe bug, an outage, etc.) “I knew it all along!”. It is the very strong tendency to view the past more simply than it was in reality. You can tell there is Hindsight Bias going on when descriptions involve counterfactuals, or “…they should have…”, or “…how did they not see that it’s so obvious!”.
  • Outcome Bias – like above, this comes up after a surprising or negative event. If the event was very damaging, expensive to clean up, or severe, then the decisions or actions that contributed to that event are judged to be very stupid, reckless, or negligent. The judgment is proportional to how severe the event was.
  • Planning Fallacy – (related to the point about making estimates under uncertainty, above) basically: being more optimistic about forecasting the time a particular project will take.

There are plenty of others, all of which I find personally fascinating and I can get lost in learning more about them. Highly suggested reading, if you’re at all interested in learning about how you might be limiting your own effectiveness.

The Ten Commandments of Egoless Programming

Appropriate, even if old…I’ve seen it referenced as coming from The Psychology of Computer Programming, written in 1971, but I don’t actually see it in the text. Regardless, here are The Ten Commandments of Egoless Programming, found on @wyattdanger‘s blog post on receiving advice from his dad:

  1. Understand and accept that you will make mistakes. The point is to find them early before they make it into production. Fortunately, except for the few of us developing rocket guidance software at JPL, mistakes are rarely fatal in our industry. We can, and should, learn, laugh, and move on.
  2. You are not your code. Remember that the entire point of a review is to find problems, and problems will be found. Don’t take it personally when one is uncovered. (Allspaw note – related: see below, number #10, and the points Theo made above.)
  3. No matter how much “karate” you know, someone else will always know more. Such an individual can teach you some new moves if you ask. Seek and accept input from others, especially when you think it’s not needed.
  4. Don’t rewrite code without consultation. There’s a fine line between “fixing code” and “rewriting code.” Know the difference, and pursue stylistic changes within the framework of a code review, not as a lone enforcer.
  5. Treat people who know less than you with respect, deference, and patience. Non-technical people who deal with developers on a regular basis almost universally hold the opinion that we are prima donnas at best and crybabies at worst. Don’t reinforce this stereotype with anger and impatience.
  6. The only constant in the world is change. Be open to it and accept it with a smile. Look at each change to your requirements, platform, or tool as a new challenge, rather than some serious inconvenience to be fought.
  7. The only true authority stems from knowledge, not from position. Knowledge engenders authority, and authority engenders respect – so if you want respect in an egoless environment, cultivate knowledge.
  8. Fight for what you believe, but gracefully accept defeat. Understand that sometimes your ideas will be overruled. Even if you are right, don’t take revenge or say “I told you so.” Never make your dearly departed idea a martyr or rallying cry.
  9. Don’t be “the coder in the corner.” Don’t be the person in the dark office emerging only for soda. The coder in the corner is out of sight, out of touch, and out of control. This person has no voice in an open, collaborative environment. Get involved in conversations, and be a participant in your office community.
  10. Critique code instead of people – be kind to the coder, not to the code. As much as possible, make all of your comments positive and oriented to improving the code. Relate comments to local standards, program specs, increased performance, etc.

Novices versus Experts

Now I generally don’t follow too much on knowledge acquisition as a research topic, but I do believe it’s hard to get away from when talking about the evolving nature of a discipline. One bit of interesting breakdown comes from a paper from Dreyfus and Dreyfus called “A Five-Stage Model of the Mental Activities Involved in Directed Skill Acquisition” which has laid out characteristics of various levels of expertise:

Novice
  • Rigid adherence to rules or plans
  • Little situational perception
  • No (or limited) discretionary judgment
Advanced Beginner
  • Guidelines for action based on attributes and aspects, which are all equal and separate
  • Limited situational perception
Competent
  • Conscious deliberate planning
  • Standardized and routine procedures
Proficient
  • Sees situations holistically rather than as aspects
  • Perceives deviations from normal patterns
  • Uses maxims for guidance, whose meanings are contextual
Expert
  • No longer relies on rules, guidelines or maxims
  • Intuitive grasp of situations
  • Analytic approach used only in novel situations

 

The paper goes on to state:

Novices operate from an explicit rules- and knowledge-based perspective. They are deliberate and analytical, and therefore slower to take action, they decide or choose.

(which means that novices are deeply subject to local rationality)

Experts operate from a mature, holistic well-tried understanding, intuitively and without conscious deliberation. This is a function of experience. They do not see problems as one thing and solutions as another, they act.

(which means that experts are context driven)

I don’t necessarily subscribe to the notion of such dry lines being drawn between skill levels, because I think that there is a lot more granularity and facets of expertise than just those outlined above, but I think it’s helpful to be aware of the unfortunately over-simplified categories.

Dirty secret: mature engineers know the importance of (sometimes irrational) feelings people have. (gasp!)

How people feel about technologies, technical decisions, and technical directions is just as important (if not more) than the facts about the details. Mature engineers know this and adjust accordingly. Again, being empathetic can help you understand how another person on your team feels about a technical decision, even if they themselves don’t have an easy time articulating why they feel that way.

People’s confidence in software, architectures, or patterns is heavily influenced by past experience and can result in positive or negative reactions to using them. Used to work at a mod_perl shop that had a lot of mystifying outages? Then you can’t be surprised to feel reluctant to use it in a different company, even if the supporting expertise and use cases are entirely different. All you remember is that mod_perl = major headaches, so you’re going to be wary of using it in any context again.

Mature engineers understand this phenomenon when making a case to use technology that carries baggage, even if it’s irrational. Convincing a group to use tools and patterns that they aren’t comfortable with isn’t a straightforward task. The “right tool for the job” maxim also has (sometimes unquantifiable) comfortability as a parameter.

For an illustration of how people’s emotions drive technical decisions and opinions, read any flame war about anything, ever.

“It is amazing what you can accomplish if you do not care who gets credit.”

This quote is commonly attributed to Harry S. Truman, but it looks like it might have first been said by a Jesuit priest in a different form. In any case, this is another indication you’re working with a mature engineer: they hold the success of the project much higher than the potential praise they may get personally for working on it. The attribution of praise or credit can be the source of such dysfunction in an engineering-driven organization, and I believe it’s because it’s largely invisible.

The notion is liberating, and once understood and internalized, a world of progress and innovative thinking can flourish, because the engineer isn’t overly concerned with the personal liability of equating the work to their own career success.

Not The End

I’m at the moment blessed to work with a number of mature engineers here at Etsy, and it’s quite humbling. We are indeed a young field, and while I think we can learn a great deal from other fields of engineering on this topic, I also think we have an advantage. The web is inextricably tied to the notion of publishing and sharing information, globally. We need to continue pointing out what it means to be a “senior” and “mature” engineer if we have a hope of progressing the field into a true discipline.

Many thanks to members of the Etsy Operations team, Mike Brittain, Kellan Elliott-McCrea, Marc Hedlund, and Theo Schlossnagle for reviewing drafts of this post. They all make me a more mature engineer.

A Mature Role for Automation: Part I

Posted 30 CommentsPosted in Cognitive Systems Engineering, Complex Systems, Human Factors, Tools, WebOps

(Part 1 of 2 posts)
I’ve been percolating on this post for a long time. Thanks very much to Mark Burgess for reviewing early drafts of it.

One of the ideas that permeates our field of web operations is that we can’t have enough automation. You’ll see experience with “building automation” on almost every job description, and many post-mortem transcriptions around the world have remediation items that state that more automation needs to be in place to prevent similar incidents.

“AUTOMATE ALL THE THINGS!” the meme says.

But the where, when, and how to design, implement, and operate automation is not as straightforward as “AUTOMATE ALL THE THINGS!”

I’d like to explore this concept that everything that could be automated should be automated, and I’d like to take a stab at putting context around the reasons why we might think this is a good idea. I’d also like to give some background on the research of how automation is typically approached, the reasoning behind various levels of automation, and most importantly: the spectrum of downsides of automation done poorly or haphazardly.

(Since it’s related, I have to include an obligatory link to Github’s public postmortem on issues they found with their automated database failover, and some follow-up posts that are well worth reading.)

In a recent post by Mathias Meyer he gives some great pointers on this topic, and strongly hints at something I also agree with, which is that we should not let learnings from other safety-related fields (aviation, combat, surgery, etc.) go to waste, because there are some decades of thinking and exploration there. This is part of my plan for exploring automation.

Frankly, I think that we as a field could have a more mature relationship with automation. Not unlike the relationship humans have with fire: a cautious but extremely useful one, not without risks.

I’ve never done a true “series” of blog posts before, but I think this topic deserves one. There’s simply too much in this exploration to have in a single post.

What this means: There will not be, nor do I think should there ever be, a tl;dr for a mature role of automation, other than: its value is extremely context-specific, domain-specific, and implementation-specific.

If I’m successful with this series of posts, I will convince you to at least investigate your own intuition about automation, and get you to bring the same “constant sense of unease” that you have with making change in production systems to how you design, implement, and reason about it. In order to do this, I’m going to reference a good number of articles that will branch out into greater detail than any single blog post could shed light on.

Bluntly, I’m hoping to use some logic, research, science, and evidence to approach these sort of questions:

  1. What do we mean when we say “automation”? What do those other fields mean when they say it?
  2. What do we expect to gain from using automation? What problem(s) does it solve?
  3. Why do we reach for it so quickly sometimes, so blindly sometimes, as the tool to cure all evils?
  4. What are the (gasp!) possible downsides, concerns, or limitations of automation?
  5. And finally – given the potential benefits and concerns with automation, what does a mature role and perspective for automation look like in web engineering?

Given that I’m going to mention limitations of automation, I want to be absolutely clear, I am not against automation. On the contrary, I am for it.

Or rather, I am for: designing and implementing automation while keeping an eye on both its limitations and benefits.

So what limitations could there be? The story of automation (at least in web operations) is one of triumphant victory. The reason that we feel good and confident about reaching for automation is almost certainly due to the perceived benefits we’ve received when we’ve done it in the past.

Canonical example: engineer deploys to production by running a series of commands by hand, to each server one by one. Man that’s tedious and error-prone, right? Now we’ve automated the whole process, it can run on its own, and we can spend our time on more fun and challenging things.

This is a prevailing perspective, and a reasonable one.

Of course we can’t ditch the approach of automation, even if we wanted to.  Strictly speaking, almost every use of a computer is to some extent using “automation”, even if we are doing things “by hand.” Which brings me to…

Definitions and Foundations

I’d like to point at the term itself, because I think it’s used in a number of different contexts to mean different things. If we’re to look at it closely, I’d like to at least clarify what I (and others who have researched the topic quite well) mean by the term “automation”. The word comes from the Greek: auto, meaning ‘self’, and matos, meaning ‘willing’, which implies something is acting on its own accord.

Some modern definitions:

“Automation is defined as the technology concerned with the application of complex mechanical, electronic, and computer based systems in the operations and control of production.” – Raouf (1988)

‘Automation’ as used in the ATA Human Factors Task Force report in 1989 refers to…”a system or method in which many of the processes of production are auotmatically controlled or performed by self-operating machines, electronic devices, etc.” – Billings (1991)

“We define automation as the execution by a machine agent (usually a computer) of a function that was previously carried out by a human.” – Parasuraman (1997)

I’ll add to that somewhat broad definition functions that have never been carried out by a human. Namely, processes and tasks that could never be performed by a human, by exploiting the resources available in a modern computer. The recording and display of computations per second, for example.

To help clarify my use of the term:

  • Automation is not just about provisioning and configuration management. Although this is maybe the most popular context in which the term is used, it’s almost certainly not the only place for automation.
  • It’s also not simply the result of programming what were previously performed as manual tasks.
  • It can mean enforcing predefined or dynamic limits on operational tasks, automated or manual.
  • It can mean surfacing, displaying, and analyzing metrics from tasks and actions.
  • It can mean making decisions and possibly taking action on observed states in a system.

Some familiar examples of these facets of automation:

  • MySQL max_connections and Apache’s MaxClients directives: these are upper bounds intended on preventing high workloads from causing damage.
  • Nagios (or any monitoring system for that matter): these perform checks on values and states at rates and intervals only a computer could perform, and can also take action on those states in order to course-correct a process (as with Event Handlers)
  • Deployment tools and configuration management tools (like Deployinator, as well as Chef/Puppet/CFEngine, etc.)
  • Provisioning tools (golden-image or package-install based)
  • Any collection or display of metrics (StatsD, Ganglia, Graphite, etc.)

Which is basically…well, everything, in some form or another in web operations. 🙂

Domains To Learn From

In many of the papers found in Human Factors and Resilience Engineering, and in blog posts that generally talk about limitations of automation, it’s done in the context of aviation. And what a great context that is! You have dramatic consequences (people die) and you have a plethora of articles and research to choose from. The volume of research done on automation in the cockpit is large due to the drama (people die, big explosions, etc.) so no surprise there.

Except the difference is, in the cockpit, human and machine elements have a different context. There are mechanical actions that the operator can and needs to do during takeoff and landing. They physically step on pedals, push levers and buttons, watch dials and gauges in various points during takeoff and landing. Automation in that context is, frankly, much more evolved there, and the contrast (and implicit contract) there between man and machine is much more stark than in the context of web infrastructures. Display layouts, power-assisted controls…we should be so lucky to have attention like that paid to our working environment in web operations! (but also, cheers to people not dying when the site goes down, amirite?)

My point is that while we discuss the pros, cons, and considerations for designing automation to help us in web operations, we have to be clear that we are not aviation, and that our discussion should reflect that while still trying to glean information from that field’s use of it.

We ought to understand also that when we are designing tasks, automation is but one (albeit a complex one) approach we can take, and that it can be implemented in a wide spectrum of ways. This also means that if we decide in some cases to not automate something (gasp!) or to step back from full automation for good reason, we shouldn’t feel bad or failed about it. Ray Kurzweil and the nutjobs that think the “singularity” is coming RealSoonNow™ won’t be impressed, but then again you’ve got work to do.

So Why Do We Want to Use Automation?

Historically, automation is used for:

  • Precision
  • Stability
  • Speed

Which sounds like a pretty good argument for it, right? Who wants to be less precise, less stable, or slower? Not I, says the Ops guy. So using automation at work seems like a no-brainer.  But is it really just as simple as that?

Some common motivations for automation are:

  • Reduce or eliminate human error
  • Reduction of the human’s workload. Specifically, ridding humans of boring and tedious tasks so they can tackle the more difficult ones
  • Bring stability to a system
  • Reduce fatigue on humans

No article about automation would be complete without pointing first at Lisanne Bainbridge’s 1983 paper, “The Ironies of Automation”. I would put her work here as modern canonical on the topic. Any self-respecting engineer should read it. While its prose is somewhat dated, the value is still very real and pertinent.

What she says, in a nutshell, is that there are at least two ironies with automation, from the traditional view of it. The premise reflects a gut intuition that pervades many fields of engineering, and one that I think should be questioned:

The basic view is that the human operator is unreliable and inefficient, and therefore should be eliminated from the system.

Roger that. This supports the idea to take humans out of the loop (because they are unreliable and inefficient) and replace them with automated processes.

The first irony is:

Designer errors [in automation] can be a major source of operating problems.

This means that the designers of automation make decisions about how it will work based on how they envision the context it will be used. There is a very real possibility that the designer hasn’t imagined (or, can’t imagine) every scenario and situation the automation and human will find themselves in, and so therefore can’t account for it in the design.

Let’s re-read the statement: “This supports the idea to take humans out of the loop (because they are unreliable and inefficient) and replace them with automated processes.”…which are designed by humans, who are assumed to be unrelia…oh, wait.

The second irony is:

The designer [of the automation], who tries to eliminate the operator, still leaves the operator to do the tasks which the designer cannot think how to automate.

Which is to say that because the designers of automation can’t fully automate the human “out” of everything in a task, the human is left to cope with what’s left after the automated parts. Which by definition are the more complex bits. So the proposed benefit of relieving humans of cognitive workload isn’t exactly realized.

There are some more generalizations that Bainbridge makes, paraphrased by James Reason in Managing The Risks of Organizational Accidents:

  • In highly automated systems, the task of the human operator is to monitor the systems to ensure that the ‘automatics’ are working as they should. But it’s well known that even the best motivated people have trouble maintaining vigilance for long periods of time. They are thus ill-suited to watch out for these very rare abnormal conditions.
  • Skills need to be practiced continuously in order to preserve them. Yet an automatic system that fails only very occasionally denies the human operator the opportunity to practice the skills that will be called upon in an emergency. Thus, operators can become deskilled in just those abilities that justify their (supposedly) marginalized existence.
  • And ‘Perhaps the final irony is that it is the most successful automated systems with rare need for manual intervention which may need the greatest investment in operator training.’

Bainbridge’s exploration of ironies and costs of automation bring a much more balanced view of the topic, IMHO. It also points to something that I don’t believe is apparent to our community, which is that automation isn’t an all-or-nothing proposition. It’s easy to bucket things that humans do, and things that machines do, and while the two do meet from time to time in different contexts, it’s simpler to think of their abilities apart from each other.

Viewing automation instead on a spectrum of contexts can break this oversimplification, which I think can help us gain a glimpse into what a more mature perspective towards automation could look like.

Levels Of Automation

It would seem automation design needs to be done with the context of its use in mind. Another fundamental work in the research of automation is the so-called “Levels Of Automation”. In their seminal 1999 paper “Human And Computer Control of Undersea Teleoperators”, Sheridan and Verplank lay out the landscape for where automation exists along the human-machine relationship (Table 8.2 in the original and most excellent vintage 1978 typewritten engineering paper)

Automation Level Automation Description
1 The computer offers no assistance: human must take all decision and actions.
 2  The computer offers a complete set of decision/action alternatives, or
 3  …narrows the selection down to a few, or
 4  …suggests one alternative, and
 5  …executes that suggestion if the human approves, or
 6  …allows the human a restricted time to veto before automatic execution, or
 7  …executes automatically, then necessarily informs humans, and
 8  …informs the human only if asked, or
 9  …informs him after execution if it, the computer, decides to.
 10  The computer decides everything and acts autonomously, ignoring the
human.

 

This was extended later in Parasuraman, Sheridan, and Wickens (2000) “A Model for Types and Levels of Human Interaction with Automation” to include four stages of information processing within which each level of automation may exist:

  1. Information Acquisition. The first stage involves the acquisition, registration, and position of multiple information sources similar to that of humans’ initial sensory processing.
  2. Information Analysis.  The second stage refers to conscious perception, selective attention, cognition, and the manipulation of processed information such as in the Baddeley model of information processing
  3. Decision and Action Selection. Next, automation can make decisions based on information acquisition, analysis and integration.
  4. Action Implementation. Finally, automation may execute forms of action.

Viewing the above 10 Levels of Automation (LOA) as a spectrum within each of those four stages allows for a way of discerning where and how much automation could (or should) be implemented, in the context of performance and cost of actions. This feels to me like a step towards making mature decisions about the role of automation in different contexts.

Here is an example of these stages and the LOA in each of them, suggested for Air Traffic Control activities:

Endsley (1999) also came up with a similar paradigm of stages of automation, in “Level of automation effects on performance, situation awareness and workload in a dynamic control task”

What are examples of viewing LOA in the context of web operations and engineering?

At Etsy, we’ve made decisions (sometimes implicitly) about the levels of automation in various tasks and tooling:

  • Deployinator: assisted by automated processes, humans trigger application code deploys to production. The when and what is human-centered. The how is computer-centered.
  • Chef: humans decide on some details in recipes (this configuration file in this place), computers decide on others (use 85% of total RAM for memcached, other logic in templates), and computer decides on automatic deployment (random 10 minute splay for Chef client runs). Mostly, humans provide the what, and computers decide the when and how.
  • Database Schema changes: assisted by automated processes, humans trigger the what and when, computer handles the how.
  • Event handling: some Nagios alerts trigger simple self-healing attempts upon some (not all) alertable events. Human decides what and how. Computer decides when.

I suspect that in many organizations, the four stages of automation (from Parasuraman, Sheridan, and Wickens) line up something like this, with regards to the breakdown in human or computer function allocation:

Information Acquisition
  • Largely computer-driven for application and infra metrics (think Graphite/Ganglia/NewRelic/Boundary/etc.)
  • Some higher-level human-driven data acquisition (think UX testing and observation/focus groups/etc.)
Information Analysis
  • Some computer-driven for application and infra (think Holt-Winters, CEP, A/B testing results, deductive reasoning about metrics, etc.)
  • Some human-driven analysis (think BI/behavioral/funnel correlations, inductive reasoning about metrics, etc.)
Decision and Action Selection
  • Some computer-driven for application and infra (think event handlers, fault tolerance and protection methods, CI, etc.)
  • Some human-driven (think some deployments, core network or storage changes deemed risky, etc.)
Action Implementation
  • Some computer-driven for application and infra (think event handlers, some config mgmt implementations, scheduled jobs with feed-back and feed-forward loops, etc.)
  • Some human-driven (think some deployments, feature ramp-ups, coordinated multi-team actions, etc.)

 

Trust

In many cases, what level of automation is appropriate and in which context is informed by the level of trust that operators have in the automation to be successful.

Do you trust an iPhone’s ability to auto-correct your spelling enough to blindly accept all suggestions? I suspect no one would, and the iPhone auto-correct designers know this because they’ve given the human the veto power of the suggestion by putting an “x” next to them. (following automation level 5, above)

Do you trust a GPS routing system enough to follow it without question? Let’s hope not. Given that there is context missing, such as stop signs, red lights, pedestrians, and other dynamic phenomena going on in traffic, GPS automobile routing may be a good example of keeping the LOA at level 4 and below, and even then only sticking to the “Information Acquisition” and “Information Analysis” states, and keeping the “Decision and Action” and “Action Implementation” stages to the human who can recognize the more complex context.

In “Trust in Automation: Designing for Appropriate Reliance“, James Lee and Katrina A. See investigate the concerns surrounding trusting automation, including organizational issues, cultural issues, and context that can influence how automation is designed and implemented. They outline a concern I think that should be familiar to anyone who has had experiences (good or bad) with automation (emphasis mine):

As automation becomes more prevalent, poor partnerships between people and automation will become increasingly costly and catastrophic. Such flawed partnerships between automation and people can be described in terms of misuse and disuse of automation. (Parasuraman & Riley, 1997).

Misuse refers to the failures that occur when people inadvertently violate critical assumptions and rely on automation inappropriately, whereas disuse signifies failures that occur when people reject the capabilities of automation.

Misuse and disuse are two examples of inappropriate reliance on automation that can compromise safety and profitability.

They discuss methods on making automation trustable:

  • Design for appropriate trust, not greater trust.
  • Show the past performance of the automation.
  • Show the process and algorithms of the automation by revealing intermediate results in a way that is comprehensible to the operators.
  • Simplify the algorithms and operation of the automation to make it more understandable.
  • Show the purpose of the automation, design basis, and range of applications in a way that relates to the users’ goals.
  • Train operators regarding its expected reliability, the mechanisms governing its behavior, and its intended use.
  • Carefully evaluate any anthropomorphizing of the automation, such as using speech to create a synthetic conversational partner, to ensure appropriate trust.

Adam Jacob, in a private email thread with myself and some others had some very insightful things to say on the topic:

The practical application of the ironies isn’t that you should/should not automate a given task, it’s answering the questions of “When is it safe to automate?”, perhaps followed by “How do I make it safe?”. We often jump directly to “automation is awesome”, which is an answer to a different question.

[if you were to ask]…”how do you draw the line between what is and isn’t appropriate?”, I come up with a couple of things:

  • The purpose of automation is to serve a need – for most of us, it’s a business need. For others, it’s a human-critical one (do not crash planes full of people regularly due to foreseeable pilot error.)
  • Recognize the need you are serving – it’s not good for its own sake, and different needs call for different levels of automation effort.
  • The implementers of that automation have a moral imperative to create automation that is serviceable, instrumented, and documented.
  • The users of automation have an imperative to ensure that the supervisors understand the system in detail, and can recover from
    failures.

I think Adam is putting this eloquently, and I think it’s an indication that we as a field are moving towards a more mature perspective on the subject.

There is a growing notion amongst those who study the history, ironies, limitations, and advantages of automation that an evolved perspective on the human-machine relationship may look a lot like human-human relationships. Specifically, the characteristics that govern groups of humans that are engaged in ‘joint activity’ could also be seen as ways that automation could interact.

Collaboration, communication, and cooperation are some of the hallmarks of teamwork amongst people. In “Ten Challenges for Making Automation a ‘Team Player’ in Joint Human-Agent Activity” David Woods, Gary Klein, Jeffrey M. Bradshaw, Robert R. Hoffman, and Paul J. Feltovich make a case for how such a relationship might exist. I wrote briefly a little while ago about the ideas that this paper rests on, in this post here about how people work together.

Here are these ten challenges the authors say we face, where ‘agents’ = humans and machines/automated processes designed by humans:

  • Basic Compact – Challenge 1: To be a team player, an intelligent agent must fulfill the requirements of a Basic Compact to engage in common-grounding activities.
  • Adequate models – Challenge 2: To be an effective team player, intelligent agents must be able to adequately model the other participants’ intentions and actions vis-à-vis the joint activity’s state and evolution—for example, are they having trouble? Are they on a standard path proceeding smoothly? What impasses have arisen? How have others adapted to disruptions to the plan?
  • Predictability – Challenge 3: Human-agent team members must be mutually predictable.
  • Directability – Challenge 4: Agents must be directable.
  • Revealing status and intentions – Challenge 5: Agents must be able to make pertinent aspects of their status and intentions obvious to their teammates.
  • Interpreting signals – Challenge 6: Agents must be able to observe and interpret pertinent signals of status and intentions.
  • Goal negotiation – Challenge 7: Agents must be able to engage in goal negotiation.
  • Collaboration – Challenge 8: Support technologies for planning and autonomy must enable a collaborative approach.
  • Attention management – Challenge 9: Agents must be able to participate in managing attention.
  • Cost control – Challenge 10: All team members must help control the costs of coordinated activity.

I do recognize these to be traits and characteristics of high-performing human teams. Think of the best teams in many contexts (engineering, sports, political, etc.) and these certainly show up. Can humans and machines work together just as well? Maybe we’ll find out over the next ten years. 🙂

“The question is no longer whether one or another function can be automated, but, rather, whether it should be. – Wiener & Curry (1980)”
“…and in what ways it should be automated.” – John Allspaw (right now, in response to Wiener & Curry’s quote above)

Fundamental: Stress-Strain Curves In Web Engineering

Posted 6 CommentsPosted in Cognitive Systems Engineering, Complex Systems, Human Factors, Resilience, WebOps

I make it no secret that my background is in mechanical engineering. I still miss those days of explicit and dynamic finite element analysis, when I worked for the VNTSC, working on vehicle crashworthiness studies for the NHTSA.

What was there not to like? Things like cars and airbags and seatbelts and dummies and that get crushed, sheared, cracked, busted in every way, all made of different materials: steel, glass, rubber, even flesh (cadaver studies)…it was awesome.

I’ve made some analogies from the world of statics and dynamics to the world of web operations before (Part I and Part II), and it still sticks in my mind as a fundamental mental model in my every day work: resources that have adaptive capacities have a fundamental relationship between stress and strain. Which is to say, in most systems we encounter, as demand for a given resource increases, the strain on the system (and therefore the adaptive capacity) under load also changes, and in most cases increases.

What do I mean by “resource”? Well, from the materials science world, this is generally a component characterized by its material properties. The textbook example is a bar of metal, being stretched.

À la:

In this traditional case, the “system” is simply a beam or a linkage or a load-bearing something.

But in my extension/abuse of the analogy, simple resources in the first order could be imagined as:

  •    CPU
  •    Memory
  •    Disk I/O
  •    Disk consumption
  •    Network bandwidth

To extend it further (and more realistically, because these resources almost never experience work in isolation of each other) you could think of the resource under load to be any combination of these things. And the system under load may be a webserver. Or a database. Or a caching server.

Captain Obvious says: welcome to the underlying facts-on-the-ground of capacity planning and monitoring. 🙂

To me, this leads to some more questions:

    • What does this relationship look like, between stress and strain?
      • Does it fail immediately, as if it was brittle?
      • Or does it “bend”, proportionally, (as in: request rate versus latency) for some period before failure?
      • If the latter, is the relationship linear, or exponential, or something else entirely?
    • Was this relationship known before the design of the system, and therefore taken into account?
      • Which is to say: what approaches are we using most in predicting this relationship between stress and strain:
        • Extrapolated experimental data from synthetic load testing?
        • Previous real-world production data from similarly-designed systems?
        • Percentage rampups of production load?
        • A few cherry-picked reports on HackerNews combined with hope and caffeine?
    • Will we be able to detect when the adaptive capacity of this system is nearing damage or failure?
    • If we can, what are we planning on doing when we reach those inflections?

The more confidence we have about this relationship between stress and strain, the more prepared we are for the system’s failures and successes.

Now, the analogy of this fundamental concept doesn’t end here. What if the “system” under varying load is an organization? What if it’s your development and operations team? Viewed on a longer scale than a web request, this can be seen as a defining characteristic of a team’s adaptive capacities.

David Woods and John Wreathall discuss this analogy they’ve made in Stress-Strain Plots as a Basis for Assessing System Resilience”. They describe how they are mapping the state space of a stress-strain plot to an organization’s adaptive capacities and resilience:

Following the conventions of stress- strain plots in material sciences, the y-axis is the stress axis. We will here label the y-axis as the demand axis (D) and the basic unit of analysis is how the organization responds to an increase in D relative to a base level of D (Figure 1). The x-axis captures how the material stretches when placed under a given load or a change in load. In the extension to organizations, the x-axis captures how the organization stretches to handle an increase in demands (S relative to some base).

In the first region – which we will term the uniform response region – the organization has developed plans, procedures, training, personnel and related operational resources that can stretch uniformly as demand varies in this region. This is the on-plan performance area or what Woods (2006) referred to as the competence envelope.

As you can imagine, the fun begins in the part of the relationship above the uniform region. In materials science, this is where plastic deformation begins; it’s the point on the curve at which a resource/component’s structure deforms under the increased stress and can no longer rebound back to its original position. It’s essentially damaged, or its shape is permanently changed in the given context.

They go on to say that in the organizational stress-strain analogy:

In the second region non-uniform stretching begins; in other words, ‘gaps’ begin to appear in the ability to maintain safe and effective production (as defined within the competence envelope) as the change in demands exceeds the ability of the organization to adapt within the competence envelope. At this point, the demands exceed the limit of the first order adaptations built into the plan-ful operation of the system in question. To avoid an accumulation of gaps that would lead to a system failure, active steps are needed to compensate for the gaps or to extend the ability of the system to stretch in response to increasing demands. These local adaptations are provided by people and groups as they actively adjust strategies and recruit resources so that the system can continue to stretch. We term this the ‘extra’ region (or more compactly, the x-region) as compensation requires extra work, extra resources, and new (extra) strategies.

So this is a good general description in Human Factors Researcher language, but what is an example of this non-uniform or plastic deformation in our world of web engineering? I see a few examples.

  • In distributed systems, at the point at which the volume of data and request (or change) rate of the data is beyond the ability for individual nodes to cope, and a wholesale rehash or fundamental redistribution is necessary. For example, in a typical OneMasterManySlaves approach to database architecture, when the rate of change on the master passes the point where no matter how many slaves you add (to relieve read load on the master) the data will continually be stale. Common solutions to this inflection point are functional partitioning of the data into smaller clusters, or federating the data amongst shards. In another example, it could be that in a Dynamo-influenced datastore, the N, W, and R knobs need adjusting to adapt to the rate or the individual nodes’ resources need to be changed.
  • In Ops teams, when individuals start to discover and compensate for brittleness in the architecture. A common sign of this happening is when alerting thresholds or approaches (active versus passive, aggregate versus individual, etc.) no longer provide the detection needed within an acceptable signal:noise envelope. This compensation can be largely invisible, growing until it’s too late and burnout has settled in.
  • The limits of an underlying technology (or the particular use case for it) is starting to show. An example of this is a single-process server. Low traffic rates pose no problem for software that can only run on a single CPU core; it can adapt to small bursts to a certain extent, and there’s a simple solution to this non-multicore situation: add more servers. However, at some point, the work needed to replace the single-core software with multicore-ready software drops below the amount of work needed to maintain and grow an army of single-process servers. This is especially true in terms of computing efficiency, as in dollars per calculation.

In other words, the ways a design or team once adapted are no longer valid in this new region of the stress-strain relationship. Successful organizations re-group and increase their ability to adapt to this new present case of demands, and invest in new capacities.

For the general case, the exhaustion of capacity to adapt as demands grow is represented by the movement to a failure point. This second phase is represented by the slope and distance to the failure point (the downswing portion of the x-region curve). Rapid collapse is one kind of brittleness; more resilient systems can anticipate the eventual decline or recognize that capacity is becoming exhausted and recruit additional resources and methods for adaptation or switch to a re-structured mode of operations (Figures 2 and 3). Gracefully degrading systems can defer movement toward a failure point by continuing to act to add extra adaptive capacity.

In effect, resilient organizations recognize the need for these new strategies early on in the non-uniform phase, before failure becomes imminent. This, in my view, is the difference between a team who has ingrained into their perspective what it means to be operationally ready, and those who have not. At an individual level, this is what I would consider to be one of the many characteristics that define a “senior” (or, rather a mature) engineer.

This is the money quote, emphasis is mine:

Recognizing that this has occurred (or is about to occur) leads people in these various roles to actively adapt to make up for the non- uniform stretching (or to signal the need for active adaptation to others). They inject new resources, tactics, and strategies to stretch adaptive capacity beyond the base built into on-plan behavior. People are the usual source of these gap-filling adaptations and these people are often inventive in finding ways to adapt when they have experience with particular gaps (Cook et al., 2000). Experienced people generally anticipate the need for these gap-filling adaptations to forestall or to be prepared for upcoming events (Klein et al., 2005; Woods and Hollnagel, 2006), though they may have to adapt reactively on some occasions after the consequences of gaps have begun to appear. (The critical role of anticipation was missed in some early work that noticed the importance of resilient performance, e.g., Wildavsky, 1988.)

This behavior leads to the extension of the non-uniform space into new uniform spaces, as the team injects new adaptive capacities:

There is a lot more in this particular paper that Woods and Wreathall cover, including:

  • Calibration – How engineering leaders and teams view themselves and their situation, along the demand-strain curve. Do they underestimate or overestimate how close they are to failure points or active adaptations that are indicative of “drift” towards failure?
  • Costs of Continual Adaption in the X-Region – As the compensations for cracks and gaps in the system’s armor increase, so does the cost. At some point, the cost of restructuring the technology or the teams becomes lower than the continual making-up-for-the-gaps that are happening.
  • The Law of Stretched Systems – “As an organization is successful, more may be demanded of it (‘faster, better, cheaper’ pressures) pushing the organization to handle demands that will exceed its uniform range. In part this relationship is captured in the Law of Stretched Systems (Woods and Hollnagel, 2006) – with new capabilities, effective leaders will adapt to exploit the new margins by demanding higher tempos, greater efficiency, new levels of performance, and more complex ways of work.”

Overall, I think Woods and Wreathall hit the nail on the head for me.  Of course, as with all analogies, this mapping of resilience and adaptive capacity to stress-strain curves has limits and they are clear on pointing those out as well.

My suggestion of course is for you to read the whole chapter. It may or may not be useful for you, but it sure is to me. I mean, I embrace the concept so much that I got a it printed on a coffee mug, and I’m thinking of making an Etsy Engineering t-shirt as well. 🙂

Human Factors and Web Engineering’s Intersection

Posted 4 CommentsPosted in Cognitive Systems Engineering, Human Factors

Given my recent (and apparently insatiable appetite) for studying the contexts, interface(s), and success and failure modes  between man and machine, it’s not a surprise that I’ve been flying head-on into the field of Human Factors. Sub-disciplines include Cognitive Engineering and Human-Computer Interaction (HCI).

It would appear to me that there isn’t one facet of the field of web engineering that can’t be informed by the results of human factors research and ought to be part of anyone’s education in Operations at the very least. How we make decisions, how we operate under multiple goal conflicts (think time pressure and outage escalation), concerns for designing controls and displays, even organizational resilience has foundations in Human Factors, and I think has just as much to do with our field as Computer Science and Distributed Systems.

So when I had opened “Human Factors for Engineers” I found the Preface by Stuart Arnold really captured my own impressions upon discovering this field some years ago,  so I’m going to quote the whole thing here:

Preface, Human Factors for Engineers

 

I am not a human factors expert. Nor can I even claim to be an able practitioner. I am an engineer who, somewhat belatedly in my career, has developed a purposeful interest in human factors.

 

This drive did not come from a desire for broader knowledge, though human factors as a subject is surely compelling enough to merit attention by the inquiring mind. It came from a progressive recognition that, without an appreciation of human factors, I fall short of being the rounded professional engineer I have always sought to be.

 

I started my career as a microwave engineer. Swept along by the time and tide of an ever-changing technical and business scene, I evolved into a software engineer. But as more time passed, and as knowledge, experience and skills grew, my identity as an engineer began to attract a new descriptor. With no intent, I had metamorphosed into a systems engineer.

 

There was no definitive rite of passage to reach this state. It was an accumulation of conditioning project experiences that turned theory into practice, and knowledge into wisdom. Throughout, I cloaked myself in a misguided security that my engineering world was largely confined to the deterministic, to functions that could be objectively enshrined in equations, to laws of physics that described with rigour the entities I had to deal with. To me, it was a world largely of the inanimate.

 

Much of this experience was against a backdrop of technical miracles; buoyant trade and strong supplier influence; where unfettered technological ingenuity drove the market, and users were expected to adapt to meet the advances that science and engineering threw at them. It was an era destined not to last.

 

Of course, from my world of microwave systems, of software systems and of maturing systems engineering, I had heard of human factors. Yet it seemed a world apart, scarcely an engineering-related discipline. People, teams and social systems were an unfortunate, if unavoidable, dimension of the environment my technical creations operated in.

 

Nor was I alone in these views. Recently asked by the IEE to write a personal view on systems and engineering, I light-heartedly, but truthfully, explained that I have been paid to drink in bars across the world with some of the best systems engineers around. This international debating arena is where I honed a cutting edge to my systems engineering. Above all others, I was struck by one thing: an engineering legacy of thinking about systems as being entirely composed of the inanimate.

 

Most engineers still view humans as an adjunct to equipment, blind even to their saving compensation for residual design shortcomings. As an engineering point of view for analyzing or synthesising solutions, this equipment-centred view has validity–but only once humans have been eliminated as contributing elements within some defined boundary of creativity. It is a view that can preempt a trade-off in which human characteristics could have contributed to a more effective solution; where a person or team, rather than inanimate elements, would on balance be a better, alternative contributor to overall system properties.

 

As elements within a boundary of creativity, operators bring intelligence to systems; the potential for real-time response to specific circumstance; adaptation to changing or unforeseen need; a natural optimisation of services delivered; and, when society requires people to integrally exercise control of a system, they legit-imise system functionality. In the fields of transportation, medicine, defence and finance the evidence abounds.

 

Nevertheless, humans are seen to exhibit a distinctive and perplexing range of ‘implementation constraints’. We all have first hand familiarity with the intrinsic limitations of humans. Indeed, the public might be forgiven for assuming that system failure is synonymous with human weaknesses – with driver or pilot error, with disregard for procedure, even with corporate mendacity.

 

But where truly does the accountability for such failure lie: with the fallible operator; the misjudgements in allocation of functional responsibility in equipment-centred designs; the failed analysis of emergent complexity in human–equipment interaction? Rightly, maturing public awareness and legal enquiry focuses evermore on these last two causes. At their heart lies a crucial relationship – a mutual understanding –  between engineers and human factors specialists.

 

Both of these groups of professionals must therefore be open to, and capable of  performing, trade-off between the functionality and behaviour of multiple, candidate engineering implementations and humans.

 

This trade-off is not of simple alternatives, functional like for functional like. The respective complexities, characteristics and behaviours of the animate and inanimate do not offer such symmetry. It requires a crafted, intimate blending of humans with engineered artefacts–human-technology cooperation rather than human–technology interaction; a proactive and enriching fusion rather than a reactive accommodation of the dissimilar.

 

Looking outward from a system-of-interest there lies an operational environment, composed notionally of an aggregation of systems, each potentially comprising humans with a stake in the services delivered by the system-of-interest. As external system actors, their distinctive and complex characteristics compound and progressively emerge to form group and social phenomena. Amongst their ranks one finds, individually and communally, the system beneficiaries. Their needs dominate the definition of required system services and ultimately they are the arbiters of solution acceptability.

 

Essentially, the well-integrated system is the product of the well-integrated team, in which all members empathise and contribute to an holistic view throughout the system life cycle. The text that follows may thus be seen as a catalyst for multi-disciplinary, integrated teamwork; for co-operating engineers and human factors professionals to develop a mutual understanding and a mutual recognition of their respective contributions.

 

In this manner, they combine to address two primary concerns. One is an equitable synthesis of overall system properties from palpably dissimilar candidate elements, with attendant concerns for micro-scale usability in the interfaces between operator and inanimate equipment. The other addresses the macro-scale operation of this combination of system elements to form an optimised socio-technical work system that delivers agreed services into its environment of use.

 

The ensuing chapters should dispel any misguided perception in the engineering mind that human factors is a discipline of heuristics. It is a science of structured and disciplined analysis of humans – as individuals, as teams, and as a society. True, the organic complexity of humans encourages a more experimental stance in order to gain understanding, and system life cycle models need accordingly to accommodate a greater degree of empiricism. No bad thing, many would say, as engineers strive to fathom the intricacies of networked architectures, emergent risks of new technology and unprecedented level of complexity.

 

The identification of pressing need and its timely fulfillment lie at the heart of today’s market-led approach to applying technology to business opportunity. Applying these criteria to this book, the need is clear-cut, the timing is ideal, and this-delivered solution speaks to the engineering community in a relevant and meaningful way. Its contributors are to be complimented on reaching out to their engineering colleagues in this manner.

 

So I commend it to each and every engineer who seeks to place his or her particular engineering contribution in a wider, richer and more relevant context. Your engineering leads to systems that are created by humans, for the benefit of humans. Typically they draw on the capabilities of humans, and they will certainly need to respond to the needs and constraints of humans. Without an appreciation of human factors, your endeavors may well fall short of the mark, however good the technical insight and creativity that lie behind them.

 

Beyond this, I also commend this text to human factors specialists who, by assimilating its presentation of familiar knowledge, may better appreciate how to meaningfully communicate with the diversity of engineering team members with whom they need to interact.

 

It’s almost as if he was convinced that cooperation and communication between disciplines was not only warranted, but needed if problems are ever to have scalable and efficient solutions. 🙂

Resilience Engineering Part II: Lenses

Posted 4 CommentsPosted in Complex Systems, Resilience

(this is part 2 of a series: here is part 1)

One of the challenges of building and operating complex systems is that it’s difficult to talk about one facet or component of them without bleeding the conversation into other related concerns. That’s the funky thing about complex systems and systems thinking: components come together to behave in different (sometimes surprising) ways that they never would on their own, in isolation. Everything always connects to everything else, so it’s always tempting to see connections and want to include them in discussion. I suspect James Urquhart feels the same.

So one helpful bit (I’ve found) is Erik Hollnagel’s Four Cornerstones of Resilience. I’ve been using it as a lens with which which to discuss and organize my thoughts on…well, everything. I think I’ve included them in every talk, presentation, or maybe even every conversation I’ve had with anyone for the last year and a half. I’m planning on annoying every audience for the foreseeable future by including them, because I can’t seem to shake how helpful they are as a lens into viewing resilience as a concept.

Four Cornerstones of Resilience

The greatest part about the four cornerstones is that it’s a simplification device for discussion. And simplifications don’t come easily when talking about complex systems. The other bit that I like is that it makes it straightforward to see relationships in activities and challenges in each of them, as they relate to each other.

For example: learning is traditionally punctuated by Post-Mortems, and what (hopefully) come out of PMs? Remediation items. Tasks and facts that can aid:

  • monitoring,  (example: “we need to adjust an alerting threshold or mechanism to be more appropriate for detecting anomalies”)
  • anticipation, (example: “we didn’t see this particular failure scenario coming before, so let’s update our knowledge on how it came about.”) 
  • response (example: “we weren’t able to troubleshoot this issue as quickly as we’d like because communicating during the outage was noisy/difficult, so let’s fix that.”)

I’ll leave it as an exercise to imagine how anticipation can then affect monitoring, response, and learning. The point here is that each of the four cornerstones can effect each other in all sorts of ways, and those relationships ought to be explored.

I do think it’s helpful when looking at these pieces to understand that they can exist on a large time window as well as a small one. You might be tempted to view them in the context of infrastructure attributes in outage scenarios; this would be a mistake, because it narrows the perspective to what you can immediately influence. Instead, I think there’s a lot of value in looking at the cornerstones as a lens on the larger picture.

Of course going into each one of these in detail isn’t going to happen in a single epic too-long blog post, but I thought I’d mention a couple of things that I currently think of when I’ve got this perspective in mind.

Anticipation

This is knowing what to expect, and dealing with the potential for fundamental surprises in systems. This involves what Westrum and Adamski called “Requisite Imagination”, the ability to explore the possibilities of failure (and success!) in a given system. The process of imagining scenarios in the future is a worthwhile one, and I certainly think a fun one. The skill of anticipation is one area where engineers can illustrate just how creative they are. Cheers to the engineers who can envision multiple futures and sort them based on likelihood. Drinks all around to those engineers who can take that further and explain the rationale for their sorted likelihood ratings. Whereas monitoring deals in the now, anticipation deals in the future.

At Etsy we have a couple of tools that help with anticipation:

  • Architectural Reviews These are meetings open to all of engineering that we have when there’s a new pattern being used or a new type of technology being introduced, and why. We gather up people proposing the ideas, and then spend time shooting holes into it with the goal of making the solution stronger than it might have been on its own. We’d also entertain what we’d do if things didn’t go according to plan with the idea. We take adopting new technologies very seriously, so this doesn’t happen very often.
  • Go or No-Go Meetings (a.k.a. Operability Reviews) These are where we gather up representative folks (at least someone from Support, Community, Product, and obviously Engineering) to discuss some fundamentals on a public-facing change, and walk through any contingencies that might need to happen. Trick is – in order to get contingencies as part of the discussion, you have to name the circumstances where they’d come up.
  • GameDay Exercises These are exercises where we validate our confidence in production by causing as many horrible things we can to components while they’re in production. Even asking if a GameDay is possible sparks enough conversation to be useful, and burning pieces to the ground to see how it behaves when it does is always a useful task. We want no unique snowflakes, so being able to stand it up as fast as it can burn down is fun for the whole family.

But anticipation isn’t just about thinking along the lines of “what could possibly go wrong?” (although that is always a decent start). It’s also about the organization, and how a team behaves when interacting with the machines. Recognizing when your adaptive capacity is failing is key to anticipation. David Woods has collected some patterns of anticipation worth exploring, many of which relate to a system’s adaptive capacity:

  • Recognize when adaptive capacity is failingExample: Can you detect when your team’s ability to respond to outages degrades?
  • Recognizing the threat of exhausting buffers or reservesExample: Can you tell when your tolerances for faults are breached? When your team’s workload prevents proactive planning from getting done?
  • Recognize when to shift priorities across goal trade-offsExample: Can you tell when you’re going to have to switch from greenfield development, and focus on cleaning up legacy infra?
  • Able to make perspective shifts and contrast diverse perspectives that go beyond their nominal positionExample: Can Operations understand the goals of Development, and vice-versa, and support them in the future?
  • Able to navigate interdependencies across roles, activities, and levels Example: Can you foresee what’s going to be needed from different groups (Finance, Support, Facilities, Development, Ops, Product, etc.) and who in those teams need to be kept up-to-date with ongoing events?
  • Recognize the need to learn new ways to adapt – Example: Will you know when it’s time to include new items in training incoming engineers, as failure scenarios and ways of working change in the organization and infrastructure?

I’m fascinated by the skill of anticipation, frankly. I spoke at Velocity Europe in Berlin last year on the topic.

Monitoring

This is knowing what to look for, and dealing with the critical in systems. Not just the mechanics of servers and networks and applications, but monitoring in the organizational sense. Anomaly detection and metrics collection and alerting are obviously part of this, and should be familiar to anyone expecting their web application to be operable.

But in addition to this, we’re talking as well about meta-metrics on the operations and activities of both infrastructure and staff.

Things like:

  • How might an team measure its cognitive load during an outage, in order to detect when it is drifting?
  • Are there any gaps that appear in a team’s coordinative or collaborative abilities, over time?
  • Can the organization detect when there are goal conflicts (example: accelerating production schedules in the face of creeping scope) quickly enough to make them explicit and do something about them?
  • What leading or lagging indicators could you use to gauge whether or not the performance demand of a team is beyond what could be deemed “normal” for the size and scale it has?
  • How might you tell if a team is becoming complacent with respect to safety, when incidents decrease? (“We’re fine! Look, we haven’t had an outage for months!”)
  • How can you confirm that engineers are being ramped up adequately to being productive and adaptive in a team?

Response

This is knowing what to do, and dealing with the actual in systems. Whether you’ve anticipated a perturbation or disturbance, as long as you can detect it, than you have something to respond to. How do you? Page the on-call engineer? Are you the on-call engineer? Response is fundamental to working in web operations, and differential diagnosis is just as applicable to troubleshooting complex systems as it is

Pitfalls in responding to surprising behaviors in complex systems have exotic and novel characteristics. They are the things that what make Post-Mortem meetings dramatic; the can often include stories of surprising turns of attention, focus, and results that makes troubleshooting more of a mystery than anything. Dietrich Dörner, in his 1980 article “On The Difficulties People Have In Dealing With Complexity“, he gave some characteristics of response in escalating scenarios. These might sound familiar to anyone who has experienced team troubleshooting during an outage:

…[people] tend to neglect how processes develop over time (awareness of rates) versus assessing how things are in the moment.
…[people] have difficulty in dealing with exponential developments (hard to imagine how fast things can change, or accelerate)
…[people] tend to think in causal series (A, therefore B), as opposed to causal nets (A, therefore B and C, therefore D and E, etc.)

I was lucky enough to talk a bit more in detail about Resilient Response In Complex Systems at QCon in London this past year.

Learning

This is knowing what has happened, and dealing with the factual in systems.  Everyone wants to believe that their team or group or company has the ability to learn, right? A hallmark of good engineering is empirical observation that results in future behavior changes. Like I mentioned above, this is the place where Post-Mortems usually come into play. At this point I think our field ought to be familiar with Post-Mortem meetings and the general structure and goal of them: to glean as much information about an incident, an outage, a surprising result, a mistake, etc. and spread those observations far and wide within the organization in order to prevent them from happening in the future.

I’m obviously a huge fan of Post-Mortems and what they can do to improve an organization’s behavior and performance. But a lesser-known tool for learning is the “Near-Miss” opportunities we see in normal, everyday work. An engineer performs an action, and realizes later that it was wrong or somehow produced a result that is surprising. When those happen, we can hold them up high, for all to see and learn from. Did they cause damage? No, that’s why they “missed.”

One of the godfathers of cognitive engineering, James Reason, said that “near-miss” events are excellent learning opportunities for organizations, because they:

  1. Can act like safety “vaccines” for an organization, because they are just a little bit of failure that doesn’t really hurt.
  2. They happen much more often than actual systemic failures, so they provide a lot more data on latent failures.
  3. They are a powerful reminder of hazards, therefore keeping the “constant sense of unease” that is needed to provide resilience in a system.

I’ll add that encouraging engineers to share the details of their near-misses has a positive side effect on the culture of the organization. At Etsy, you will see (from time to time) an email to the whole team from an engineer that has the form:

Dear Everybody,

This morning I went to do X, so I did Y. Boy was that a bad idea! Not only did it not do what I thought it was going to, but also it almost brought the site down because of Z, which was a surprise to me. So whatever you do, don’t make the same mistake I did. In the meantime, I’m going to see what I can do to prevent that from happening.

Love,

Joe Engineer

For one, it provides the confirmation that anyone, at any time, no matter their seniority level, can make a mistake or act on faulty assumptions. The other benefit is that it sends the message that admitting to making a mistake is acceptable and encouraged, and that people should feel safe in admitting to these sometimes embarrassing events.

This last point is so powerful that it’s hard to emphasize it more. It’s related to encouraging a Just Culture, something that I wrote about recently over at Code As Craft, Etsy’s Engineering blog.

The last bit I wanted to mention about learning is purposefully not incident-related. One of the basic tenets of Resilience Engineering is that safety is not the absence of incidents an failures; it’s the presence of actions, behaviors, and culture (all along the lines of the four cornerstones above) that causes an organization to be safe. Learning from failures means that the surface area to learn from is not all that large. To be clear, most organizations see successes much much more often than they do failures.

One such focus might be changes. If, across 100 production deploys, you had 9 change-related outages, which should you learn from? Should you be satisfied to look at those nine, have postmortems, and then move forward, safe in the idea that you’ve rid yourself of the badness? Or should you also look at the 91 deploys, and gather some hypothesis about why they ended up ok? You can learn from 9 events, or 91. The argument here is that you’ll be safer by learning from both.

So in addition to learning from why things go wrong, we ought to learn just as much from why do things go right? Why didn’t  your site go down today? Why aren’t you having an outage right now? This is likely due to a huge number of influences and reasons, all worth exploring.

….

Ok, so I lied. I didn’t expect this to be such a long post. I do find the four cornerstones to be a good lens with which to think and speak about resilience and complex systems. It’s as much of my vocabulary now as OODA and CAP is. 🙂

The Devil’s In The Details

Posted 8 CommentsPosted in Cognitive Systems Engineering, Complex Systems, Naturalistic Decision Making, WebOps

I’m a firm believer that context is everything, and that it’s needed in every constructive conversation we want to have as engineers.

As a nascent (but adorable) engineering field, we discuss (in blogs, books, meetups, conferences, etc.) success and failure in a number of areas, including the ways in which we work. We don’t just build complex systems, we are a complex system. In order to get our work done, we have to successfully bring together people and skills from diverse backgrounds. When we reach large-scale, we have to enlist deep and diverse domain expertise across our staff.

But sometimes, we can get frustrated or bogged-down in the details of these interactions between groups and individuals. We can feel like the blunt end (management) doesn’t understand the sharp end (practitioners), or we can feel as though one group doesn’t understand the goals, concerns, or tradeoffs of another, or we simply aren’t doing a good enough job of enabling people to have constructive conversations.

As usual, we’re not the only people who might be interested in how people work together, especially in combination with machines. There’s a great chapter in the 2005 issue of Organizational Simulation (link to journal) that outlines the concept of a “Basic Compact” that people have with each other when engaged in joint activity. From Common Ground and Coordination In Joint Activity:

People engage in joint activity for many reasons: because of necessity (neither party, alone, has the required skills or resources), enrichment (while each party could accomplish the task, they believe that adding complementary points of view will create a richer product), coercion (the boss assigns a group to carry out an assignment), efficiency (the parties working together can do the job faster or with fewer resources), resilience (the different perspectives and knowledge broaden the exploration of possibilities and cross check to detect and recover from errors) or even collegiality (the team members enjoy working together).

We propose that joint activity requires a “Basic Compact” that constitutes a level of commitment for all parties to support the process of coordination. The Basic Compact is an agreement (usually tacit) to participate in the joint activity and to carry out the required coordination responsibilities. Members of a relay team enter into a Basic Compact by virtue of their being on the team; people who are angrily arguing with each other are committed to a Basic Compact as long as they want the argument to continue.

That first reason why people engage in joint activity: because of necessity (neither party, alone, has the required skills or resources) points to some of the reasons why at the same time successful companies develop and employ generalists, there is an advantage to having people differing domain expertise come together. An example of this might be “development” and “operations”, or “design” and “product”, or “finance” and “business development”, or “public relations” and “community”, etc.

Regardless, two of the authors are responsible for some of the best writing on Cognitive Systems Engineering and Naturalistic Decision Making: Dr. David Woods and Gary Klein, respectively. There’s a lot of great bits in here. Costs of coordination, spoken and assumed behaviors, as well as touching on everyone’s favorite topic in engineering: automation and it’s affects on engineering behavior. Come on, what’s not to like in this paper?

The chapter is here as a PDF, so you best get it for your weekend reading! 🙂