My Next Step

Posted 2 CommentsPosted in Uncategorized

After leaving my CTO role at Etsy this past May, I took the summer off to spend time with my family, enjoy New York summertime, and give my mind time to refresh and recharge.

I thought long and hard about what I wanted to do next; the last time I took a new job was in 2009.

Since I had been an instigator in forming the SNAFUCatchers consortium (technical name: Resilient Business-Critical Software Operations) I helped put together the Stella Report. So, it may come as no surprise that I’m going to continue bridging the world of software engineering with cognitive systems engineering, human factors, systems safety, and resilience engineering.

Along with Dr. Richard Cook and Dr. David Woods, I’m launching a company we’re calling Adaptive Capacity Labs, and it’s focused on helping companies (via consulting and training) build their own capacity to go beyond the typical “template-driven” postmortem process and treat post-incident review as the powerful insight lens it can be.

I could not be more excited. Not only do I believe we can help businesses in real meaningful ways, I sincerely believe that doing this work is good for the industry as a whole. I also could not be happier to work alongside David and Richard.

It’s time we start taking human performance seriously.

Invited article in IEEE Software – Technical Debt: Challenges and Perspectives

Posted 2 CommentsPosted in Cognitive Systems Engineering, Complex Systems, Uncategorized

Earlier this year, I was asked to contribute to an article in IEEE Software, entitled “Technical Debt: Challenges and Perspectives.”

I can’t post the entire article here, but I can post the accepted text of my part of the article here.

Misusing the Metaphor

John Allspaw

All technical disciplines (not just software development) require different shorthand terms or labels to represent complicated concepts. Metaphors are a popular type of shorthand; in software development, technical debt is quite a curious one that’s long overdue for revisiting. When Ward Cunningham originally coined the term “technical debt,” he used it as a metaphor to capture explicit decisions to write software such that it allows short-term progress to be “repaid” in the long term, through refactoring.

Since this coinage, however, the term has taken on many shapes and forms. Although many people continue to use the term in the way that Cunningham intended, many more use it to represent some amount of code that’s discovered (sometime later) to have faults that can, with the benefit of hindsight, be attributed to taking “shortcuts.” In this case, we can say that such code results not from explicit decisions but from the normal, everyday tradeoff judgments developers make.

If we write software to reflect our current understanding of the problem it’s intended to solve, the environment it will live in, and the set of potential conditions in which it’s expected to execute, we’ll make these judgment calls. Some of these judgments will be accurate enough to be called “correct”; some won’t. In other words, technical debt has come to mean a type of counterfactual judgment about code quality and can generate a koan-like question:

can you have technical debt without knowing you have it?

My main argument isn’t that technical debt’s definition has morphed over time; many people have already made that observation. Instead, I believe that engineers have used the term to represent a different (and perhaps even more unsettling) phenomenon: a type of debt that can’t be recognized at the time of the code’s creation. They’ve used the term “technical debt” simply be- cause it’s the closest descriptive label they’ve had, not because it’s the same as what Cunningham meant. This phenomenon has no countermeasure like refactoring that can be applied in anticipation, because it’s invisible until an anomaly reveals its presence.

We need a new term to describe this, because it’s not “technical debt” in the classic sense.

The cliff-hanger is that since I wrote the article, I’ve been working with some folks on what that new term could be. Stay tuned. 🙂

UPDATE: the new term is “dark debt” – via the Stella Report.


For citation purposes:

B. Stopford, K. Wallace and J. Allspaw, “Technical Debt: Challenges and Perspectives,” in IEEE Software, vol. 34, no. 4, pp. 79-81, 2017.
doi: 10.1109/MS.2017.99
keywords: {software engineering;software development;Best practices;Computer architecture;Information technology;Parallel processing;Software engineering;Software measurement;Ben Stopford;John Allspaw;Ken Wallace;software development;software engineering;technical debt},

Multiple Perspectives On Technical Problems and Solutions

Posted 19 CommentsPosted in Architecture, Complex Systems, Culture, Etsy

Over the years, a number of people have asked about the details surrounding Etsy’s architecture review process.

In this post, I’d like to focus on the architecture review working group’s role in facilitating dialogue about technology decision-making. Part of this is really just about working groups in general (pros, cons, formats, etc.) and another part of it relies on the general philosophy that Dan McKinley lays out in his post/talk, so I’d consider reading those first and then coming back here.

Fundamental: engineering decision-making is a socially constructed activity

But first, we need to start with a bit of grounding theory. In 1985, Billy Vaughn Koen (now professor emeritus at the University of Texas) introduced the idea of the “Engineering Method” which I’ve written about before. The key idea is his observation that engineers are defined not what they produce, but how they do their work. He describes the Engineering Method succinctly, saying that it is:

“the strategy for causing the best change in a poorly understood or uncertain situation within the available resources”

Note the normative terms best and poorly.

In other words, engineering (as an activity) does not have “correct” solutions to problems. As an aside, if you’re looking for correct solutions to problems, I’d suggest that you go work in a different field (like mathematics); engineering will likely frustrate you.

I wholeheartedly agree with this idea, and I’d take it a bit further and say that successful engineering teams find solutions to problems largely through dialogue. By this I mean, they engage in various forms of:

  • Describing the problem they believe needs solving. This may or may not be straightforward to explain or describe, so a back-and-forth is usually needed for a group to get a full grasp of it.
  • Generating hypotheses about whether or not the problem(s) being described need to be solved in more or less complete or exhaustive ways. Some problem descriptions might be really narrow, or really broad, some problems don’t have to be fully “solved” all at once, etc. Will the problem exist in perpetuity?
  • Evaluating options for solutions. What are the pros and cons? Can a group that needs to support a solution sustain or maintain said solution? Will some solutions have an “expiration date”? What possible unintended consequences could result in upstream/downstream systems that depend on this solution?
  • Building confidence in implementing a given solution. How uncertain (and in what ways) is the group in how the solution may play out in positive or negative terms?
  • Etc.

I realize this should be obvious, but I include this perspective here because I’m continually surprised how difficult this is to believe and understand when the topic is companies choosing to use a particular piece of technology (framework, language, architecture, etc.) or not.

Once you can grok the concept that engineering decisions are constructed socially, then you can understand how I believe the set and setting of an “architecture review” is critical.

The Concept and The Intention

When Kellan and I first came to Etsy in 2010, we put in place a process whereby an engineer (or a group) who were looking to implement something new (different from what we had been already doing or using) would present the problem they were trying to solve and why they believed existing solutions at Etsy wouldn’t be ideal to solve it. We called this an architecture review.

We called these “something new” things departures. Again, Dan’s post/talk goes over this, but the basic idea is that there are costs (many of which can be both hidden and surprising on many fronts) for making departures that attempt to solve a local problem without thinking about the global effects it can have on an organization.

Departures can include things such as:

  • writing a function or feature in a language that isn’t in wide usage at the company already
  • redesigning a function or feature (even in a language already widely-used)
  • introducing a pattern or architecture not already in use
  • using new server software for storing, caching, or retrieving data not already being used
  • essentially, anything that had the potential to surprise and challenge the team when it (eventually) broke

Those bullets above are pretty fuzzy, I’m sure you noticed. So how did you know when you should ask for an architecture review? Ask your coworkers. If you’re debating back and forth whether your thing needs one, then you likely do.

So what was this architecture review meeting all about? It was pretty simple, really. It was just a presentation by the engineer(s) looking for feedback on a solution they came up with, and Kellan and I would ask questions. The meeting was open to anyone in the company who wanted to attend. The hope was that many engineers would attend, actually. The intent here was to help the engineers describe their problem as thoroughly as they can, and by asking questions, we could help draw out any tacit assumptions they made in thinking through the problem. In a nutshell: it was a critical-thinking exercise, and we wanted engineers to want to do this, because it was ideally about providing feedback.

The gist of it could be described like this, from the point of view of an engineer wanting feedback:

“Hey everybody – check this out: I believe I have to solve problem X, and I’ve thought about using A, B, and C to do it, but they all don’t seem ideal. I think I have to go with departure Y as the solution. Please try to talk me out of this.”

This approach leads to some really good prompts for the dialogue I mentioned above. Prompts such as:

  • “Is it possible I don’t even have to solve problem X?”
  • “Am I missing something with A, B, or C? Is it possible that they’ll work well enough?”
  • “If I go with Y, what do we need to do to feel confident about supporting Y as the solution for this type of problem?”

At a high level, we were reasonably successful with this approach in those early days, in that we were able to get engineers to come, talk about the judgements and assumptions they were making, and entertain questions and suggestions about the details about both their understanding of the problem and potential solutions. Even though it began with mostly Kellan and I asking questions, we actively encouraged others to as well. Slowly, they did, and it became a really strong source of confidence for the team, collectively.

There were some really surprising results by doing these reviews. More than once, someone in the room would recognize the problem being presented, and relay that they had also wrestled with and solved an almost identical problem in the past in a different part of the codebase. With a few minor changes, they said, it could work for the problem at hand, instead of reinventing some new stuff. Great!

One time, an engineer walked through a reasonably complicated diagram outlining how they were going to collect, summarize, store, and act on data to power a new feature. Their solution involved not only putting a new language into the critical path of the application but introducing a new (and at the time relatively immature) datastore to the stack as well. After a few questions back and forth, it became clear that the data they needed already existed and they were only one SQL query and one cron job away from implementing the new feature.

Those sort of outcomes didn’t happen often. But when they did, it was quite satisfying.

Other times, an engineer would present, dialogue would ensue, and it would become clear that from multiple perspectives, going with the departure (the new and novel suggestion) appeared to be the best route.

Overall, I’d say that we got a lot of benefits from this early architecture review practice. Engineers starting out in their career got good practice on presenting their work and their thinking, veteran engineers had an opportunity to share some knowledge or expertise they otherwise wouldn’t have. Like I mentioned, sometimes engineers got to save a lot of time and headache by going a perhaps simpler route. From a leadership perspective, my confidence in the org’s ability to talk constructively about design increased.

How the multiple perspective dialogue evolved

However, there was one problem: when you’re a CTO and an SVP, you can’t be surprised when people come to meetings when you invite them. I often wondered if there were questions and opinions that weren’t being said because of the power dynamic in the room. I knew that unless a critical mass of the engineers in the room demonstrated the ethos of the practice (that is, the creation and support of a critical-thinking dialogue meant to help engineers, both individually and organizationally) then there would be a good chance it would devolve into a sort of hierarchical “American Idol”-style panel of judgement and heavy-handed dictation of policy.

The idea of course was that better decision-making came from people showing up and feeling comfortable about asking questions that could potentially be seen (in a different environment) as “dumb” or naive, and the presenter(s) hearing critique as comments on the design, not as comments on their skills. This meant that the origin of the design (what problem it was intended to solve at the time, what people’s concerns were at the time, etc.) could be recorded

The more the organization grew, the harder it became for a single person (even the CTO or an SVP) to sustain this approach and assumptions that engineers brought with them into architecture reviews.

The beginning of a particular kind of working group

So, an Architecture Review Working Group, or “ARWG” was developed. The main idea was that such a group could keep the practice going without senior engineering leadership bringing an authoritarian flavor to the meetings, and continually model the behavior the practice needed to encourage the multiple perspectives that departures needed.

A small group was formed, 4 or 5 people. The original engineers would be from product, infrastructure, and operations teams, but engineers from other teams joined later. At some point, some members rotated in or out.

The group’s charter was basically the same, at a high level, as how we intended those early architecture reviews: provide a stable and welcome environment where engineers can openly describe how they’re thinking about solving problems with potentially new and novel technology and patterns. This environment needs to support questions and comments from the rest of the organization about trade-offs, assumptions, constraints, pressures, maintenance, and the onus that comes with departures. And finally, documenting how decisions around departures were arrived at, so a sort of anthropological artifact exists to inform future decisions and dialogues.

You might be thinking: “but where and when does a decision get made?”

The ARWG’s role was not to make a decision.

The ARWG’s role was to create and sustain the conditions where a dialogue can take place, with as many perspectives on both the problem and solution as can be had within a given period of time, and then to facilitate a discussion around a decision to be made.

At this point I’d like to make a semantic distinction between “dialogue” and “discussion” and I’m going to pull from the blog post previous to this, where I suggested the book “Dialogue: The Art Of Thinking Together” 

Dialogue is about exploring the nature of choice. To choose is to select among alternatives. Dialogue is about evoking insight, which is a way of reordering our knowledge— particularly the taken-for-granted assumptions that people bring to the table.

Discussion is about making a decision. Unlike dialogue, which seeks to open possibilities and see new options, discussion seeks closure and completion. The word decide means “to resolve difficulties by cutting through them.” Its roots literally mean to “murder the alternative.”

The ARWG’s role is to facilitate dialogue first, then discussion. The key idea is to shed as much light via “open and curious minds” on the departure (problem and solution) before then getting into an evaluation phase of options. The dialogue is intended to bring attention to the departure’s details (and assumed necessity) first, and only then can a discussion about the merits of different options take place.

In my experience, when an architecture review brings attention to a problem and proposed solutions from multiple perspectives, decisions become less controversial. When a decision appears to be obvious to a broad group (“Question: should we (or should we not) take backups of critical databases? Decision: Yes.”) how a decision gets made almost disappears.

It’s only when there isn’t universal agreement about a decision (or even if a decision is necessary) that the how, who, and when a decision gets made becomes important to know. The idea of an architecture review is to expose the problem space and proposed departure ideas to dialogue in a broad enough way that confusion about them can be reduced as much as possible. Less confusion about the topic(s) can help reduce uncertainty and/or anxiety about a solution.

Now, some pitfalls exist here:

  • Engineers (both presenting and participating as audience) need to understand the purpose of the architecture review is to develop better outcomes. That’s it. It’s not to showcase their technical prowess.
  • If there is nothing but focus on “but who makes the ultimate decision?” shows up, this is a signal that critique and feedback (again, on both the problem as well as solutions) is not really wanted, and engineers think their departure ideas should get a pass from critique for whatever reason. Asking about those reasons is useful.
  • Without a strong and continual emphasis on “critique the code, not the coder” this approach can (and will, I can guarantee it) devolve into episodes of defensiveness on multiple fronts. First and foremost, engineers who are looking for feedback on their ideas of a departure need to see it as part of their role as a mature engineer.

Sometimes, you might find an engineer who is so incredibly enthusiastic about the solution they’ve developed for a given problem that they begin to focus on the presentation as a “sales pitch” more than expressing a desire to get feedback. The good news is that this is relatively straightforward to detect when it happens. The bad news is that the purpose of the review isn’t universally clear.

Other times, you might find a group of engineers responsible for developing a solution seeing themselves as different than a group who is responsible for maintaining the solution. This authority-responsibility “double bind” does reveal itself in even the least siloed organizations. In this case, congratulations! The architecture review was able to bring potential elephants-in-the-room to the table.

In almost every case, no matter what the result is of an architecture review, there will always be lingering shades of doubt in people’s minds about taking a departure or not was a good decision. These lingering shades are normal.

Welcome to engineering, where the solving of problems boils down to a “strategy for causing the best change in a poorly understood or uncertain situation within the available resources.”

While I cannot state that taking this approach is an airtight way of always making the best decisions when it comes to technical departures, I can say this: without getting multiple perspectives from different groups on a technical departure, such as this approach, you’re all but guaranteeing suboptimal decisions.

So the next time you are so certain that a particular new way of doing something is better and the rest of your organization should get behind your idea, I would tell you this:

“Excellent, it sounds like you have a hypothesis! We are gonna do an architecture review. If it’s as obvious a solution as you think it is, it should be easy for the rest of the org to come to the same conclusion, and that will make implementing and maintaining it that much easier. If it has some downsides that aren’t apparent, we will at least have a chance to tease those out!”

Book Suggestion: “Dialogue: The Art Of Thinking Together”

Posted 1 CommentPosted in Culture
Dig this. It’s a diagram from a great book called “Dialogue: The Art Of Thinking Together”

I’m reading a book that was suggested to me by the Director of the Office of Learning in the US Forest Service as “required reading” for any modern organization that intends to learn – Dialogue: The Art Of Thinking Together

As a teaser, William Isaacs makes a very good case for considering discussion to be seen as different from dialogue. 

He gives one way of thinking about the two:

Many of us believe that truth emerges after we take two conflicting ideas and somehow smash them together. The resulting spark is supposed to shed light on the situation. But more often than not, what actually happens is that one party simply beats the other down. A discussion attempts to get people to choose one of two alternatives. A dialogue helps to surface the alternatives and lay them side by side, so that they can all be seen in context.

Discussion is not always without merit though. Done well, it provides the benefit of breaking things into parts in order to understand them more clearly. A “skillful discussion” seeks to find some order among the particles while they are still “hot.” It involves the art of putting oneself in another’s shoes, of seeing the world the way she sees it. In skillful discussion, we inquire into the reasons behind someone’s position and the thinking and the evidence that support it. As this kind of discussion progresses, it can lead to a dialectic, the productive antagonism of two points of view. A dialectic pits different ideas against one another and then makes space for new views to emerge out of both.

Later on in the same chapter is this nugget:

Discussion is about making a decision. Unlike dialogue, which seeks to open possibilities and see new options, discussion seeks closure and completion. The word decide means “to resolve difficulties by cutting through them.” Its roots literally mean to “murder the alternative.”

Dialogue is about exploring the nature of choice. To choose is to select among alternatives. Dialogue is about evoking insight, which is a way of reordering our knowledge— particularly the taken-for-granted assumptions that people bring to the table.

I’m not all the way through the book, but it’s quite good thus far. 🙂

Abstract As A Verb

Posted Leave a commentPosted in Uncategorized

The New Stack has an interview with me on various topics here.

I think the following part of the interview gets at what I think is an under-investigated bit of language and meaning:

TNS: At the same time, I imagine that you’ve abstracted a lot of the supporting infrastructure away from the engineer. They don’t have to worry about the particular configuration of the supporting stack?

JA: Yes and no. And I think it really is a common expectation — that abstracting away. The difference is, are you abstracting away so that you truly can say “I don’t have to worry about this”? Or are you abstracting away because you’re aware of those guts, but want to focus your attention right now in this area. That is what we’re looking for.

Post-mortem debriefings every day are littered with the artifacts of people insisting, the second before an outage, that “I don’t have to care about that.”

If “abstracting away” is nothing for you but a euphemism for “Not my job,” “I don’t care about that,” or “I’m not interested in that,” I think Etsy might not be the place for you. Because when things break, when things don’t behave the way they’re expected to, you can’t hold up your arms and say “Not my problem.” That’s what I could call “covering your ass” engineering, and it may work at other companies, but it doesn’t work here.

And the ironic part is that we find, in reality, engineers are more than willing to want to know. I’ve never heard an engineer not wanting to know more about networking. I’ve never heard an engineer wanting to say “You know what, I don’t want to care about database schema design.” And so if the reality is that people do care, then it’s kind of a waste of time to pretend that we’re “abstracting away”. Because you’re going to not care up until the absolute second you do, and when you do, that’s all you want to care about.

Architectural Folk Models

Posted 3 CommentsPosted in Architecture, Culture

I’m going to post the contents of a gist I wrote (2 years ago?!), because Theo is right, some gists are better as posts. The context for this was a debate on Twitter (which, as always, is about as elegant and pleasing to read as a turtle trying to breakdance). 

Summing up contextual influence on systems architecture

1. Monolithic applications and architectures can vary in their monolith-ness. This is an under-specified description.

2. Microservice applications and architectures can vary in their micro-ness. This is an under-specified description.

3. Both microservices and monolithic architectures have both benefits and disadvantages that are contextual.

4. Successful organizations will exploit those benefits while working around any weaknesses.

5. Success of a business is a large influence on the exploitation of benefits and implementation and costs of workarounds.

6. All benefits and work arounds are context-sensitive. Meaning that they are both technically and socially constructed by the organization that navigates them.

7. Path dependency is a thing. History matters and manifests in these architectural decisions and evolution in an organization.

8. Patterns exist to inform practice, not dictate it. Zealous adherence to an architectural pattern brings peril when it is to the exclusion of cultural and social context in actual practice.

9. Architectural patterns will expand, contract, evolve, and change in multiple ways to fit the trade-offs that an organization perceives it has to make, at the time they make them.

Much has been said about this, including some more by me, since then, but apparently it is not a dead topic and I figured I should grab it off of the gist system. 🙂

In the end, I consider architectural patterns to be folk models. Meaning that in popular dialogue, they tend to:

Substitute one label for another, rather than decomposing a large construct into more measurable specifics (how do I know when I say ‘microservice’ to you, we can be sure your understanding of the term is the same without being more specific?)

…are immune to falsification (how do I look at an architecture and decide when it’s no longer a monolith in a way that is universally true?)

…and easily get over-generalized to situations they were never meant to speak about. (when we talk about microservices, how do I know when we are no longer talking about technical specifications and when we start talking about organizational design?)

Much thanks to Hollnagel and Dekker for introducing me to the concept of folk models.

Reflections on the 6th Resilience Engineering Symposium

Posted 6 CommentsPosted in Cognitive Systems Engineering, Complex Systems, Resilience, Systems Safety, Talks

I just spent the last week in Lisbon, Portugal at the Resilience Engineering Symposium. Zoran Perkov and I were invited to speak on the topic of software operations and resilience in the financial trading and Internet services worlds, to an audience of practitioners and researchers from all around the globe, in a myriad of industries.

My hope was to start a dialogue about the connections we’ve seen (and to hopefully explore more) between practices and industries, and to catch theories about resilience up to what’s actually happening in these “pressurized and consequential”1 worlds.

I thought I’d put down some of my notes, highlights and takeaways here.

  • In order to look at how resilience gets “engineered” (if that is actually a thing) we have to look at adaptations that people make in the work that they do, to fill in the gaps that show up as a result of the incompleteness of designs, tools, and prescribed practices. We have to do this with a “low commitment to concepts”2 because otherwise we run the risk of starting with a model (OODA? four cornerstones of resilience? swiss cheese? situation awareness? etc.) and then finding data to fill in those buckets. Which can happen unfortunately quite easily, and also: is not actually science.


  • While I had understood this before the symposium, I’m now even clearer on it: resilience is not the same as fault-tolerance or “graceful degradation.” Instead, it’s something more, akin to what Woods calls graceful extensibility.”


  • The other researchers and practitioners in ‘safety-critical’ industries were very interested in approaches such as continuous deployment/delivery might look like in their fields. They saw it as a set of evolutions from waterfall that Internet software has made that allows it to be flexible and adaptive in the face of uncertainty of how the high-level system of users, providers, customers, operations, performance, etc. will behave in production. This was their reflection, not my words in their mouths, and I really couldn’t agree more. Validating!


  • While financial trading systems and Internet software have some striking similarities, the differences are stark. Zoran and I are both jealous of each other’s worlds in different ways. Also: Zoran can quickly scare the shit out of an audience filled with pension and retirement plans. 🙂


  • The lines between words (phases?) such as: design-implementation-operations are blurred in worlds where adaptive cycles take place, largely because feedback loops are the focus (or source?) of the cycles.


  • We still have a lot to do in “software operations”3 in that we may be quite good at focusing and discussing software development and practices, alongside the computer science concepts that influence those things, but we’re not yet good at exploring what we can find about our field through the lenses of social science and cognitive psychology. I would like to change that, because I think we haven’t gone far enough in being introspective on those fronts. I think we might only currently flirting with those areas. By dropping a Conway’s Law here and a cognitive bias there, it’s a good start. But we need to consider that we might not actually know what the hell we’re talking about (yet!). However, I’m optimistic on this front, because our community has both curiosity and a seemingly boundless ability to debate esoteric topics with each other. Now if we can only stop doing it in 140 characters at a time… 🙂


  • The term “devops” definitely has analogues in other industries. At the very least, the term brought vigorous nodding as I explained it. Woods used the phrase “throw it over the wall” and it resonated quite strongly with many folks from diverse fields. People from aviation, maritime, patient safety…they all could easily give a story that was analogous to “worked fine in dev, ops problem now” in their worlds. Again, validating.


  • There is no Resilience Engineering (or Cognitive Systems Engineering or Systems Safety for that matter) without real dialogue about real practice in the world. In other words, there is no such thing as purely academic here. Every “academic” here viewed their “laboratories” as cockpits, operating rooms and ERs, control rooms in mission control and nuclear plants, on the bridges of massive ships. I’m left thinking that for the most part, this community abhors the fluorescent-lighted environments of universities. They run toward potential explosions, not away from them. Frankly, I think our field of software has a much larger population of the stereotype of the “out-of-touch” computer scientist whose ideas in papers never see the light of production traffic. (hat tip to Kyle for doing the work to do real-world research on what were previously known as academic theories!)



1 Richard Cook’s words.

2 David Woods’ words. I now know how important this is when connecting theory to practice. More on this topic in a different post!

3 This is what I’m now calling what used to be known as “WebOps” or what some refer to as ‘devops’ to reflect that there is more to software services that are delivered via the Internet than just the web, and I’d like to update my language a bit.

Some Principles of Human-Centered Computing

Posted 2 CommentsPosted in Cognitive Systems Engineering, Complex Systems

From Perspectives On Cognitive Task Analysis: Historical Origins and Modern Communities of Practice
(emphasis mine)

The Aretha Franklin Principle Do not devalue the human to justify the machine. Do not criticize the machine to rationalize the human. Advocate the human–machine system to amplify both.
The Sacagawea Principle Human-centered computational tools need to support active organization of information, active search for information, active exploration of information, reflection on the meaning of information, and evaluation and choice among action sequence alternatives.
The Lewis and Clark Principle The human user of the guidance needs to be shown the guidance in a way that is organized in terms of their major goals. Information needed for each particular goal should be shown in a meaningful form and should allow the human to directly comprehend the major decisions associated with each goal.
The Envisioned World Principle The introduction of new technology, including appropriately human-centered technology, will bring about changes in environmental constraints (i.e., features of the sociotechnical system or the context of practice). Even though the domain constraints may remain unchanged, and even if cognitive constraints are leveraged and amplified, changes to the environmental constraints will impact the work.
The Fort Knox Principle The knowledge and skills of proficient workers is gold. It must be elicited and preserved, but the gold must not simply be stored and safeguarded. It must be disseminated and used within the organization when needed.
The Pleasure Principle Good tools provide a feeling of direct engagement. They simultaneously provide a feeling of flow and challenge.
The Janus Principle Human-centered systems do not force a separation between learning and performance. They integrate them.
The Mirror– Mirror Principle Every participant in a complex cognitive system will form a model of the other participant agents as well as a model of the controlled process and its environment.
The Moving Target Principle The sociotechnical workplace is constantly changing, and constant change in environmental constraints may entail constant change in cognitive constraints, even if domain constraints remain constant.


An Open Letter To Monitoring/Metrics/Alerting Companies

Posted 15 CommentsPosted in Cognitive Systems Engineering, Tools, WebOps

I’d like to open up a dialogue with companies who are selling X-As-A-Service products that are focused on assisting operations and development teams in tracking the health and performance of their software systems.

Note: It’s likely my suggestions below are understood and embraced by many companies already. I know a number of them who are paying attention to all areas I would want them to, and/or make sure they’re not making claims about their product that aren’t genuine. 

Anomaly detection is important. It can’t be overlooked. We as a discipline need to pay attention to it, and continually get better at it.

But for the companies who rely on your value-add selling point(s) as:

  • “our product will tell you when things are going wrong” and/or
  • “our product will automatically fix things when it finds something is wrong”

the implication is these things will somehow relieve the engineer from thinking or doing anything about those activities, so they can focus on more ‘important’ things. “Well-designed automation will keep people from having to do tedious work”, the cartoon-like salesman says.

Please stop doing this. It’s a lie in the form of marketing material and it’s a huge boondoggle that distracts us away from focusing on what we should work on, which is to augment and assist people in solving problems.

Anomaly detection in software is, and always will be, an unsolved problem. Your company will not solve it. Your software will not solve it. Our people will improvise around it and adapt their work to cope with the fact that we will not always know what and how something is wrong at the exact time we need to know.

My suggestion is to first acknowledge this (that your attempts to detect anomalies perfectly, at the right time, is not possible) when you talk to potential customers. Want my business? Say this up front, so we can then move on to talking about how your software will assist my team of expert humans who will always be smarter than your code.

In other words, your monitoring software should take the Tony Stark approach, not the WOPR/HAL9000 approach.

These are things I’d like to know about how you thought about your product:

  • Tell me about how you used qualitative research in developing your product.
  • Tell me about how you observed actual engineers in their natural habitat, in the real world, as they detected and responded to anomalies that arose.
  • Show me your findings from when you had actual UX/UI professionals consider carefully how the interfaces of your product should be designed.
  • Demonstrate to me the people designing your product have actually been on-call and have experience with the scenario where they needed to understand what the hell was going on, had no idea where to start looking, all under time and consequence pressure.
  • Show me the people who are building your product take as a first design principle that outages and other “untoward” events are handled not by a lone engineer, but more often then not by a team of engineers all with their different expertise and focus of attention. Successful response depends on not just on anomaly detection, but how the team shares the observations they are making amongst each other in order to come up with actions to take.


Stop thinking you’re trying to solve a troubleshooting problem; you’re not.


The world you’re trying to sell to is in the business of dynamic fault managementThis means that quite often you can’t just take a component out of service and investigate what’s wrong with it. It means diagnosis involves testing hypotheses that could actually make things a lot worse than they already are. This means that phases of responding to issues have overlapping concerns all at the same time. Things like:

  • I don’t know what is going on.
  • I have a guess about what is going on, but I’m not sure, and I don’t know how to confirm it.
  • Because of what Sue and Alice said, and what I see, I think what is going on is X.
  • Since we think X is happening, I think we should do Y.
  • Is there a chance that Y will make things worse?
  • If we don’t know what’s happening with N, can we do M so things don’t get worse, or we can buy time to figure out what to do about N?
  • Do we think this thing (that we have no clue about) is changing for the better or the worse?
  • etc.

Instead of telling me about how your software will solve problems, show me you’re trying to build a product that is going to join my team as an awesome team member, because I’m going to think about using/buying your service in the same way I think about hiring.


John Allspaw


Stress, Strain, and Reminders

Posted 4 CommentsPosted in Complex Systems, Etsy

This is a photo of the backside of the T-shirt for the operations engineering team  at Etsy:


This diagram might not come as a surprise to those who know that I come from a mechanical engineering background. But I also wanted to have this on the T-shirt as a reminder (maybe just to myself, but hopefully those on the team) that organizations (or groups within them) can experience stresses and strains just like materials do.

About the time that I was thinking about the T-shirt I had come across “Stress-Strain Plots As a Basis For Assessing System Resilience” (Woods, D. D., & Wreathall, J. (2008))

One of the largest questions in my mind then (well, even before then, since then, and still) was: how to do engineers (in their particular environment and familiarity with their tools) allow for them to adapt and learn? If I could explore that question, then I might have some hope to answer, in the words of Eduardo Salas: “How can you turn a team of experts into an expert team?” (link)

In the paper, Woods and Wreathall explore the idea of the very familiar stress strain diagram that is found in the textbook of any materials science class for the last 10 decades or so. They look to this is an analogy for organizations and as illustrations that groups of people organizations have different “state spaces” in which they adapt.

In “uniform” or normal stretching there is what they would describe as the competence envelope and then past that, there are the more interesting “extra regions” were teams have to reconfigure, improvise, and make trade-offs in uncertain conditions. This topic is so interesting to me I decided to do a master’s thesis on it.

Here’s the thing: no work in complex systems can be prescribed. Which means it can’t be codified, and it can’t be proceduralized. Instead, rules and procedures and code are the scaffolding upon which operators, designers, engineers adapt, in order to be successful.

Sometimes these adaptations bring efficiencies. Sometimes they bring costs. Sometimes they bring surprises. Sometimes they bring more needs to adapt. But one thing is certain: they don’t bring the system back to some well-known equilibrium of ‘stable’ – complex systems don’t work that way.

But you don’t have to read my interpretation of the paper, you should just go and read it. 🙂

The last (and potentially just as important) reminder for me in the diagram is that all analogies have limits, and this one is no exception. When we use analogies and don’t acknowledge their limitations we can get into trouble. But that’s for a different post on a different day.