Kitchen Soap

Over the years, a number of people have asked about the details surrounding Etsy’s architecture review process.

In this post, I’d like to focus on the architecture review working group’s role in facilitating dialogue about technology decision-making. Part of this is really just about working groups in general (pros, cons, formats, etc.) and another part of it relies on the general philosophy that Dan McKinley lays out in his post/talk, so I’d consider reading those first and then coming back here.

Fundamental: engineering decision-making is a socially constructed activity

But first, we need to start with a bit of grounding theory. In 1985, Billy Vaughn Koen (now professor emeritus at the University of Texas) introduced the idea of the “Engineering Method” which I’ve written about before. The key idea is his observation that engineers are defined not what they produce, but how they do their work. He describes the Engineering Method succinctly, saying that it is:

“the strategy for causing the best change in a poorly understood or uncertain situation within the available resources”

Note the normative terms best and poorly.

In other words, engineering (as an activity) does not have “correct” solutions to problems. As an aside, if you’re looking for correct solutions to problems, I’d suggest that you go work in a different field (like mathematics); engineering will likely frustrate you.

I wholeheartedly agree with this idea, and I’d take it a bit further and say that successful engineering teams find solutions to problems largely through dialogue. By this I mean, they engage in various forms of:

Describing the problem they believe needs solving. This may or may not be straightforward to explain or describe, so a back-and-forth is usually needed for a group to get a full grasp of it.
Generating hypotheses about whether or not the problem(s) being described need to be solved in more or less complete or exhaustive ways. Some problem descriptions might be really narrow, or really broad, some problems don’t have to be fully “solved” all at once, etc. Will the problem exist in perpetuity?
Evaluating options for solutions. What are the pros and cons? Can a group that needs to support a solution sustain or maintain said solution? Will some solutions have an “expiration date”? What possible unintended consequences could result in upstream/downstream systems that depend on this solution?
Building confidence in implementing a given solution. How uncertain (and in what ways) is the group in how the solution may play out in positive or negative terms?
Etc.

I realize this should be obvious, but I include this perspective here because I’m continually surprised how difficult this is to believe and understand when the topic is companies choosing to use a particular piece of technology (framework, language, architecture, etc.) or not.

Once you can grok the concept that engineering decisions are constructed socially, then you can understand how I believe the set and setting of an “architecture review” is critical.

The Concept and The Intention

When Kellan and I first came to Etsy in 2010, we put in place a process whereby an engineer (or a group) who were looking to implement something new (different from what we had been already doing or using) would present the problem they were trying to solve and why they believed existing solutions at Etsy wouldn’t be ideal to solve it. We called this an architecture review.

We called these “something new” things departures. Again, Dan’s post/talk goes over this, but the basic idea is that there are costs (many of which can be both hidden and surprising on many fronts) for making departures that attempt to solve a local problem without thinking about the global effects it can have on an organization.

Departures can include things such as:

writing a function or feature in a language that isn’t in wide usage at the company already
redesigning a function or feature (even in a language already widely-used)
introducing a pattern or architecture not already in use
using new server software for storing, caching, or retrieving data not already being used
essentially, anything that had the potential to surprise and challenge the team when it (eventually) broke

Those bullets above are pretty fuzzy, I’m sure you noticed. So how did you know when you should ask for an architecture review? Ask your coworkers. If you’re debating back and forth whether your thing needs one, then you likely do.

So what was this architecture review meeting all about? It was pretty simple, really. It was just a presentation by the engineer(s) looking for feedback on a solution they came up with, and Kellan and I would ask questions. The meeting was open to anyone in the company who wanted to attend. The hope was that many engineers would attend, actually. The intent here was to help the engineers describe their problem as thoroughly as they can, and by asking questions, we could help draw out any tacit assumptions they made in thinking through the problem. In a nutshell: it was a critical-thinking exercise, and we wanted engineers to want to do this, because it was ideally about providing feedback.

The gist of it could be described like this, from the point of view of an engineer wanting feedback:

“Hey everybody – check this out: I believe I have to solve problem X, and I’ve thought about using A, B, and C to do it, but they all don’t seem ideal. I think I have to go with departure Y as the solution. Please try to talk me out of this.”

This approach leads to some really good prompts for the dialogue I mentioned above. Prompts such as:

“Is it possible I don’t even have to solve problem X?”
“Am I missing something with A, B, or C? Is it possible that they’ll work well enough?”
“If I go with Y, what do we need to do to feel confident about supporting Y as the solution for this type of problem?”

At a high level, we were reasonably successful with this approach in those early days, in that we were able to get engineers to come, talk about the judgements and assumptions they were making, and entertain questions and suggestions about the details about both their understanding of the problem and potential solutions. Even though it began with mostly Kellan and I asking questions, we actively encouraged others to as well. Slowly, they did, and it became a really strong source of confidence for the team, collectively.

There were some really surprising results by doing these reviews. More than once, someone in the room would recognize the problem being presented, and relay that they had also wrestled with and solved an almost identical problem in the past in a different part of the codebase. With a few minor changes, they said, it could work for the problem at hand, instead of reinventing some new stuff. Great!

One time, an engineer walked through a reasonably complicated diagram outlining how they were going to collect, summarize, store, and act on data to power a new feature. Their solution involved not only putting a new language into the critical path of the application but introducing a new (and at the time relatively immature) datastore to the stack as well. After a few questions back and forth, it became clear that the data they needed already existed and they were only one SQL query and one cron job away from implementing the new feature.

Those sort of outcomes didn’t happen often. But when they did, it was quite satisfying.

Other times, an engineer would present, dialogue would ensue, and it would become clear that from multiple perspectives, going with the departure (the new and novel suggestion) appeared to be the best route.

Overall, I’d say that we got a lot of benefits from this early architecture review practice. Engineers starting out in their career got good practice on presenting their work and their thinking, veteran engineers had an opportunity to share some knowledge or expertise they otherwise wouldn’t have. Like I mentioned, sometimes engineers got to save a lot of time and headache by going a perhaps simpler route. From a leadership perspective, my confidence in the org’s ability to talk constructively about design increased.

How the multiple perspective dialogue evolved

However, there was one problem: when you’re a CTO and an SVP, you can’t be surprised when people come to meetings when you invite them. I often wondered if there were questions and opinions that weren’t being said because of the power dynamic in the room. I knew that unless a critical mass of the engineers in the room demonstrated the ethos of the practice (that is, the creation and support of a critical-thinking dialogue meant to help engineers, both individually and organizationally) then there would be a good chance it would devolve into a sort of hierarchical “American Idol”-style panel of judgement and heavy-handed dictation of policy.

The idea of course was that better decision-making came from people showing up and feeling comfortable about asking questions that could potentially be seen (in a different environment) as “dumb” or naive, and the presenter(s) hearing critique as comments on the design, not as comments on their skills. This meant that the origin of the design (what problem it was intended to solve at the time, what people’s concerns were at the time, etc.) could be recorded

The more the organization grew, the harder it became for a single person (even the CTO or an SVP) to sustain this approach and assumptions that engineers brought with them into architecture reviews.

The beginning of a particular kind of working group

So, an Architecture Review Working Group, or “ARWG” was developed. The main idea was that such a group could keep the practice going without senior engineering leadership bringing an authoritarian flavor to the meetings, and continually model the behavior the practice needed to encourage the multiple perspectives that departures needed.

A small group was formed, 4 or 5 people. The original engineers would be from product, infrastructure, and operations teams, but engineers from other teams joined later. At some point, some members rotated in or out.

The group’s charter was basically the same, at a high level, as how we intended those early architecture reviews: provide a stable and welcome environment where engineers can openly describe how they’re thinking about solving problems with potentially new and novel technology and patterns. This environment needs to support questions and comments from the rest of the organization about trade-offs, assumptions, constraints, pressures, maintenance, and the onus that comes with departures. And finally, documenting how decisions around departures were arrived at, so a sort of anthropological artifact exists to inform future decisions and dialogues.

You might be thinking: “but where and when does a decision get made?”

The ARWG’s role was not to make a decision.

The ARWG’s role was to create and sustain the conditions where a dialogue can take place, with as many perspectives on both the problem and solution as can be had within a given period of time, and then to facilitate a discussion around a decision to be made.

At this point I’d like to make a semantic distinction between “dialogue” and “discussion” and I’m going to pull from the blog post previous to this, where I suggested the book “Dialogue: The Art Of Thinking Together”

Dialogue is about exploring the nature of choice. To choose is to select among alternatives. Dialogue is about evoking insight, which is a way of reordering our knowledge– particularly the taken-for-granted assumptions that people bring to the table.

Discussion is about making a decision. Unlike dialogue, which seeks to open possibilities and see new options, discussion seeks closure and completion. The word decide means “to resolve difficulties by cutting through them.” Its roots literally mean to “murder the alternative.”

The ARWG’s role is to facilitate dialogue first, then discussion. The key idea is to shed as much light via “open and curious minds” on the departure (problem and solution) before then getting into an evaluation phase of options. The dialogue is intended to bring attention to the departure’s details (and assumed necessity) first, and only then can a discussion about the merits of different options take place.

In my experience, when an architecture review brings attention to a problem and proposed solutions from multiple perspectives, decisions become less controversial. When a decision appears to be obvious to a broad group (“Question: should we (or should we not) take backups of critical databases? Decision: Yes.”) how a decision gets made almost disappears.

It’s only when there isn’t universal agreement about a decision (or even if a decision is necessary) that the how, who, and when a decision gets made becomes important to know. The idea of an architecture review is to expose the problem space and proposed departure ideas to dialogue in a broad enough way that confusion about them can be reduced as much as possible. Less confusion about the topic(s) can help reduce uncertainty and/or anxiety about a solution.

Now, some pitfalls exist here:

Engineers (both presenting and participating as audience) need to understand the purpose of the architecture review is to develop better outcomes. That’s it. It’s not to showcase their technical prowess.
If there is nothing but focus on “but who makes the ultimate decision?” shows up, this is a signal that critique and feedback (again, on both the problem as well as solutions) is not really wanted, and engineers think their departure ideas should get a pass from critique for whatever reason. Asking about those reasons is useful.
Without a strong and continual emphasis on “critique the code, not the coder” this approach can (and will, I can guarantee it) devolve into episodes of defensiveness on multiple fronts. First and foremost, engineers who are looking for feedback on their ideas of a departure need to see it as part of their role as a mature engineer.

Sometimes, you might find an engineer who is so incredibly enthusiastic about the solution they’ve developed for a given problem that they begin to focus on the presentation as a “sales pitch” more than expressing a desire to get feedback. The good news is that this is relatively straightforward to detect when it happens. The bad news is that the purpose of the review isn’t universally clear.

Other times, you might find a group of engineers responsible for developing a solution seeing themselves as different than a group who is responsible for maintaining the solution. This authority-responsibility “double bind” does reveal itself in even the least siloed organizations. In this case, congratulations! The architecture review was able to bring potential elephants-in-the-room to the table.

In almost every case, no matter what the result is of an architecture review, there will always be lingering shades of doubt in people’s minds about taking a departure or not was a good decision. These lingering shades are normal.

Welcome to engineering, where the solving of problems boils down to a “strategy for causing the best change in a poorly understood or uncertain situation within the available resources.”

While I cannot state that taking this approach is an airtight way of always making the best decisions when it comes to technical departures, I can say this: without getting multiple perspectives from different groups on a technical departure, such as this approach, you’re all but guaranteeing suboptimal decisions.

So the next time you are so certain that a particular new way of doing something is better and the rest of your organization should get behind your idea, I would tell you this:

“Excellent, it sounds like you have a hypothesis! We are gonna do an architecture review. If it’s as obvious a solution as you think it is, it should be easy for the rest of the org to come to the same conclusion, and that will make implementing and maintaining it that much easier. If it has some downsides that aren’t apparent, we will at least have a chance to tease those out!”

Kitchen SoapThoughts on systems safety, software operations, and sociotechnical systems.

Multiple Perspectives On Technical Problems and Solutions

19 Comments

Multiple Perspectives On Technical Problems and Solutions

Book Suggestion: “Dialogue: The Art Of Thinking Together”

Invited article in IEEE Software – Technical Debt: Challenges and Perspectives

19 Comments