admin

All articles by admin

 

My Next Step

After leaving my CTO role at Etsy this past May, I took the summer off to spend time with my family, enjoy New York summertime, and give my mind time to refresh and recharge. I thought long and hard about what I wanted to do next; the last time I took a new job was...
Continue reading...  

Invited article in IEEE Software – Technical Debt: Challenges and Perspectives

Earlier this year, I was asked to contribute to an article in IEEE Software, entitled “Technical Debt: Challenges and Perspectives.” I can’t post the entire article here, but I can post the accepted text of my part of the article here. Misusing the Metaphor John Allspaw All technical disciplines (not just software development) require different...
Continue reading...  

Multiple Perspectives On Technical Problems and Solutions

Over the years, a number of people have asked about the details surrounding Etsy’s architecture review process. In this post, I’d like to focus on the architecture review working group’s role in facilitating dialogue about technology decision-making. Part of this is really just about working groups in general (pros, cons, formats, etc.) and another part...
Continue reading...  

Book Suggestion: “Dialogue: The Art Of Thinking Together”

I’m reading a book that was suggested to me by the Director of the Office of Learning in the US Forest Service as “required reading” for any modern organization that intends to learn – Dialogue: The Art Of Thinking Together As a teaser, William Isaacs makes a very good case for considering discussion to be seen as...
Continue reading...  

Abstract As A Verb

The New Stack has an interview with me on various topics here. I think the following part of the interview gets at what I think is an under-investigated bit of language and meaning: TNS: At the same time, I imagine that you’ve abstracted a lot of the supporting infrastructure away from the engineer. They don’t have...
Continue reading...  

Architectural Folk Models

I’m going to post the contents of a gist I wrote (2 years ago?!), because Theo is right, some gists are better as posts. The context for this was a debate on Twitter (which, as always, is about as elegant and pleasing to read as a turtle trying to breakdance).  Summing up contextual influence on systems architecture...
Continue reading...  

Reflections on the 6th Resilience Engineering Symposium

I just spent the last week in Lisbon, Portugal at the Resilience Engineering Symposium. Zoran Perkov and I were invited to speak on the topic of software operations and resilience in the financial trading and Internet services worlds, to an audience of practitioners and researchers from all around the globe, in a myriad of industries....
Continue reading...  

Some Principles of Human-Centered Computing

From Perspectives On Cognitive Task Analysis: Historical Origins and Modern Communities of Practice (emphasis mine) The Aretha Franklin Principle Do not devalue the human to justify the machine. Do not criticize the machine to rationalize the human. Advocate the human—machine system to amplify both. The Sacagawea Principle Human-centered computational tools need to support active organization of...
Continue reading...  

An Open Letter To Monitoring/Metrics/Alerting Companies

I’d like to open up a dialogue with companies who are selling X-As-A-Service products that are focused on assisting operations and development teams in tracking the health and performance of their software systems. Note: It’s likely my suggestions below are understood and embraced by many companies already. I know a number of them who are...
Continue reading...  

Stress, Strain, and Reminders

This is a photo of the backside of the T-shirt for the operations engineering team  at Etsy: This diagram might not come as a surprise to those who know that I come from a mechanical engineering background. But I also wanted to have this on the T-shirt as a reminder (maybe just to myself, but...
Continue reading...  

The Infinite Hows (or, the Dangers Of The Five Whys)

(this is also posted on O’Reilly’s Radar blog. Much thanks to Daniel Schauenberg, Morgan Evans, and Steven Shorrock for feedback on this) Before I begin this post, let me say that this is intended to be a critique of the Five Whys method, not a criticism of the people who are in favor of using...
Continue reading...  

Translations Between Domains: David Woods

One of the reasons I’ve continued to be more and more interested in Human Factors and Safety Science is that I found myself without many answers to the questions I have had in my career. Questions surrounding how organizations work, how people think and work with computers, how decisions get made under uncertainty, and how...
Continue reading...  

Teaching Engineering As A Social Science

Below is a piece written by Edward Wenk, Jr., which originally appeared in PRlSM, the magazine for the American Society for Engineering Education (Publication Volume 6. No. 4. December 1996.) While I think that there’s much more than what Wenk points to as ‘social science’ – I agree wholeheartedly with his ideas. I might even say...
Continue reading...  

Engineering’s Relationship To Science

One of the things that I hoped to get across in my post about perspectives on mature engineering was the subtle idea that engineering’s relationship to science is not straightforward. My first caveat is that I am not a language expert, but I do respect it as a potential deadly weapon. I do hope that...
Continue reading...  

Paradigm Check Point: Prefacing Debriefings

I’m a firm believer in restating values, goals, and perspectives at the beginning of every group debriefing (e.g. “postmortem meetings”) in order to bring new folks up to speed on how we view the process and what the purpose of the debriefing is. When I came upon a similar baselining dialogue from another domain, I...
Continue reading...  

High Tempo, High Consequence

A Time to Remember I want you to think back to a time when you found yourself in an emergency situation at work. Maybe it was diagnosing and trying to recover from a site outage. Maybe it was when you were confronting the uncertain possibility of critical data loss. Maybe it was when you and...
Continue reading...  

We are too much accustomed to attribute to a singl…

We are too much accustomed to attribute to a single cause that which is the product of several, and the majority of our controversies come from that....
Continue reading...  

Counterfactual Thinking, Rules, and The Knight Capital Accident

In between reading copious amounts of indignation surrounding whatever is suboptimal about healthcare.gov, you may or may not have noticed the SEC statement regarding the Knight Capital accident that took place in 2012. This Release No. 70694 is a document that contains many details about the accident, and you can read what looks like on the surface...
Continue reading...  

Learning from Failure at Etsy

(This was originally posted on Code As Craft, Etsy’s engineering blog. I’m re-posting it here because it still resonates strongly as I prepare to teach a ‘postmortem facilitator’s course internally at Etsy.) Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought...
Continue reading...  

A Mature Role for Automation: Part II

(Courtney Nash’s excellent post on this topic inadvertently pushed me to finally finish this – give it a read) In the last post on this topic, I hoped to lay the foundation for what a mature role for automation might look like in web operations, and bring considerations to the decision-making process involved with considering...
Continue reading...  

Owning Attention (Considerations for Alert Design)

In the past month or two, I’ve spoken on the topic of alert design. There’s a video of my giving the talk (at Monitorama, as well), but I thought I’d try to post on the topic and material as well. The topic of alerts and “alert design” as seen as a deliberate and purposeful thing...
Continue reading...  

Prevention versus Governance versus Adaptive Capacities

The other day I posted about the intersections of Systems Safety and web operations and engineering. One of the largest proponents of bringing a systems thinking perspective to safety (specifically ‘software safety’) is Dr. Nancy Leveson, who has been in that field (really a multidisciplinary field) for at least a couple of decades. She’s the...
Continue reading...  

Always a Student: Operations and Systems Safety

Anyone who has known me well knows that I’m generally not satisfied with skimming the surface of a topic that I feel excited about. So to them, it wouldn’t be a surprise that I’m now working on (yes, while I’m still at Etsy!) a master’s degree. Since January, I’ve been working with an incredible group...
Continue reading...  

Availability: Nuance As A Service

Something that has struck me funny recently surrounds the traditional notion of availability of web applications. With respect to its relationship to revenue, to infrastructure and application behavior, and fault protection and tolerance, I’m thinking it may be time to get a broader upgrade adjustment to the industry’s perception on the topic. These nuances in the...
Continue reading...  

On Being A Senior Engineer

UPDATE: I’ve added a short section on the topic of sponsorship. I think that there’s a lot of institutional knowledge in our field, especially about what makes for a productive engineer. But while there are a good deal of books in the management field about “expert” roles and responsibilities of non-technical individual contributors, I don’t...
Continue reading...  

A Mature Role for Automation: Part I

(Part 1 of 2 posts) I’ve been percolating on this post for a long time. Thanks very much to Mark Burgess for reviewing early drafts of it. One of the ideas that permeates our field of web operations is that we can’t have enough automation. You’ll see experience with “building automation” on almost every job...
Continue reading...  

Fundamental: Stress-Strain Curves In Web Engineering

I make it no secret that my background is in mechanical engineering. I still miss those days of explicit and dynamic finite element analysis, when I worked for the VNTSC, working on vehicle crashworthiness studies for the NHTSA. What was there not to like? Things like cars and airbags and seatbelts and dummies and that...
Continue reading...  

Human Factors and Web Engineering’s Intersection

Given my recent (and apparently insatiable appetite) for studying the contexts, interface(s), and success and failure modes  between man and machine, it’s not a surprise that I’ve been flying head-on into the field of Human Factors. Sub-disciplines include Cognitive Engineering and Human-Computer Interaction (HCI). It would appear to me that there isn’t one facet of the field of...
Continue reading...  

Resilience Engineering Part II: Lenses

(this is part 2 of a series: here is part 1) One of the challenges of building and operating complex systems is that it’s difficult to talk about one facet or component of them without bleeding the conversation into other related concerns. That’s the funky thing about complex systems and systems thinking: components come together...
Continue reading...  

The Devil’s In The Details

I’m a firm believer that context is everything, and that it’s needed in every constructive conversation we want to have as engineers. As a nascent (but adorable) engineering field, we discuss (in blogs, books, meetups, conferences, etc.) success and failure in a number of areas, including the ways in which we work. We don’t just...
Continue reading...  

Each necessary, but only jointly sufficient

I thought it might be worth digging in a bit deeper on something that I mentioned in the Advanced Postmortem Fu talk I gave at last year’s Velocity conference. For complex socio-technical systems (web engineering and operations) there is a myth that deserves to be busted, and that is the assumption that for outages and...
Continue reading...  

Convincing management that cooperation and collaboration was worth it

While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo’s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at...
Continue reading...  

Fault Tolerance and Protection

In yet another post where I point to a paper written from the perspective of another field of engineering about a topic that I think is inherently mappable to the web engineering world, I’ll at least give a summary. 🙂 Every time someone on-call gets an alert, they should always be thinking along these lines:...
Continue reading...  

Systems Engineering: A great definition.

Ben Rockwood said something last December about the re-emergence of the Systems Engineer and I agree with him, 100%. To add to that, I’d like to quote the excellent NASA Systems Engineering handbook’s introduction. The emphasis is mine: Systems engineering is a methodical, disciplined approach for the design, realization, technical management, operations, and retirement of...
Continue reading...  

Training Organizational Resilience in Escalating Situations

This little ramble of thoughts are related to my talk at Velocity coming up, but I know I’ll never get to this part at the conference, so I figured I’d post about it here. Building resilience from a systems point of view means (amongst other things) understanding how your organization deals with failure and unexpected...
Continue reading...  

Resilience Engineering: Part I

I’ve been drafting this post for a really long time. Like most posts, it’s largely for me to get some thoughts down. It’s also very related to the topic I’ll be talking about at Velocity later this year. When I gave a keynote talk at the Surge Conference last year, I talked about how our...
Continue reading...  

Etsy’s Chef Repo, 2010

Etsy’s Chef Repo, 2010 from jspaw on Vimeo. Delicious InfoViz courtesy of Gource....
Continue reading...  

MTTR is more important than MTBF (for most types of F)

UPDATE, 10/17/2017: This post hasn’t aged well, and needs some patching. The title should be “TTR is more important than TBF (for most types of F)” Why? Because taking the statistical mean of TTR or TBF makes absolutely no sense, whatsoever. Incidents and events simply are not comparable in that way, and even if they were, the time...
Continue reading...  

Go or No-Go: Operability and Contingency Planning (Surge)

Last month I had the honor of speaking at the Surge Conference in Baltimore, put together by OmniTI. It was a most excellent conference, and the expertise levels were ridiculously high. I count myself lucky to be considered the same league as the rest of the presenters. I did give a Keynote talk, and I...
Continue reading...  

Nagios alerts on the iPhone – deleting boatloads

Protip: if you’re getting Nagios alerts on an iPhone, and you have your contact set as:  xxx-xxx-xxxx@txt.att.net, you’ll get messages from a ‘sender’ that looks like: “1 (410) 000-173”. This is not someone in Maryland, it’s a special address so that AT&T can route a reply back to the sender if need be. The side...
Continue reading...  

Ops Meta-Metrics: Velocity 2010 Slides

As expected, Velocity was excellent this year. What an awesome time to be in this field. Caveat for those who didn’t see/hear my talk: the graphs and numbers in the slides are, for the most part, made up. But they’re also in line with what I’ve seen at Flickr and Etsy. Ops Meta-Metrics: The Currency...
Continue reading...  

Some WebOps Interview Questions

It can be difficult to evaluate web ops candidates, for a couple of different reasons. One is that the breadth of knowledge needed for the field can be pretty wide, so spending too much time on any particular technical area can be a waste of time. Another reason is that it can be difficult to...
Continue reading...  

The new book: Web Operations

At the Velocity Conference last year, I was talking to Mike Loukides from O’Reilly about the topics being presented and how it was so great to see such successful veterans of the field come out from behind the curtain and share their experiences. Mike said that there was interest in doing a book on the...
Continue reading...  

We’re hiring ops folks at Etsy!

We’re hiring web ops engineers at Etsy.  Here’s the gist of it…. Responsibilities Building and maintaining Etsy’s infrastructure, from installed iron to production Taking part in a 24×7 on-call rotation Tightly cooperating and collaborating with development, product, community and customer care Requirements Experience with configuration management systems and concepts (Chef, Puppet, Cfengine, etc.) Experience in...
Continue reading...  

Pigz – parallel gzip OMG

Pigz is basically parallel gzip, to take advantage of multiple cores.  When you’ve got massive files, this can be a pretty big advantage, especially when you’ve got lots of cores sitting around. Taking a 418m squid access log file, on a dual-quad Nehalem L5520  with HyperThreading turned on: [jallspaw@server01 ~]$ ls -lh daemon.log.2; time gzip...
Continue reading...  

Agile Executive Podcast

Yesterday I was on a podcast with Andrew Shafer and Michael Coté, and we talked about development and operations cooperation. I rambled a bit, like I tend to do. Andrew brought up something that’s disturbing, and I’ve seen elsewhere, which is that after seeing our presentation last year at Velocity, some folks decided that we...
Continue reading...  

Need some FUDforum consulting done

I’ve been helping out a friend for some years with running a decent-size discussion forum. It’s running on a little (512mb of RAM) dedicated server and it’s outgrown the box it’s on. It needs to move to a new machine, which is all ready to take it. Problem is, it’s in a twisty-maze of dependencies....
Continue reading...  

Deployment is just a part of dev/ops cooperation, not the whole thing

Dev/Ops is what some people are calling the renewed cross-interest in development and operations collaboration. Hammond and I spoke about it, and there was even a conference in Europe dedicated to it. While I do think that there’s still a lot more that is to be discussed around this idea of cooperation and mixing of...
Continue reading...  

The epicenter of the web, and NYC

One of my apprehensions in moving to New York from San Francisco was a common concern: why would I move from the ‘epicenter’ of the web to a place where it’s not? There’s been lots written about startup hub cities, and innovative web metro areas, but the fact of the matter is that New York...
Continue reading...  

From one door to another

Last week I gave 2 month’s notice – I’ll be leaving Flickr in January. When Stew and Cat asked me to join Flickr in January of 2005, I felt like it was time to go and do something different, so I said yes. Five years (and four billion photos) later, it’s again time to go...
Continue reading...