admin

I like making things go! At the moment, I'm SVP of Infrastructure and Operations at Etsy, and I'm currently pursuing a Master's degree in Human Factors and Systems Safety at Lund University.

All articles by admin

 

Teaching Engineering As A Social Science

Below is a piece written by Edward Wenk, Jr., which originally appeared in PRlSM, the magazine for the American Society for Engineering Education (Publication Volume 6. No. 4. December 1996.) While I think that there’s much more than what Wenk points to as ‘social science’ – I agree wholeheartedly with his ideas. I might even say...
Continue reading...  

Engineering’s Relationship To Science

One of the things that I hoped to get across in my post about perspectives on mature engineering was the subtle idea that engineering’s relationship to science is not straightforward. My first caveat is that I am not a language expert, but I do respect it as a potential deadly weapon. I do hope that...
Continue reading...  

Paradigm Check Point: Prefacing Debriefings

I’m a firm believer in restating values, goals, and perspectives at the beginning of every group debriefing (e.g. “postmortem meetings”) in order to bring new folks up to speed on how we view the process and what the purpose of the debriefing is. When I came upon a similar baselining dialogue from another domain, I...
Continue reading...  

High Tempo, High Consequence

A Time to Remember I want you to think back to a time when you found yourself in an emergency situation at work. Maybe it was diagnosing and trying to recover from a site outage. Maybe it was when you were confronting the uncertain possibility of critical data loss. Maybe it was when you and...
Continue reading...  

We are too much accustomed to attribute to a singl…

We are too much accustomed to attribute to a single cause that which is the product of several, and the majority of our controversies come from that....
Continue reading...  

Counterfactual Thinking, Rules, and The Knight Capital Accident

In between reading copious amounts of indignation surrounding whatever is suboptimal about healthcare.gov, you may or may not have noticed the SEC statement regarding the Knight Capital accident that took place in 2012. This Release No. 70694 is a document that contains many details about the accident, and you can read what looks like on the surface...
Continue reading...  

Learning from Failure at Etsy

(This was originally posted on Code As Craft, Etsy’s engineering blog. I’m re-posting it here because it still resonates strongly as I prepare to teach a ‘postmortem facilitator’s course internally at Etsy.) Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought...
Continue reading...  

A Mature Role for Automation: Part II

(Courtney Nash’s excellent post on this topic inadvertently pushed me to finally finish this – give it a read) In the last post on this topic, I hoped to lay the foundation for what a mature role for automation might look like in web operations, and bring considerations to the decision-making process involved with considering...
Continue reading...  

Owning Attention (Considerations for Alert Design)

In the past month or two, I’ve spoken on the topic of alert design. There’s a video of my giving the talk (at Monitorama, as well), but I thought I’d try to post on the topic and material as well. The topic of alerts and “alert design” as seen as a deliberate and purposeful thing...
Continue reading...  

Prevention versus Governance versus Adaptive Capacities

The other day I posted about the intersections of Systems Safety and web operations and engineering. One of the largest proponents of bringing a systems thinking perspective to safety (specifically ‘software safety’) is Dr. Nancy Leveson, who has been in that field (really a multidisciplinary field) for at least a couple of decades. She’s the...
Continue reading...  

Always a Student: Operations and Systems Safety

Anyone who has known me well knows that I’m generally not satisfied with skimming the surface of a topic that I feel excited about. So to them it wouldn’t be a surprise that I’m now working on (yes, while I’m still at Etsy!) a master’s degree. Since January I’ve been working with an incredible group...
Continue reading...  

Availability: Nuance As A Service

Something that has struck me funny recently surrounds the traditional notion of availability of web applications. With respect to its relationship to revenue, to infrastructure and application behavior, and fault protection and tolerance, I’m thinking it may be time to get a broader upgrade adjustment to the industry’s perception on the topic. These nuances in the...
Continue reading...  

On Being A Senior Engineer

I think that there’s a lot of institutional knowledge in our field, especially about what makes for a productive engineer. But while there are a good deal of books in the management field about “expert” roles and responsibilities of non-technical individual contributors, I don’t see too many modern books or posts that might shed light...
Continue reading...  

A Mature Role for Automation: Part I

I’ve been percolating on this post for a long time. Thanks very much to Mark Burgess for reviewing early drafts of it. One of the ideas that permeates our field of web operations is that we can’t have enough automation. You’ll see experience with “building automation” on almost every job description, and many post-mortem transcriptions...
Continue reading...  

Fundamental: Stress-Strain Curves In Web Engineering

I make it no secret that my background is in mechanical engineering. I still miss those days of explicit and dynamic finite element analysis, when I worked for the VNTSC, working on vehicle crashworthiness studies for the NHTSA. What was there not to like? Things like cars and airbags and seatbelts and dummies and that...
Continue reading...  

Human Factors and Web Engineering’s Intersection

Given my recent (and apparently insatiable appetite) for studying the contexts, interface(s), and success and failure modes  between man and machine, it’s not a surprise that I’ve been flying head-on into the field of Human Factors. Sub-disciplines include Cognitive Engineering and Human-Computer Interaction (HCI). It would appear to me that there isn’t one facet of the field of...
Continue reading...  

Resilience Engineering Part II: Lenses

(this is part 2 of a series: here is part 1) One of the challenges of building and operating complex systems is that it’s difficult to talk about one facet or component of them without bleeding the conversation into other related concerns. That’s the funky thing about complex systems and systems thinking: components come together...
Continue reading...  

The Devil’s In The Details

I’m a firm believer that context is everything, and that it’s needed in every constructive conversation we want to have as engineers. As a nascent (but adorable) engineering field, we discuss (in blogs, books, meetups, conferences, etc.) success and failure in a number of areas, including the ways in which we work. We don’t just...
Continue reading...  

Each necessary, but only jointly sufficient

I thought it might be worth digging in a bit deeper on something that I mentioned in the Advanced Postmortem Fu talk I gave at last year’s Velocity conference. For complex socio-technical systems (web engineering and operations) there is a myth that deserves to be busted, and that is the assumption that for outages and...
Continue reading...  

Convincing management that cooperation and collaboration was worth it

While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo’s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at...
Continue reading...  

Fault Tolerance and Protection

In yet another post where I point to a paper written from the perspective of another field of engineering about a topic that I think is inherently mappable to the web engineering world, I’ll at least give a summary. Every time someone on-call gets an alert, they should always be thinking along these lines: Does...
Continue reading...  

Systems Engineering: A great definition.

Ben Rockwood said something last December about the re-emergence of the Systems Engineer and I agree with him, 100%. To add to that, I’d like to quote the excellent NASA Systems Engineering handbook’s introduction. The emphasis is mine: Systems engineering is a methodical, disciplined approach for the design, realization, technical management, operations, and retirement of...
Continue reading...  

Training Organizational Resilience in Escalating Situations

This little ramble of thoughts are related to my talk at Velocity coming up, but I know I’ll never get to this part at the conference, so I figured I’d post about it here. Building resilience from a systems point of view means (amongst other things) understanding how your organization deals with failure and unexpected...
Continue reading...  

Resilience Engineering: Part I

I’ve been drafting this post for a really long time. Like most posts, it’s largely for me to get some thoughts down. It’s also very related to the topic I’ll be talking about at Velocity later this year. When I gave a keynote talk at the Surge Conference last year, I talked about how our...
Continue reading...  

Etsy’s Chef Repo, 2010

Etsy’s Chef Repo, 2010 from jspaw on Vimeo. Delicious InfoViz courtesy of Gource....
Continue reading...  

MTTR is more important than MTBF (for most types of F)

This week I gave a talk at QCon SF about development and operations cooperation at Etsy and Flickr.  It’s a refresh of talks I’ve given in the past, with more detail about how it’s going at Etsy. (It’s going excellently ) There’s a bunch of topics in the presentation slides, all centered around roles, responsibilities,...
Continue reading...  

Go or No-Go: Operability and Contingency Planning (Surge)

Last month I had the honor of speaking at the Surge Conference in Baltimore, put together by OmniTI. It was a most excellent conference, and the expertise levels were ridiculously high. I count myself lucky to be considered the same league as the rest of the presenters. I did give a Keynote talk, and I...
Continue reading...  

Nagios alerts on the iPhone – deleting boatloads

Protip: if you’re getting Nagios alerts on an iPhone, and you have your contact set as:  xxx-xxx-xxxx@txt.att.net, you’ll get messages from a ‘sender’ that looks like: “1 (410) 000-173″. This is not someone in Maryland, it’s a special address so that AT&T can route a reply back to the sender if need be. The side...
Continue reading...  

Ops Meta-Metrics: Velocity 2010 Slides

As expected, Velocity was excellent this year. What an awesome time to be in this field. Caveat for those who didn’t see/hear my talk: the graphs and numbers in the slides are, for the most part, made up. But they’re also in line with what I’ve seen at Flickr and Etsy. Ops Meta-Metrics: The Currency...
Continue reading...  

Some WebOps Interview Questions

It can be difficult to evaluate web ops candidates, for a couple of different reasons. One is that the breadth of knowledge needed for the field can be pretty wide, so spending too much time on any particular technical area can be a waste of time. Another reason is that it can be difficult to...
Continue reading...  

The new book: Web Operations

At the Velocity Conference last year, I was talking to Mike Loukides from O’Reilly about the topics being presented and how it was so great to see such successful veterans of the field come out from behind the curtain and share their experiences. Mike said that there was interest in doing a book on the...
Continue reading...  

We’re hiring ops folks at Etsy!

We’re hiring web ops engineers at Etsy.  Here’s the gist of it…. Responsibilities Building and maintaining Etsy’s infrastructure, from installed iron to production Taking part in a 24×7 on-call rotation Tightly cooperating and collaborating with development, product, community and customer care Requirements Experience with configuration management systems and concepts (Chef, Puppet, Cfengine, etc.) Experience in...
Continue reading...  

Pigz – parallel gzip OMG

Pigz is basically parallel gzip, to take advantage of multiple cores.  When you’ve got massive files, this can be a pretty big advantage, especially when you’ve got lots of cores sitting around. Taking a 418m squid access log file, on a dual-quad Nehalem L5520  with HyperThreading turned on: [jallspaw@server01 ~]$ ls -lh daemon.log.2; time gzip...
Continue reading...  

Agile Executive Podcast

Yesterday I was on a podcast with Andrew Shafer and Michael Coté, and we talked about development and operations cooperation. I rambled a bit, like I tend to do. Andrew brought up something that’s disturbing, and I’ve seen elsewhere, which is that after seeing our presentation last year at Velocity, some folks decided that we...
Continue reading...  

Need some FUDforum consulting done

I’ve been helping out a friend for some years with running a decent-size discussion forum. It’s running on a little (512mb of RAM) dedicated server and it’s outgrown the box it’s on. It needs to move to a new machine, which is all ready to take it. Problem is, it’s in a twisty-maze of dependencies....
Continue reading...  

Deployment is just a part of dev/ops cooperation, not the whole thing

Dev/Ops is what some people are calling the renewed cross-interest in development and operations collaboration. Hammond and I spoke about it, and there was even a conference in Europe dedicated to it. While I do think that there’s still a lot more that is to be discussed around this idea of cooperation and mixing of...
Continue reading...  

The epicenter of the web, and NYC

One of my apprehensions in moving to New York from San Francisco was a common concern: why would I move from the ‘epicenter’ of the web to a place where it’s not? There’s been lots written about startup hub cities, and innovative web metro areas, but the fact of the matter is that New York...
Continue reading...  

From one door to another

Last week I gave 2 month’s notice – I’ll be leaving Flickr in January. When Stew and Cat asked me to join Flickr in January of 2005, I felt like it was time to go and do something different, so I said yes. Five years (and four billion photos) later, it’s again time to go...
Continue reading...  

How Complex Systems Fail: A WebOps Perspective

I guess I’m late on getting to this, but How Complex Systems Fail by Richard Cook is excellent. Let me start with this: I don’t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure. I found this...
Continue reading...  

When you deploy: your internal monologue

The minimum cycle of questions you should be asking yourself. As brought up by @debuggist and @benjaminblack....
Continue reading...  

Meanwhile: More Meta-Metrics

Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as… What: …did we do before (historical trending,...
Continue reading...  

WebOps: Good prep for becoming a new parent?

I think I’ve said before somewhere that working in the field of web operations prepared me somewhat for being a parent. I thought the other day that I should write down some of this reasoning, because it’s pretty often that I’m reminded of similarities: High availability Having redundant infrastructure is WebOps 101. For my kids’...
Continue reading...  

Automated Control paper by the RAD Lab folks

Wow, how did I miss this until now? In June, some smart people gathered in Barcelona for the First Workshop on Automated Control for Datacenters and Clouds (ACDC09) and jeez it looked like it was a good time, from a glance at the program. One of the cooler papers is “Automatic exploration of datacenter performance regimes”...
Continue reading...  

Extreme Automated Infrastructure

I’ve said it before that I’ve always been a huge fan of SystemImager, for super simple imaging. It has some shortcomings for config management, but those are solved with things like Chef or Puppet. With all of the great things being talked about surrounding ‘Automated Infrastructure’, I’ll point to something insanely cool: 1,190 nodes installed...
Continue reading...  

SLAs, clouds, and whatnot

Excellent. Good work, Ben: ah, the mighty service level agreement! the tooth and claw by which the wily customer brings the vendor to heel. get the SLA right and you, the customer, can sit back and relax, safe in the knowledge that should there be an outage, you are covered. your business is protected from...
Continue reading...  

Uncaching bits in filesystem cache

Domas makes something more useful than I bet most would think: http://mituzas.lt/2009/06/26/uncache/...
Continue reading...  

Slides for Velocity Talk 2009

UPDATE: blip.tv has the video of the talk as well, below. Jeez I have some major bed-head. That was a blast! I had never done a ‘duet’ talk before. Here are the slides: 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr …and the video of it is here:...
Continue reading...  

Annoying To Me.

I can’t tell you how ripped I get when people say things like this: “cloud computing means getting rid of ops” If by “ops” you mean “people in data centers racking servers, installing OSes, running cables, replacing broken hardware, etc.” then sure, cloud computing aims to relieve you of those burdens. If you really think...
Continue reading...  

Context and Operational Metrics

I really don’t think it can be overestimated how important context can be when it comes to troubleshooting or evaluating the health of an infrastructure. When starting to troubleshoot a complex problem, web ops 101 “best practices” usually start with asking at least these questions: When did this problem start? What changes, if any, (software,...
Continue reading...  

Mechanical Analogies To Web Stuff, Part 2.

This is a ramble continued from before, which means it’s mostly a blog post for me, but maybe others might find it interesting. The last time I made an analogy between back-end web architectures and mechanical structures, I blathered on about what are basically structural limitations of individual components in a physical device, and how...
Continue reading...