Quantcast

From the category archives:

WebOps

Web Operations
At the Velocity Conference last year, I was talking to Mike Loukides from O’Reilly about the topics being presented and how it was so great to see such successful veterans of the field come out from behind the curtain and share their experiences. Mike said that there was interest in doing a book on the (obviously) broad subject of web operations, in a format similar to the Beautiful books that O’Reilly has in their Theory in Practice series.

Needless to say, I jumped at the chance to help out. Over the following months, Jesse Robbins and I wrangled a bunch of topics we thought were integral to the field and authors who could cover them. It’s in the final stages of being published as we speak, but I think it came out pretty damn good.

These folks cranked out great chapters while still doing their day jobs, and it shows. It’s a great collection of war stories, advice, and hard-earned lessons.

Here is a list of the chapters:

“The Web Ops Career Path” Theo Schlossnagle
“Cloud Computing At Picnik: Lessons Learned” Justin Huff
“Infrastructure and Application Metrics” Matt Massie and myself
“Continuous Deployment” Eric Ries
“Infrastructure as Code” Adam Jacob
“Monitoring” Patrick Debois
“How Complex Systems Fail” Dr. Richard Cook
“Community Management and Web Operations” Heather Champ
“Dealing With Unexpected Traffic Spikes” Brian Moon
“Dev and Ops Cooperation and Collaboration” Paul Hammond
“How Your Visitors Feel: User-Facing Metrics” Alistair Croll and Sean Power
“Relational Database Strategy and Tactics For The Web” Baron Schwartz
“The Art and Science of Postmortems” Jacob Loomis
“Managing Web Storage” Anoop Nagwani
“Nonrelational Datastores” Eric Florenzano
“Agile Infrastructure” Andrew Clay Shafer
“Things That Go Bump In The Night (And How To Sleep Through Them)” Michael Christian

Royalties from the sales will go to the national 826 Valencia organization, which is dedicated to supporting students ages 6 to 18 with their writing skills. They do this by offering free drop-in tutoring at eight different locations around the country, as well as special events, student publishing, and scholarships.

{ 6 comments }

I guess I’m late on getting to this, but How Complex Systems Fail by Richard Cook is excellent.

Let me start with this: I don’t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure. I found this via @benjaminblack, and I agree with him 100%: this should be considered required reading for anyone in our industry. I’m not sure if Cook ever thought that his paper would apply to web infrastructure, but I think it can and does. Please take 30 minutes right now and read it. :)

There are a number of salient points in the paper that I’d like to comment on. Again, this is through the lens of failures of complex systems as it pertains to web operations:

7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

I’m going to guess that this portion may be viewed as controversial in the prevailing webops wisdom, where post-mortems are for sure necessary, but whose content may or may not be effective in preventing similar types of failure. I do value the process of a post-mortem, because I think the human element of understanding complex failures is important and doing whatever you can to put in place safety is good, modulo what is said in section #16 of the paper. I believe that even a rudimentary process of “5 Whys” has value. But at the same time, I also think that there is something in the spirit of this paragraph, which is that there is a danger in standing behind a single underlying cause when there are systemic failures involved. Doing this can lead to the false belief that you’ve got this mode covered, you’ve found the silver bullet that made the whole mountain crumble, and jeez what a relief because that will never bite us again.

14) Change introduces new forms of failure.

I totally agree with this point. However, I often see this as a rallying point for operations teams to say “No!” to change, when instead they should be working alongside development (and product owners) with a goal of reducing the risk of failure associated with each change. I do not believe that ‘release early, release often’ in and of itself can reduce that risk. I believe that the real (and only) way to do this is both technical and cultural. But I’ve spoken about this before.

16) Safety is a characteristic of systems and not of their components

Emphasis on “Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system.” Real safety comes from smart people doing smart things to the entire shebang, not the individual guts.

and I think the point I love the most, with all of my heart:

18) Failure free operations require experience with failure.

Fear is a strong emotion. I believe it can be used as a strong motivator for ensuring safety in the face of constant change, instead of a reason to push back on the very idea of change. Embrace fear of outages and degradation. Use it to guide your architecture, your code, your infrastructure. So lean into it.

There are a lot of great points in the paper, and I could go on, but you get the idea.

{ 4 comments }

Meanwhile: More Meta-Metrics

October 5, 2009

Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as…
What:
…did we do before (historical trending, etc)
…is going [...]

Read the full article →

Slides for Velocity Talk 2009

June 23, 2009

UPDATE: blip.tv has the video of the talk as well, below. Jeez I have some major bed-head.
That was a blast! I had never done a ‘duet’ talk before. Here are the slides:
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
…and the video of it is here:

Read the full article →

Slides from Web2.0 Expo 2009. (and somethin else interestin’)

April 3, 2009

That was a pretty good time. Saw lots of good and wicked smaht people, and I got a lot of great questions after my talk. The slides are up on slideshare, and here are the PDF slides.
Operational Efficiency Hacks Web20 Expo2009
View more presentations from John Allspaw.

UPDATE: Gil Raphaelli has posted his python bindings he [...]

Read the full article →

Some Things We Did Today

March 5, 2009

Moving one of our eight photoserving farms from hardware Layer7 URL hash balancing (expensive, has limits) to L4 DSR balancing with CARP (cheap and simple) and figuring out how to juggle 18,000 requests/second while we do it.
Built yet some more automated query analysis reporting (with some yummy MySQLProxy)
Added yet another aggregated graph of database queries, [...]

Read the full article →

2009 Velocity Conference submissions are open!

November 20, 2008

The CFP for next year’s Velocity Conference is up now, so all you ops and performance ninjas submit your ideas for talks.
I’m lucky enough to be on the program committee this year, and I think the conference is a huge opportunity to spread the ops love on all kinds of topics. There’s a list on [...]

Read the full article →

Code Swarm for Config Management

October 21, 2008

Gil Raphaelli, one of the guys on our Flickr Ops team, put together a Code Swarm animation for the configuration/deployment management tool we use at Flickr to manage our infrastructure. Myles Grant did this for our bug reporting system as well. Check it out:

Our automated config management system is called Gemstone, but conceptually you can [...]

Read the full article →

More back-of-envelope-math…

September 18, 2008

Via kottke: some good examples of doing rough math in your head, causing you to guess about assumptions all along the way.
IMHO, being able to do this is one of the things that makes a good web ops person. The examples might be “useless”, but the process is invaluable.

Read the full article →

Internet-Scale Efficiency

September 16, 2008

James Hamilton’s excellent LADIS 2008 presentation has lots of great stuff in it about internet scale bits. Cool stats.

Read the full article →