While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo’s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at Flickr, and why. I did this in the hope that other Yahoo properties could learn from that team’s process and culture, which we worked really hard at building and keeping.
The idea that Development and Operations could:
- Share responsibility/accountability for availability and performance
- Have an equal seat at the table when it came to application and infrastructure design, architecture, and emergency response
- Build and maintain a deferential culture to each other when it came to domain expertise
- Cultivate equanimity when it came to emergency response and post-mortem meetings
…wasn’t evenly distributed across other Yahoo properties, from my limited perspective.
But I knew (still know) lots of incredible engineers at Yahoo that weren’t being supported as they could be by their upper management. So sending this letter was driven by wanting to help their situation. Don’t get me wrong, not everything was rainbows and flowers at Flickr, but we certainly had a lot more of them than other Yahoo groups.
When I re-read this, I’m reminded that when I came to Etsy, I wasn’t entirely sure that any of these approaches would work in the Etsy Engineering environment. The engineering staff at Etsy was a lot larger than Flickr’s and continuous deployment was in its infancy when I got there. I can now happily report that 2 years later, these concepts not only solidified at Etsy, they evolved to accommodate a lot more than what challenged us at Flickr. I couldn’t be happier about how it’s turned out.
I’ll note that there’s nothing groundbreaking in this note I sent, and nothing that I hadn’t said publicly in a presentation or two around the same time.
This is the note I sent to the three layers of management above me in my org at Yahoo:
Subject: Why Flickr went from 73rd most popular Y! property in 2005 to the 6th, 5 years later.
Below are my thoughts about some of the reasons why Flickr has had success, from an Operations Engineering manager’s point of view.
When I say everyone below, I mean all of the groups and sub-groups within the Flickr property: Product, Customer Care, Development, Service Engineering, Abuse and Advocacy, Design, and Community Management.
Here are at least some of the reasons we had success:
- Product included and respected everyone’s thoughts, in almost every feature and choice.
- Everyone owned availability of the site, not just Ops.
- Community management and customer service were involved early and often. In everything. If they weren’t, it was an oversight taken seriously, and would be fixed.
- Development and Operations had zero divide when it came to availability and performance. No, really. They worked in concert, involving each other in their own affairs when it mattered, and trusting each other every step of the way. This culture was taught, not born.
- I have never viewed Flickr Operations as firefighters, and have never considered Flickr Dev Engineering to be arsonists. (I have heard this analogy elsewhere in Yahoo.) The two teams are 100% equal partners, with absolute transparency. If anything, we had a problem with too much deference given between the two teams.
- The site was able to evolve, change, and grow as fast as needed to be as long as it was made safe to do so. To be specific: code and config deploys. When it wasn’t safe, we slowed, and everyone was fine with that happening, knowing that the goal was to return to fast-as-we-need-to-be. See above about everyone owning availability.
- Developers were able to see their work almost instantly in production. Institutionalized fear of degradation and outage ensured that changes were as safe as they needed to be. Developers and Ops engineers knew intuitively that the safety net you have is the one that you have built for yourself. When changes are small and frequent, the causes of degradation or outage due to code deploys are exceptionally transparent to all involved. (Re-read above about everyone owning availability.)
- We never deployed “early and often” because it was:
- a trend,
- we wanted to brag,
- or because we think we’re better than anyone. (We did it because it was right for Flickr to do so.)
- Everyone was made aware of any launches that had risks associated with it, and we worked on lists of things that could possibly go wrong, and what we would do in the event they did go wrong. Sometimes we missed things, and we had to think quickly, but those times were rare with new feature launches.
- Flickr Ops had always had the “go or no-go” decision, as did other groups who could vote with respect to their preparedness. A significant part of my job was working towards saying “go”, not “no-go”. In fact, almost all of it.
Examples: the most boring (anti-climatic, from an operational perspective) launches ever
- Flickr Video: I actually held the launch back by some hours until we could rectify a networking issue that I thought posed a risk to post-launch traffic. Other than that, it was a switch in the application that was turned from off to on. The feature’s code had been on prod servers for months in beta. See ‘dark launch’
- Homepage redesign: Unprecedented amount of activity data being pulled onto the logged-in homepage, order of magnitude increase in the number of calls to backend databases. Why was it boring? Because it was dark launched 10 days earlier. The actual launch was a flip of the ‘on’ switch
- People In Photos (aka, ‘people tagging’): Because the feature required data that we didn’t actually have yet, we couldn’t exactly dark launch it. It was a feature that had to be turned on, or off. Because of this, Flickr’s Architect wrote out a list of all of the parts of the feature that could cause load-related issues, what the likelihood of each was, how to turn those parts of the feature off, what custome care affect it might have, and what contingencies would probably require some community management involvement.
When we already have the data on the backend needed to display for a new feature, we would ‘dark launch’, meaning that the code would make all of the back-end calls (i.e. the calls that bring load-related risk to the deploy) and simply throw the data away, not showing it to the user. We could then increase or decrease the percentage of traffic who made those calls in safety, since we never risked the user experience by showing them a new feature and then having to take it away because of load issues.
This increases everyone’s confidence almost to the point of apathy, as far as fear of load-related issues are concerned. I have no idea how many code deploys there were made to production on any given day in the past 5 years (although I could find it on a graph easily), because for the most part I don’t care, because those changes made in production have such a low chance of causing issues. When they have caused issues, everyone on the Flickr staff can find on a webpage when the change was made, who made the change, and exactly (line-by-line) what the change was.
In the case where we had confidence in the resource consumption of a feature, but not 100% confidence in functionality, the feature was turned on for staff only. I’d say that about 95% of the features we launched in those 5 years were turned on for staff long before they were turned on for the entire Flickr population. When we still didn’t feel 100% confident, we ramped up the percentage of Flickr members who could see and use the new feature slowly.
We have many pieces of Flickr that are encapsulated as ‘feature’ flags, which look as simple as: $cfg[disable_feature_video] = 0; this allows the site to be much more resilient to specific failures. If we have any degradation within a certain feature, we can simply turn that feature off in many cases, instead of taking the entire site down. These ‘flags’ have, in the past, been prioritized with conversations with Product, so there is an easy choice to make if something goes wrong and site uptime becomes opposed to feature uptime.
This is an extremely important point: Dark Launches and Config Flags, were concepts and tools created by Flickr Development, not Flickr Operations, even though the end-result of each points toward a typical Operations goal: stability and availability. This is a key distinction. These are initiatives made by Engineering leadership because devs feel protective of the availability of the site, respectful of Operations responsibilities, and just plain good engineering.
If the Flickr Operations had built these tools and approaches to keeping the site stable, I do not believe we would have the same amount of success.
There is more on this topic here: http://code.flickr.com/blog/2009/12/02/flipping-out/
Flickr Operations is in an enviable position in that they don’t have to convince anyone in the Flickr property that:
- Operations has ‘go or no-go’ decision-making power, along with every other subgroup.
- Spending time, effort, and money to ensure stable feature launches before they launch is the rule, not the exception.
- Continuous Deployment is better for the availability of the site
- Flickr Operations should be involved as early as possible in the development phase of any project
These things are taken for granted. Any other way would simply feel weird.
I have no idea if posting this letter helps anyone other than myself, but there you go.