<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Kitchen Soap &#187; Flickr</title>
	<atom:link href="http://www.kitchensoap.com/category/flickr/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.kitchensoap.com</link>
	<description>Thoughts on capacity planning and web operations.</description>
	<lastBuildDate>Tue, 17 Jan 2012 17:57:33 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Convincing management that cooperation and collaboration was worth it</title>
		<link>http://www.kitchensoap.com/2012/01/05/convincing-management-that-cooperation-and-collaboration-was-worth-it/</link>
		<comments>http://www.kitchensoap.com/2012/01/05/convincing-management-that-cooperation-and-collaboration-was-worth-it/#comments</comments>
		<pubDate>Thu, 05 Jan 2012 15:35:10 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Culture]]></category>
		<category><![CDATA[Flickr]]></category>
		<category><![CDATA[Random]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=8760</guid>
		<description><![CDATA[While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo&#8217;s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo&#8217;s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at Flickr, and why. I did this in the hope that other Yahoo properties could learn from that team&#8217;s process and culture, which we worked really hard at building and keeping.</p>
<p>The idea that Development and Operations could:</p>
<ul>
<li>Share responsibility/accountability for availability and performance</li>
<li>Have an equal seat at the table when it came to application and infrastructure design, architecture, and emergency response</li>
<li>Build and maintain a deferential culture to each other when it came to domain expertise</li>
<li>Cultivate equanimity when it came to emergency response and post-mortem meetings</li>
</ul>
<div>
<p>&#8230;wasn&#8217;t evenly distributed across other Yahoo properties, from my limited perspective.</p>
<p>But I knew (still know) lots of incredible engineers at Yahoo that weren&#8217;t being supported as they could be by their upper management. So sending this letter was driven by wanting to help their situation. Don&#8217;t get me wrong, not everything was rainbows and flowers at Flickr, but we certainly had a lot more of them than other Yahoo groups.</p>
<p>When I re-read this, I&#8217;m reminded that when I came to Etsy, I wasn&#8217;t entirely sure that any of these approaches would work in the Etsy Engineering environment. The engineering staff at Etsy was a lot larger than Flickr&#8217;s and continuous deployment was in its infancy when I got there. I can now happily report that 2 years later, these concepts not only solidified at Etsy, they evolved to accommodate a <em><strong>lot</strong></em> more than what challenged us at Flickr. I couldn&#8217;t be happier about how it&#8217;s turned out.</p>
<p>I&#8217;ll note that there&#8217;s nothing groundbreaking in this note I sent, and nothing that I hadn&#8217;t said publicly in a presentation or two around the same time.</p>
<p>This is the note I sent to the three layers of management above me in my org at Yahoo:</p>
<blockquote>
<h3>Subject: Why Flickr went from 73rd most popular Y! property in 2005 to the 6th, 5 years later.</h3>
<p>Below are my thoughts about some of the reasons why Flickr has had success, from an Operations Engineering manager&#8217;s point of view.</p>
<p>When I say <em>everyone </em>below, I mean all of the groups and sub-groups within the Flickr property: <strong>Product</strong>, <strong>Customer Care</strong>, <strong>Development</strong>, <strong>Service Engineering</strong>, <strong>Abuse and Advocacy</strong>, <strong>Design</strong>, and <strong>Community Management</strong>.</p>
<h3>Here are at least some of the reasons we had success:</h3>
<ul>
<ul>
<li>Product included and respected everyone&#8217;s thoughts, in almost every feature and choice.</li>
<li><em>Everyone</em> owned availability of the site, not just Ops.</li>
<li>Community management and customer service were involved <strong>early</strong> and <strong>often</strong>. In <em>everything</em>. If they weren&#8217;t, it was an oversight taken seriously, and would be fixed.</li>
<li>Development and Operations had <strong>zero</strong> divide when it came to availability and performance. No, really. They worked in concert, involving each other in their own affairs when it mattered, and trusting each other every step of the way. This culture was taught, not born.</li>
<li>I have <em>never</em> viewed Flickr Operations as <strong><em>firefighters</em></strong>, and have never considered Flickr Dev Engineering to be <strong><em>arsonists</em></strong>. (I have heard this analogy elsewhere in Yahoo.) The two teams are 100% equal partners, with absolute transparency. If anything, we had a problem with too much deference given between the two teams.</li>
<li>The site was able to evolve, change, and grow as fast as needed to be as long as it was made safe to do so. To be specific: code and config deploys. When it wasn&#8217;t safe, we slowed, and everyone was fine with that happening, knowing that the goal was to return to <em>fast-as-we-need-to-be</em>. See above about everyone owning availability.</li>
<li>Developers were able to see their work almost instantly in production. Institutionalized fear of degradation and outage ensured that changes were as safe as they needed to be. Developers and Ops engineers knew intuitively that the safety net you have is the one that you have built for yourself. When changes are small and frequent, the causes of degradation or outage due to code deploys are exceptionally transparent to all involved. (Re-read above about everyone owning availability.)</li>
<li>We never deployed &#8220;early and often&#8221; because it was:
<ul>
<li>a trend,</li>
<li>we wanted to brag,</li>
<li>or because we think we&#8217;re better than anyone. (We did it because it was right for Flickr to do so.)</li>
</ul>
</li>
<li>Everyone was made aware of any launches that had risks associated with it, and we worked on lists of things that could possibly go wrong, and what we would do in the event they did go wrong. Sometimes we missed things, and we had to think quickly, but those times were rare with new feature launches.</li>
<li>Flickr Ops had <em>always</em> had the &#8220;go or no-go&#8221; decision, as did other groups who could vote with respect to their preparedness. A significant part of my job was working towards saying &#8220;go&#8221;, not &#8220;no-go&#8221;. In fact, almost all of it.</li>
</ul>
</ul>
<h4>Examples: the most boring (anti-climatic, from an operational perspective) launches ever</h4>
<ul>
<ul>
<li><strong>Flickr Video</strong>: I actually held the launch back by some hours until we could rectify a networking issue that I thought posed a risk to post-launch traffic. Other than that, it was a switch in the application that was turned from off to on. The feature&#8217;s code had been on prod servers for months in beta. See &#8216;dark launch&#8217;</li>
<li><strong>Homepage redesign</strong>: Unprecedented amount of activity data being pulled onto the logged-in homepage, order of magnitude increase in the number of calls to backend databases. Why was it boring? Because it was dark launched 10 days earlier. The actual launch was a flip of the &#8216;on&#8217; switch</li>
<li><strong>People In Photos (aka, &#8216;people tagging&#8217;)</strong>: Because the feature required data that we didn&#8217;t actually have yet, we couldn&#8217;t exactly dark launch it. It was a feature that had to be turned on, or off. Because of this, Flickr&#8217;s Architect wrote out a list of all of the parts of the feature that could cause load-related issues, what the likelihood of each was, how to turn those parts of the feature off, what custome care affect it might have, and what contingencies would probably require some community management involvement.</li>
</ul>
</ul>
<h4>Dark Launches</h4>
<p>When we already have the data on the backend needed to display for a new feature, we would &#8216;dark launch&#8217;, meaning that the code would make all of the back-end calls (i.e. the calls that bring load-related risk to the deploy) and simply throw the data away, not showing it to the user. We could then increase or decrease the percentage of traffic who made those calls in safety, since we never risked the user experience by showing them a new feature and then having to take it away because of load issues.</p>
<p>This increases <em>everyone&#8217;s</em> confidence almost to the point of apathy, as far as fear of load-related issues are concerned. I have no idea how many code deploys there were made to production on any given day in the past 5 years (although I could find it on a graph easily), because for the most part I don&#8217;t care, because those changes made in production have such a low chance of causing issues. When they have caused issues, everyone on the Flickr staff can find on a webpage <strong><em>when</em></strong> the change was made, <strong><em>who</em></strong> made the change, and exactly (line-by-line) <strong><em>what</em></strong> the change was.</p>
<p>In the case where we had confidence in the resource consumption of a feature, but not 100% confidence in functionality, the feature was turned on for staff only. I&#8217;d say that about 95% of the features we launched in those 5 years were turned on for staff long before they were turned on for the entire Flickr population. When we still didn&#8217;t feel 100% confident, we ramped up the percentage of Flickr members who could see and use the new feature slowly.</p>
<h4>Config Flags</h4>
<p>We have many pieces of Flickr that are encapsulated as &#8216;feature&#8217; flags, which look as simple as: $cfg[disable_feature_video] = 0; this allows the site to be much more resilient to specific failures. If we have any degradation within a certain feature, we can simply turn that feature off in many cases, instead of taking the entire site down. These &#8216;flags&#8217; have, in the past, been prioritized with conversations with Product, so there is an easy choice to make if something goes wrong and site uptime becomes opposed to feature uptime.</p>
<p>This is an extremely important point: Dark Launches and Config Flags, were concepts and tools created by Flickr Development, not Flickr Operations, even though the end-result of each points toward a typical Operations goal: stability and availability. This is a key distinction. These are initiatives made by Engineering leadership because devs feel protective of the availability of the site, respectful of Operations responsibilities, and just plain good engineering.</p>
<p>If the Flickr Operations had built these tools and approaches to keeping the site stable, I do not believe we would have the same amount of success.</p>
<p>There is more on this topic here: <a href="http://code.flickr.com/blog/2009/12/02/flipping-out/" target="_blank">http://code.flickr.com/blog/2009/12/02/flipping-out/ </a></p>
<h4>Summary</h4>
<p>Flickr Operations is in an enviable position in that they don&#8217;t have to convince anyone in the Flickr property that:</p>
<ul>
<ul>
<ol>
<li>Operations has &#8216;go or no-go&#8217; decision-making power, along with every other subgroup.</li>
<li>Spending time, effort, and money to ensure stable feature launches <em>before they launch </em>is the rule, not the exception<em>.</em></li>
<li>Continuous Deployment is better for the availability of the site</li>
<li>Flickr Operations should be involved as early as possible in the development phase of any project</li>
</ol>
</ul>
</ul>
<p>These things are taken for granted. Any other way would simply feel weird.</p></blockquote>
<p>I have no idea if posting this letter helps anyone other than myself, but there you go.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2012/01/05/convincing-management-that-cooperation-and-collaboration-was-worth-it/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>MTTR is more important than MTBF (for most types of F)</title>
		<link>http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/</link>
		<comments>http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/#comments</comments>
		<pubDate>Sun, 07 Nov 2010 18:27:23 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Culture]]></category>
		<category><![CDATA[Etsy]]></category>
		<category><![CDATA[Flickr]]></category>
		<category><![CDATA[Slides]]></category>
		<category><![CDATA[Talks]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=508</guid>
		<description><![CDATA[This week I gave a talk at QCon SF about development and operations cooperation at Etsy and Flickr.  It&#8217;s a refresh of talks I&#8217;ve given in the past, with more detail about how it&#8217;s going at Etsy. (It&#8217;s going excellently ) There&#8217;s a bunch of topics in the presentation slides, all centered around roles, responsibilities, [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>This week I gave a talk at QCon SF about <a href="http://www.slideshare.net/jallspaw/dev-and-ops-collaboration-and-awareness-at-etsy-and-flickr" target="_blank">development and operations cooperation at Etsy and Flickr</a>.  It&#8217;s a refresh of talks I&#8217;ve given in the past, with more detail about how it&#8217;s going at Etsy. (It&#8217;s going excellently <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  )</p>
<p>There&#8217;s a bunch of topics in the presentation slides, all centered around roles, responsibilities, and intersection points of domain expertise commonly found in development and operations teams. One of the not-groundbreaking ideas that I&#8217;m finally getting down is something that should be evident for anyone practicing or interested in &#8216;continuous deployment&#8217;:</p>
<p style="padding-left: 30px;">Being able to recover quickly from failure is more important than having failures less often.</p>
<p>This has what should be an obvious caveat: some types of failures shouldn&#8217;t ever happen, and not all failures/degradations/outages are the same. (like failures resulting in accidental data <em>loss</em>, for example)</p>
<p>Put another way:</p>
<blockquote>
<h1><strong>MTTR is more important than MTBF </strong></h1>
<p><strong><em>(for most types of F)</em></strong></p></blockquote>
<p>(Edited: I did say originally &#8220;MTTR &gt; MTBF&#8221;)</p>
<p>What I&#8217;m definitely <strong>not</strong> saying is that failure should be an acceptable condition. I&#8217;m positing that since failure <em>will</em> happen, it&#8217;s just as important (or in some cases <em>more</em> important) to spend time and energy on your response to failure than trying to prevent it. I agree with <a href="http://twitter.com/ph" target="_blank">Hammond</a>, when he said:</p>
<blockquote><p>If you think you can prevent failure, then you aren&#8217;t developing your ability to respond.</p></blockquote>
<p>In a complete steal of <a href="http://radar.oreilly.com/artur/" target="_blank">Artur Bergman</a>&#8216;s material, an example in the slides of the talk is of the Jeep versus Rolls Royce:</p>
<p><a href="http://www.kitchensoap.com/wp-content/uploads/2010/11/Screen-shot-2010-11-07-at-1.08.39-PM.png"><img class="alignleft size-medium wp-image-517" title="Jeep versus Rolls" src="http://www.kitchensoap.com/wp-content/uploads/2010/11/Screen-shot-2010-11-07-at-1.08.39-PM-300x225.png" alt="Jeep versus Rolls" width="300" height="225" /></a> Artur has a Jeep, and he&#8217;s right when he says that for the most part, Jeeps are built with optimizing Mean-Time-To-Repair, not the classical approach to automotive engineering, which is to optimize Mean-Time-Between-Failures. This is likely because Jeep owners have been beating the shit out of their vehicles for decades, and every now and again, they expect that abuse to break something. Jeep designers know this, which is why it&#8217;s so damn easy to repair. Nuts and bolts are easy to reach, tools are included when you buy the thing, and if you haven&#8217;t seen the video of <a href="http://www.youtube.com/watch?v=lgwF8mdQwlw" target="_blank">Army personnel disassembling and reassembling a Jeep in under 4 minutes</a>, you&#8217;re missing out.</p>
<p>The Rolls Royce, on the other hand, likely don&#8217;t have such adventurous owners, and when it does break down, it&#8217;s a fine and acceptable thing for the car to be out of service for a long and expensive fixing by the manufacturer.</p>
<p>We as web operations folks want our architectures to be built optimized for MTTR, not for MTBF. I think that the reasons should be obvious, and the fact that practices like:</p>
<ul>
<li>Dark launching</li>
<li>Percentage-based production A/B rollouts</li>
<li><a href="http://code.flickr.com/blog/2009/12/02/flipping-out/" target="_blank">Feature flags </a></li>
</ul>
<p>are becoming commonplace should verify this approach as having legs.</p>
<p>The slides from QConSF are here:</p>
<div style="width:425px" id="__ss_5695138"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/jallspaw/dev-and-ops-collaboration-and-awareness-at-etsy-and-flickr" title="Dev and Ops Collaboration and Awareness at Etsy and Flickr" target="_blank">Dev and Ops Collaboration and Awareness at Etsy and Flickr</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/5695138" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/jallspaw" target="_blank">John Allspaw</a> </div>
</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Slides from Web2.0 Expo 2009. (and somethin else interestin&#8217;)</title>
		<link>http://www.kitchensoap.com/2009/04/03/slides-from-web20-expo-2009-and-somethin-else-interestin/</link>
		<comments>http://www.kitchensoap.com/2009/04/03/slides-from-web20-expo-2009-and-somethin-else-interestin/#comments</comments>
		<pubDate>Fri, 03 Apr 2009 21:21:40 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Flickr]]></category>
		<category><![CDATA[Slides]]></category>
		<category><![CDATA[Talks]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=115</guid>
		<description><![CDATA[That was a pretty good time. Saw lots of good and wicked smaht people, and I got a lot of great questions after my talk. The slides are up on slideshare, and here are the PDF slides. Operational Efficiency Hacks Web20 Expo2009 View more presentations from John Allspaw. UPDATE: Gil Raphaelli has posted his python [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>That was a pretty good time. Saw lots of good and wicked smaht people, and I got a lot of great questions after my talk. The slides are up on <a href="http://www.slideshare.net/jallspaw/operational-efficiency-hacks-web20-expo2009" target="_blank">slideshare</a>, and here are the <a title="Operational Efficiency Hacks Web 2.0 Expo 2009" href="http://kitchensoap.com/talks/OpsHacksWeb20Expo2009-Notes.pdf" target="_blank">PDF slides</a>. <strong><em></em></strong></p>
<div style="width:425px;text-align:left" id="__ss_1245887"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" href="http://www.slideshare.net/jallspaw/operational-efficiency-hacks-web20-expo2009?type=presentation" title="Operational Efficiency Hacks Web20 Expo2009">Operational Efficiency Hacks Web20 Expo2009</a><object style="margin:0px" width="425" height="355"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=opshacksweb20expo2009-090403152449-phpapp02&#038;stripped_title=operational-efficiency-hacks-web20-expo2009" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=opshacksweb20expo2009-090403152449-phpapp02&#038;stripped_title=operational-efficiency-hacks-web20-expo2009" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object>
<div style="font-size:11px;font-family:tahoma,arial;height:26px;padding-top:2px;">View more <a style="text-decoration:underline;" href="http://www.slideshare.net/">presentations</a> from <a style="text-decoration:underline;" href="http://www.slideshare.net/jallspaw">John Allspaw</a>.</div>
</div>
<p><strong><em>UPDATE:</em></strong> Gil Raphaelli has <a href="http://g.raphaelli.com/2009/4/2/libyahoo2-python-bindings" target="_blank">posted</a> his python bindings he wrote for our libyahoo2 use in our Ops IM Bot.</p>
<p>There <em>was</em> something that I left out of my slides, mostly because I didn&#8217;t want to distract from the main topic, which was optimization and efficiencies.</p>
<p>While I used our image processing capacity at Flickr as an example of how compilers and hardware can have some significant influence on how fast or efficient you can run, I had wondered what the Magical Cloud™ would do with these differences.</p>
<p>So I took the tests I ran on our own machines and ran them on Small, Medium, Large, Extra Large, and Extra Large(High) instances of EC2, to see. The results were a bit surprising to me, but I&#8217;m sure not surprising to anyone who uses EC2 with any significant amount of CPU demand.</p>
<p>For the testing, I have a script that does some super simple image resizing with GraphicsMagick. It splits a DSLR photo into 6 different sizes, much in the same way that we do at Flickr for the real world. It does that resizing on about 7 different files, and I timed them all. This is with the most recent version of GraphicsMagick, 1.3.5, with the awesome OpenMP bits in it.</p>
<p>Here is the slide of the tests run on different (increasingly faster) dedicated machines:</p>
<p style="text-align: center;"><img class="size-medium wp-image-117 aligncenter" title="Faster Image Processing Hardware" src="http://www.kitchensoap.com/wp-content/uploads/2009/04/gm-hardware2-300x213.png" alt="Faster Image Processing Hardware" width="300" height="213" /></p>
<p>and here is the slide that I <em>didn&#8217;t</em> include, of the EC2 timings of the same test:</p>
<p style="text-align: center;"><img class="size-medium wp-image-118 aligncenter" title="Image Processing on EC2" src="http://www.kitchensoap.com/wp-content/uploads/2009/04/gm-ec2-300x213.png" alt="Image Processing on EC2" width="300" height="213" /></p>
<p>Now I&#8217;m not suggesting that the two graphs <strong><em>should</em></strong> look similar, or that EC2 <em>should</em> be faster. I&#8217;m well aware of the shift in perspective when deploying capacity within the cloud versus within your own data center. So I&#8217;m not surprised that the fastest test results are on the order of 2x slower on EC2. Application logic, feature designs (synchronous versus asynchronous image processing, for example) can take care of these differences and could be a welcome trade-off in having to run your own machines.</p>
<p>What I am surprised about is the variation (or lack thereof) of all but the small instances. After I took a closer look at vmstat and top, I realized that the small instances consistently saw about 50-60% <a href="http://help.rightscale.com/cgi-bin/rightscale.cfg/php/enduser/std_adp.php?p_faqid=28" target="_blank">CPU stolen</a> from it, the mediums almost always saw zero stolen, and the Large and ExtraLarges saw up to 35% CPU stolen from it during the jobs.</p>
<p>So, interesting.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/04/03/slides-from-web20-expo-2009-and-somethin-else-interestin/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Web Ops Visualizations Group on Flickr</title>
		<link>http://www.kitchensoap.com/2008/12/16/web-ops-visualizations-group-on-flickr/</link>
		<comments>http://www.kitchensoap.com/2008/12/16/web-ops-visualizations-group-on-flickr/#comments</comments>
		<pubDate>Tue, 16 Dec 2008 18:19:10 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Flickr]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[Web Ops]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=79</guid>
		<description><![CDATA[Like lots of operations people, we&#8217;re quite addicted to data pr0n here at Flickr. We&#8217;ve got graphs for pretty much everything, and add graphs all of the time. We&#8217;ve blogged about some of how and why we do it. One thing we&#8217;re in the habit of is screenshotting these graphs when things go wrong, right, [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Like lots of operations people, we&#8217;re quite addicted to data pr0n here at Flickr. We&#8217;ve got graphs for pretty much everything, and add graphs all of the time. We&#8217;ve <a href="http://code.flickr.com/blog/2008/10/27/counting-timing/" target="_blank">blogged</a> <a href="http://code.flickr.com/blog/2008/10/13/flickr-digs-ganglia/" target="_blank">about</a> some of how and why we do it.</p>
<p>One thing we&#8217;re in the habit of is screenshotting these graphs when things go wrong, right, or indifferent, and adding them to a group on Flickr. I&#8217;ve decided to make a public group for these sort of screenshots, for anyone to contribute to:</p>
<p style="text-align: center;"><a href="http://flickr.com/groups/webopsviz/" target="_blank">http://flickr.com/groups/webopsviz/</a></p>
<p>You should realize before posting anything here, that you might want to think about if you want everyone in the world to see what you&#8217;ve got. I&#8217;ve made a quick FAQ on the groups page, but I&#8217;ll repeat it here:</p>
<blockquote><p><strong>Q: What is this?</strong><br />
A: This group is for sharing visualizations of web operations metrics. For the most part, this means graphs of systems and application metrics, from software like ganglia, cacti, hyperic, etc.</p>
<p><strong>Q:Who gets to see this?</strong><br />
A: This is a semi-public group, so don&#8217;t post anything you don&#8217;t want others to see.<br />
For now, it&#8217;ll be for members-only to post and view.  Ideally, I think it&#8217;d be great to share some of these things publicly.</p>
<p><strong>Q: What&#8217;s interesting to post here?</strong><br />
A: Spikes, dips, patterns. Things with colors. Shiny things. Donuts. Ponies.</p>
<p><strong>Q: My company will fire me if I show our metrics!</strong><br />
A: Don&#8217;t be dense, and post your pageview, revenue, or other super-secret stuff that you think would be sensitive. Your mileage may vary.</p></blockquote>
<p>So: you&#8217;ve got something to brag about? How many requests per second can your awesome new solid-state-disk database do? You got spikes? Post them!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2008/12/16/web-ops-visualizations-group-on-flickr/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Slides from Velocity</title>
		<link>http://www.kitchensoap.com/2008/06/25/slides-from-velocity/</link>
		<comments>http://www.kitchensoap.com/2008/06/25/slides-from-velocity/#comments</comments>
		<pubDate>Wed, 25 Jun 2008 13:41:23 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Flickr]]></category>
		<category><![CDATA[Slides]]></category>
		<category><![CDATA[Talks]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=48</guid>
		<description><![CDATA[Here are the slides from my talk at the Velocity Conference.]]></description>
			<content:encoded><![CDATA[<p></p><p><a title="Capacity Management for Web Operations" href="http://www.slideshare.net/jallspaw/velocity2008-capacity-management1-484676" target="_blank">Here</a> are the slides from my talk at the Velocity Conference.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2008/06/25/slides-from-velocity/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Squid patch for making &#8220;time&#8221; stats more meaningful.</title>
		<link>http://www.kitchensoap.com/2008/05/22/squid-patch-for-making-time-stats-more-meaningful/</link>
		<comments>http://www.kitchensoap.com/2008/05/22/squid-patch-for-making-time-stats-more-meaningful/#comments</comments>
		<pubDate>Thu, 22 May 2008 18:40:44 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Caching]]></category>
		<category><![CDATA[Flickr]]></category>
		<category><![CDATA[WebOps]]></category>
		<category><![CDATA[webops squid]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=43</guid>
		<description><![CDATA[Thanks to Mark, squid&#8217;s got a patch I&#8217;ve been wanting for a gazillion years: time-to-serve statistics that don&#8217;t include the client&#8217;s location http://www.squid-cache.org/bugs/show_bug.cgi?id=2345 Normally, squid&#8217;s kept statistics that included the &#8220;time&#8221; to serve an object, whether it be a HIT, MISS, NEAR HIT, etc. The clock starts for this time when the first headers are [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Thanks to <a href="http://mnot.net/blog" target="_blank">Mark</a>, squid&#8217;s got a patch I&#8217;ve been wanting for a gazillion years: time-to-serve statistics that don&#8217;t include the client&#8217;s location</p>
<blockquote><p><a href="http://www.squid-cache.org/bugs/show_bug.cgi?id=2345" target="_blank">http://www.squid-cache.org/bugs/show_bug.cgi?id=2345</a></p></blockquote>
<p>Normally, squid&#8217;s kept statistics that included the &#8220;time&#8221; to serve an object, whether it be a HIT, MISS, NEAR HIT, etc. The clock starts for this time when the first headers are received by the client that are validated as a legit squid request, but then doesn&#8217;t stop until the client has every last bit of the response.</p>
<p>What this means is that if you have servers in the US and your traffic pattern follows the NY/SF pattern (peaks from around 9am-4pm) and your overseas traffic (i.e. clients really far from your boxes) has a pattern the inverse of that, then you might see &#8216;time-to-serve&#8217; in squid to be <em>worse</em> during your lowest traffic. Which is confusing, to say the least. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>This patch changes the stopwatch to start at the same time (when squid&#8217;s received headers from the client) but <em>stop </em>when squid&#8217;s preparing the headers for the response. This measures ONLY the time that squid had the object in its hands, for a hit or a miss, which IMHO is a much better measure of how squid is actually performing with the hardware&#8217;s resources.</p>
<p>Yay! Thanks Mark.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2008/05/22/squid-patch-for-making-time-stats-more-meaningful/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Flickr&#8217;s hiring a dba.</title>
		<link>http://www.kitchensoap.com/2008/01/30/flickrs-hiring-a-dba/</link>
		<comments>http://www.kitchensoap.com/2008/01/30/flickrs-hiring-a-dba/#comments</comments>
		<pubDate>Thu, 31 Jan 2008 05:03:37 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Flickr]]></category>
		<category><![CDATA[WebOps]]></category>
		<category><![CDATA[mysql]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/2008/01/30/flickrs-hiring-a-dba/</guid>
		<description><![CDATA[(Only hardworking supernerds should apply) We&#8217;re looking for an experienced and motivated MySQL DBA to help make things go at Flickr. Stuff you&#8217;ll do: • Work with engineers on performance tuning, query optimization, index tuning. • Monitor databases for problems and to diagnose where those problems are. • Work with developers and operations to maintain [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>(Only hardworking <strong>supernerds</strong> should apply)</p>
<p>We&#8217;re looking for an experienced and motivated MySQL DBA to help make things go at Flickr.</p>
<p>Stuff you&#8217;ll do:<br />
• Work with engineers on performance tuning, query optimization, index tuning.<br />
• Monitor databases for problems and to diagnose where those problems are.<br />
• Work with developers and operations to maintain a scalable, reliable, and robust database environment.<br />
• Build database tools and scripts to automate where possible.<br />
• Support MySQL databases for production and development.<br />
• Provide 24&#215;7 escalated on-call support on a pager rotation.</p>
<p>Smarts and experience you&#8217;ll need:<br />
• 3-4+ years MySQL experience.<br />
• 2+ years of experience as a MySQL DBA in a high traffic, transactional environment.<br />
• 2+ years working in a LAMP environment, particularly PHP/MySQL<br />
• Proficient with database performance strategies.<br />
• Proficient tuning MySQL processes and queries.<br />
• Experience in administration of InnoDB<br />
• Experience with MySQL Replication, with both Master-Slave and Master-Master replication.<br />
• Ability to work cooperatively with software engineers and system administrators.<br />
• Excellent communication skills<br />
• Exceptional problem-solving expertise and attention to detail.<br />
• BS in Computer Science or equivalent.</p>
<p>Super Nerdy Bonus Points For:<br />
• Experience with Data Sharding and federated architectures.<br />
• Experience with multi-datacenter MySQL replication.<br />
• Experience working in a social media environment.</p>
<p>Ok ? Now, <a href="mailto:iwantajob@kitchensoap.com?subject=MySQL%20DBA%20gig%20at%20Flickr">send me your resume</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2008/01/30/flickrs-hiring-a-dba/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Making a site faster by removing machines</title>
		<link>http://www.kitchensoap.com/2007/08/20/making-a-site-faster-by-removing-machines/</link>
		<comments>http://www.kitchensoap.com/2007/08/20/making-a-site-faster-by-removing-machines/#comments</comments>
		<pubDate>Mon, 20 Aug 2007 16:21:38 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Capacity Planning]]></category>
		<category><![CDATA[Flickr]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/2007/08/20/making-a-site-faster-by-removing-machines/</guid>
		<description><![CDATA[(well, not really) A little while ago, in one of our clusters we replaced a boatload of PowerEdge 1425 webserver-class boxes with a much smaller number of HP DL145 G3 quad-core boxes, getting the same amount of oomph from 1/3 the boxes. Not too bad.]]></description>
			<content:encoded><![CDATA[<p></p><p><em>(well, not really)</em></p>
<p>A little while ago, in one of our clusters we replaced a boatload of PowerEdge 1425 webserver-class boxes with a much smaller number of HP DL145 G3 quad-core boxes, getting the same amount of oomph from 1/3 the boxes.  Not too bad.</p>
<p><img src="http://www.kitchensoap.com/wp-content/uploads/2007/08/quads.png" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2007/08/20/making-a-site-faster-by-removing-machines/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Varnish and the state of web caching</title>
		<link>http://www.kitchensoap.com/2006/12/16/varnish-and-the-state-of-web-caching/</link>
		<comments>http://www.kitchensoap.com/2006/12/16/varnish-and-the-state-of-web-caching/#comments</comments>
		<pubDate>Sat, 16 Dec 2006 17:21:31 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Caching]]></category>
		<category><![CDATA[Flickr]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/2006/12/16/varnish-and-the-state-of-web-caching/</guid>
		<description><![CDATA[So there&#8217;s lots of excitement around Varnish, which is a caching proxy that is built to be first and foremost a reverse-proxy, as opposed to squid, which does both forward and reverse. Acceleration (reverse-proxying) is obviously important to us at Flickr, as we use squid extensively. I&#8217;m hoping to do some testing with Varnish once [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>So there&#8217;s lots of excitement around <a title="varnish" target="_blank" href="http://www.varnish-cache.org/">Varnish</a>, which is a caching proxy that is built to be first and foremost a reverse-proxy, as opposed to <a title="squid" target="_blank" href="http://squid-cache.org">squid</a>, which does both forward and reverse. Acceleration (reverse-proxying) is obviously important to us at <a target="_blank" href="http://flickr.com">Flickr</a>, as we use squid extensively.</p>
<p><span id="more-9"></span></p>
<p>I&#8217;m hoping to do some testing with Varnish once it&#8217;s stable and has the ability to manage a constantly full cache.  After emailing with <a target="_blank" href="http://people.freebsd.org/~phk/">Poul Henning-Kamp</a> (one of the main developers) he says that object replacement/eviction is indeed on the roadmap, so we shall see.</p>
<p>From what I can tell, Varnish sounds a <em>little </em>like the COSS filesystem that squid can use, in that it uses one big file to store objects in.  In varnish, this is mmap&#8217;d into the process and the kernel does all of the disk work. Since replacement/eviction isn&#8217;t done yet, not sure if the mechanism is &#8220;cyclical&#8221; like COSS, but however it will work, it&#8217;ll probably see some big performance increases when compared to the standard &#8216;nested directories&#8217; way that <em>aufs </em>does things in squid currently.</p>
<p>Woohoo!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2006/12/16/varnish-and-the-state-of-web-caching/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Hats and beards</title>
		<link>http://www.kitchensoap.com/2006/12/12/hats-and-beards/</link>
		<comments>http://www.kitchensoap.com/2006/12/12/hats-and-beards/#comments</comments>
		<pubDate>Tue, 12 Dec 2006 23:34:03 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Flickr]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/2006/12/12/hats-and-beards/</guid>
		<description><![CDATA[http://flickr.com/photos/allspaw/311471361/]]></description>
			<content:encoded><![CDATA[<p></p><p><a href="http://flickr.com/photos/allspaw/311471361/">http://flickr.com/photos/allspaw/311471361/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2006/12/12/hats-and-beards/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

