<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Kitchen Soap &#187; allspaw</title>
	<atom:link href="http://www.kitchensoap.com/author/admin/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.kitchensoap.com</link>
	<description>Thoughts on capacity planning and web operations.</description>
	<lastBuildDate>Tue, 17 Jan 2012 17:57:33 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Convincing management that cooperation and collaboration was worth it</title>
		<link>http://www.kitchensoap.com/2012/01/05/convincing-management-that-cooperation-and-collaboration-was-worth-it/</link>
		<comments>http://www.kitchensoap.com/2012/01/05/convincing-management-that-cooperation-and-collaboration-was-worth-it/#comments</comments>
		<pubDate>Thu, 05 Jan 2012 15:35:10 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Culture]]></category>
		<category><![CDATA[Flickr]]></category>
		<category><![CDATA[Random]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=8760</guid>
		<description><![CDATA[While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo&#8217;s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo&#8217;s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at Flickr, and why. I did this in the hope that other Yahoo properties could learn from that team&#8217;s process and culture, which we worked really hard at building and keeping.</p>
<p>The idea that Development and Operations could:</p>
<ul>
<li>Share responsibility/accountability for availability and performance</li>
<li>Have an equal seat at the table when it came to application and infrastructure design, architecture, and emergency response</li>
<li>Build and maintain a deferential culture to each other when it came to domain expertise</li>
<li>Cultivate equanimity when it came to emergency response and post-mortem meetings</li>
</ul>
<div>
<p>&#8230;wasn&#8217;t evenly distributed across other Yahoo properties, from my limited perspective.</p>
<p>But I knew (still know) lots of incredible engineers at Yahoo that weren&#8217;t being supported as they could be by their upper management. So sending this letter was driven by wanting to help their situation. Don&#8217;t get me wrong, not everything was rainbows and flowers at Flickr, but we certainly had a lot more of them than other Yahoo groups.</p>
<p>When I re-read this, I&#8217;m reminded that when I came to Etsy, I wasn&#8217;t entirely sure that any of these approaches would work in the Etsy Engineering environment. The engineering staff at Etsy was a lot larger than Flickr&#8217;s and continuous deployment was in its infancy when I got there. I can now happily report that 2 years later, these concepts not only solidified at Etsy, they evolved to accommodate a <em><strong>lot</strong></em> more than what challenged us at Flickr. I couldn&#8217;t be happier about how it&#8217;s turned out.</p>
<p>I&#8217;ll note that there&#8217;s nothing groundbreaking in this note I sent, and nothing that I hadn&#8217;t said publicly in a presentation or two around the same time.</p>
<p>This is the note I sent to the three layers of management above me in my org at Yahoo:</p>
<blockquote>
<h3>Subject: Why Flickr went from 73rd most popular Y! property in 2005 to the 6th, 5 years later.</h3>
<p>Below are my thoughts about some of the reasons why Flickr has had success, from an Operations Engineering manager&#8217;s point of view.</p>
<p>When I say <em>everyone </em>below, I mean all of the groups and sub-groups within the Flickr property: <strong>Product</strong>, <strong>Customer Care</strong>, <strong>Development</strong>, <strong>Service Engineering</strong>, <strong>Abuse and Advocacy</strong>, <strong>Design</strong>, and <strong>Community Management</strong>.</p>
<h3>Here are at least some of the reasons we had success:</h3>
<ul>
<ul>
<li>Product included and respected everyone&#8217;s thoughts, in almost every feature and choice.</li>
<li><em>Everyone</em> owned availability of the site, not just Ops.</li>
<li>Community management and customer service were involved <strong>early</strong> and <strong>often</strong>. In <em>everything</em>. If they weren&#8217;t, it was an oversight taken seriously, and would be fixed.</li>
<li>Development and Operations had <strong>zero</strong> divide when it came to availability and performance. No, really. They worked in concert, involving each other in their own affairs when it mattered, and trusting each other every step of the way. This culture was taught, not born.</li>
<li>I have <em>never</em> viewed Flickr Operations as <strong><em>firefighters</em></strong>, and have never considered Flickr Dev Engineering to be <strong><em>arsonists</em></strong>. (I have heard this analogy elsewhere in Yahoo.) The two teams are 100% equal partners, with absolute transparency. If anything, we had a problem with too much deference given between the two teams.</li>
<li>The site was able to evolve, change, and grow as fast as needed to be as long as it was made safe to do so. To be specific: code and config deploys. When it wasn&#8217;t safe, we slowed, and everyone was fine with that happening, knowing that the goal was to return to <em>fast-as-we-need-to-be</em>. See above about everyone owning availability.</li>
<li>Developers were able to see their work almost instantly in production. Institutionalized fear of degradation and outage ensured that changes were as safe as they needed to be. Developers and Ops engineers knew intuitively that the safety net you have is the one that you have built for yourself. When changes are small and frequent, the causes of degradation or outage due to code deploys are exceptionally transparent to all involved. (Re-read above about everyone owning availability.)</li>
<li>We never deployed &#8220;early and often&#8221; because it was:
<ul>
<li>a trend,</li>
<li>we wanted to brag,</li>
<li>or because we think we&#8217;re better than anyone. (We did it because it was right for Flickr to do so.)</li>
</ul>
</li>
<li>Everyone was made aware of any launches that had risks associated with it, and we worked on lists of things that could possibly go wrong, and what we would do in the event they did go wrong. Sometimes we missed things, and we had to think quickly, but those times were rare with new feature launches.</li>
<li>Flickr Ops had <em>always</em> had the &#8220;go or no-go&#8221; decision, as did other groups who could vote with respect to their preparedness. A significant part of my job was working towards saying &#8220;go&#8221;, not &#8220;no-go&#8221;. In fact, almost all of it.</li>
</ul>
</ul>
<h4>Examples: the most boring (anti-climatic, from an operational perspective) launches ever</h4>
<ul>
<ul>
<li><strong>Flickr Video</strong>: I actually held the launch back by some hours until we could rectify a networking issue that I thought posed a risk to post-launch traffic. Other than that, it was a switch in the application that was turned from off to on. The feature&#8217;s code had been on prod servers for months in beta. See &#8216;dark launch&#8217;</li>
<li><strong>Homepage redesign</strong>: Unprecedented amount of activity data being pulled onto the logged-in homepage, order of magnitude increase in the number of calls to backend databases. Why was it boring? Because it was dark launched 10 days earlier. The actual launch was a flip of the &#8216;on&#8217; switch</li>
<li><strong>People In Photos (aka, &#8216;people tagging&#8217;)</strong>: Because the feature required data that we didn&#8217;t actually have yet, we couldn&#8217;t exactly dark launch it. It was a feature that had to be turned on, or off. Because of this, Flickr&#8217;s Architect wrote out a list of all of the parts of the feature that could cause load-related issues, what the likelihood of each was, how to turn those parts of the feature off, what custome care affect it might have, and what contingencies would probably require some community management involvement.</li>
</ul>
</ul>
<h4>Dark Launches</h4>
<p>When we already have the data on the backend needed to display for a new feature, we would &#8216;dark launch&#8217;, meaning that the code would make all of the back-end calls (i.e. the calls that bring load-related risk to the deploy) and simply throw the data away, not showing it to the user. We could then increase or decrease the percentage of traffic who made those calls in safety, since we never risked the user experience by showing them a new feature and then having to take it away because of load issues.</p>
<p>This increases <em>everyone&#8217;s</em> confidence almost to the point of apathy, as far as fear of load-related issues are concerned. I have no idea how many code deploys there were made to production on any given day in the past 5 years (although I could find it on a graph easily), because for the most part I don&#8217;t care, because those changes made in production have such a low chance of causing issues. When they have caused issues, everyone on the Flickr staff can find on a webpage <strong><em>when</em></strong> the change was made, <strong><em>who</em></strong> made the change, and exactly (line-by-line) <strong><em>what</em></strong> the change was.</p>
<p>In the case where we had confidence in the resource consumption of a feature, but not 100% confidence in functionality, the feature was turned on for staff only. I&#8217;d say that about 95% of the features we launched in those 5 years were turned on for staff long before they were turned on for the entire Flickr population. When we still didn&#8217;t feel 100% confident, we ramped up the percentage of Flickr members who could see and use the new feature slowly.</p>
<h4>Config Flags</h4>
<p>We have many pieces of Flickr that are encapsulated as &#8216;feature&#8217; flags, which look as simple as: $cfg[disable_feature_video] = 0; this allows the site to be much more resilient to specific failures. If we have any degradation within a certain feature, we can simply turn that feature off in many cases, instead of taking the entire site down. These &#8216;flags&#8217; have, in the past, been prioritized with conversations with Product, so there is an easy choice to make if something goes wrong and site uptime becomes opposed to feature uptime.</p>
<p>This is an extremely important point: Dark Launches and Config Flags, were concepts and tools created by Flickr Development, not Flickr Operations, even though the end-result of each points toward a typical Operations goal: stability and availability. This is a key distinction. These are initiatives made by Engineering leadership because devs feel protective of the availability of the site, respectful of Operations responsibilities, and just plain good engineering.</p>
<p>If the Flickr Operations had built these tools and approaches to keeping the site stable, I do not believe we would have the same amount of success.</p>
<p>There is more on this topic here: <a href="http://code.flickr.com/blog/2009/12/02/flipping-out/" target="_blank">http://code.flickr.com/blog/2009/12/02/flipping-out/ </a></p>
<h4>Summary</h4>
<p>Flickr Operations is in an enviable position in that they don&#8217;t have to convince anyone in the Flickr property that:</p>
<ul>
<ul>
<ol>
<li>Operations has &#8216;go or no-go&#8217; decision-making power, along with every other subgroup.</li>
<li>Spending time, effort, and money to ensure stable feature launches <em>before they launch </em>is the rule, not the exception<em>.</em></li>
<li>Continuous Deployment is better for the availability of the site</li>
<li>Flickr Operations should be involved as early as possible in the development phase of any project</li>
</ol>
</ul>
</ul>
<p>These things are taken for granted. Any other way would simply feel weird.</p></blockquote>
<p>I have no idea if posting this letter helps anyone other than myself, but there you go.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2012/01/05/convincing-management-that-cooperation-and-collaboration-was-worth-it/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Fault Tolerance and Protection</title>
		<link>http://www.kitchensoap.com/2011/09/08/fault-tolerance-and-protection/</link>
		<comments>http://www.kitchensoap.com/2011/09/08/fault-tolerance-and-protection/#comments</comments>
		<pubDate>Thu, 08 Sep 2011 11:17:16 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Resilience]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=7193</guid>
		<description><![CDATA[In yet another post where I point to a paper written from the perspective of another field of engineering about a topic that I think is inherently mappable to the web engineering world, I&#8217;ll at least give a summary. Every time someone on-call gets an alert, they should always be thinking along these lines: Does [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>In yet another post where I point to a paper written from the perspective of another field of engineering about a topic that I think is inherently mappable to the web engineering world, I&#8217;ll at least give a summary. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Every time someone on-call gets an alert, they should always be thinking along these lines:</p>
<ul>
<li>Does this <em>really </em>require me to wake up from sleeping or pause this movie I&#8217;m watching, to fix?</li>
<li>Can this <em>really </em>not wait until the morning, during office hours?</li>
</ul>
<p>If the answer is yes to those, then excellent: the machines alerted a human to something that only a human could ever diagnose or fix. There was nothing that the software could have done to rectify the situation. Paging a human was justified.</p>
<p>But for those situations where the answer was &#8220;no&#8221; to those questions, one might (or should, anyway) think of bolstering your system&#8217;s &#8220;fault tolerance&#8221; or &#8220;fault protection.&#8221; But how many folks grok the full details of what that means?  Does it mean self-healing? Does it mean isolation of errors or unexpected behaviors that fall outside the bounds of normal operating circumstances? Or does it mean both and if so how should we approach building this tolerance and protection? The Wikipedia definitions for &#8220;<a href="http://en.wikipedia.org/wiki/Fault-tolerant_system" target="_blank">fault tolerant systems</a>&#8221; and &#8220;<a href="http://en.wikipedia.org/wiki/Fault-tolerant_design" target="_blank">fault tolerant design</a>&#8221; are a very good start on educating yourself on the concepts, but they&#8217;re reasonably general in scope.</p>
<p>The fact is, designing web systems to be truly fault-tolerant and protective is <em>hard. </em>These are questions that can&#8217;t be answered solely within infrastructural bounds; fault-tolerance isn&#8217;t selective in its tiering, it has to be thought of from layer 1 of the network all the way to the browser.</p>
<p>Now, not every web startup is lucky enough to hire someone from <a href="http://www.jpl.nasa.gov/">NASA&#8217;s Jet Propulsion Lab</a>, who has written software for space vehicles, but we managed to convince Greg Horvath to leave there and join Etsy. He pointed me to an excellent paper, by <a href="https://pub-lib.jpl.nasa.gov/docushare/dsweb/Get/Document-316/08-031+GN%26C+Fault+Protection+Fundamentals.pdf">Robert D. Rasmussen, called &#8220;GN&amp;C Fault Protection Fundamentals&#8221;</a> and thankfully, it&#8217;s a lot less about Guidance, Navigation, and Control and more about fault tolerance and protection strategies, concerns, and implementations.</p>
<p>Some of those concerns, from the paper:</p>
<blockquote>
<ul>
<li>Do not separate fault protection from normal operation of the same functions.</li>
<li>Strive for function <em>preservation</em>, not just fault <em>protection</em>.</li>
<li>Test systems, not fault protection; test behavior, not reflexes.</li>
<li>Cleanly establish a delineation of mainline control functions from transcendent issues.</li>
<li>Solve problems locally, if possible; explicitly manage broader impacts, if not.</li>
<li>Respond to the situation as it is, not as it is hoped to be.</li>
<li>Distinguish fault diagnosis from fault response initiation.</li>
<li>Follow the path of least regret.</li>
<li>Take the analysis of all contingencies to their logical conclusion.</li>
<li>Never underestimate the value of operational flexibility.</li>
<li>Allow for all reasonable possibilities — even the implausible ones.</li>
</ul>
</blockquote>
<p>The last idea there points to having &#8220;requisite imagination&#8221; to explore as fully as possible, the question &#8220;What could possibly go wrong?&#8221;, which is really just another manifestation of one of the four cornerstones of Resilience Engineering, which is: &#8220;Anticipation&#8221;. But that&#8217;s a topic for another post.</p>
<p>Here&#8217;s Rasmussen&#8217;s paper, <a href="https://pub-lib.jpl.nasa.gov/docushare/dsweb/Get/Document-316/08-031+GN%26C+Fault+Protection+Fundamentals.pdf">please go and read it</a>. If you don&#8217;t, you&#8217;re totally missing out and not keeping up!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2011/09/08/fault-tolerance-and-protection/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Systems Engineering: A great definition.</title>
		<link>http://www.kitchensoap.com/2011/07/18/systems-engineering-great-definition/</link>
		<comments>http://www.kitchensoap.com/2011/07/18/systems-engineering-great-definition/#comments</comments>
		<pubDate>Mon, 18 Jul 2011 11:46:57 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Culture]]></category>
		<category><![CDATA[Random]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=6175</guid>
		<description><![CDATA[Ben Rockwood said something last December about the re-emergence of the Systems Engineer and I agree with him, 100%. To add to that, I&#8217;d like to quote the excellent NASA Systems Engineering handbook&#8217;s introduction. The emphasis is mine: Systems engineering is a methodical, disciplined approach for the design, realization, technical management, operations, and retirement of [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Ben Rockwood said <a href="http://cuddletech.com/blog/?p=150" target="_blank">something last December</a> about the re-emergence of the Systems Engineer and I agree with him, 100%.</p>
<div id="attachment_6366" class="wp-caption alignright" style="width: 231px">
	<a href="http://education.ksc.nasa.gov/esmdspacegrant/Documents/NASA%20SP-2007-6105%20Rev%201%20Final%2031Dec2007.pdf"><img class="size-medium wp-image-6366" title="NASA Systems Engineering Handbook" src="http://www.kitchensoap.com/wp-content/uploads/2011/07/Screen-shot-2011-07-18-at-7.36.22-AM-231x300.png" alt="NASA Systems Engineering Handbook" width="231" height="300" /></a>
	<p class="wp-caption-text">NASA Systems Engineering Handbook, 2007</p>
</div>
<p>To add to that, I&#8217;d like to quote the excellent NASA Systems Engineering handbook&#8217;s introduction. The emphasis is mine:</p>
<blockquote><p>Systems engineering is a methodical, disciplined approach for the design, realization, technical management, operations, and retirement of a system. A “system” is a construct or collection of different elements that together produce results not obtainable by the elements alone. The elements, or parts, can include <strong>people, hardware, software, facilities, policies, and documents; that is, all things required to produce system-level results.</strong> The results include system-level qualities, properties, characteristics, functions, behavior, and performance. The value added by the system as a whole, beyond that contributed independently by the parts, is primarily created by the relationship among the parts; that is, how they are interconnected. It is a way of looking at the “big picture” when making technical decisions. It is a way of achieving stakeholder functional, physical, and operational performance requirements in the intended use environment over the planned life of the systems. <strong><em>In other words, systems engineering is a logical way of thinking.</em></strong></p>
<p>Systems engineering is the art and science of developing an operable system capable of meeting requirements within often opposed constraints. <strong><em>Systems engineering is a holistic, integrative discipline, wherein the contributions of structural engineers, electrical engineers, mechanism designers, power engineers, human factors engineers, and many more disciplines are evaluated and balanced, one against another, to produce a coherent whole that is not dominated by the perspective of a single discipline.</em></strong></p>
<p>Systems engineering seeks a safe and balanced design in the face of opposing interests and multiple, sometimes conflicting constraints. The systems engineer must develop the skill and instinct for identifying and focusing efforts on assessments to optimize the overall design and not favor one system/subsystem at the expense of another. The art is in knowing when and where to probe. Personnel with these skills are usually tagged as “systems engineers.” They may have other titles—lead systems engineer, technical manager, chief engineer— but for this document, we will use the term <strong><em>systems engineer</em></strong>.</p>
<p>The exact role and responsibility of the systems engineer may change from project to project depending on the size and complexity of the project and from phase to phase of the life cycle. For large projects, there may be one or more systems engineers. For small projects, sometimes the project manager may perform these practices. But, whoever assumes those responsibilities, the systems engineering functions must be performed. The actual assignment of the roles and responsibilities of the named systems engineer may also therefore vary. The lead systems engineer ensures that the system technically fulfills the defined needs and requirements and that a proper systems engineering approach is being followed. The systems engineer oversees the project’s systems engineering activities as performed by the technical team and directs, communicates, monitors, and coordinates tasks. The systems engineer reviews and evaluates the technical aspects of the project to ensure that the systems/subsystems engineering processes are functioning properly and evolves the system from concept to product. <strong><em>The entire technical team is involved in the systems engineering process.</em></strong></p></blockquote>
<p>I would imagine that successful organization understands this concept of systems engineering, but I don&#8217;t think I&#8217;ve ever seen it put so well.</p>
<p>NASA&#8217;s engineers have both common and conflicting goals, just like we do in web operations. They weigh trade-offs in efficiency and thoroughness, and wade into the constraints of better, cheaper, faster, and hopefully: more <a title="Resilience Engineering Part I" href="http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/" target="_blank">resilient</a>.</p>
<p>This re-emergence of the systems engineering (or &#8220;full-stack&#8221; engineering) notion is excellent and exciting to me, and I&#8217;m hoping that everyone in our field, when they hear &#8220;DevOps&#8221; (and/or how Theo says <a href="http://www.youtube.com/watch?v=y0mHo7SMCQk" target="_blank">*Ops</a>) what they mean is taking a <em><strong>systems engineering</strong></em> view.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2011/07/18/systems-engineering-great-definition/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Training Organizational Resilience in Escalating Situations</title>
		<link>http://www.kitchensoap.com/2011/05/10/training-organizational-resilience-in-escalating-situations/</link>
		<comments>http://www.kitchensoap.com/2011/05/10/training-organizational-resilience-in-escalating-situations/#comments</comments>
		<pubDate>Tue, 10 May 2011 14:40:22 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Culture]]></category>
		<category><![CDATA[Resilience]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=3509</guid>
		<description><![CDATA[This little ramble of thoughts are related to my talk at Velocity coming up, but I know I&#8217;ll never get to this part at the conference, so I figured I&#8217;d post about it here. Building resilience from a systems point of view means (amongst other things) understanding how your organization deals with failure and unexpected [...]]]></description>
			<content:encoded><![CDATA[<p></p><p><em>This little ramble of thoughts are related to my <a href="http://velocityconf.com/velocity2011/public/schedule/detail/19766" target="_blank">talk at Velocity</a> coming up, but I know I&#8217;ll never get to this part at the conference, so I figured I&#8217;d post about it here.</em></p>
<p>Building resilience from a systems point of view means (amongst other things) understanding how your <em>organization</em> deals with failure and unexpected situations. Generally this means having a development and operations teams that can work well together under pressure, with fluctuating amounts of uncertainty, bringing their own domain expertise to the table when it matters.</p>
<p>This is what drives some of my favorite <a href="http://www.kitchensoap.com/2010/05/26/some-webops-interview-questions/" target="_blank">Ops candidate interview questions</a>. Knowing Unix commands, network architectures, database behaviors, and scripting languages are obviously required, but comprise only one facet of the gig.  The <em><strong>real mettle</strong></em> comes from being able easily zoom in and out of the whole system under scrutiny, splitting up troubleshooting responsibilities amongst your team (and trusting their results) and differentiating red herring symptoms from truly related ones. It also comes from things like:</p>
<ul>
<li>Staying away from distracting conversation during the outage response. Nothing kills a TTR like unrelated talk in IRC or a conf call.</li>
<li>Trusting your information. This is where the UI challenges of dashboard design can make or break an outage response. &#8220;Are those units <em>milli, or mega?&#8221;</em></li>
<li>Balancing too much communication and too little amongst team members. Troubleshooting outage verbosity is a fickle mistress.</li>
<li>Stomping actions. OneThingAtATime™ methods aren&#8217;t easy to stick to, especially when things escalate.</li>
<li>Keeping outage fatigue at bay, and recognizing when brains are melting and need to take a break.</li>
</ul>
<p>To make matters worse, determining causality can be tenuous at best when you&#8217;re working with complex systems, so being able to recognize when a failure has a single root cause (hint: with the big outages &#8211; almost never) and when it has multiple contributing causes is a skill that isn&#8217;t easily gained without seeing a lot of action in the past.</p>
<p>So it&#8217;s not a surprise that working well within a team under stressful scenarios is something other fields try to train people for.  Trauma surgeons, FBI agents, military teams, air traffic control, etc. all have drills, exercises, and simulations for teaching these skills, but they are all done within the context of what those escalating situations look like in their specific fields.</p>
<p>So this brings a question that has come up before in my circles:</p>
<blockquote><p><em>Can this sort of organizational resilience be <strong>taught</strong>, within the context of web operations? </em></p></blockquote>
<p>GameDay exercises could certainly be one avenue for testing and training team-based outage response, but most of the focus there (at least those discussed publicly by companies who hold GameDay exercises) is testing the infrastructure and application-level components, and even then under controlled conditions and relatively narrow failure modes.</p>
<p>So the confidence-building value of GameDay drills lie elsewhere, and don&#8217;t really exercise the cognitive load that real-world failures can produce on the humans (i.e. the troubleshooting dev and ops teams) like the <a href="http://aws.amazon.com/message/65648/" target="_blank">spectacular Amazon AWS outage</a> recently.</p>
<p>But! Some smart folks have been thinking about this question, at a higher-level:</p>
<blockquote><p><em>Is it possible to construct non-contextual and generic drills that can train competencies for this sort of on-the-fly, making-sense-of-unfamiliar-failure-modes, and sometimes disorienting troubleshooting?</em></p></blockquote>
<p>At the Lund University in Sweden, there&#8217;s an excellent article on <a href="http://www.leonardo.lth.se/research/organizational_resilience_in_escalating_situations/" target="_blank">building organizational resilience in escalating situations</a>, which I believe resulted in a chapter in the <a href="http://www.amazon.com/Resilience-Engineering-Practice-Ashgate-Studies/dp/1409410358/" target="_blank">Resilience Engineering in Practice</a> book, and also references another excellent article by David Woods and Emily Patterson called <em><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.8626&amp;rank=1" target="_blank">How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands</a>.</em></p>
<p>The parts I want to highlight here are best practices for designing scenarios meant to train these skills. If you&#8217;re looking to design a good drill meant to educate and/or train Ops and Devs on what cognitive muscles to develop for handling large-scale outages, this is a pretty damn good list (quoted from both of those sources above):</p>
<blockquote>
<ul>
<li>Try to force people beyond their learned  roles and routines. The scenario can contain problems that are not  solvable within those roles or routines, and forces people to step out  of those roles and routines.</li>
<li>Contain a number of hidden goals,  at various times during the scenario, that people could pursue (e.g.  different ways of escaping the situation or de-escalating it), but that  they have to vocalize and articulate in order to begin to achieve them  (as they cannot do so by themselves).</li>
<li>Include potential actions  of which the consequences are both important and difficult to foresee  (and that might significantly influence people’s ability to control the  problem in the near future). This can force people into pro-active  thinking and articulation of their expectations of what might happen.</li>
<li>Be  able to trap people in locking onto one solution that everybody is  fixedly working towards. This can be done by garden-pathing; making the  escalating problem look initially (with strong cues) like something the  crew could already familiar with, but then letting it depart (with much  weaker cues) to see whether the crew is caught on the garden path and  lets the situation escalate.</li>
<li>Or the scenario,  by creating so much cognitive noise in terms of new warnings and  events, should be able to trip people into thematic vagabonding—the  tendency to redirect attention and change diagnosis with each incoming data piece, which results in a fragmentation of problem-solving.</li>
</ul>
</blockquote>
<p>Think that such a scenario could be constructed?</p>
<p>I want to think so, but of course nothing teaches like the hindsight of a real production outage, eh? <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2011/05/10/training-organizational-resilience-in-escalating-situations/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Resilience Engineering: Part I</title>
		<link>http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/</link>
		<comments>http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/#comments</comments>
		<pubDate>Thu, 07 Apr 2011 14:03:25 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Books]]></category>
		<category><![CDATA[Culture]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=1733</guid>
		<description><![CDATA[I&#8217;ve been drafting this post for a really long time. Like most posts, it&#8217;s largely for me to get some thoughts down. It&#8217;s also very related to the topic I&#8217;ll be talking about at Velocity later this year. When I gave a keynote talk at the Surge Conference last year, I talked about how our [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>I&#8217;ve been drafting this post for a really long time. Like most posts, it&#8217;s largely for me to get some thoughts down. It&#8217;s also very related to the topic I&#8217;ll be <a href="http://velocityconf.com/velocity2011/public/schedule/detail/19766" target="_blank">talking about at Velocity</a> later this year.</p>
<p>When I gave a keynote talk at the <a href="http://omniti.com/surge/2010/speakers/john-allspaw" target="_blank">Surge Conference last year</a>, I talked about how our field of web engineering is still young, and would do very well to pay attention to other fields of engineering, since I suspect that we have a lot to learn from them. Contrary to popular belief, concepts such as fault tolerance, redundancy of components, sacrificial parts, automatic safety mechanisms, and capacity planning weren&#8217;t invented with the web. As it turns out, some of those ideas have been studied and put into practice in other fields for decades, if not centuries.</p>
<p>Systems engineering, control theory, reliability engineering&#8230;the list goes on for where we should be looking for influences, and other folks have <a href="http://cuddletech.com/blog/?p=150" target="_blank">noticed this as well</a>. As our field recognizes the value of taking a &#8220;systems&#8221; (the <a href="http://en.wikipedia.org/wiki/C._West_Churchman" target="_blank">C. West Churchman</a> definition, not the computer software definition) view on building and managing infrastructures with a &#8220;<a href="http://www.facebook.com/notes/facebook-engineering/the-full-stack-part-i/461505383919" target="_blank">Full Stack Programmer</a>&#8221; perspective, we should pull our heads out of our echo chamber every now and again, because we can gain so much from lessons learned elsewhere.</p>
<p>Last year, I was lucky to convince <a href="http://www.ctlab.org/Cook.cfm" target="_blank">Dr. Richard Cook</a> to let us include his article &#8220;<a href="http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/" target="_blank">How Complex Systems Fail</a>&#8221; in <em><a href="http://oreilly.com/catalog/0636920000136" target="_blank">Web Operations</a></em>. Some months before, I had seen the article and began to poke around Dr. Cook&#8217;s research areas: human error, cognitive systems engineering, safety, and a relatively new multi-discipline area known as <strong>Resilience Engineering</strong>.</p>
<p>What I found was nothing less than exhilarating and inspirational, and it&#8217;s hard for me to not consider this research mandatory reading for anyone involved with building or designing <a href="http://en.wikipedia.org/wiki/Sociotechnical_systems" target="_blank">socio-technical systems</a>. (<em>Hint: we all do, in web operations</em>) Frankly, I haven&#8217;t been this excited since I saw Jimmy Page in a restaurant once in the mid-90s. Even though Dr. Cook (and others in his field, like <a href="http://www.ida.liu.se/~eriho/" target="_blank">Erik Hollnagel</a>, <a href="http://www-iwse.eng.ohio-state.edu/biosketch_DWoods.cfm" target="_blank">David Woods</a>, and <a href="http://www.griffith.edu.au/professional-page/sidney-dekker" target="_blank">Sidney Dekker</a>) historically have written and researched resilience in the context of aviation, space transportation, healthcare and manufacturing, their findings strike me as incredibly appropriate to web operations and development.</p>
<p>Except, of course, accidents in our field don&#8217;t actually harm or kill people. But they almost always involve humans, machines, high stress, and high expectations.</p>
<p>Some of the concepts in resilience engineering run contrary to the typical (or stereotypical) perspectives that I&#8217;ve found in operations management, and that&#8217;s what I find so fascinating. I&#8217;m especially interested in <strong>organizational</strong> resilience, and the realization that safety in systems develops not in <em>spite</em> of us messy humans, but <em>because </em>of it.</p>
<p>For example:</p>
<p><strong>Historical approaches taken towards improving &#8220;safety&#8221; in production might not be best<br />
</strong></p>
<p>Conventional wisdom might have you believe that the systems we build are basically safe, and that all they need is protection from unreliable humans. This logically stems from the myth that all outages/degradations occur as the result of a change gone wrong, and I suspect this idea also comes from Root Cause Analysis write-ups ending with &#8220;human error&#8221; at the bottom of the page. But Dekker, Woods, and others in <em><a href="http://www.amazon.com/Behind-Human-Error-David-Woods/dp/0754678342" target="_blank">Behind Human Error</a> </em>suggest that listing human error as a root cause isn&#8217;t where you should <em>end, </em>it&#8217;s where you should <em>start</em> your investigation. Getting behind what led to a &#8216;human error&#8217; is where the good stuff happens, but unless you&#8217;ve got a safe political climate (i.e., no one is going to get punished or fired for making mistakes) you&#8217;ll never get at how and why the error was made. Which means that you will ignore one of the largest opportunities to make your system (and organization) more efficient and resilient in the face of incidents. Mismatches, slips, lapses, and violations&#8230;each one of those types of error can lead to different ways of improving. And of course, working out the motivations and intentions of people who have made errors isn&#8217;t straightforward, especially engineers who might not have enough humility to admit to making an error in the first place.</p>
<p><strong>Root Cause Analysis can be easily misinterpreted and abused<br />
</strong></p>
<p>The idea that failures in complex systems can literally have a singular &#8216;root&#8217; cause, as if failures are the result of linear steps in time, is just incorrect. Not only is it almost always incorrect, but in practice that perspective can be harmful to an organization because it allows management and others to feel better about improving safety, when they&#8217;re not, because the solution(s) can be viewed as simple and singular fixes (in reality, they&#8217;re not). James Reason&#8217;s pioneering book <a href="http://www.amazon.com/Human-Error-James-Reason/dp/0521314194"><em>Human Error</em></a> is enlightening on these points, to say the least. In reality (and I am guilty of this as anyone) there are motivations to reduce complex failures to singular/linear models, tipping the scales on what Hollnagel refers to as an ETTO, or <a href="http://www.namahn.com/resources/interview/erik-hollnagel-birds-do-it" target="_blank">Efficiency-Thoroughness Trade-Off</a>, which I think will sound familiar to anyone working in a web startup. Because why spend extra time digging to find details of that human error-causing outage, when you have work to do? Plus, if you linger too long in that postmortem meeting, people are going to feel even worse about making a mistake, and that&#8217;s just cruel, right? <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><strong>PostMortems or accident investigations is <em>not </em>the only way an organization can improve &#8220;safety&#8221;</strong></p>
<p>Only looking at failures to guide your designs, tools, and processes drastically minimizes your ability to improve, Hollnagel says. Instead of looking at the things that go <em>wrong</em>, looking at the things that go <em>right</em> is a better strategy to improve resiliency. Personally, I think that engineering teams who practice continuous deployment intuitively understand this. Small and frequent changes made to production by a growing number of developers ascribe to a particular culture of safety, whether they know it or not. It requires what Hollnagel refers to as a &#8220;constant sense of unease&#8221;, and awareness of failure is what helps bridge that stereotypical development and operations divide.</p>
<p><strong>Resilience should be a 4th management objective, alongside Better/Faster/Cheaper</strong></p>
<p>The definition goes like this:</p>
<blockquote><p>Resilience is the intrinsic ability of a system to adjust its  functioning prior to, during, or following changes and disturbances, so  that it can sustain required operations under both expected and  unexpected conditions. Since resilience is about being able to function,  rather than being impervious to failure, there is no conflict between  productivity and safety.</p></blockquote>
<p>This sounds like one of those commonsense ideas, right? In an extremely self-serving way, I find some validation in that definition that <a href="http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/" target="_blank">optimizing for MTTR is better than optimizing for MTBF</a>. My gut says that this shouldn&#8217;t be shocking or a revelation; it&#8217;s what mature engineering is all about.</p>
<p><strong>Safety might not come from the sources you think it comes from</strong></p>
<blockquote><p>&#8220;&#8230;so safety isn&#8217;t about the <em>absence</em> of something&#8230;that you need to count errors or monitor violations, and tabulate incidents and try to make those things go away&#8230;..it&#8217;s about the <em>presence</em> of something. But the presence of what? When we find that things go right under difficult circumstances, it&#8217;s mostly because of people&#8217;s <em>adaptive capacity</em>; their ability to recognize, adapt to, and absorb changes and disruptions, some of which might fall outside of what the system is designed or trained to handle.&#8221;</p>
<p>- Sidney Dekker</p></blockquote>
<p>My plan is to post more about these topics, because there are just too many ideas to explain in a single go. Apparently, Ashgate Publishing has owned this space, with a <a href="http://www.ashgate.com/default.aspx?page=2415" target="_blank">whole series of books</a>. The newest one, <a href="http://www.amazon.com/Resilience-Engineering-Practice-Ashgate-Studies/dp/1409410358/" target="_blank"><em>Resilience Engineering in Practice</em></a>, is in my bag, and I can&#8217;t put it down. Examples of these ideas in real-world scenarios (hospital and medical ops, power plants, air traffic control, financial services) are juicy with details, and the chapter &#8220;Lessons from the Hudson&#8221; goes into excellent detail about the trade-offs that go on in the mind of someone in high-stress failure scenarios, like <a href="http://en.wikipedia.org/wiki/Chesley_Sullenberger" target="_blank">Chesley Sullenberger</a>.</p>
<p>I&#8217;ll end on this decent introduction to some of the ideas that includes the above quote, from Sidney Dekker. There&#8217;s some distracting camera work, but the ideas get across:<br />
<iframe title="YouTube video player" width="480" height="390" src="http://www.youtube.com/embed/mVt9nIf9VJw" frameborder="0" allowfullscreen></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Etsy&#8217;s Chef Repo, 2010</title>
		<link>http://www.kitchensoap.com/2010/12/31/etsys-chef-repo-2010/</link>
		<comments>http://www.kitchensoap.com/2010/12/31/etsys-chef-repo-2010/#comments</comments>
		<pubDate>Fri, 31 Dec 2010 20:26:12 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Etsy]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=529</guid>
		<description><![CDATA[Etsy&#8217;s Chef Repo, 2010 from jspaw on Vimeo. Delicious InfoViz courtesy of Gource.]]></description>
			<content:encoded><![CDATA[<p></p><p><iframe src="http://player.vimeo.com/video/18330382" width="600" height="450" frameborder="0"></iframe>
<p><a href="http://vimeo.com/18330382">Etsy&#8217;s Chef Repo, 2010</a> from <a href="http://vimeo.com/jspaw">jspaw</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p>Delicious InfoViz courtesy of <a title="Gource" href="http://code.google.com/p/gource/" target="_blank">Gource</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2010/12/31/etsys-chef-repo-2010/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>MTTR is more important than MTBF (for most types of F)</title>
		<link>http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/</link>
		<comments>http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/#comments</comments>
		<pubDate>Sun, 07 Nov 2010 18:27:23 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Culture]]></category>
		<category><![CDATA[Etsy]]></category>
		<category><![CDATA[Flickr]]></category>
		<category><![CDATA[Slides]]></category>
		<category><![CDATA[Talks]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=508</guid>
		<description><![CDATA[This week I gave a talk at QCon SF about development and operations cooperation at Etsy and Flickr.  It&#8217;s a refresh of talks I&#8217;ve given in the past, with more detail about how it&#8217;s going at Etsy. (It&#8217;s going excellently ) There&#8217;s a bunch of topics in the presentation slides, all centered around roles, responsibilities, [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>This week I gave a talk at QCon SF about <a href="http://www.slideshare.net/jallspaw/dev-and-ops-collaboration-and-awareness-at-etsy-and-flickr" target="_blank">development and operations cooperation at Etsy and Flickr</a>.  It&#8217;s a refresh of talks I&#8217;ve given in the past, with more detail about how it&#8217;s going at Etsy. (It&#8217;s going excellently <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  )</p>
<p>There&#8217;s a bunch of topics in the presentation slides, all centered around roles, responsibilities, and intersection points of domain expertise commonly found in development and operations teams. One of the not-groundbreaking ideas that I&#8217;m finally getting down is something that should be evident for anyone practicing or interested in &#8216;continuous deployment&#8217;:</p>
<p style="padding-left: 30px;">Being able to recover quickly from failure is more important than having failures less often.</p>
<p>This has what should be an obvious caveat: some types of failures shouldn&#8217;t ever happen, and not all failures/degradations/outages are the same. (like failures resulting in accidental data <em>loss</em>, for example)</p>
<p>Put another way:</p>
<blockquote>
<h1><strong>MTTR is more important than MTBF </strong></h1>
<p><strong><em>(for most types of F)</em></strong></p></blockquote>
<p>(Edited: I did say originally &#8220;MTTR &gt; MTBF&#8221;)</p>
<p>What I&#8217;m definitely <strong>not</strong> saying is that failure should be an acceptable condition. I&#8217;m positing that since failure <em>will</em> happen, it&#8217;s just as important (or in some cases <em>more</em> important) to spend time and energy on your response to failure than trying to prevent it. I agree with <a href="http://twitter.com/ph" target="_blank">Hammond</a>, when he said:</p>
<blockquote><p>If you think you can prevent failure, then you aren&#8217;t developing your ability to respond.</p></blockquote>
<p>In a complete steal of <a href="http://radar.oreilly.com/artur/" target="_blank">Artur Bergman</a>&#8216;s material, an example in the slides of the talk is of the Jeep versus Rolls Royce:</p>
<p><a href="http://www.kitchensoap.com/wp-content/uploads/2010/11/Screen-shot-2010-11-07-at-1.08.39-PM.png"><img class="alignleft size-medium wp-image-517" title="Jeep versus Rolls" src="http://www.kitchensoap.com/wp-content/uploads/2010/11/Screen-shot-2010-11-07-at-1.08.39-PM-300x225.png" alt="Jeep versus Rolls" width="300" height="225" /></a> Artur has a Jeep, and he&#8217;s right when he says that for the most part, Jeeps are built with optimizing Mean-Time-To-Repair, not the classical approach to automotive engineering, which is to optimize Mean-Time-Between-Failures. This is likely because Jeep owners have been beating the shit out of their vehicles for decades, and every now and again, they expect that abuse to break something. Jeep designers know this, which is why it&#8217;s so damn easy to repair. Nuts and bolts are easy to reach, tools are included when you buy the thing, and if you haven&#8217;t seen the video of <a href="http://www.youtube.com/watch?v=lgwF8mdQwlw" target="_blank">Army personnel disassembling and reassembling a Jeep in under 4 minutes</a>, you&#8217;re missing out.</p>
<p>The Rolls Royce, on the other hand, likely don&#8217;t have such adventurous owners, and when it does break down, it&#8217;s a fine and acceptable thing for the car to be out of service for a long and expensive fixing by the manufacturer.</p>
<p>We as web operations folks want our architectures to be built optimized for MTTR, not for MTBF. I think that the reasons should be obvious, and the fact that practices like:</p>
<ul>
<li>Dark launching</li>
<li>Percentage-based production A/B rollouts</li>
<li><a href="http://code.flickr.com/blog/2009/12/02/flipping-out/" target="_blank">Feature flags </a></li>
</ul>
<p>are becoming commonplace should verify this approach as having legs.</p>
<p>The slides from QConSF are here:</p>
<div style="width:425px" id="__ss_5695138"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/jallspaw/dev-and-ops-collaboration-and-awareness-at-etsy-and-flickr" title="Dev and Ops Collaboration and Awareness at Etsy and Flickr" target="_blank">Dev and Ops Collaboration and Awareness at Etsy and Flickr</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/5695138" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/jallspaw" target="_blank">John Allspaw</a> </div>
</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Go or No-Go: Operability and Contingency Planning (Surge)</title>
		<link>http://www.kitchensoap.com/2010/11/03/go-or-no-go-operability-and-contingency-planning-surge/</link>
		<comments>http://www.kitchensoap.com/2010/11/03/go-or-no-go-operability-and-contingency-planning-surge/#comments</comments>
		<pubDate>Wed, 03 Nov 2010 17:35:33 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Etsy]]></category>
		<category><![CDATA[Slides]]></category>
		<category><![CDATA[Talks]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=510</guid>
		<description><![CDATA[Last month I had the honor of speaking at the Surge Conference in Baltimore, put together by OmniTI. It was a most excellent conference, and the expertise levels were ridiculously high. I count myself lucky to be considered the same league as the rest of the presenters. I did give a Keynote talk, and I [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Last month I had the honor of speaking at the <a href="http://omniti.com/surge/2010/speakers/john-allspaw" target="_blank">Surge Conference</a> in Baltimore, put together by <a href="http://omniti.com/" target="_blank">OmniTI</a>.</p>
<p>It was a most excellent conference, and the expertise levels were ridiculously high. I count myself lucky to be considered the same league as the rest of the presenters. I did give a Keynote talk, and I haven&#8217;t uploaded those slides yet. The talk I gave on the second day of the conference was about how we plan for feature launches at <a href="http://www.etsy.com" target="_blank">Etsy</a>, which follows a similar pattern we had at <a href="http://www.flickr.com" target="_blank">Flickr</a>.</p>
<p>So, here are the slides for that <a href="http://omniti.com/surge/2010/speakers/john-allspaw" target="_blank">talk</a>:</p>
<div style="width:425px" id="__ss_5657590"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/jallspaw/go-or-nogo-operability-and-contingency-planning-at-etsycom" title="Go or No-Go: Operability and Contingency Planning at Etsy.com" target="_blank">Go or No-Go: Operability and Contingency Planning at Etsy.com</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/5657590" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/jallspaw" target="_blank">John Allspaw</a> </div>
</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2010/11/03/go-or-no-go-operability-and-contingency-planning-surge/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Nagios alerts on the iPhone &#8211; deleting boatloads</title>
		<link>http://www.kitchensoap.com/2010/10/27/nagios-alerts-on-the-iphone-deleting-boatloads/</link>
		<comments>http://www.kitchensoap.com/2010/10/27/nagios-alerts-on-the-iphone-deleting-boatloads/#comments</comments>
		<pubDate>Wed, 27 Oct 2010 14:35:30 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=501</guid>
		<description><![CDATA[Protip: if you&#8217;re getting Nagios alerts on an iPhone, and you have your contact set as:  xxx-xxx-xxxx@txt.att.net, you&#8217;ll get messages from a &#8216;sender&#8217; that looks like: &#8220;1 (410) 000-173&#8243;. This is not someone in Maryland, it&#8217;s a special address so that AT&#38;T can route a reply back to the sender if need be. The side [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Protip: if you&#8217;re getting Nagios alerts on an iPhone, and you have your contact set as:  <em>xxx-xxx-xxxx@txt.att.net</em>, you&#8217;ll get messages from a &#8216;sender&#8217; that looks like: &#8220;1 (410) 000-173&#8243;. This is not someone in Maryland, it&#8217;s a special address so that AT&amp;T can route a reply back to the sender if need be.</p>
<p>The side affect of this is when/if you get a boatload of alerts (which can happen in cascading failure scenarios where you don&#8217;t have any Nagios <a href="http://nagios.sourceforge.net/docs/3_0/dependencies.html" target="_blank">dependencies</a> or <a href="http://nagios.sourceforge.net/docs/3_0/eventhandlers.html" target="_blank">event handlers</a> set up) you&#8217;re gonna have to spend a proportional boatload of time swiping and deleting those alerts one by one.</p>
<p>This, of course, is a major bummer. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </p>
<p>A solution is to set your contact info in nagios instead to <em>xxx-xxx-xxxx@mms.att.net</em>, which will properly set a &#8220;from&#8221; address on your iPhone, so when it comes time to delete the boatload of messages, you can do it in a single &#8216;delete conversation&#8217; swipe.</p>
<p style="padding-left: 30px;"><strong>Caveat:</strong> If you do this (set to mms.att.net, instead of txt.att.net) you&#8217;ll lose the ability to reply to a Nagios alert. This presumably will affect those smart folks who have set up the ability to acknowledge an alert from their phone via a reply/procmail mechanism.</p>
<p>Bonus protip: make it so that you don&#8217;t <strong><em>ever</em></strong> get boatloads of Nagios alerts at once.  That will help, too.</p>
<p>Implied bonus protip: <a href="http://nagios.sourceforge.net/docs/3_0/extcommands.html" target="_blank">event handlers</a> and <a href="http://nagios.sourceforge.net/docs/3_0/dependencies.html" target="_blank">dependencies</a> are the sign of an evolved ops organization. It&#8217;s not too difficult to set up, and you&#8217;ll feel joy after you do!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2010/10/27/nagios-alerts-on-the-iphone-deleting-boatloads/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Ops Meta-Metrics: Velocity 2010 Slides</title>
		<link>http://www.kitchensoap.com/2010/06/24/ops-meta-metrics-velocity-2010-slides/</link>
		<comments>http://www.kitchensoap.com/2010/06/24/ops-meta-metrics-velocity-2010-slides/#comments</comments>
		<pubDate>Fri, 25 Jun 2010 04:05:16 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=483</guid>
		<description><![CDATA[As expected, Velocity was excellent this year. What an awesome time to be in this field. Caveat for those who didn&#8217;t see/hear my talk: the graphs and numbers in the slides are, for the most part, made up. But they&#8217;re also in line with what I&#8217;ve seen at Flickr and Etsy. Ops Meta-Metrics: The Currency [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>As expected, Velocity was excellent this year. What an awesome time to be in this field.</p>
<p>Caveat for those who didn&#8217;t see/hear my talk: the graphs and numbers in the slides are, for the most part, made up. But they&#8217;re also in line with what I&#8217;ve seen at Flickr and Etsy.</p>
<div style="width:425px" id="__ss_4609305"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/jallspaw/ops-metametrics-the-currency-you-pay-for-change" title="Ops Meta-Metrics: The Currency You Pay For Change" target="_blank">Ops Meta-Metrics: The Currency You Pay For Change</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/4609305" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/jallspaw" target="_blank">John Allspaw</a> </div>
</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2010/06/24/ops-meta-metrics-velocity-2010-slides/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

