UPDATE: blip.tv has the video of the talk as well, below. Jeez I have some major bed-head.

That was a blast! I had never done a ‘duet’ talk before. Here are the slides:

…and the video of it is here:

{ 9 comments }

Annoying To Me.

by allspaw on May 22, 2009

I can’t tell you how ripped I get when people say things like this:

“cloud computing means getting rid of ops”

If by “ops” you mean “people in data centers racking servers, installing OSes, running cables, replacing broken hardware, etc.” then sure, cloud computing aims to relieve you of those burdens. If you really think ‘ops’ is just that, then you really should put down your Nick Carr book and pay attention to the real world for a change.

The reality is, if your ops team is spending a lot of time doing that, then you’re either:

  1. Too big to use someone *else’s* cloud, because you basically have your own (Yahoo, Amazon, Google, etc.)
  2. Stuck in 1999.

If you deal with any of these things:

  • handling site issues/incidents
  • building and maintaining tools to monitor and gather systems and application-level metrics
  • program abilities to adapt infrastructure to changing system or application-level conditions (usage, failure, degradation, etc.)
  • implements, and maintains deployment systems (code, config management, etc.)
  • capacity planning (no, really)

then you’re doing “ops”, by my definition. In some environments, these things are done by “developers”. But my definition says those devs are performing ops functions.

Cloud computing isn’t going to make ‘ops’ go away, it’s relieving of ops (and dev) of a bunch of pain-in-the-ass things so they can focus on the real work needed. Namely: your application.

Last I checked, clouds don’t perform the tasks listed above, because those things (done right) are application-specific. And while cloud computing enables (in an excellent way) the efficient resource allocation (or de-allocation) for an application, it doesn’t get rid of the need to do the above things.

{ 8 comments }

I really don’t think it can be overestimated how important context can be when it comes to troubleshooting or evaluating the health of an infrastructure. When starting to troubleshoot a complex problem, web ops 101 “best practices” usually start with asking at least these questions:

  1. When did this problem start?
  2. What changes, if any, (software, hardware, usage, environmental, etc.) were made just previous to the start of the problem?

The context surrounding these problem events are pretty damn critical to figuring out what the hell is going on.
Most monitoring systems are based around the idea that you want to know if a particular metric is above (or sometimes below) a certain threshold, and have ‘warning’ or ‘critical’ values that represent what is going bad or already bad. When these alarms go off, knowing how and when they got there is really important your troubleshooting approach. This context is paramount in figuring out where to spend your time and focus.

For example: an alarm goes off because a monitor has detected that some metric has reached a critical state. Something that goes critical instantly can be quite different than something that edged into critical after being in a warning state for some time.

Check it out:

Monitored metric passing thru warning and critical thresholds.

Metric passing thru warning and critical thresholds.

Almost instantaneous critical, no time spent in warning.

Almost instantaneous critical, no time spent in warning.

For this discussion, the actual metric here isn’t that important. It could be CPU on a webserver, it could be latency on a cache hit or miss on memcached/squid/varnish/etc, or it could be network bandwidth on a rack switch.  The values you set for warning and critical are normally informed by how much tolerance the system can withstand being in warning mode, and given ‘normal’ failure modes, and allow enough wall-clock time for recovery actions to take place before it reaches critical.

Most people would approach these two scenarios quite differently, because of the context that time lends to the issue.

In the book, I give an example of how valuable this context is in troubleshooting interconnected systems. When metrics from different clusters or systems are laid right next to each other, significant changes in usage can be put into the right context. Cascading failures can be pretty hard to track down to begin with. Tracking them down without the big picture of the system is impossible. That graph you’re using for troubleshooting: is it showing you a cause, or symptom?

Because context is so important, I’m a huge fan of overlaying higher-level application statistics with lower-level systems ones. This guy has a great example of it over on the Web Ops Visualization group pool:

He’s not just measuring the webserver CPU, he’s also measuring the ratio of requests per second to total CPU. This is context that can be hugely valuable. If any of the underlying resources change (faster CPUs, more caching on the back-end, application optimizations, etc.) he’ll be able to tell quickly how much benefit he’ll gain (or lose) by tracking this bit.

At the Velocity Summit, Theo mentioned that since OmniTI started throwing metrics for all their clients into reconnoiter, they almost always plot their business metrics on top of their system metrics, because why the hell not? Even if there’s no immediate correlation, it gives their system statistics the context needed for the bigger picture, which is:

How is my infrastructure actually enabling my business?

I’ll say that gathering metrics is pretty key to running a tight ship, but seeing them in context is invaluable.

Reblog this post [with Zemanta]

{ 3 comments }

This is a ramble continued from before, which means it’s mostly a blog post for me, but maybe others might find it interesting.

The last time I made an analogy between back-end web architectures and mechanical structures, I blathered on about what are basically structural limitations of individual components in a physical device, and how it’s somewhat analogous to the mechanics of a website’s infrastructure. For example, just like the tie rods, bumpers, and frame on your car, webservers show some amount of “strain” (i.e. resource usage, like CPU) when they are loaded up with “stress” (i.e., requests.) Mechanical components have limited ability to withstand pressure, as do web tier components. It’s exactly those load characteristics and ultimate limitations that drive capacity planning.

Spring and Damper, Loaded Under Stress

Spring and Damper, Loaded Under Stress

But anyone who works with web architectures knows that individual components are only a part of the whole story of how much “load” a particular system can take, before it degrades (gracefully or not) or fails (gracefully or not.)

In a car, bolts connect struts and shocks together, which can deform non-linearly, car bumpers press on frames which can squeeze firewalls closer to engine blocks, and a myriad of other inter-connected influences happen. The car will drive, crash, and even sit idle all according to those stress-strain relationships and can be characterized by springs, dampers, and the material properties that the components are made up of.

Check out this finite-element simulation of a pickup truck crashing into a rigid wall:

Pickup Truck Crash Simulation

In the same way, web architectures can have storage layers, application layers, and all sorts of resources, each with their own “stress/strain” characteristics, and limits. Caching layers have limits that don’t just touch response time; they affect the origin servers or databases that they are caching for.  Buffers all around can fill and empty. Network bandwidth expands and contracts, hopefully within its limits. And in most typical setups, webservers can push and pull on almost everything. The list goes on and on, up and down the stack.

My brain filled with this analogy sorta looks like this:

Interconnected Components (car crash and web achitecture)

Interconnected Components (car crash and web achitecture)

Thinking about web architectures like this helps me visualize the whole system, and give context to the dynamics of the whole thing.

Now, at least one part of this analogy doesn’t play out very well. While mechanical systems can experience nonlinear and very dynamic change (in the case of a car crash, for example) the properties of their components aren’t easily changed in the same way that web architectural bits can.

With websites, the introduction of change (for example, a bad database query) can affect (in a bad way) the entire system, not just the component(s) that saw the change. Adding handfuls of milliseconds to a query that’s made often, and you’re now holding page requests up longer. The same thing applies to optimizations as well. Break that shitty query into two small fast ones, and watch how usage can change all over the system pretty quickly. Databases respond a bit faster, pages get built quicker, which means users click on more links, etc. This second-order effect of optimization is probably pretty familiar to those of us running sites of decent scale.

Back to our mechanical analogy. Imagine if you could magically change the tensile properties of the steel in your car’s engine block. Small stresses or strains within the engine then might not add up in the same way, and that will affect the heat it generates, putting more resistance to your cooling system. Or your pistons might not deform those few microns that they used to at 4500 RPM. And so on and so on. Introducing small changes can balloon into large ones pretty quickly. Insert something corny here about holistic and systemic interconnectedness, Butterfly effects, etc.

Now consider a pit crew that could, remotely and instantly, change the rubber’s density of the tires on a race car. While it’s racing.  Even better: imagine if the car could detect the conditions on the road, conditions within the car, conditions with the driver, and could adjust the material properties of the tires, the fuel mixture, or the stiffness of each part of the suspension, all automatically. While it’s racing. That would be insanely cool, to say the least.

Of course, on the web side of this analogy, these changes happen all the time to individual components. Since we do make a decent amount of changes to flickr.com on a daily basis (see bottom of this page), we can sometimes see some dramatic results of those changes in the same way. Squeeze some more speed with a recompile or upgrade, extend the cache expiry time of an object, or tighten up a slowish database query, and all of a sudden your whole system’s performance can look very different. So with web architectures, you can actually change the components while you’re racing.

This game of ‘follow-the-bottleneck’ is in many ways inevitable and unavoidable, but that’s ok. It’s yet another motivator for capacity planning and future development. One of our constantly moving goals is to automate, in whatever way we can, the absorption of each component’s pushes and pulls on the entire system.

In the web operations world, there’s a term for being able to make those instantaneous changes while you’re racing: automated infrastructure.

But that’s another topic entirely. :)

Reblog this post [with Zemanta]

{ 3 comments }

That was a pretty good time. Saw lots of good and wicked smaht people, and I got a lot of great questions after my talk. The slides are up on slideshare, and here are the PDF slides.

UPDATE: Gil Raphaelli has posted his python bindings he wrote for our libyahoo2 use in our Ops IM Bot.

There was something that I left out of my slides, mostly because I didn’t want to distract from the main topic, which was optimization and efficiencies.

While I used our image processing capacity at Flickr as an example of how compilers and hardware can have some significant influence on how fast or efficient you can run, I had wondered what the Magical Cloud™ would do with these differences.

So I took the tests I ran on our own machines and ran them on Small, Medium, Large, Extra Large, and Extra Large(High) instances of EC2, to see. The results were a bit surprising to me, but I’m sure not surprising to anyone who uses EC2 with any significant amount of CPU demand.

For the testing, I have a script that does some super simple image resizing with GraphicsMagick. It splits a DSLR photo into 6 different sizes, much in the same way that we do at Flickr for the real world. It does that resizing on about 7 different files, and I timed them all. This is with the most recent version of GraphicsMagick, 1.3.5, with the awesome OpenMP bits in it.

Here is the slide of the tests run on different (increasingly faster) dedicated machines:

Faster Image Processing Hardware

and here is the slide that I didn’t include, of the EC2 timings of the same test:

Image Processing on EC2

Now I’m not suggesting that the two graphs should look similar, or that EC2 should be faster. I’m well aware of the shift in perspective when deploying capacity within the cloud versus within your own data center. So I’m not surprised that the fastest test results are on the order of 2x slower on EC2. Application logic, feature designs (synchronous versus asynchronous image processing, for example) can take care of these differences and could be a welcome trade-off in having to run your own machines.

What I am surprised about is the variation (or lack thereof) of all but the small instances. After I took a closer look at vmstat and top, I realized that the small instances consistently saw about 50-60% CPU stolen from it, the mediums almost always saw zero stolen, and the Large and ExtraLarges saw up to 35% CPU stolen from it during the jobs.

So, interesting.

{ 6 comments }

It’s been wondered about why I chose not to include any real amount of material in my book about the mathematical topics related to capacity planning, like queueing theory.

There are already many other excellent books that dig into the math behind Little’s Law, M/M/1 queues, and Poisson arrival processes. These concepts do indeed detail the behavior of capacity models, and at any given point in the lifetime of a web request, describe what the hell is going on. The best part about that math is that it can be applied to almost any computer system responding to requests, without regard to hardware configurations, operating system settings, or really any details about the application(s) you’re trying to model.

Requests come in, they get processed, might get tossed around within the back-end architecture, and then get spit back out as a response. Great! The math can model that.

But I didn’t cover those queuing fundamentals in my book for a bunch of reasons:

  1. Other books spend a good deal of ink going over it. I didn’t write a book to replace any of Neil Gunther’s or Daniel Menasce’s publications. I wrote a book meant to be useful for developers and operations people working on growing websites like Flickr. Not banks. Not mainframe applications. Not HPC clusters. Websites.
  2. When it comes to sizing infrastructure, I don’t like spending a lot of time on modeling and simulation, where that math has been historically used. Why? Because there are too many variables (and changes in those variables) in both the application and usage of the application to warrant any time developing a model. By the time I could have a model built that I had any confidence in, it wouldn’t be accurate enough to be worth anything. Instead, I wanted to focus on the fundamentals of recording, observing, and correlating both system and business metrics. The part where you tie systems statistics to business metrics (“What does 50% CPU actually do for my business?”) is a spot that I found lacking, and I believe that context has to be there in order to get the larger picture of your capacity.
  3. There was already enough material to cover about forecasting, metrics collection, and deployment.

Don’t get me wrong, the math is interesting. Hell, I spent almost 5 years of my career building models and simulations of nonlinear structural dynamics (vehicle crashworthiness research). But I didn’t see the need to have that material in my book.

{ 2 comments }

Some Things We Did Today

by allspaw on March 5, 2009

  • Moving one of our eight photoserving farms from hardware Layer7 URL hash balancing (expensive, has limits) to L4 DSR balancing with CARP (cheap and simple) and figuring out how to juggle 18,000 requests/second while we do it.
  • Built yet some more automated query analysis reporting (with some yummy MySQLProxy)
  • Added yet another aggregated graph of database queries, broken down by type and cluster
  • Bunch of cfg mgmt changes (polishing up IO scheduling and filesystem tunings in a 2nd datacenter, more caching of search results)
  • Review of the higher priority to-dos in the Ops open bug queue (only 155 open! :) )
  • Finding new capacity ceilings for the image processing, given some recent optimizations)

{ 4 comments }

Speaking at Web2.0 Expo 2009

by allspaw on February 19, 2009

Looks like I’m gonna talk about even more nerdy things at the Web2.0 Expo in April.

You don’t have to wait for a recession to tighten up your operations. Squeezing more oomph out of your servers (or instances!) is always a good thing, and streamlining how you handle site issues is too. We’ll will talk about what we’ve been doing at Flickr to get more out of less from both our machines and our humans.

Capacity Hacks: diagonal scaling, tuning opportunities, and some other stupid performance tricks.

Ops “runbook” Hacks: Server and process self-healing, application-level measurement, ops communication tools, and some worst-case scenario tricks to have in your back pocket.

{ 2 comments }

I don’t blog much, and when I do, they are pretty short and too the point. This post is different: feel free to put into the “ramble” category.
I’m really just posting it here for myself as a thought exercise.

Some years ago, while drawing a network map for the site I was working at the time, something struck me about how our tiered architecture looked. My background before getting into this computer business was in mechanical engineering, so it’s not surprising that when looking at these webservers, databases, caching servers, storage, I imagined: mechanical devices.

Bear with me for a bit. When I did crashworthiness research for the gov’t, we ran mathematical models of car crashes on big honkin’ parallel computers. These simulations ran with the math and concepts that governs structural dynamics. Like, Newton stuff. In the simpler ‘rigid-body’ simulations, you have things like bumpers, frames, doors, engine blocks, and tires that are reduced to hunks of mass connected with springs and dampers, and when connected all together, can paint a pretty accurate picture of what happens when a vehicle slams into a rigid wall. In the more complex models, 3-dimensional elements were given the material properties of steel, rubber, glass, etc. so that forces (and strains due to those forces) could be used to calculate how crunched those components get during a crash.

In either case, you’ve got components that get strained based on the load placed on them, and the relationship between load and strain can be measured and used to predict future behavior.  Sound familiar?

Here’s what a stress-strain diagram for steel looks like:

As load increases, the material is strained. Up to a point (roughly, point 3 on the diagram), it can bounce back to pre-load conditions. MechEng dorks call this elastic deformation. After that point, it stays put after it’s strained, which is called plastic deformation. At some point, the steel fails completely and breaks in two.

Now here’s what a webserver looks like when load (busy apache processes) increases:

CPU vs. Apache Procs

As the rate of hits increase, the CPU usage increases. (well, duh)

So, I’m saying (sort of) webservers can be imagined as components that have limited tolerances for load, and behave (for the most part) predictably as that load increases. Just like springs, dampers, and other mechanical components that make up a dynamically loaded  structure. Do webservers have plastic parts of their stress/strain curve? Nah, they work fine right up until the point that CPU hits 100%, and while there might be swapping or other badness happening for sustained load > 95%, with most multicore machines, my experience is that they can mostly recover when load comes down.

This isn’t exactly an accurate analogy. There’s all sorts of ways how mechanical force is different than webserver traffic. Even if it was the same, it can’t be applied cleanly to the example of databases, which are a bit more complex in how they behave. Not to mention that it’s a pretty broad stroke. We’re ignoring all of the small connecting stuff that can make a big difference, like rivets, tie rods, bolts, welds (or: NICs, cables, switching, bus speeds, etc.).

But: for an old gearhead like myself, visualizing all our servers in terms of forces and strains is intuitive for me. Whether it’s springs and dampers connected to masses, materials bending under load, or even the tachometer of a running engine…anything that can help with the holistic view is good, IMHO. :)

If you’ve read this far, then I’m impressed that you have such a high tolerance for long-winded babbling.

{ 5 comments }

Like lots of operations people, we’re quite addicted to data pr0n here at Flickr. We’ve got graphs for pretty much everything, and add graphs all of the time. We’ve blogged about some of how and why we do it.

One thing we’re in the habit of is screenshotting these graphs when things go wrong, right, or indifferent, and adding them to a group on Flickr. I’ve decided to make a public group for these sort of screenshots, for anyone to contribute to:

http://flickr.com/groups/webopsviz/

You should realize before posting anything here, that you might want to think about if you want everyone in the world to see what you’ve got. I’ve made a quick FAQ on the groups page, but I’ll repeat it here:

Q: What is this?
A: This group is for sharing visualizations of web operations metrics. For the most part, this means graphs of systems and application metrics, from software like ganglia, cacti, hyperic, etc.

Q:Who gets to see this?
A: This is a semi-public group, so don’t post anything you don’t want others to see.
For now, it’ll be for members-only to post and view. Ideally, I think it’d be great to share some of these things publicly.

Q: What’s interesting to post here?
A: Spikes, dips, patterns. Things with colors. Shiny things. Donuts. Ponies.

Q: My company will fire me if I show our metrics!
A: Don’t be dense, and post your pageview, revenue, or other super-secret stuff that you think would be sensitive. Your mileage may vary.

So: you’ve got something to brag about? How many requests per second can your awesome new solid-state-disk database do? You got spikes? Post them!

{ 0 comments }