Knowing when you can fail is mandatory.

“Do you know when your database layer will fall over and die ? At how many QPS (queries per second) will your application fall prey to slowness, corruption, replication issues, or other sorts of badness ?”

I asked that question of the audience when giving a talk on capacity planning at the MySQL conference last year, and I looked out upon only a few nodding heads. I think that finding out the limitations of your current architecture and hardware is not only important, it should be considered mandatory. After talking to folks after the talk, a lot of people came up and asked about ways of doing this, and I told them to use production traffic to find out.

Yep: with production traffic.

It sounds scary to lots of people who run their Ops by the book. But the facts is that no amount of simulation, benchmarking, or modeling will tell your more about when things will fall over than testing in production. Folks like Neil Gunther might disagree with me on this, but I will argue that most social web applications just aren’t able to be modeled accurately with queueing theory and simulation. Not accurate enough to base a specific hardware plan on, anyway.

I’m obviously not suggesting that you haphazardly throw live traffic until a box (or a cluster, or a datacenter) dies a thrashing, swapping death, dropping transactions on the floor. I’m suggesting that you build into your architecture a way to segment live traffic in a safe and easy way to discover the effects of that traffic on the specific hardware and software configuration you’ve got running.

And then do it again each time the application or hardware changes in any significant way.

Isn’t that one of the reasons why load balancing was invented ? 🙂

Some thoughts we think about in flickrland:

  1. How many webserver requests/sec does it take (given “normal” usage) to bring user CPU up 10% ? How many to make the machine die ?
  2. How much disk IOwait, CPU, or INSERTs/UPDATEs/DELETEs can you take on your database server before replication slave lag becomes an issue ?
  3. How close are all of your switches to running out of network goodness ?

1 Comment

  1. Paul Holbrook   •  

    I couldn’t agree more. (Well, perhaps I could if I nodded vigorously.)

    When I worked at (96-01), we never found a way that was even close to simulating what production traffic did to our web farm. Our most important number that we knew for our web servers was that tip over point.

    We actually kept track of two things: the red line point, if you will – which at the time was about 20,000 hits per minute/server – and what a box could really do. Back before CNN had load balancers, we would put multiple IPs onto a box – load it up – to see what it could do. And we were very quick to move traffic to other boxes when a box started to fail.

    Later when I moved to EarthLink, I was unpleasantly surprised to find out what the engineers did not know these numbers for the various servers. Knowing that number makes planning easier, and it makes justifying new purchases easier: “See this graph of our traffic? And see how it compares to our capacity?”

Comments are closed.