Site icon Kitchen Soap

Knowing when you can fail is mandatory.

“Do you know when your database layer will fall over and die ? At how many QPS (queries per second) will your application fall prey to slowness, corruption, replication issues, or other sorts of badness ?”

I asked that question of the audience when giving a talk on capacity planning at the MySQL conference last year, and I looked out upon only a few nodding heads. I think that finding out the limitations of your current architecture and hardware is not only important, it should be considered mandatory. After talking to folks after the talk, a lot of people came up and asked about ways of doing this, and I told them to use production traffic to find out.

Yep: with production traffic.

It sounds scary to lots of people who run their Ops by the book. But the facts is that no amount of simulation, benchmarking, or modeling will tell your more about when things will fall over than testing in production. Folks like Neil Gunther might disagree with me on this, but I will argue that most social web applications just aren’t able to be modeled accurately with queueing theory and simulation. Not accurate enough to base a specific hardware plan on, anyway.

I’m obviously not suggesting that you haphazardly throw live traffic until a box (or a cluster, or a datacenter) dies a thrashing, swapping death, dropping transactions on the floor. I’m suggesting that you build into your architecture a way to segment live traffic in a safe and easy way to discover the effects of that traffic on the specific hardware and software configuration you’ve got running.

And then do it again each time the application or hardware changes in any significant way.

Isn’t that one of the reasons why load balancing was invented ? 🙂

Some thoughts we think about in flickrland:

  1. How many webserver requests/sec does it take (given “normal” usage) to bring user CPU up 10% ? How many to make the machine die ?
  2. How much disk IOwait, CPU, or INSERTs/UPDATEs/DELETEs can you take on your database server before replication slave lag becomes an issue ?
  3. How close are all of your switches to running out of network goodness ?
Exit mobile version