- Moving one of our eight photoserving farms from hardware Layer7 URL hash balancing (expensive, has limits) to L4 DSR balancing with CARP (cheap and simple) and figuring out how to juggle 18,000 requests/second while we do it.
- Built yet some more automated query analysis reporting (with some yummy MySQLProxy)
- Added yet another aggregated graph of database queries, broken down by type and cluster
- Bunch of cfg mgmt changes (polishing up IO scheduling and filesystem tunings in a 2nd datacenter, more caching of search results)
- Review of the higher priority to-dos in the Ops open bug queue (only 155 open! )
- Finding new capacity ceilings for the image processing, given some recent optimizations)
Looks like I’m gonna talk about even more nerdy things at the Web2.0 Expo in April.
You don’t have to wait for a recession to tighten up your operations. Squeezing more oomph out of your servers (or instances!) is always a good thing, and streamlining how you handle site issues is too. We’ll will talk about what we’ve been doing at Flickr to get more out of less from both our machines and our humans.
Capacity Hacks: diagonal scaling, tuning opportunities, and some other stupid performance tricks.
Ops “runbook” Hacks: Server and process self-healing, application-level measurement, ops communication tools, and some worst-case scenario tricks to have in your back pocket.
Whew. That took longer than I thought.
Todd Hoff over at the High Scalability blog has an email interview with me about a book that I wrote, called “The Art of Capacity Planning: Scaling Web Resources“. I’m still just happy that I got it done at all, seeing how it was due the same week that my son was born.
This book happened because of a LOT of people. You know who you are, because you’re in the Acknowledgements.
James Hamilton’s excellent LADIS 2008 presentation has lots of great stuff in it about internet scale bits. Cool stats.
So now there’s chapters 1-4 on Safari RoughCuts. Which means if you don’t mind shelling out the dough, you can take a look at what I’ve been getting up early for every day for the past few months. The working title is “The Art of Capacity Planning” and it’s meant to be a no-nonsense description of the capacity planning process and considerations for web operations.
I still have two chapters to go before it’s all finished, but if you’re nice enough to take a look at what I’ve got thus far, I’d appreciate any feedback. I’m sure there could be typos and some graphs misaligned, but such is life with “drafts”.
Here they are.
Dear users of S3, EC2, and other ‘utility’ computing stuffs:
Here’s a crude and completely oversimplified evolution of infrastructure needs of a growing website, with an assumption:
Have you ‘outgrown’ your original use of utility computing, for whatever reason ? If so, what was the reason? Financial? Technical?
Why I’m asking:
I’m in the process of writing a book on the topic of capacity planning for web architectures, so I’m interested in what you’ve got to say.
I’m probably very late to this party, but I just discovered Dashboard Spy. Given the amount of “data porn” that folks in webops look at on a daily basis, this sort of stuff is pretty damn interesting.
I’m especially loving the current trend of developing ‘business’ dashboards, since it can fit in quite nicely with infrastructure statistics. Quite often when I need to make capacity justifications, I pull forecasts from both the higher-level metrics (i.e. photos uploaded) and the lower-level metrics (i.e. disk space consumed by photos) and have to marry those two bits together.
In fact, I love that stuff so much that I’m writing a book about it.