Context and Operational Metrics

I really don’t think it can be overestimated how important context can be when it comes to troubleshooting or evaluating the health of an infrastructure. When starting to troubleshoot a complex problem, web ops 101 “best practices” usually start with asking at least these questions: When did this problem start? What changes, if any, (software,...
Continue reading...

Mechanical Analogies To Web Stuff, Part 2.

This is a ramble continued from before, which means it’s mostly a blog post for me, but maybe others might find it interesting. The last time I made an analogy between back-end web architectures and mechanical structures, I blathered on about what are basically structural limitations of individual components in a physical device, and how...
Continue reading...

Why I didn’t include queueing math in my book.

It’s been wondered about why I chose not to include any real amount of material in my book about the mathematical topics related to capacity planning, like queueing theory. There are already many other excellent books that dig into the math behind Little’s Law, M/M/1 queues, and Poisson arrival processes. These concepts do indeed detail...
Continue reading...

Some Things We Did Today

Moving one of our eight photoserving farms from hardware Layer7 URL hash balancing (expensive, has limits) to L4 DSR balancing with CARP (cheap and simple) and figuring out how to juggle 18,000 requests/second while we do it. Built yet some more automated query analysis reporting (with some yummy MySQLProxy) Added yet another aggregated graph of...
Continue reading...