So now there’s chapters 1-4 on Safari RoughCuts. Which means if you don’t mind shelling out the dough, you can take a look at what I’ve been getting up early for every day for the past few months. The working title is “The Art of Capacity Planning” and it’s meant to be a no-nonsense description of the capacity planning process and considerations for web operations.

I still have two chapters to go before it’s all finished, but if you’re nice enough to take a look at what I’ve got thus far, I’d appreciate any feedback. I’m sure there could be typos and some graphs misaligned, but such is life with “drafts”. :)

{ 0 comments }

Thanks to Mark, squid’s got a patch I’ve been wanting for a gazillion years: time-to-serve statistics that don’t include the client’s location

http://www.squid-cache.org/bugs/show_bug.cgi?id=2345

Normally, squid’s kept statistics that included the “time” to serve an object, whether it be a HIT, MISS, NEAR HIT, etc. The clock starts for this time when the first headers are received by the client that are validated as a legit squid request, but then doesn’t stop until the client has every last bit of the response.

What this means is that if you have servers in the US and your traffic pattern follows the NY/SF pattern (peaks from around 9am-4pm) and your overseas traffic (i.e. clients really far from your boxes) has a pattern the inverse of that, then you might see ‘time-to-serve’ in squid to be worse during your lowest traffic. Which is confusing, to say the least. :)

This patch changes the stopwatch to start at the same time (when squid’s received headers from the client) but stop when squid’s preparing the headers for the response. This measures ONLY the time that squid had the object in its hands, for a hit or a miss, which IMHO is a much better measure of how squid is actually performing with the hardware’s resources.

Yay! Thanks Mark.

{ 0 comments }

Slides from Web 2.0 Expo2008

by allspaw on April 29, 2008

Here they are.

{ 4 comments }

Awhile back, I said I’d love to have a tool that would allow me to peek inside filesystem cache and tell me what files (or pages of files) are inside. Well Peter Zaitsev points to the fincore tool, which comes pretty damn close: you give it a file, and it will tell you which pages of a particular file are in core memory.

Rock. Thanks, David Plonka.

{ 2 comments }

WebOps Communication Tools

by allspaw on March 10, 2008

After seeing Jesse’s great post on Radar (never knew about FreeConferenceCall, very cool!) about the quick and easy webops event communications, I thought I might put a post together on some of what we’re using at Flickr to keep track of things ops-related.

Production Changes/Immediate Issues

We have our configuration management schemes wrapped up in version control, so we can track changes there easily. But sometimes we affect production machines not relating to configuration changes that need to be communicated to everyone on the team.

We have an IM bot that everyone is a contact of, and when we make any changes, we simply send the bot an IM which will get parroted to everyone on the team as it happens, and logged in a text file with the IM name and timestamp, which we also serve as a webpage, wrapped up in the nice YUI bits for easy sorting. If you’re offline, then you’ll get the bot’s messages when you log on again.

Examples of this would be taking boxes in and out of a load-balanced pool, restarting apache or squid or MySQL, or even one-off A/B testing of any temporary (kernel or application) paramaters that may or may not raise a red flag.

Ongoing Work

Because we like to keep logs, and because we have some guys on the team working remotely, we keep a running commentary on an internal IRC server. We use this for basic day-to-day work. Running it under a screen means that we have weeks, months, and even years of ops conversations about what we’ve done in the scrollback, and in the irc logs.

Code Deployment

Of course Cal and the other devs have a system in place that logs all new code deploys with a username and a timestamp, and a diff to the version control bits that show what changed in that deploy, which makes things easy to correlate system and application metrics with changes made to the codebase. (thank you)

{ 5 comments }

Dear users of S3, EC2, and other ‘utility’ computing stuffs:

Here’s a crude and completely oversimplified evolution of infrastructure needs of a growing website, with an assumption:

Evolution of web infrastructure

Have you ‘outgrown’ your original use of utility computing, for whatever reason ? If so, what was the reason? Financial? Technical?

Why I’m asking:

I’m in the process of writing a book on the topic of capacity planning for web architectures, so I’m interested in what you’ve got to say.

{ 13 comments }

Datacenter Operating Systems

by allspaw on February 20, 2008

I’m probably late in getting to this, but seeing the article in the WSJ about the RAD project made me stop to take a look. It appears to be a collection of different projects, all relating to infrastructure deployment/management and various research topics surrounding it. Looks cool so far.

{ 0 comments }

Loving Dashboard Spy.

by allspaw on February 17, 2008

I’m probably very late to this party, but I just discovered Dashboard Spy. Given the amount of “data porn” that folks in webops look at on a daily basis, this sort of stuff is pretty damn interesting.

I’m especially loving the current trend of developing ‘business’ dashboards, since it can fit in quite nicely with infrastructure statistics. Quite often when I need to make capacity justifications, I pull forecasts from both the higher-level metrics (i.e. photos uploaded) and the lower-level metrics (i.e. disk space consumed by photos) and have to marry those two bits together.

In fact, I love that stuff so much that I’m writing a book about it. :)

{ 4 comments }

Flickr’s hiring a dba.

by allspaw on January 30, 2008

(Only hardworking supernerds should apply)

We’re looking for an experienced and motivated MySQL DBA to help make things go at Flickr.

Stuff you’ll do:
• Work with engineers on performance tuning, query optimization, index tuning.
• Monitor databases for problems and to diagnose where those problems are.
• Work with developers and operations to maintain a scalable, reliable, and robust database environment.
• Build database tools and scripts to automate where possible.
• Support MySQL databases for production and development.
• Provide 24×7 escalated on-call support on a pager rotation.

Smarts and experience you’ll need:
• 3-4+ years MySQL experience.
• 2+ years of experience as a MySQL DBA in a high traffic, transactional environment.
• 2+ years working in a LAMP environment, particularly PHP/MySQL
• Proficient with database performance strategies.
• Proficient tuning MySQL processes and queries.
• Experience in administration of InnoDB
• Experience with MySQL Replication, with both Master-Slave and Master-Master replication.
• Ability to work cooperatively with software engineers and system administrators.
• Excellent communication skills
• Exceptional problem-solving expertise and attention to detail.
• BS in Computer Science or equivalent.

Super Nerdy Bonus Points For:
• Experience with Data Sharding and federated architectures.
• Experience with multi-datacenter MySQL replication.
• Experience working in a social media environment.

Ok ? Now, send me your resume!

{ 4 comments }

Speaking at Web 2.0 Expo 2008

by allspaw on January 3, 2008

I’m gonna give a talk in capacity planning for web operations at the Web 2.0 Expo in April. Wondering if I should submit the same sort of talk for the Velocity conference in June. Don’t want to be redundant or anything.

Web 2.0 Expo San Francisco 2008

{ 4 comments }