After seeing Jesse’s great post on Radar (never knew about FreeConferenceCall, very cool!) about the quick and easy webops event communications, I thought I might put a post together on some of what we’re using at Flickr to keep track of things ops-related.
Production Changes/Immediate Issues
We have our configuration management schemes wrapped up in version control, so we can track changes there easily. But sometimes we affect production machines not relating to configuration changes that need to be communicated to everyone on the team.
We have an IM bot that everyone is a contact of, and when we make any changes, we simply send the bot an IM which will get parroted to everyone on the team as it happens, and logged in a text file with the IM name and timestamp, which we also serve as a webpage, wrapped up in the nice YUI bits for easy sorting. If you’re offline, then you’ll get the bot’s messages when you log on again.
Examples of this would be taking boxes in and out of a load-balanced pool, restarting apache or squid or MySQL, or even one-off A/B testing of any temporary (kernel or application) paramaters that may or may not raise a red flag.
Because we like to keep logs, and because we have some guys on the team working remotely, we keep a running commentary on an internal IRC server. We use this for basic day-to-day work. Running it under a screen means that we have weeks, months, and even years of ops conversations about what we’ve done in the scrollback, and in the irc logs.
Of course Cal and the other devs have a system in place that logs all new code deploys with a username and a timestamp, and a diff to the version control bits that show what changed in that deploy, which makes things easy to correlate system and application metrics with changes made to the codebase. (thank you)