Dev/Ops is what some people are calling the renewed cross-interest in development and operations collaboration. Hammond and I spoke about it, and there was even a conference in Europe dedicated to it. While I do think that there’s still a lot more that is to be discussed around this idea of cooperation and mixing of approaches, this is a Very Good Thing™.
In what Andrew has called ‘boundary objects‘, deployment of new code has been a rallying point for the devops crowd, and I think that’s great. Deployment is definitely one of the places where the rubber meets the road. In some organizations, deployment of new code can be the single-most stressful and dividing parts of their work. People get fired or quit because of the emotional baggage that can come with an event that in the worst case, is nothing more than a planned outage disguised as progress and a followup finger-pointing session. Some groups have such dysfunction that they might as well just not even deploy the code. Just skip that part, head into a conference room, and fight bareknuckle. Toxic would be the nice way of describing those environments.
So it’s no wonder that a lot of the emphasis in this growing “devops” community is on deployment. Whether it’s providing confidence in changes with rigorous testing, deploying small changes often, dark launching, feature flags, or building a one-button deploy system – any effort to reduce the risk of change should be considered mandatory, IMHO.
But at the same time, deployment is only just a part of what really makes a great environment for development and operations to collaborate. Really. It’s not just about developers collaborating on deployment and releases. It’s about both teams understanding each other’s responsibilities after code is deployed to production, and collaborating along the areas of their expertise in a way that’s constructive.
Good Operations teams already write code, just not usually user-facing code. They spend a good deal of their time writing code to gather information from the infrastructure and act on it with short, medium, or long-term goals, usually aimed at performance and availability.
I’ll say that things like:
- metrics collection
- monitoring and associated thresholds
- load-feedback behavior
- fault tolerance
should also be considered boundary objects between development and ops.
This is some of what I mean by that:
I’ve said this before, but context is absolutely everything. Application-level or feature-level metrics is what gives the missing context to in-the-box resource usage like CPU, disk, memory, or network. At Flickr, the ops group maintains a number of different platforms for gathering metrics, like ganglia. To make it easy to add metrics, some of our backend applications will just write a temp file with key value pairs that we want to have squirted into ganglia. Like:
and ganglia’s gmetric cron job will pick that up every minute with the key as the metric name, and the value as, well, the value.
This means that all developers have to do is drop that file into an expected location and it will do the right thing. No tickets for making a new metric, no need for writing yet another script to gather a single metric, no need to understand the intricacies of whatever metrics collection system you have.
That’s an example of technical collaboration between the two groups. The missing piece is the cultural bits, which is the developer communicating their motivation behind getting these in-app metrics gathered and put on a graph. This gives the metric context, and might give ops some ideas on how they could use the metric for monitoring, capacity, or other purposes.
Involving development in designing your monitoring system can help provide a great perspective on failure modes. Peer code reviews are common in software development, so why shouldn’t monitors be reviewed? It’s still code, and it’s going to provide your humans (and maybe machines) with the data needed to fail gracefully, heal itself, or inform developers on what their constraints are when building new things. Your monitoring system is just like your code in that it should always be evolving, alongside your growth.
Remember all the raves about Google Analytics adding “intelligence” and alerts? Having some notion of thresholds isn’t just for people answering pages from nagios, it’s for everyone. How else can you gauge your expectations and guide future modifications to your code with respect to resource usage?
Load feedback behavior
Like a lot of smart web infrastructures, we’ve built an offline tasks system, which will asyncronously run jobs on our data that don’t have to be real-time. If you haven’t read Myles’ post on it, you really should. It’s a huge part of our strategy to avoid pretty common scalability pitfalls.
Anyway, these tasks, which can be relatively hard on the databases (which is one of the reasons why we do them asyncronously in the first place) have some built-in feedback mechanisms: they’ll check if there’s an unreasonably high number of concurrent MySQL connections, or if the database shard master-master pair doesn’t have both servers in production, or otherwise can detect that either what it’s trying to do on the database is too harsh at the moment. Whether it’s because of current live traffic being high, or a loss of redundancy, the offline task system will stop what it’s doing and re-queue it for later. This is a great (and safe) way of schmearing out heavy loads over a longer time period, reducing their risk.
Throw in some metrics collection about the size of those queues, and monitor alerts to do something for low or high-water mark thresholds, and then you’re cookin’ with gas.
Through the magic of apache notes, developers can send extremely useful bits from within php code to the access and error logs. At Flickr, we’ve got some pretty simple notes set to help track things down when there are issues. For example. when I load the page for my photostream, the log line looks something like:
www394 123.456.789.012 5555 173663 [14/Dec/2009:04:08:21 +0000] “GET /photos/allspaw HTTP/1.1″ – 200 18233 “-” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:126.96.36.199) Gecko/20090824 Firefox/3.5.3″ – -
where 5555 is my user id. Since php knows you’re logged in when you view a certain page, there’s no reason why we shouldn’t just log that in the request, so if there are any user-specific issues, it’s not a needle in a haystack.
Another example are API requests. We’ll log the api key making the call along with the authenticated user id, even in POST requests. Being able to trace a bullet through the entire request and response via logs is obviously handy. Putting user ids, API methods, and API key specific info into log lines is hugely helpful when troubleshooting issues, especially if you’re running one of the most popular APIs on the web.
Ross blogged about how we do feature flipping last week. He goes over how important (and awesome) this is to our development process, but another one of the advantages of this approach is how it affects operations.
This is an example of development taking an active role in not only deployment, but the time and effort to operationalize features and pieces of code so that in cases of degradation or failure, these individual pieces can be forced to fail gracefully. Our talk at Velocity last year went over some of this, but it’s still one of the reasons why we can push code thousands of times a year and still have an extremely low MTTR whenever there’s an issue.
New code causing degradation? There’s an app for that! (it’s called a feature flag)
Anyway, my point is that deployment is only a small part of how development and operations should collaborate and communicate. In fact, dev and ops is only the most obvious starting point for getting along and working together on problems.
Product and community management also have important boundary objects with operations as well, but that’s for another blog post.