Some WebOps Interview Questions

It can be difficult to evaluate web ops candidates, for a couple of different reasons. One is that the breadth of knowledge needed for the field can be pretty wide, so spending too much time on any particular technical area can be a waste of time. Another reason is that it can be difficult to gauge how collaborative someone’s demeanor is in an interview. Collaboration is a requirement at Etsy. :)

So in addition to the standard technical questions, I like to ask high-level questions where the answers can zoom in and out of a larger picture within the operations context.

  • Diagram the current architecture you’re responsible for, and point out where it’s not scalable or fault-tolerant.
  • What are some examples of how you might scale a read-heavy application? Why?
  • What are some examples of how you might scale a write-heavy application? Why?
  • Tell me how code gets deployed in your current gig, from developer’s brain to production.
  • Tell the story of the best-run outage you’ve been a part of, in as much detail as you can. What made it “good”?
  • Tell the story of the worst-run outage you’ve been a part of, in as much detail as you can. What made it “bad”?
  • What is the purpose of a post-mortem meeting?
  • How do you handle (and feel about) making changes (code/schema/network/etc.) in your current environment?

These are purposefully open-ended questions meant to dig into what’s important to you as someone responsible for the performance and availability of a growing website.  This is just a snippet of what we normally ask, in addition to my (and Jesse‘s) favorite interview question.

So: maybe you should take a look at the type of ops engineers we’re looking for, and apply? :)

We’re hiring ops folks at Etsy!

We’re hiring web ops engineers at Etsy.  Here’s the gist of it….

Responsibilities

  • Building and maintaining Etsy’s infrastructure, from installed iron to production
  • Taking part in a 24×7 on-call rotation
  • Tightly cooperating and collaborating with development, product, community and customer care

Requirements

  • Experience with configuration management systems and concepts (Chef, Puppet, Cfengine, etc.)
  • Experience in systems programming and general scripting tasks (perl, bash, php, python, etc.)
  • Experience with high-volume web applications with social components
  • Experience with multi-datacenter architectures, global fault tolerance, and CDNs
  • Experience with fault-tolerant replication strategies
  • Experience with mission-critical search and realtime database architectures (Solr, Lucene, MySQL, Mongodb, Postgres, etc.)
  • Experience working with customizing network management systems and monitoring tools (Nagios, Ganglia, Graphite, Cacti, etc.)
  • Strong understanding of web application architecture, including TCP/IP and HTTP, and caching strategies at all layers
  • Support of software engineers and their development environment and code repository (Subversion), including code deployment to production
  • Enterprise experience with internal core systems, such as but not limited to DNS, LDAP, NTP
  • Experience with data center management, including strong knowledge of power, space, and cooling issues
  • Experience with credit card gateways and PCI compliance issues
  • Excellent communication skills, both written and verbal

Bonus

  • Experience in a “continuous deployment” environment
  • Experience in social networking or community-generated content
  • Experience with managing the infrastructure for a growing open API
  • Database query optimization
  • Hands-on network security tasks, including VPNs/firewalls configuration
  • Network experience with BGP, EIGRP, OSPF, VLAN, PVLAN,Spanning-Tree, MSTI
  • Knowledge of programming languages such as Python, PHP, JAVA, Ruby

This is a great place to work. We work on real problems, and there’s plenty of juicy technology to sink your teeth into.

Pigz – parallel gzip OMG

Pigz is basically parallel gzip, to take advantage of multiple cores.  When you’ve got massive files, this can be a pretty big advantage, especially when you’ve got lots of cores sitting around.

Taking a 418m squid access log file, on a dual-quad Nehalem L5520  with HyperThreading turned on:

[jallspaw@server01 ~]$ ls -lh daemon.log.2; time gzip ./daemon.log.2 ; ls -lh ./daemon.log.2.gz
-rw-r—– 1 jallspaw jallspaw 418M Apr  2 19:18 daemon.log.2

real    0m12.398s
user    0m12.107s
sys     0m0.288s
-rw-r—– 1 jallspaw jallspaw 45M Apr  2 19:18 ./daemon.log.2.gz

…now gunziping it:

[jallspaw@server01 ~]$ ls -lh daemon.log.2.gz; time gunzip ./daemon.log.2 ; ls -lh ./daemon.log.2
-rw-r—– 1 jallspaw jallspaw 45M Apr  2 19:18 daemon.log.2.gz

real    0m3.245s
user    0m2.693s
sys     0m0.552s
-rw-r—– 1 jallspaw jallspaw 418M Apr  2 19:18 ./daemon.log.2

htop looks like this when this is happening:

1 CPU core, 418mb file gzipped in 12.3 sec

1 CPU core, 418mb file gzipped in 12.3 sec

(Note the freeloading/lazy 15 cores sitting around watching its friend core #10 sweating)

…now pigz’ing it:

[jallspaw@server01 ~]$ ls -lh daemon.log.2; time ./pigz-2.1.6/pigz ./daemon.log.2 ; ls -lh ./daemon.log.2.gz
-rw-r—– 1 jallspaw jallspaw 418M Apr  2 19:18 daemon.log.2

real    0m1.569s
user    0m23.092s
sys     0m0.422s
-rw-r—– 1 jallspaw jallspaw 45M Apr  2 19:18 ./daemon.log.2.gz

…now unpigz’ing it:

[jallspaw@server01 ~]$ ls -lh daemon.log.2.gz; time ./pigz-2.1.6/unpigz ./daemon.log.2.gz ; ls -lh ./daemon.log.2
-rw-r—– 1 jallspaw jallspaw 45M Apr  2 19:18 daemon.log.2.gz

real    0m1.456s
user    0m1.861s
sys     0m0.867s
-rw-r—– 1 jallspaw jallspaw 418M Apr  2 19:18 ./daemon.log.2

and htop looks like this when it’s happening:

16 CPU cores, 418mb pigzd in 1.5sec

16 CPU cores, 418mb pigz'd in 1.5sec

which do you like better?

Agile Executive Podcast

Yesterday I was on a podcast with Andrew Shafer and Michael Coté, and we talked about development and operations cooperation. I rambled a bit, like I tend to do.

Andrew brought up something that’s disturbing, and I’ve seen elsewhere, which is that after seeing our presentation last year at Velocity, some folks decided that we somehow gave an endorsement to the idea of pushing your code whenever you want, and let the ‘ops guys’ deal with whatever comes as a result. Which isn’t at all what we suggested, and pretty much against the ideas of cooperation and communication between the dev and ops teams. I talk a bit about this in the podcast.

You have to prove that pushing whenever you want is an ok (safe, secure, etc.) thing to do. And the minute you can’t prove it, and you decide to continue that way….IMHO: you’re doing it wrong. :)

Need some FUDforum consulting done

I’ve been helping out a friend for some years with running a decent-size discussion forum. It’s running on a little (512mb of RAM) dedicated server and it’s outgrown the box it’s on. It needs to move to a new machine, which is all ready to take it.

Problem is, it’s in a twisty-maze of dependencies. It’s running FUDforum 2.6.4RC1, on MySQL 3.23, on RedHat 9 (!). It needs to somehow get backed up, moved, and upgraded to latest FUDforum (3.0.0) and MySQL 5, on the new machine.

It’s not 100% straightforward, needs someone who’s done this before, and someone who isn’t me, because of the new job and all.

If you know someone who can help out, please email me where my email address is jallspaw which is located on a server whose domain name is yahoo.com.

Thanks!

UPDATE: I found a guy.  And he’s great with FUDForum. Excellent!  Thanks all those who emailed!

Deployment is just a part of dev/ops cooperation, not the whole thing

Dev/Ops is what some people are calling the renewed cross-interest in development and operations collaboration. Hammond and I spoke about it, and there was even a conference in Europe dedicated to it. While I do think that there’s still a lot more that is to be discussed around this idea of cooperation and mixing of approaches, this is a Very Good Thing™.

In what Andrew has called ‘boundary objects‘, deployment of new code has been a rallying point for the devops crowd, and I think that’s great. Deployment is definitely one of the places where the rubber meets the road. In some organizations, deployment of new code can be the single-most stressful and dividing parts of their work. People get fired or quit because of the emotional baggage that can come with an event that in the worst case, is nothing more than a planned outage disguised as progress and a followup finger-pointing session. Some groups have such dysfunction that they might as well just not even deploy the code. Just skip that part, head into a conference room, and fight bareknuckle. Toxic would be the nice way of describing those environments.

So it’s no wonder that a lot of the emphasis in this growing “devops” community is on deployment. Whether it’s providing confidence in changes with rigorous testing, deploying small changes often, dark launching, feature flags, or building a one-button deploy system – any effort to reduce the risk of change should be considered mandatory, IMHO.

But at the same time, deployment is only just a part of what really makes a great environment for development and operations to collaborate. Really. It’s not just about developers collaborating on deployment and releases. It’s about both teams understanding each other’s responsibilities after code is deployed to production, and collaborating along the areas of their expertise in a way that’s constructive.

Good Operations teams already write code, just not usually user-facing code. They spend a good deal of their time writing code to gather information from the infrastructure and act on it with short, medium, or long-term goals, usually aimed at performance and availability.

I’ll say that things like:

  • metrics collection
  • monitoring and associated thresholds
  • load-feedback behavior
  • instrumentation
  • fault tolerance

should also be considered boundary objects between development and ops.

This is some of what I mean by that:

Metrics collection

I’ve said this before, but context is absolutely everything. Application-level or feature-level metrics is what gives the missing context to in-the-box resource usage like CPU, disk, memory, or network. At Flickr, the ops group maintains a number of different platforms for gathering metrics, like ganglia. To make it easy to add metrics, some of our backend applications will just write a temp file with key value pairs that we want to have squirted into ganglia. Like:

image_processed=30

image_processing_time=5

and ganglia’s gmetric cron job will pick that up every minute with the key as the metric name, and the value as, well, the value.

This means that all developers have to do is drop that file into an expected location and it will do the right thing. No tickets for making a new metric, no need for writing yet another script to gather a single metric, no need to understand the intricacies of whatever metrics collection system you have.

That’s an example of technical collaboration between the two groups. The missing piece is the cultural bits, which is the developer communicating their motivation behind getting these in-app metrics gathered and put on a graph. This gives the metric context, and might give ops some ideas on how they could use the metric for monitoring, capacity, or other purposes.

Monitoring

Involving development in designing your monitoring system can help provide a great perspective on failure modes. Peer code reviews are common in software development, so why shouldn’t monitors be reviewed? It’s still code, and it’s going to provide your humans (and maybe machines) with the data needed to fail gracefully, heal itself, or inform developers on what their constraints are when building new things. Your monitoring system is just like your code in that it should always be evolving, alongside your growth.

Remember all the raves about Google Analytics adding “intelligence” and alerts? Having some notion of thresholds isn’t just for people answering pages from nagios, it’s for everyone. How else can you gauge your expectations and guide future modifications to your code with respect to resource usage?

Load feedback behavior

Like a lot of smart web infrastructures, we’ve built an offline tasks system, which will asyncronously run jobs on our data that don’t have to be real-time. If you haven’t read Myles’ post on it, you really should. It’s a huge part of our strategy to avoid pretty common scalability pitfalls.

Anyway, these tasks, which can be relatively hard on the databases (which is one of the reasons why we do them asyncronously in the first place) have some built-in feedback mechanisms: they’ll check if there’s an unreasonably high number of concurrent MySQL connections, or if the database shard master-master pair doesn’t have both servers in production, or otherwise can detect that either what it’s trying to do on the database is too harsh at the moment. Whether it’s because of current live traffic being high, or a loss of redundancy, the offline task system will stop what it’s doing and re-queue it for later. This is a great (and safe) way of schmearing out heavy loads over a longer time period, reducing their risk.

Throw in some metrics collection about the size of those queues, and monitor alerts to do something for low or high-water mark thresholds, and then you’re cookin’ with gas.

Instrumentation

Through the magic of apache notes, developers can send extremely useful bits from within php code to the access and error logs. At Flickr, we’ve got some pretty simple notes set to help track things down when there are issues. For example. when I load the page for my photostream, the log line looks something like:

www394 123.456.789.012 5555 173663 [14/Dec/2009:04:08:21 +0000] “GET /photos/allspaw HTTP/1.1” – 200 18233 “-” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3” – –

where 5555 is my user id. Since php knows you’re logged in when you view a certain page, there’s no reason why we shouldn’t just log that in the request, so if there are any user-specific issues, it’s not a needle in a haystack.

Another example are API requests. We’ll log the api key making the call along with the authenticated user id, even in POST requests. Being able to trace a bullet through the entire request and response via logs is obviously handy. Putting user ids, API methods, and API key specific info into log lines is hugely helpful when troubleshooting issues, especially if you’re running one of the most popular APIs on the web.

Fault Tolerance

Ross blogged about how we do feature flipping last week. He goes over how important (and awesome) this is to our development process, but another one of the advantages of this approach is how it affects operations.

This is an example of development taking an active role in not only deployment, but the time and effort to operationalize features and pieces of code so that in cases of degradation or failure, these individual pieces can be forced to fail gracefully. Our talk at Velocity last year went over some of this, but it’s still one of the reasons why we can push code thousands of times a year and still have an extremely low MTTR whenever there’s an issue.

New code causing degradation? There’s an app for that! (it’s called a feature flag)

Anyway, my point is that deployment is only a small part of how development and operations should collaborate and communicate. In fact, dev and ops is only the most obvious starting point for getting along and working together on problems.

Product and community management also have important boundary objects with operations as well, but that’s for another blog post. :)


From one door to another

Last week I gave 2 month’s notice – I’ll be leaving Flickr in January.

When Stew and Cat asked me to join Flickr in January of 2005, I felt like it was time to go and do something different, so I said yes.

Five years (and four billion photos) later, it’s again time to go and do something different. It’s hard for me to describe what a blast this has been. Our goal was to kick ass, and I think we did. Flickr has served as the  backdrop of some of the largest changes in my life, and the work I’ve done there is essentially tied to those events in my memory.

During my time here at Flickr, I:

In addition to building, scaling, evolving, and generally being as loud and fast as we could possibly be with the original Ludicorp team, I had the absolute privilege to hire and work in the trenches with some of the greatest people on the web. I also had the chance to work with some of the smartest people at Yahoo, who I’ll continue to have relationships with even after I leave. Yahoo has treated me well, and I’ve learned more here than I have at any other company.

The reason I stayed here for five years wasn’t for the accolades (or the vesting). It was because I worked with people who care about building something that people care about.

This also happens to be the same reason why I chose my next step: Etsy. They care, and it shows.

I still have a little more time here at Flickr to rock a bit more, but I’m excited to work with my friend Chad again on something that matters. I’ll be running the Ops group there, where they’ve already got superstars.

Chad wrote some more about it here.