A Tail of (Disk) Corruption and Eventual Salvation

Phil Windley // Fri Jun 5 09:36:00 2009 // kynetx puppet system+administration

Yesterday I accidentally overwrote the disk image of a running Xen server that represented a machine in production. I didn't notice for two hours because the services on the machine kept going since they didn't need the disk. In fact, the only reason I noticed was because I happened to need to do the same thing again and got the command from the shell command history and noticed the mistake--a one character typo.

Recovery from this event required rebuilding the machine. Fortunately, we've been working hard on infrastructure automation and have a set of Puppet recipes for completely building out a server and an automated deploy process that loads the required application code and tests it. In addition, since it's all on Xen and the old server was still running, we just build up another machine while the old, corrupted server trundled along.

In about an hour a new machine was up and running. We have been meaning to put up redundent copies of this particualr service. In fact, that's what I was working on when I caused the problem in the first place. We took this opportunity to put the new machine behind the load balancer and moved the DNS entry. Overnight the traffic has tailed off considerably to the damaged machine and the new machine is now handling most of the load. I'll build up another copy on another host machine this morning so that we have the redundency we need.

Lessons learned:

Automate everything. More automation would have prevented the mistake I made. The automation we had saved us. Puppet and other tools are the only way to manage infrastructure.
Virtualization gives your incredible flexibility. I love the ability to put up new machines as needed wthout having to manage the hardware tasks at the same time.
Put things behind load balancers and have multiple copes of services running, even if you don't need the capacity. If we'd done this earlier, my mistake would have been a non-event. We'd have just removed the damaged server from rotation and went on with life.

The end result is that Kynetx suffered no downtime, but we did have some tense moments. Changes we're making to the infrastructure will improve our chances of achieving the same results with less stress and sweat.