It’s that time of the year when Amazon’s big Token Ring Network (AWS) goes down. I don’t subscribe to the notion that there is a magic bullet for consistent, highly available, tolerant systems (in fact, Eric Brewer’s CAP Theorem states exactly this), but I’m not exploring how to design these systems, but rather what to do when things go wrong.
I work at 2600hz, a VoIP Infrastructure company, and one thing you may not know about Phones is that people grow unnaturally attached to their equipment. Change an icon, move a button or alter a ringtone and you’ll hear about Armageddon in no time. It’s no surprise then that we’ve had to over-engineer our infrastructure to prevent any semblance of downtime. Here are Three of the lessons I’ve learned about dealing with big problems in real time:
1. Don’t Panic
When you build systems using homegrown tools, you have no one to yell at. This can be a terrible feeling, or an extremely empowering one. Chances are, if you built it, you can fix it, so adopting this mindset from the get-go will help a lot.
When I was working at a large Telecom provider and the network went down, the first thing that happened was a ritual summoning of the responsible parties to a Conference call. The bigger your company, the harder this is to do, but the idea is always the same: get the people who can effect change together as quickly as possible. The sooner you’re working on the problem, the faster you can fix it.
2. The Blame Game is Lame
Deciding who is to blame for the disruption to your business is an exercise in futility while the impact of the error is still being felt. People who feel like they’re dumb tend to think like they’re dumb so empowering your administrators is the best thing to do in a crisis. Worry about how to prevent the problem during your post-mortem, but don’t stop fixing the issue as it stands to cast aspersions on someone else’s activity which will negatively impact your time to resolution. Fix, don’t point fingers.
3. Post-Mortem means Beating a Dead Horse
Envy and Jealousy are useless emotions, unless they motivate positive change. Finger-pointing and guilt-assignment are parallel traits that have no place except in inducing corrective action. In the case of a disaster, the only thing worse than the disaster itself is the idea of repeating that disaster, in perpetuity. This happens, not because the individuals tasked with preventing these disasters have failed, (that much is obvious) but because the weight of the incident was not assigned properly, the weight was not felt properly.
When performing a post-mortem on a disaster: identify how this can be prevented in the future, discuss the ramifications of not fixing these issues and assign stakeholders whose responsibility involves fixing these problems. Then hold these individuals accountable.
No one wants to live the nightmare of the recurring bug that “can’t be fixed” but that’s the fate we’re all doomed to repeat if we can’t own problems and introduce tactful solutions.
So remember, when problems arise, Don’t panic, don’t blame anyone and do a thorough post-mortem. Time to resolution means getting the key members of the recovery team together and working hard. You can’t work hard while you’re upset so mitigating the emotional toll of an outage is as important as mitigating the infrastructure damage.
TL;DR: In an outage, don’t freak out, fix the stuff before blaming people, BUT MAKE SURE IT GETS FIXED.