Disasters happen. When they do, are you ready to handle it with grace? In general people get good at handling events that they experience regularly, but high-risk disasters are managed so they don't happen often. It's not every day that power goes out at the primary data center, but when it does you want to be sure that your auto-fail-over actually works. You want to be sure your backups actually work.However, testing disaster preparedness often takes too much time time and creates little organizational value. The insights and practice gained from the experience may never be put to the test. Given the experience takes so much effort, you may not get an opportunity to explore all of the most likely disaster scenarios.
So the question is: what's a light-weight way to simulate disasters that still provides insights while taking less time? At the suggestion of the fine consultants at Applied Trust we decided to run a table top disaster recovery session. Chris Rossi had some suggestions on what to do and I did a bit of researching and experimenting.
Planning and setup for tabletop disaster recovery
I decided to add a little fun in the event by making it themed around a Dungeons and Dragons style event. To prepare for the event:
- I created google docs for everyone on the team (1 per person) so I could review their notes after the fact.
- I came up with 4 scenarios, though we only covered 3 of them in the hour we had allotted for the event. The scenarios went from simple and regularly occuring to more obscure.
- For each scenario, I wrote down the symptom, cause, and a few bits of evidence to help me keep my story straight.
The flow of each scenario was:
- As the "Disaster Master" I would state a condition like "You get an alert from New Relic saying the number of errors has gone above our alerting threshold"
- Everyone takes a minute or two and writes notes of everything they can think of to check to understand the problem
- As the typing subsided, people would start asking me questions.
- I tried to keep my responses positive where possible, but it was amazing how often the answer was "You searched for X and you found nothing."
- If I sensed that people were getting frustrated I would give out a clue - I realized that we often just look at a screen of data and see a hint (e.g. uncommon requests coming from a single IP address) which causes us to pick up a whole thread of evidence. I tried to provide that kind of experience, though it was difficult to succeed.
- I had pre-written phrases from different role-playing games to spice up the answers where appropriate, e.g. "Grepping the log file returned zero bytes of text and 1 orc who stole your food."
- We would then debrief on each topic for a bit talking about nuances of the questions and answers, discussing which systems people do or don't know well, and making notes for what we want to do better in the future (removing single points of failure, improving documentation, etc.)
Benefits of Tabletop Disaster Master scenarios
What benefits did we find from this practice?
- We ran through 3 scenarios in under hour and the whole team got to participate (whether suggesting ideas or hearing thought processes from others)
- We identified deficiencies in our technology stack, documentation, knowledge, and processes that we've prioritized and addressed.
- People enjoyed the experience and were all placed in control of each scenario gaining experience that will be valuable if they ever face the issue in real life.
- One particular area of knowledge sharing that was helpful was how to get in touch with people. We use email, hipchat, and hangouts every day, but in an out-of-hours emergency not everyone was sure where the contact list was to look for phone numbers.