“Everyone’s got a plan until they get punched in the mouth.” I often think of that quote, from the boxer Mike Tyson (who ought to know), whenever I consider the current state of disaster-recovery planning in U.S. enterprises.
Sure, most large organizations have disaster recovery plans in place. But many of these plans have a way of failing when an actual disaster strikes – when they get that punch in the mouth. So how do you make sure a disaster plan is meaningful and that it will work when needed? The answer is simple: disaster plans must be tested. Unfortunately, that’s something too many businesses fail to do until they’re flat on their backs with the ref standing above them, counting them out.
Practice Makes Perfect
Best practice calls for testing and adjusting disaster recovery plans at least once a year. Some organizations do it twice a year and a handful do it quarterly. Regardless of frequency, the goal of such a test is to make sure that following the plan leads to a working, production-ready recovery of designated systems and services. The test should also measure how long it takes to achieve the various recovery milestones included in the plan.
These measurements often serve as the impetus to making changes in the plan to better meet the business’ needs. Key elements to measure and track during testing include:
- Recovery Point Objective (RPO): The point in time in the past to which systems and services will recover, to begin moving forward.
- Recovery Time Objective (RTO): The point in time in the future, following a disaster, after which the organization resumes “up and running” status. This is an interval measured in days, hours, and seconds (dd:hh:ss) rather than a specific date or time.
Pulling the Plug
Some DR experts argue that the only real way to test a disaster recovery plan is to actually pull the plug on the systems and networks to be recovered. And then to use the plan to enact a real, live recovery. This presents several risks and challenges, however – not least being that the testing period itself will leave the organization unable to conduct normal business. Those that adopt this approach often hedge their bets in a number of ways.
First, they might declare a “virtual failure” and leave production systems alone, while proceeding with a full enactment of the disaster recovery plan. Second, they might switch over from production facilities to backup facilities while still shadowing all activities and transactions on production systems to use as a fallback if the switch-over itself fails. Many experts believe this is an essential part of testing, because it stresses real systems and people in real-world scenarios.
Several important factors much be considered before pulling the plug for a test. First, executives and stakeholders must agree that the learning and experience gained by the exercise is worth potential business or opportunity costs that may result while the test is underway. Second, it takes careful timing and a willing and dedicated DR team to perform the test. To mitigate harm to the business, the timing usually involves running the test during a major holiday (for those not in retail of course) or during scheduled downtime.
Because of long hours and days involved in completing the test, including weekends and/or holidays, many organizations set up A and B teams (or larger numbers of teams in environments where testing is conducted more frequently) so that only one team will be asked to give up weekends or holidays to conduct any given test, while the other team(s) are idle.
There’s More to Recovery Than Systems and Services
There’s one more vital key to testing a DR plan that’s far too often overlooked or under-emphasized. Stakeholders from key business functions and departments should participate in the test, so they can use (or attempt to use) the restored systems and services to do real work.
During the aftermath of an actual disaster, those teams will need to know where to report, how to get into the facility, and how to use telephone, messaging, email, and other information systems to do their normal jobs. A surprising number of disaster recovery tests fail not because the systems and services aren’t online and ready to use but because the people who use those systems and services:
- Don’t know where to go to report for work, or how to log into alternate remote systems to do their work;
- Don’t know how to contact their co-workers, colleagues, and business partners to do their jobs, or how to send and receive emails, IMs, or text messages to communicate with them; or
- Can’t access key corporate or organizational information assets or functions, such as the employee directory and corporate departments (HR, payroll, administration, and so forth).
It’s not enough to just be ready to get back to work – the exclusive focus of too many DR plans. It’s also essential to get down to work on the recovered systems, using recovered services, and performing the everyday tasks that make businesses and organizations run.
The various failures and oversights that occur during testing should be carefully documented, and used to adjust what goes into the DR plan, and what gets tested on the next iteration. That’s where the real value of DR plan testing comes into its own. For a deeper conversation about what your organization needs to survive and thrive in the aftermath of a disaster, reach out.