Put aside your latest list of one-alarm fires for a few hours and spend some time planning your disaster recovery response.

Back to Work

Put aside your latest list of one-alarm fires for a few hours and spend some time planning your disaster recovery response.

Despite many predictions outside (and even some inside) the IT profession, Y2K wasn’t a disaster. Those of you who bought short on disaster futures, congratulations. But this doesn’t mean disasters won’t strike. It needn’t be an earth-killing meteor or global plague, but simply a localized dislocation that directly affects your company’s information systems. That could range from a broad-based local problem to one that affects only your company. I don’t mean a disk failure or controller going out; I’m talking about the building burning down, getting flooded, or something equally dramatic in which your entire system ends up toast. Regardless of the breadth of the problem, your concern (after your home and family, of course) will be on your systems—and on what you’re going to do about it.

Windows 2000 Server has a suite of recovery tools that includes the Advanced Options menu, the tried and true Emergency Repair Disk (ERD), and the Recovery Console. There’s also disk mirroring, RAID, and, of course, the high-end clustering option. While these services are welcome indeed, none is an end in itself. Disaster recovery, while depending on these and other features, isn’t largely a technical issue, but a logistical one.

The first step in getting a handle on disaster recovery is to have someone in your organization with authority put a value on the information in your system. A retail flower shop is going to have completely different valuation of its information than a financial institution. Some companies include the cost of rebuilding the systems and restoring the information to resume business, while other companies also look at the lost opportunity costs associated with a down system.

If You’re Small, It’s Simple

For a small business with a standalone or very small network, a reasonable disaster recovery plan can be pretty straightforward. It could simply be a complete hardware and software inventory—a shopping list that includes everything necessary to receive a backup. You wouldn’t necessarily need to replace all of the software; that would be replaced from the backup. If disaster strikes, you take the list to a supplier and begin the tedious task of rebuilding your system to the point where it can accept the backup tape or CD that you have religiously been creating and storing offsite. This is down and dirty, but it works with two caveats: You’ll lose the data between the backup you have on hand and your last backup, and it can take considerable time to rebuild the system.

A retail shop’s problems will also be structural—perhaps rebuilding or finding a new location. But a standalone service business can be up and running within a day, with forwarded phones, if the equipment is readily available from a local supplier.

Obviously, this gets more complicated as you add workstations and other servers to the scenario. The key to this type of plan isn’t that your plan is overly detailed or entirely comprehensive; it’s that you have a plan at all. When a crisis is upon you, you need to have steps laid out to get you out of the situation—and these steps need to have been decided upon and written out when things were calm. That’s the only way to ensure you’ll accomplish what you need to in the most efficient manner possible.

At the other end of the scale is the company that can’t be down, period. This complicates the plan and adds astronomical costs; however, in many cases these high costs are still less than the lost opportunity costs of an information system failure. An organization that can’t be down must build and maintain complete parallel systems running concurrently with periodic data transfers to the backup system, or even real time. Very few organizations have this type of requirement because the cost is prohibitive. Most large organizations fall in the middle to upper end of disaster recovery.

These organizations must follow a systematic process to build a workable written and published plan. The plan must be detailed enough to allow staff—not necessarily those who wrote the plan—to get an information system up, running, and available within the time specified.

Determine the Scope

The first step is to determine the boundaries of what is considered the information system. Is it only the mainframes or does it include departmental servers as well? You might include 100 percent of the users or just a key 25 percent, or perhaps just certain classes of users—say accounting. This isn’t a technical process; it’s a management decision that needs to be made before the plan is developed. However, to help the decision process, knowledgeable IT staff like you should present some choices based on what you know to be strategic to the business. When you’ve determined the scope, you’ll use it to develop the options available to get the system up and running, and to return the entire system to an acceptable level of availability.

Once the scope is decided upon, you can create a disaster plan. For example, the objective may be that the mainframes and departmental servers must be functional within 48 hours, with 95 percent of class A users connected and 25 percent of class B users attached. Within 72 hours, 100 percent of all users must have system access. Obviously, the real numbers will be determined by management using a cost benefit analysis to make decisions. Regardless, the objectives should be things that can be clearly measured, in order to calculate the “lost opportunity” cost. An objective that says “most users will have access to the systems” can’t be measured, making it useless. If system down time costs your company $2 million a day and a loss of more than $6 million dollars is unacceptable, then the system must be up within three days—thus keeping losses under $6 million.

This brings up another point. A disaster recovery plan is a living document. Generally, the lost opportunity costs grow rather than diminish over time in an organization dependent upon technology. In addition, you need to keep an eye on equipment costs to determine the expense of rebuilding the systems, and the compatibility of the current software with new machines. In a large company, this alone can be one person’s reason for being.

Hot Site Service

One alternative to the cost of having a complete redundant system waiting in the wings is to subscribe to a hot site service bureau. This is a service organization that maintains potential backup equipment for several customers in different geographical locations. Each customer shares in the expense of maintaining the equipment, thereby spreading the cost. Another advantage with a hot site service bureau is that you can stage real-time disaster recovery plan drills to test the equipment and your procedures.

Regardless of whether you use a hot site service bureau or maintain your own remote backup location, you also have to consider user access to the other site. In addition to equipment redundancy, you need to build a data network backup system so that users can gain access to the new system. Again, this can be privately built or subscribed to through a service provider.

Another component I haven’t addressed here is a voice network backup plan. You’ll want to consider this as carefully as your hardware and software backup plan. Where will your calls be forwarded and how?

Additional Information

MCP Magazine covered the topic of Windows NT, SQL Server, SMS, and SNA disaster recovery in the September 1998 issue, "Prepare for the Worst: What You Need To Know Before Bad Stuff Happens."

The Disaster Recovery Journal at www.drj.com provides sample disaster recover plans and requests for proposals, along with on-going editorial on the subject. You'll also find information for the "disaster recovery newbie." You'll have to register for free access, then await a password.

Most large service firms offer recovery and protection services or hot site service. If you purchase your systems from a particular vendor, such as Compaq, IBM, or Dell, check into their offerings.

Avoidance Is Critical

One final thought. As critical as a thorough disaster recovery plan is to the business, the other important component is a disaster avoidance plan. Your security plans should include non-technical measures such as securing power patch panels, running automatic virus scanning, locking down workstations and servers, and other obvious but often overlooked items.

The biggest problem with most disaster recovery plans in most companies is simply that there isn’t one. As a support professional, take the time to think through what you’d have to do to completely rebuild your information system in a different location. After your heart starts beating again, let management know of your concerns in writing. Your role as a technical support or design professional is to outline the implications of each system failure and then present them in a way that helps management make a cost evaluation. You can then use that to develop a plan that addresses the judgments upper management has made. In other words, it’s a classic CYA situation—with a professional approach, of course.

Featured