
"It is impossible for ideas to
compete in the marketplace if no forum for
their presentation is provided or available."
Thomas Mann, 1896
The Business Forum
Journal
BUSINESS RESUMPTION PLANNING
Justification, Implementation
& Testing
By Dr. Paul H. Rosenthal
Modern organizations have a large variety
of operational and managerial functions whose continuous operations are
critical to the organizations continuing viability. Business Resumption
Planning (BRP) involves arranging for emergency operations of these critical
business functions and for resource recovery planning of these functions
following a natural or man-made disaster. Business Resumption Plans are needed
for all such organizational units, including data centers, information systems
(IS) supported functions, and those organizational functions which are
performed manually. The widespread lack of a BRP for many data centers and for
most non-IS related operational and managerial functions is based on two
mistaken beliefs:
1. that the chance that a disaster will
occur is so remote that the responsible managers and executives need not
consider BRP as an essential part of their jobs, and
2. that, over the long term, the cost of
a workable BRP exceeds the total costs incurred when a low probability
disaster occurs.
These beliefs are
based on the incorrect use of insurance based probability concepts,
instead of on' prudent person risk assessment approaches. This paper will,
therefore, present both the scope and procedures for developing and testing a
BRP as well as a detailed discussion of justifying a BRP based on the prudent
person approach. It is organized into the following sections:
Development of the BRP field
Contents of a typical BRP
Phases in developing a typical BRP
Prudent Person justification
methodology and case study
Functions of BRP Teams
Data Center Backup Architectures
Manual Systems Backup Approaches
Overview of Business Resumption Planning
Recent experience in
the information systems contingency planning area has demonstrated that
disasters to quality business offices/facilities occur approximately once
every hundred years. A disaster probability of 1% per year appears to be the
proper basis of individual facility based business resumption planning. This
is actually a very high probability when the prudent business person realizes
that the loss of a business facility containing critical functions can destroy
the company. The advent of critical data processing applications, as discussed
in Andrews [1], first brought this exposure to the attention of most
executives.
Development of the BRP Field
Prior to the
development of computer based information systems that were critical to an
organizations day-to-day operations, Business Resumption Planning consisted
primarily of insurance programs, life-safety oriented building evacuation
plans, and mutual aid agreements for batch processing resources between data
centers. By the late 1970s, computer based systems had evolved from batch back
office and accounting functions to online systems supporting critical business
functions such as airline reservations systems and bank teller support
applications.
To meet this new
critical dependence on information systems, the Comptroller of the Currency
issued a banking circular entitled "Contingency Planning for EDP
Support" on May 26, 1983 to the national banks that it audited. The
circular required that the banks prepare Contingency Plans for recovering
their critical IS functions. This circular started the modern IS contingency
planning support industry. The IS contingency planning support industry
consists of two areas: consultants and packaged plans to assist an
organization in developing their BRP, and shared commercial backup data
centers for use following a disaster.
Several organizations
sell an IS contingency plan development methodology including planning
procedures and a sample plan available on a diskette. The sample plan is
tailored, often with consulting help from the vendor, to the organizations
individual requirements. The sample plans are often of great value to
organizations who have not had prior experience. However, their use often
leads to long wordy plans that are not read by potential emergency operations
and recovery participants.
Two large and numerous
smaller organizations offer data center operations facilities containing
operating computers (a hot site), fully equipped space for installing several
additional computers (a cold site), and space for clerical and support
personnel. Scheduled use of the hot site computer(s) is only for testing by
the eighty to a hundred clients of the hot/cold site facility. These
commercial backup sites are extremely valuable after a localized disaster that
makes a data center unavailable, but are subject to prior or multiple
occupancy following a wide spread disaster affecting many facilities.
The availability of
these support services and facilities have made data center disaster planning
easily affordable for all but the largest companies.
By the late 1980s,
external auditors were demanding that all organizations with critical IS
applications have contingency plans that included not just their data center
but its users as well. The term Business Resumption Planning started to be
used as IS contingency planning methodologies were expanded to cover all
critical functions (automated as well as manual) of an organization. However,
the availability of commercial data center backup facilities makes Business
Resumption Planning easier for automated functions than for manual functions.
In fact, some business functions based on manual processing can develop
economically justifiable backup procedures only by automating their critical
functions.
Contents of a Typical BRP
The typical business
resumption plan contains three types of information: backup resource
arrangements; BRP procedures for notification, activation, mobilization, and
emergency operations; and listings of equipment and other resources at all
facilities. It requires two very different formats, that is the information
needed before and after a disaster.
A formal BRP plan is
needed for orientation of employees that will be involved in activating the
plan and performing emergency operations. Orientation material is also needed
for all employees in both the life safety and BRP areas. Detailed emergency
operations plans are also needed both before a disaster for use during
testing, and after a disaster for use during emergency operations. The type of
information needed includes:
Backup resource
arrangements:
The initial and most critical activity in creating a realistic
BRP, is arranging for backup resources and data for use after a disaster.
Data centers with critical online applications either use commercial backup
centers or utilize
dual data center architectures. IS users and manual processing organizations
also need to have
backup space and resources available. Most organizations have space and
equipment occupied by
non-critical processing activities (such as development organizations,
conference rooms, exhibit
space, executive offices, etc.) that can be utilized in an emergency, given
sufficient planning.
The lack of copies of paper based data, forms, and procedures is the most
common weakness
in BRP arrangements. Data centers routinely send copies of all data files to
off-site storage daily.
Manual processing organizations must also store copies of all critical data
off-site, frequently as
microfilm. The loss of desk-top papers, diskettes, and Rolodexes can cripple
organizations.
BRP procedures:
BRP procedures documentation for use before a disaster are
for employee
orientation, BRP participant training, operational and simulation testing, and
auditing activities. The
typical contents of a BRP follows.
Applicability
Distribution list (Controlled)
Organization structure (Team definitions)
Notification trees (Business and home numbers)
Activation and mobilization procedures
Backup resources (Locations and functions)
Emergency operations policies (For each business functions teams)
Resource recovery policies
Testing and training policies
Appendixes (copies of material for use following a disaster)
BRP activities: BRP activities documentation for use following a disaster
should be concise and consist primarily of
tables and charts. Lengthy statements of policy and detailed procedures are not read and typically
ignored during an emergency. Effective BRP documentation for use following a
disaster must contain:
These site specific wallet size cards are distributed to all employees. They
should contain both life-safety and BRP
information such as assembly locations, emergency operations policy, and numbers to call for
information.
These fold over pocket size cards contain a management call-tree, key assembly
and backup resource addresses and
telephone numbers, and emergency team contact information.
Detailed team operations procedures are normally needed only when multiple
locations performing the same
business function exist. Such locations require detailed procedures since they are normally not
staffed with the senior personnel needed to adapt policy level directives to the specifics of a
particular emergency. This type of procedure is lengthy and difficult to prepare since it must
anticipate various types and levels of disasters.
BRP reference
information:
Reference information is often included in the
appendix of the BRP as well as being bound
separately for ease of use following a disaster.
These fold over pocket size cards contain the team member call-tree, an
activity checklist, and backup resource
information. Typical teams include: policy, emergency operations center, facility management, site
recovery, backup data center operations, logistics, off-site storage coordination, floor
wardens, assembly site coordinators, public/employee communications, telecommunications, etc.
These tabulations
contain such information as: lists of all resources by location including
replacement information and backup resource locations; where all personnel are
to report during emergency operations; and where to forward data and materials
from off-site storage.
The best BRP
documentation for use following a disaster that the author has observed,
consisted of approximately a dozen reference cards, several books of resource
tabulations, and two well equipped EOCs.
Justifying and Developing a Business Resumption Plan
The funding, security,
planning, and testing phases of developing a BRP are presented in this section,
followed by a simulated conversation presenting a detailed analysis of the
prudent person justification methodology as applied to the funding phase. A more
detailed discussion of the justification approach can be found in Waldman [10].
Phases in Developing a Typical BRP
Building a quality
Business Resumption Plan is a lengthy process involving many persons and
disciplines. Most organizations first build an IS data center plan, then a BRP
for each critical data center user, and finally attempt a BRP for their critical
manual processes. The following four phases apply to each of these functional
areas.
Phase I - Funding
The initial step in any
BRP program is that of obtaining the substantial funding normally required. This
requires convincing the Board of Directors on the reality of a possible disaster
and its probable impact on the ability of the organization to survive.
A detailed quantitative
approach to a risk analysis is used widely. The approach is popular with
government and large decentralized industrial firms with major consulting
budgets, see Wong [11]. It determines the probability of various man-made and
natural disasters occurring and their impact on key business functions. The
results present probability estimates that are difficult to translate into
risk-cost-protection decisions. Additionally they are actuarial based and do not
apply to business decisions involving a single site or resource.
The
board of directors of most firms respond far better to a fiduciary
responsibility based risk analysis. A list of risks to which the firm's
facilities and personnel is exposed is presented and a case study approach is
used to demonstrate realistic risk exposure in Waldman [10]. Estimates are made
of the financial impact on various business functions, computer related and
non-computer related, of a loss in resource capability. When the impacts include
financial or service level losses that can effect the firms survival, then the
board members fiduciary responsibility requires a prudent level of protection
and recovery capability. Funding for an adequate BRP is then made available,
often as a priority project.
Phase II
- Disaster Prevention
Following
initial funding, the next step in the BRP program, is to determine the possible
extent of exposure to natural and man-made disasters of critical resources;
including facilities, data, and personnel. Procedures must then be implemented
to minimize the probability of such disasters occurring.
Physical
security planning primarily involves access controls, fire and water protection,
earthquake and storm hardening, and critical records security. Most firms have a
physical security program in place covering these areas prior to the
implementation of a BRP program. The next step in the BRP is, therefore, simply
an assessment of the program, and improvement if necessary. The author's
experience indicates that the critical records area, particularly for
non-computerized files, frequently is a major weak point.
Data
security and protection programs are not as wide spread as physical security
programs. Few firms have high quality data oriented security programs involving
off-site backup storage of critical paper based financial & personnel
records. Protection in this area frequently requires a major effort.
Phase III - BRP Planning
Disaster
planning, as illustrated in Rohde and Haskett [6], is often initiated by the
organization's data center, as it first implements applications critical to the
day-to-day operations of the organization. The selling of data processing
oriented disaster planning to the Board of Directors often alerts them to the
risk presented by the non-computerized portions of the firms operations, and a
total disaster recovery planning effort is initiated.
A
team effort is the best approach to creating an organization's initial BRP. The
team should include at least a person experienced in BRP architectures and plan
development; a person with long term responsibility for developing and
maintaining the plan; and an influential manager with in-depth knowledge of the
organization, its operations, and its people. The team should move through the
development cycle backwards and then forward.
Step
1 - for the various major resources of each business unit, determine the
potential recovery architectures available, their costs, and the recovery
periods they offer.
Step
2 - perform a risk analysis determining which resources are truly critical to the
organizations survival. For those resources, determine their desired recovery
periods and the most practical recovery architecture approach for each.
Step
3 - present a business resumption policy to the Board of Directors balancing
risk, costs, and service levels.
Step
4 - create a detailed design of the authorized recovery architectures and assist
each business unit in creating a business resumption policy and architecture.
Step
5 - assist each business unit in designating a BRP coordinator,
assigning a planning team, and assisting them in developing and testing their
plan.
Phase IV
- BRP Testing
Desk-top
walk through - Before any detailed testing, key stakeholders in each business
function's BRP are convened in a conference room, and a detailed review is
performed of the plan. Many small events are described and the participants are
asked to state how the plan would guide their reactions. The events should
require utilization of: major backup resources, emergency operations approaches,
and all emergency response teams. Following this step, operations and simulation
tests are scheduled.
Operational
testing - Few organizations operationally test the complete disaster reaction
cycle of: activation, life-safety, damage assessment, mobilization, emergency
operations using off-site files and backup resources, and recovery planning.
Only the data processing emergency operations area can normally be tested
without involving a substantial number of persons during business hours. The
scope of most operational tests therefore, includes: a semi-annual off-hour call
to the manager of data center operations, assembly of the backup site operations
team, acquisition of backup materials from an off-site location, travel to a
backup hot/cold site, installation of systems and applications software, loading
production data, and systems test of several critical applications.
Simulation
testing - Simulation is the most feasible approach for testing the decision
making aspects of disaster reaction activities, see Rosenthal [8 & 9]. The
use of simulation exercises for testing a BRP has been spreading slowly over the
last decade. Unlike their military counterpart, war games that use computer
driven scenarios to perform very realistic exercises, BRP exercises are paper
and pencil simulations. Teams are placed at tables representing their backup
locations, and the description of an evolving disaster is presented. The teams
communicate using backup communication resources or forms, make decisions, and
everyone pretends that what is ordered actually happens. Debriefings and
evaluation studies follow to correct any flaws in the BRP.
A
scenario for use in simulation testing of a BRP must fulfill several objectives.
Objective
I - Be solvable for a majority of the business functions participating, using
existing plans and backup resources.
Except
under very unusual political conditions, a major failure during a simulation
test is not a suitable motivator for improvement in disaster planning, or for
acquiring additional funding for backup resources. A primarily positive
experience however, appears to be a powerful motivator to obtain additional
funding to complete planning and backup resource acquisition. In fact, the
scheduling of a simulation is often the easiest way to motivate organizations to
update their staffing and contact lists.
Proper
planning of a scenario includes the review of each participating organization's
BRP to determine if they can perform emergency operations at an acceptable
level. If they cannot, a discussion with top management is appropriate, and
warnings to the deficient organization's management is always proper (e.g.,
there should be no major surprises or disappointments).
Objective
II - Represent a realistic risk.
A
fire, flood, earthquake (in California), or bomb is normally the basis for a
scenario. A detailed knowledge of the buildings, area, and emergency services
involved is always necessary. The disaster and its effects over several days or
weeks has to be described, so that the participation of facility and security
personnel is required.
Objective
III -
Capable of being partitioned into practical time steps.
Simulation
exercises with four to six time steps are the most practical. Each time step
must meet the following criteria:
a)
The external and internal environment should change in terms of both the
evolution of the events causing the disaster, and in terms of emergency and
recovery efforts (e.g., new information given to participants and new actions
required).
b)
Each team should have some significant action they must accomplish (e.g., a
decision, announcement, report to management).
c)
Time allowed for the time step should be sufficient but not generous (normally
30-60 minutes).
The
period simulated becomes longer with each time step in the simulation. The
simulations performed to date indicate that the initial period simulated is
often one to four hours, while the final period simulated is often several days
to a week.
Objective
IV - Be self documenting
Messages
and plans produced during the simulation exercise should be rigidly formatted
and documented, so that there is a detailed record of all events and actions.
This documentation, together with the umpires and evaluators check lists, is
necessary for later analysis.
Most
simulation exercises are very successful in that they force personnel to learn
the BRP while working together, and find flaws and inconsistencies in emergency
operations and recovery policies and plans.
Prudent Person Justification
Methodology
There
is a substantial methodological and financial difference between data center and
manual business functions risk management decisions using 'prudent business' and
traditional 'probability' based methodologies. There is also a vast difference
between risk management of a data center's processing of the record keeping
applications of the 1960's through the early 1980's, and the risk management of
the critical on-line operational applications of the late 1980's through the
1990's. The hypothesis of Waldmans thesis [10]. is that the probability based
approach, created for the record keeping applications of the early years of
Information Systems (IS), is no longer appropriate for the mission critical
applications of the 1990's, and has never been appropriate for critical manual
systems.
-
The prudent person methodology is based on executives eliminating those
alternatives that risk the short and long term viability of the firm, and
analysts then selecting the lowest cost alternative that provides acceptable
recovery times. The prudent person duty of care test requires that an officer
make a "reasonable investigation and honestly believe that their decision
is in the best interest of the corporation", see Metzger et al. [4].
-
The probability based approach is based on analysts multiplying the potential
loss experienced following a disaster by the probability of it occurring, and
comparing that to the cost of backup alternatives. As an example, the IBM
approach as defined in Wong [11], states that "For each system ... the
expected frequency of occurrence per annum, P, as well as the loss incurred, V,
are calculated ...and the exposure, E, per annum is then evaluated from the
values of P and C". This approach often exposes the organization to
unacceptable losses when a low probability disaster actually occurs.
A
BRP Justification Example
A
comparison of these approaches is illustrated in the simulated conversation that
follows. It is extracted from a recent speech by the authors to the San Fernando
Valley Association of Contingency Planners. The illustration involves a
simulated discussion between an executive, an insurance oriented financial
analyst ( Mr. Probability) and a management systems oriented MIS planner (Mr.
Application).
The discussion is as follows.
Mr.
Executive Speaks
"Mr.
Probability, I understand you do not agree with Mr. Applications request for
a budget
increase of $500,000 per year to implement a contingency planning program for
our computer and data communication systems, as well as for semi-annual testing
using a commercially available data center back-up site. Why do you feel the
expenditures are financially unjustified?"
Mr.
Probability Speaks
"Mr.
Executive, I have contacted the proposed data center backup site vendor and
requested his experience over the last decade, concerning how often a data
center suffers a disaster. They have found that each of their backup centers
with subscription levels of approximately 100, are used approximately once a
year for other than reactive testing or planned conversions. I am therefore
stating that a data center disaster is a once in a hundred year
occurrence."
"Secondly,
Mr. Applications request indicates that the old data center, that was replaced
ten years ago because of security and size consideration, is still functional as
a backup site capable of processing our production operations within a week
following emergency ordering of equipment.
As
he states, it can be populated with computers etc. within a week and the CIS
department could have our production systems current in two weeks. The backup
data center he wishes to subscribe too, plus our testing expenses, will cost
$50,000 per month and would permit him to have our productions systems current
in two days. We are therefore computing that $500,000 per year is worth 10 days
of operational losses.
"Thirdly,
Mr. Applications indicates that direct losses due to non-current production
systems are $3,000,000 per day. This figure was arrived at by accounting based
on lost contributions to overhead and profit of three days sales.
"Lastly,
if we multiply the $3,000,000 by 10 days we get $30,000,000 additional potential
loss in a disaster if we do not subscribe for the commercial backup center.
However when we divide this figure by 100, the probability of loss per year, we
get a value of the commercial backup center of $300,000 per year. This net loss
of $200,000 per year, is obviously a bad investment of capital."
Mr.
Executive Speaks
"Mr.
Application, if your figures show such a bad investment, why have you proposed
the contract?"
Mr.
Application Speaks
"Mr.
Probability does not understand the implications of ten days down time, now that
our production applications are online. It is no longer a question of lost sales
but of retaining customers business after the ten day outage. I have had
extensive conversations with marketing and customer service management. They
feel that half our customers can tolerate a 10 day wait and will continue to buy
from us. However marketing believes it will take several years to win back the
other half of our current customers. Their optimistic figure of an average of a
year to resell a customer will give us an additional loss of $18,000,000
($3,000,000 times one-half times 240 days). This will give us a total loss of
$48,000,000. If we then divide by 100 we get a return of $480,000 per year on an
investment of $500,000. E.g. a break-even situation.
"However,
this type argument, including the division by 100, is irrelevant. The problem is
that the data center could be destroyed tomorrow, or next year or not for 200
years. If it is destroyed and we did not take prudent steps to safeguard this
critical business resource, we will all lose our jobs and be subject to
stockholder suites.
"The
loss of approximately $48,000,000 is almost half of our annual profits.
Additionally
the loss of half our customers and the reduced service level for the remainder
will ruin our reputation for service that is our future. We may never recover,
let alone gain back the business we lost in a year on average.
"I
therefore recommend that, as a prudent business executive, you must protect such
a critical resource as our data center and give us the $500,000 per year
additional budget which is an addition of 2% to our total budget."
This
simulated discussion illustrates how subtle aspects in the presentation of
information can significantly impact decision making-particularly when
uncertainty is involved. Throughout the literature on decision making, there is
substantial evidence to suggest that intuitions about risk routinely deviate
from rationality, because executives do not typically appreciate the nature of
uncertainty. This simulated discussion illustrates that the decision on how much
safety is worth, is very difficult.
A
BRP Justification Case Study
This
BRP justification case study presents both the probability based method and the
recommended methodology based on prudent business person concepts. The
recommended approach justifies the cost of data center contingency planning
based on the total cost and impact to the organization when their information
technology resources are unavailable. More details on this case study are
available in Waldman [10].
Risk
Analysis
This
example is based on a consulting study, performed during in the early 1990's, of
a wholesale distributors data center contingency planning project. The
planning project determined that the proposed security and data protection plan
would leave their data center vulnerable to natural and man made disasters, such
as: fire, earthquake, utility interruption, and strikes. The objective of the
planning study was to recommend an architecture that provided for continued
processing of their critical applications. These included order processing,
inventory management, accounts receivable, and payroll.
Alternative Scenarios
Four alternative BRP
architecture scenarios were available. These were: a dual center approach, use
of a vendor hot/cold site, use of a current company facility as a cold site, and
continuation of their current approach with no backup resources.
-
The dual center
approach involves the construction of a new secure data center facility, as well
as the splitting of processing between the current and the new center. The
typical approach for this type of architecture is discussed in detail in the
article by Rosenthal [7], entitled "The Emerging Enterprise System
Architecture". The expected maximum down time after a disaster using this
approach is several hours.
-
The vendor hot/cold
site approach involves subscribing to a commercial recovery center with
compatible mainframe systems. The expected maximum down time after a disaster is
several days.
-
The cold site approach
involves the equipping of a current warehouse facility with all environmental
and communication facilities required to quickly install a duplicate of their
current data center. The expected maximum down time after a disaster is several
weeks.
-
A continuation of
their current no backup resource approach will lead to a expected maximum down
time, after a disaster, of several months.
Forecasted Annual Expenses
Table 1: Annual Scenario
Expenses, presents the then estimated annual expenses of each scenario. The
total annual information technology budget for the organization approximates
$12,000,000. The dual data alternative with down-time of several hours
represents approximately 6%, and the vendor hot/cold site alternative with
down-time of several days represents approximately 2%of the annual IT budget.
Table 1:
Annual Scenario Expenses
Data Center
Annual Vendor Fees
Site Preparation (1)
Telecommunications
Initial Installation
(2)
Annual Cost of Lines
Personnel
Duplicate Operations Staff
Testing at Other
Site
Simulation
Testing
Plan
Maintenance
TOTALS |
Dual Centers
$150,000
$175,000
$15,000
$40,000
$500,000
$5,000
$6,000
$20,000
$761,000 |
Vendor Hot/Cold
Site
$100,000
$15,000
$60,000
$5,000
$6,000
$20,000
$256,000 |
Own
Cold
Site
$15,000
$40,000
$6,000
$12,000
$173,000 |
No
Backup
$6,000
$8,000
$14,000 |
(1) 7 year amortization (2) 5 year amortization
Step 1: Forecasted Annual Losses
The following analysis presents both the losses incurred by the organization
during recovery of normal
information technology processing, the losses incurred in winning back any lost
customers, and reestablishing their
service level reputation.
Step
2: Estimated Order Retention Rates
Figure
1: Projected Order Retention Rates, were derived from interviews with key
marketing and management personnel. As a wholesaler, the organization believes
their customers would switch to alternative suppliers for new orders within four
days. This would be caused by the lack of inventory information and the delays
in picking and shipping cause by reversion to a slow manual operation and a
shortage of trained personnel. They also estimate that reorders of proprietary
items, representing 65% of reorders, would continue, but reorders of generally
available items would stop over a three week period. Figure 1, is derived in the
spreadsheet included in Appendix A of Waldman [10].

Step
3: Estimated Order Rates
Figure
2: Projected Order Recovery Rate After A Disaster, presents the estimated rates
of recovery of orders following an interruption due to lack of IT capability
after a disaster. The key marketing and management personnel believe the firm
could recover approximately 5% of their former order rate per month after return
of full IT capability. However, recovery from the vendor hot/cold scenario is
slower, since all lost orders were new orders. Figure 2, is also derived in the
spreadsheet included in Appendix A of Waldman [10].
Step
4: Forecast Economic Impacts
Table
2: Economic Analysis of Scenarios, computes the impact of the scenarios from
both the
Probability
and the Fiduciary Responsibility approach. The result of the analysis shown in
Figure 2, is the estimated weeks of lost sales shown as the first data line in
Table 2. From a fiduciary responsibility view, the scenarios have the following
impact.
Fiduciary
Responsibility Analysis Approach
The
use of the following recommended fiduciary responsibility approach is
illustrated for each of the defined scenarios.
Figure
2
Projected
Order Recovery Rates

There
are no losses associated with the dual center approach, since the impact of a disaster is not
different from a routine interruption due to hardware, software, telecommunication, or utility
failures. The costs of this alternative approximates 6% of the total IS budget, and 1.2% of
the firms operating expenses. The firm, as typical of most distributors, did not believe the cost
of dual data centers was worth saving a days down-time.
Losses
from a disaster using this alternative, will be approximately $500,000. This
represents, as
shown on the third data line of Table 2, about 1% of annual profits. This level
of loss is acceptable,
given the minimal probability of a disaster to the data center. The cost of approximately
$250,000 for this alternative is 2% of data center costs. This is the
alternative selected
by the firm, an action typical of most business organizations.
This
approach would lead to an estimated loss of approximately $15,000,000 which
represents 40% of annual profits. This has a major impact on the company. The
management thought this type of loss would cause the board to totally replace
management of the firm, and might result in selling it to a competitor. This
level of loss was simply not acceptable.
This
approach would lead to an approximated loss of $30,000,000. This represents a
loss of 70% of annual profits, which would be a disastrous impact on the
company. The board would immediately have to sell the firm, or cease operations.
This level of loss was totally unacceptable.
Probability
Analysis Approach
The
data for a probability analysis of the distributors data center contingency
planning is included in Table 2.
Table 2:
Economic Analysis of Scenarios
Long Term Order Loss
Weeks of lost Sales
Value Added after Disaster
Percentage of Profit
Annual Expenses (Table 1)
Value Added Analysis
Average Annual Sales
Value Added Percentage
Annual Value Added
Value Added per Week
Profit Percentage
Annual
profit
|
Dual Centers
0.00
$0
0%
$761,000
$250,000,000
25%
$62,500,000
$1,201,923
15%
$37,500,000
|
Vendor
Hot/Cold
Site
0.41
$491,587
1%
$256,000
|
Own
Cold Site
12.21
$14,680,288
39%
$173,000
|
No
Backup
22.18
$26,658,654
71%
$14,000
|
Probability Based Analysis
Long Term
Order Loss
Averted Loss
Probability of Disaster
Annual Averted Loss
Annual Expenses
Return on Investment (ROI) |
$0
$26,658,654
0.01
$266,587
$761,000
-65.0% |
$491,587
$26,167,067
0.01
$261,67
$256,000
2.2% |
$14,680,288
$11,978,365
0.01
$119,784
$173,000
-30.8% |
$26,658,654
$0
0.01
$0
$14,000
-100.0%
|
The result of a typical analysis of the data follows.
This alternative averts all loses, since recovery takes only a matter of
hours or shifts. Using the typical one percent
probability, the averted annual cost is approximately $270,000. This potential gain is
balanced against the annual additional expenses of approximately $760,000, a negative ROI of almost
65%. This alternative would be considered impractical for firms with this level of down-time
sensitivity.
The annualized allocated cost and annual expenses of this alternative are
approximately equal. Therefore this alternative is a break-even option using the insurance based
probability analysis based approach. From a
study of the literature it appears that many other firms have also reached this
conclusion. The popularity of this contingency planning alternative, as shown
by the success of many
firms in the backup site business, appears to be based on the decision that we can meet fiduciary
duties without it costing anything (e.g.: a break-even low initial cost investment).
This alternative clearly leads to a significant negative ROI for this firm.
This alternative is normally not authorized
to use this approach, except when an old data center is available, thereby eliminating
initial costs and creating a break-even situation.
This alternative involves almost no expenditure, but leaves the organization
open to potentially disastrous losses from
loosing their data center. This alternatives popularity is probably based on the belief
that their data center is well protected, and therefore will not be destroyed; as well as the reality
that the manual information systems of the organization are in this same condition, and just as
critical to the organizations operations.
Summary
of BRP Case Study
This
case study is not unusual, in that the two methods result in the same
recommended decision. The fiduciary responsibility approach normally leads to
selection of an acceptable backup plan, while the probability approach, as
described in Ozier [5], may sometimes lead to the high risk-no backup approach.
"Threat
events having a low-frequency, high-impact risk ... may have a low probability
of loss that encourages management to take risks unduly."
This
concern about possible high risk approaches is also illustrated by the case
study described in Engemann & Miller [2, pp. 143].
"Finally,
management felt that qualitative factors related to the marketplace reaction to
a severe loss that resulted from inadequate contingency plans had to be factored
into the analysis, even if such losses were eventually covered by
insurance."
Contents of a Usable
Business Resumption Plan
The
critical elements in a usable BRP are the team organization and procedures
needed to efficiently move to the backup location and resume productive work,
and the backup facilities and equipment that can actually be used to perform the
critical business functions affected by the disaster (Rosenthal and Himel [8].
Functions
of BRP Teams
Most
organizations with mature business resumption plans have a three tier BRP team
organization structure including:
Top tier-
Policy Group
Second
tier - Disaster Management Team (DMT)
Third tier- Emergency Response Teams
(ERT)
The
top tier Policy Group consists of upper-level executives who are available for
approving major DMT decisions involving customer service impact, major
expenditures or major potential liabilities. For example, after the Bay Area
earthquake a major bank opened their branches the next day without power and
full cleanup and repairs. The ability to provide much needed cash to customers
was deemed more important than the potential for accidents or robberies.
The
middle tier DMT includes representatives of key departments and functions
involved in life-safety and business contingency planning. The following table
lists the functional organizations often represented on a DMT.
Selecting
the chairperson of the DMT is often a difficult and politically sensitive
decision. The pressure to appoint a senior executive should be resisted. Senior
executives belong in the Policy Group among their peers. The chair of the DMT,
and therefore the coordinator of the EOC, should be an extremely knowledgeable
peer of the other members of the DMT. The chair should not however, be
associated with any ERT. The chair is frequently the supervisor of the Project
Head, Business Continuity Planning.
The
third tier consists of a large number of Emergency Response Teams (ERT). For
example, the data processing area might have specialized logistics, backup data
center operations, network operations, and user support ERTs. The safety area
might include a dozen or more ERTs with first aid and evacuation
responsibilities, each headed by a floor warden.

Functions of a Policy
Group
The responsibility of
the policy group is to authorize out of the ordinary expenditures required by
emergency operations, as well as to set policies primarily impacting
stockholders and the public.
They must take the time
to carefully consider the long term impact of the operational decisions being
made by the DMT and the ERTs. Therefore, the team is made up of a variety of
company executive including legal, public relations, human relations, and
financial experts; and is normally the only team not staffed completely by
personnel with primarily day-to-day operations responsibilities.
Functions of a Disaster
Management Team (DMT)
During a disaster the
DMT has three primary functions:
Coordinating the efforts
of emergency response teams to assure the safety of personnel and to minimize
the damage to their facilities following a disaster. A life-safety DMT normally
is organized for every major facility or campus.
Planning
and coordinating emergency operations and restoration of normal operations
following a disaster. A business continuity DMT is normally responsible for a
total business unit, frequently involving multiple and wide-spread facilities.
The
DMT performs its functions from the organizations Emergency Operations Centers (EOC).
EOCs observed by the author are of two basic structural types: the single
conference room approach, and the dual room approach.
Conference
Room EOC Approach
The
most common and least expensive approach is the converted conference room. Large
conference rooms at two or more widely separated locations are converted to EOCs.
Furnishing and equipment required include:
-
Telephone consoles for each participant; including an EOC rotary line, a
dedicated incoming line for each function, and a line for outgoing calls.
-
TVS and radios to monitor news and public announcements.
-
White boards, tack boards, and flip charts.
-
Facility maps and area maps with medical and emergency service facilities
identified.
-
Multiple radios with multiple channels for use in communicating with emergency
response teams and the outside world. At least one of the EOCs should house a
portable satellite communication unit.
-
Room power connected to the building's emergency power system.
-
Food, water, and rest facilities for primary and alternate DMT members.
California
firms often have Los Angeles and San Francisco EOCs and DMTs because of the
possibility of an area wide disaster due to a major earthquake. Other areas of
the world may not need this much separation between locations.
Dual
Room EOC Approach
The
dual room EOC approach provides contiguous space for both the Policy and DMT. A
glass wall between the two rooms permits the Policy group to monitor DMT
activities and observe status boards and displays. Parallel decision making is
enhanced permitting continuous emergency operations control while significant
policy decisions are being made.
The
dual room EOC is normally used by organizations with frequent operational
emergencies, such as utilities exposed to power outages or pipeline breaks. The
EOC is used for both operational emergencies and for disasters affecting
non-operational facilities and personnel. A second conference room type EOC is
also normally available at a site remote from the primary EOC.
EOC
testing involves two functions: a periodic walk-through of all equipment by the
Project Head- Business Continuity Planning, and periodically performing DMT
simulations in the EOC.

Functions
of Emergency Response Teams (ERT)
The
activities of emergency response teams following an emergency must be closely
coordinated and adapt swiftly to the type of disaster and its evolving impacts.
Emergency response teams can include such areas as: policy, emergency operations
center management (DMT), facility acquisition and management, site and equipment
recovery, backup data center operations, logistics and transportation, off-site
storage coordination, floor wardens, assembly site coordinators, public/employee
communications, telecommunications, finance and insurance, etc.
Staffing
these teams is a significant problem. Each of the functions of the team
(including team leadership and around the clock coverage) must have a primary
and backup person assigned.
Work
locations must be assigned and intra-team and inter-team communications planning
must be assured. Some typical problems follow. Does your plan really define to
whom the responsibility to handle each problem has been assigned.
-
Who has the authority to declare a disaster and authorize expenditure of funds?
-
Who decides what to tell BRP team members, other employees, customers, and the
media?
-
Is there an inventory of available space and equipment?
-
Have all business functions been prioritized so that the facility acquisition
and management team can quickly assign space to displaced organizations?
-
Are the teams staffed and lead by persons with the day-to-day operations
knowledge required for effective emergency operations?
-
Are all sites stocked with emergency food, water, medical supplies, and other
equipment need following a disaster?
-
Are realistic life-safety and assembly drills periodically conducted at all
sites?
-
Are there adequate security arrangements for damaged or evacuated sites?
-
Is there a HELP desk planned with sufficient telephone capacity to properly
forward calls from media, employees, family of employees, BRP team members,
customers, and suppliers?
-
Are the auditors assuring that up-to-date copies of all critical records and
data are stored off-site at a secure facility?
-
Do you really know what your insurance coverage is for damage, injuries, and
business interruption?
-
Is there an organization responsible for assuring that all business functions
and locations have developed a realistic BRP and is adequately testing both the
operational and management aspects of the plan?
The
determination of emergency teams functions and reporting structure is dependent
on individual firm and site characteristics. The team structure described is
typical of a large operational facility housing several clerical organizations
and a major data center with a distant commercial backup data center.
Operations
Center BRP Teams
This
team evaluates the extent of damage to the facility and informs the DMT of the
estimated time required to rebuild the damaged facility. The team then assumes
the responsibility for restoring the current facility or creating a new
facility.
This
team consists personnel and public relations staff with responsibility for
collecting information on the status of operations, facilities and personnel and
communicating relevant information to the media, employees, customers and
vendors.
This
team is made up of representatives from each of the functions occupying the
damaged facility as well as members from each data processing application
support group impacted by the disaster.
Their
role is to schedule and coordinate initial and continuing emergency operations.
Responsibilities
include providing emergency cash and payments, physical security at damaged and
backup sites, commuting & lodging support, handling insurance claims, and
keeping records of emergency costs & expenditures.
Operations
Center Life/Safety Teams
These
teams are responsible for personnel evacuation or lodging following a localized
or area-wide disaster. They often include:
Staffed
by volunteer employees trained in first aid and in evacuation methods.
Responsible for coordinating evacuation or lodging of occupants in a specific
floor or area, as well as performing first-air and communicating with the EOC.
Facility Management Team
Staffed
by physical plant operations personnel. Responsible for operating or shutting
down the facility after a disaster.
Physical Security Team
Responsible
for maintaining security at damaged and at temporary locations.
Information
Systems Emergency Operations Teams and Positions
These
types of teams are responsible for business resumption of critical functions
occupying the impacted facility. The data center emergency operations teams
described are also typical of the type of teams and positions often needed by
other functions occupying a typical operations center.
IS BRP Coordinator
Responsible
for coordinating the IS recovery and supervising all other IS BRP teams. Normally
is a member of the DMT and is located in the EOC.
IS Backup Center Operations Team
Responsibilities
include computer, data communications, and peripheral operations; establishing
the data processing schedule during catchup; disseminating processing output;
and providing the Operations Coordination Team with timely status reports.
IS Logistics & Supplies Team
Responsibilities
include transportation, courier, shipping & receiving, and library &
warehousing during emergency operations. This includes retrieval of data,
software, and documentation from off-site locations.
IS Operations Support Team
This
multi-discipline team's responsibility is to support emergency IS operations.
Staff includes technical (systems software), applications development, and data
& voice communications support professional personnel.
IS Specialized Resources Operation Teams
These
teams interface or operate sites with specialized IS resources, such as page
printers, micro graphics, and check sorters.
Data
Center Backup Architectures
An
organization's IS architecture must assure near continuous availability of both
data centers and telecommunication networks. Both internal and external
resources are available to offer the backup resources needed to assure the high
availability required by most business resumption policies.
Data
Center Backup Approaches
There
are three major approaches to Information Systems (IS) Architectures for
protecting critical IS processing from interruptions or disasters. They include
the use of a commercial backup data center, the use of multiple in-house data
centers, and the distribution of processing to multiple user locations
(Rosenthal, 1994).
Commercial
backup data centers offer facilities that permit reactivation of critical
processing within 24-36 hours using their hot site, and reactivation of
non-critical processing within 1-2 weeks using their cold site. Organizations
with a single data center that can tolerate this type of delay find the use of a
commercial backup site both cost effective and practical.
Organizations
with a small number (normally two to four) of large decentralized data center
locations can often use, within 12-24 hours, development and non-critical
processing capacity as backup hot site resources. Rapid upgrading of equipment
can be implemented in place of a cold site.
ELECTRONIC
ARCHIVAL:
A
HIGH PROTECTION BACKUP ARCHITECTURE

A
typical architecture for dual data centers using electronic archival is shown in
Figure 3. The production data center normally will contain an online and an
information center (MIS/DSS) system. The backup (development) center would then
contain the development system and space to quickly add an additional system.
Recovery after a disaster or major interruption at the production center
consists of posting today's transactions from the log tape and activating
communication lines terminating at the backup (development) center.
The
problem in using multiple in-house data centers to backup each other, is in
maintaining compatible configurations and systems software versions. Very rigid
centralized control of data center configurations and standards is required.
Many
organizations have dozens to hundreds of similar function facilities. When a
data center suffers a disaster, the total facility that it supports is normally
also affected. The BRP policy is frequently to shut the facility until repaired,
and transfer operations to neighboring locations.
Figure
3:
TYPICAL
DUAL DATA CENTER BRP ARCHITECTURE

Telecommunication
Network Backup Approaches
Historically, many
organizations have leased voice grade multi-drop telephone lines to support an
individual application's data communication requirements. Implementing BRP for
networks of multi-drop data lines is often performed by adding an additional
drop at the backup data center to each line. When this approach is infeasible
because of the distance to the backup center, the dial backup capability of
their modems is used to connect both to their data center in the event of a line
outage and to the backup center in the event of a disaster.
The recent availability
of inexpensive multiplexes and concentrators, and of a wide variety of cost
effective high speed lines, has increased the use of trunk connections linking
multiple user locations to their data center. Multiple user locations are now
being interconnected to data centers through a high speed backbone network that
requires a high level of protection from interruptions and disasters. There are
two major approaches to assuring high levels of availability for these backbone
telecommunication networks. They include building redundancy into the network
and/or using switched digital circuits from a common carrier.
High speed trunk
oriented data networks based on regional or major site controllers should be
configured to include route redundancy. The redundancy is valuable, not just for
BRP purposes, but also to handle anticipated load variations and to permit
maintenance of equipment and circuits without interrupting service.
All of the commercial
backup data centers have switched circuit capability for connecting the backup
center to a customers regional or site communication controllers. In under an
hour, several common carriers can reconfigure a clients network, switching the
client data center out of the network and the backup center into the network.
An example of a network
using dial backup, network redundancy, and switched broadband circuits is shown
in Figure 4. Remote sites are connected to regional concentrators with multiple
routes to the data center. These concentrators also have switched broadband
capability to connect to the backup center after BRP purposes. Sites or
terminals close to the data center have voice grade dial backup capability to
reach the backup center.
The economics of
implementing this type of backbone architecture as part of a BRP program is very
favorable. Broadband digital links are highly reliable and starting to be priced
at rates highly competitive with multi-drop voice grade lines. Many firms have
achieved slight reductions in cost by consolidating their various application
oriented networks while simultaneously adding redundancy and/or switched
capability to meet BRP requirements.
Manual System Backup
Approaches
Backup methods for
manual records/information systems tend to be expensive and to utilize
specialized equipment; or are not very safe. This problem may explain, but it
does not excuse the lack of effective BRP arrangements for most critical manual
systems. The following types of backup methods are only representative of the
multitude of architectures available when creative managers are faced with the
executive demand for a realistic BRP for all critical business functions. These
various backup methods can be categorized based on if the manual processing will
continue to be performed on paper or by using other media (primarily
micrographic or IT image systems).
Paper-Based Processing Backup Alternatives
Paper based processing seldom survives a quality business process
reengineering (BPR) study. However, a BPR is seldom performed unless extensive automation has already
occurred in that business unit. Therefore,
the following approachs are the most common result of a demand for a BRP.
Only currently being used paper records are to be removed from the file room/safe. This approach gives good protection
during non-working hours. However, paper records
are seldom removed and returned individually, because of the inefficiencies
involved. Also in the event of a fire, earthquake, bomb scare etc., staff do not return
current records to the secure area and, in
fact, seldom close the rooms/safes. These approaches give only fair protection, and should not pass audit when
the records are critical to the survival of
the organization.
Few business processes do not update the majority of records accessed. This approach, therefore, is seldom used. It is however,
very effective and safe when feasible.
This type of processing involves the use of non-computer based storage for
processing media.
The most common types of media are microfilm/microfiche and image mass
storage systems.
Micrographic media for use in processing is very common when most activity
is requests for information, and all actions
generate new records that can be filmed and
archived. This approach, when applicable, is very effective and safe.
Image Systems are normally used for the same
type of applications as micrographic media.
The can , however, automatically index new transactions affecting
a master record. This permits their use in more applications than micrographic
systems. This approach, when applicable, is also very effective and safe
CONCLUSION
Business resumption planning should be an integrated portion of a total
security program. The security program should
cover physical security of facilities and equipment, data security of automated files and manual records, protection of all levels
of personnel, and business resumption planning. Business resumption
planning needs to be an integral part of doing business. For example, IBM
internal policy -as stated in their Corporate
Disaster Recovery Planning Standard (Policy Number 209)- directs all operating and staff units of the company to develop plans for
any emergency that results in either a significant loss of assets
or revenue flow, or renders the organization unit unable to meet customer
commitments or protect the interests of
stockholders and employees.
Executives of all organizations have a fiduciary responsibility to take
prudent steps to assure the survival of their
organization following a natural or man-made disaster. Providing the necessary
funds and leadership for a quality business
resumption planning program for all critical business functions, both IS and
manually oriented, is a key portion of that responsibility.
BIBLIOGRAPHY
1. Andrews, W.C. "Contingency Planning for Physical Disasters",
Journal of Systems Management, 41:7, 28-32,
July 1990.
A short but comprehensive description of the why and how of justifying and
producing a data center BRP.
2. Engemann, Kurt J., and Holmes E. Miller. "Operations Risk Management
at a Major Bank," Interfaces, 22:6 :
140-49, November-December 1992.
Presents a decision analysis framework for making risk management decisions.
3. Lamond, B.J. "An Auditing Approach to Disaster Recovery",
Internal Auditor, 47:5, 38-48, October 1990.
A survey of the DRP preparation cycle including an introduction to
operational testing and plan maintenance.
4. Metzger, Michael B., et al. Business Law and the Regulatory Environment:
Concepts and Cases. Chicago: Richard D.
Irwin, Inc.: 867-69, 1995.
This book defines the duty of care and fiduciary responsibility
of officers and directors of corporations. It
states that the Model Business Corporation Act requires officers to act in good faith, and with the care a prudent person, in a like
position, would exercise under similar circumstances,
as well in a manner they reasonably believe to be best interest of the
corporations.
5. Ozier, Will. "Issues in Quantitative vs. Qualitative Risk
Analysis," Managing IT/IT Solutions. Delran: Datapro
Information Services Group, report 6055 (1994): 1-7.
A detailed comparison of the quantitative (probability) and qualitative
(fiduciary responsibility) approaches and
their impact on managerial decisions.
6. Rohde, R. and Haskett, J. "Disaster Recovery Planning for Academic
Computing Centers", Communications of
the ACM, 33:652-657, 1990.
A step by step description of producing a BRP for a university data center.
7. Rosenthal, P. "The Emerging Enterprise Systems Architecture",
Journal of Systems Management, 45:2;16-21,
February 1994.
8. Rosenthal, P. and Himel, B. "Business Resumption Planning: Exercising
Your Emergency Response Teams",
Computers & Security, 10:497-514, 1991.
A detailed description of a data center disaster plans simulation testing
including a complete script of an actual
exercise.
9. Rosenthal, P, and Sheiniuk, G. "Exercising the Business Disaster
Team", Journal of Systems Management,
38:4;12-16 & 38-42, 1993.
A detailed description of a business continuity and life-safety disaster plans
simulation testing including a complete script of an
actual exercise.
10. Waldman, Jan I. A Methodology for Justification of Business Resumption
Planning Based on Fiduciary Responsibility
Considerations, Unpublished masters thesis, California State University, Los
Angeles, 1995.
A detailed description, with examples, of the use of the prudent person BRP
justification approach.
11. Wong, K. K. Risk Analysis and Control - A Guide for DP Managers, Hayden
Book Company Inc., 1997.
This classic presentation of the quantitative approach to risk analysis.
Contains a description of the statistical,
IBM, and NCC [National Computing Center] approaches to risk
evaluation, as well as a good description of risk control.
 Paul
H. Rosenthal, PhD is a Fellow of The Business Forum Institute
and Professor of Information Systems at California State University, Los
Angeles. Dr. Paul Rosenthal has for many years taught a wide variety of courses encompassing information systems
technology, business management, political economy, and systems audit and
assessment He is recognized as one of the leading
experts is Business Continuity and Disaster Recovery Planning.
Paul received a Bachelors' Degree in Education and a Masters
degree in Applied Mathematics
from Temple University, an MBA from UCLA, and a DBA from USC. Prior
to joining CalstateLA, he spent more than thirty five years in industry as a
professional, both as a manager and as a consultant. His recent research
interests involve business continuity management, IS/IT education
assessment, IS/IT Infrastructure Planning and advanced Technology Systems
Assessment.
Contact
the Author:
~
Click Here
Editorial Policy: Nothing you read in
The Business Forum Journal
should ever be construed to
be the opinion of, statements condoned by, or advice
from, The Business Forum, its staff, workers, officers, members, directors, sponsors or shareholders. We pass no opinion whatsoever on the content
of what we publish, nor do we accept any responsibility for the claims, or
any of the statements made, within anything published herein. We merely
aim to provide an academic forum and an information sourcing vehicle for
the benefit of the business and the academic communities of the Pacific States of America
and the World.
Therefore, readers must always determine for themselves where the statistics, comments, statements and
advice that are published herein are gained from and act, or not act, upon such entirely and always at their own risk. We
accept absolutely no liability whatsoever, nor take any responsibility for
what anyone does, or does not do, based upon what is published herein, or
information gained through the use of links to other web sites included
herein.
Please refer to our:
legal
disclaimer
The Business
Forum Beverly Hills, California, United States of America
Email:
[email protected]
Graphics by
DawsonDesign Webmaster:
bruceclay.com

�
Copyright The Business Forum Institute - 1982 - 2015 **
All rights reserved.
The Business Forum Institute is not responsible
for
the content of external sites.
Read
more
|