|
"It is
impossible for ideas to compete in the marketplace if no forum for Exercising the Disaster Management Team
Author:
Professor Paul H.
Rosenthal, PhD
Abstract Disaster simulation exercises are used to test the staffing, management, and decision making of both the computer and non-computer related aspects of an organization's business continuity and life-safety plans. Special simulation methods must be used to exercise the Disaster Management Team and their Emergency Operations Center. A proven approach to designing and conducting this type of simulation is presented, including a full script from an actual simulation exercise. Keywords:
During the 1980's and 1990's, contingency planning evolved from data center backup planning, to business resumption planning (BRP) and recently to business continuity management (BCM). Business resumption planning involved arranging for emergency business and data center operations and recovery planning following a disaster. The growth of commercial backup data centers has made available inexpensive data center contingency resources for all but the largest organizations. The commercial hot/cold backup sites are available for testing of critical applications, so that most organizations had, by the late 1980's, fully tested data center contingency plans. However, the vast majority of data center users had only a vague idea of how they might operate during a disaster that destroys the data center or their personnel-oriented operating locations.
Data processing management has gradually persuaded their users that they need a BCM plan that integrates with the data center plans. Data processing management has also persuade many of their users that there is a need for simultaneously testing the data center and the user's business continuity plans through disaster simulation exercises similar to those described in this paper and by Rosenthal and Himel [2]. However, unless these BCM plans are periodically tested, they are seldom usable operationally. Plans that were initially operationally viable become obsolete very quickly unless periodic tests force every department and work group to maintain off-site: up-to-date contact lists; files and processing resources; communications resources; and current programs, procedures, and forms. Disaster simulation exercises are now widely used to exercise the staffing, procedures, and resources for both the computer and non-computer related aspects of an organization's business continuity management and life-safety plans. The scenarios and simulation methods normally used for these exercises are designed to test the various functional teams charged with recovering business operations while assuring the safety of personnel and facilities. The disaster management team (DMT) charged with coordinating the actions of the functional teams during notification, mobilization, activation, emergency operations, and recovery are normally involved in these simulations only as observers. This lack of involvement results from the need to use highly structured scenarios that include the expected decisions of the management team. Separate simulation exercises as described in this paper are therefore normally required to test the preparation of the DMT and the configuration of their Emergency Operations Center (EOC). Functions of Operational and Simulation Testing There are two primary activities involved in testing a BCM plan and the DMT: Operational Testing Performing critical computer and non-computer related tasks using backup resources and facilities. Simulation Testing Performing the notification, mobilization, activation, emergency operations, and recovery phases of a BCM plan based on a typical disaster. The methodologies for operational testing are well known. Organizations with available backup resources, normally adequately test their emergency operations capability. However, the use of simulation for testing the managerial aspects of their BCM plan is rare, and the methods for planning and conducting such simulations are poorly understood. This paper, therefore, presents a proven methodology for BCM plan simulation that has been used in several simulations conducted in the Los Angeles basin. In addition to the methodology, a script from an exercise is included that was used by a major Los Angeles financial firm. Business Continuity Management Planning Life Cycle Rothberg [6] defines a disaster as "...any event that causes significant disruption to operations, thereby threatening the business' survival." Business resumption planning (BRP), the newest term for disaster recovery planning, can be conceptually divided into three major phases: prevention, planning, and testing. Figure 1lists the major life cycle tasks needed to protect against such disasters. Exercising the disaster management team (DMT), the subject of this paper, is the last step in the BRP life cycle. It is usually the least performed activity in the BRP life cycle.
An excellent example of a BCM plan for a university data center can be found in Rohde and Haskett [5]. They, as do most other authors, stop short of the testing phase. Without both periodic operational testing (performing critical business functions using backup resources) and simulation testing (exercising the decision making portions of the plan), a plan quickly becomes unusable. Phase I- Prevention The first steps in a BCM plan procedure, is to determine the possible extent of exposure to a disaster, and then to minimize the probability of a disaster occurring. The initial step in any BCM plan procedure is that of obtaining the substantial funding normally required. This requires selling the Board of Directors on the reality of a possible disaster and the impact on the ability of the organization to survive. According to Yetter [8], an inadequate understanding of the potential threats and their possible impact is often the weak link in many disaster security and recovery programs. His paper is an excellent presentation of the quantitative approach to threat evaluation. The results of such a quantitative study is used to rate the potential severity of each hazard as a guide for prevention and recovery spending. The detailed quantitative approach to risk analysis is popular with government and large industrial firms with major consulting budgets. The board of directors of most firms however, responds better to a fiduciary responsibility analysis. A list of risks to which the firm's facilities and personnel is exposed is presented and a case study approach is used to demonstrate realistic risk exposure. Estimates are made of the financial impact on various business functions, computer related and non-computer related, of a loss in resource capability. When the impacts include financial or service level losses that can effect the firms' survival, then the board members fiduciary responsibility requires a prudent level of protection and recovery capability. Funding for an adequate BRP is then made available, often as a priority project. Physical security planning primarily involves access controls, fire and water protection, earthquake and storm hardening, and critical records security. Most firms have a physical security program in place covering these areas prior to the implementation of a BRP program. The second step in the BRP is therefore simply an assessment of the program, and improvement if necessary. The authors experience indicates that the critical records area, particularly for non-computerized files, is frequently the major weak point. Data security and protection programs are not as wide spread as physical security programs. Few firms have high quality data oriented security programs, particularly in the personal computer area and for non-financial & personnel manual records involving off-site backup of critical records. This area frequently requires a major effort. Phase II- Planning The disaster planning process outlined in Figure 1 is often initiated by the data center, as it implements applications critical to the day-to-day operations of the organization. The data processing oriented disaster planning selling job to the board often alerts them to the risk presented by the non-computerized portions of the firms operations, and as discussed in Orr [4] a total recovery planning effort is initiated. A good overview of the BRP process can be found in Janulaitis [3]. Phase III- Testing Desk top walk through - Prior to any detailed testing, key stakeholders in the BRP are convened in a conference room, and a detailed review is performed of the plan. Many small events are described and the participants are asked to state how the plan would guide their reactions. The events should require utilization of: major backup resources, emergency operations approaches, and all emergency response teams. Following this step, operations and simulation tests are scheduled. Operational testing - Few organizations operationally test the complete disaster reaction cycle of: activation, life-safety, damage assessment, mobilization, emergency operations using off-site files and backup resources, and recovery planning. Only the data processing emergency operations area can be tested without involving a substantial number of persons during business hours. The scope of most operational tests therefore, includes: a semi-annual off-hour call to the manager of data center operations, assembly of the backup site operations team, acquisition of backup materials from an off-site location, travel to a backup hot/cold site, installation of systems and applications software, loading production data, and systems test of several critical applications. Simulation testing - Simulation is the most feasible approach for testing the decision making aspects of disaster reaction activities. The use of simulation exercises for BRP has been spreading slowly over the last decade. Unlike their counterpart military war games that use computer driven scenarios to perform very realistic exercises, BRP exercises are paper and pencil simulations. Teams are placed at tables representing their backup locations, and the description of an evolving disaster is presented. The teams communicate using backup communication resources or forms, make decisions, and everyone pretends that what is ordered actually happens. Debriefings and evaluation studies follow to correct any flaws in the BRP. Most simulation exercises are very successful in that they force personnel to learn the BRP while working together, and find flaws and inconsistencies in policies and plans. The remainder of this paper presents a detailed methodology and the disaster scenario used recently by a major Los Angeles firm for performing such a simulation exercise for their Disaster Management Team. Details of a similar approach for the simulation testing of Emergency Response Teams representing business functions or operational activities, can be found in Himel and Rosenthal [2]. Functions of BRP Teams Most organizations with mature business resumption plans have a three tier BRP organization structure (for an example see Coleman [1]), including:
The top tier Policy Group consists of upper-level executives that are available for approving major DMT decisions involving customer service impact, major expenditures or major potential liabilities. For example, after the Bay Area earthquake a major bank opened their branches the next day without power and full cleanup and repairs. The ability to provide much needed cash to customers was deemed more important than the potential for accidents or robberies. The middle tier DMT includes representatives of key departments and functions involved in life-safety and business contingency planning. Figure 2 lists the functional organizations often represented on a DMT. Selecting the chairperson of the DMT is often a difficult and politically sensitive decision. The pressure to appoint a senior executive should be resisted. Senior executives belong in the Policy Group among their peers. The chair of the DMT, and therefore the coordinator of the EOC, should be an extremely knowledgeable peer of the other members of the DMT. The chair should not however, be associated with any ERT. The chair is frequently the supervisor of the Project Head, Business Continuity Planning.
The third tier is made up of a large number of Emergency Response Teams (ERT). For example, the data processing area might have specialized logistics, backup data center operations, network operations, and user support ERTs. The safety area might include a dozen or more ERTs with first aid and evacuation responsibilities, each headed by a floor warden. Periodic Testing of your BRP Every six months your plans should be operationally tested using your backup facilities and offsite storage resources. Every year the management aspects of your plan should be simulation tested. These two activities assure the currency of your plan and the readiness of your staff. The remainder of this paper discusses the planning of simulation tests for 2nd-tier disaster management teams. Functions of a Disaster Management Team (DMT) During a disaster the DMT has two primary functions: Life-Safety Management Coordinating the efforts of emergency response teams to assure the safety of personnel and to minimize the damage to their facilities following a disaster. A life-safety DMT is normally organized for every major facility or campus. Business Continuity Planning Planning and coordinating emergency operations and restoration of normal operations following a disaster. A business continuity DMT is normally responsible for a total business unit, frequently involving multiple and wide-spread facilities. A combined life-safety and business continuity simulation test is feasible for organizations with a single facility or campus. However, for organizations with multiple facilities, separate life-safety tests for each locations plus a separate integrated business resumption test are normally performed. The Emergency Operations Center (EOC) The EOC of the firm that used the scenario presented in this paper illustrates the most common and least expensive approach, using a converted conference room. Large conference rooms at two or more widely separated locations are permanently converted to EOCs. Furnishing and equipment required include:
California firms often have Los Angeles and San Francisco EOCs because of the possibility of an area wide disaster due to a major earthquake. Other areas of the world may not need this much separation between locations. A dual room EOC approach is also used; normally by organizations with frequent operational emergencies, such as utilities exposed to power outages or pipeline breaks. It involves two rooms, one for management and one for operations personnel with a glass wall between them. The EOC is used for both operational emergencies and for disasters affecting non-operational facilities and personnel. A second conference room type EOC is also normally available at a site remote from the primary EOC. EOC testing involves two functions: a periodic walk-through of all equipment by the Project Head- Business Continuity Management Planning, and periodically performing BCM plan simulations in the EOC. Designing a DMT Simulation Scenario Proper planning of a scenario requires a detailed knowledge of the risk exposures and business continuity plans for all impacted facilities and organizations. As discussed in Rosenthal and Himel [2], a scenario should:
The simulation scenario which follows was derived from a State of California earthquake planning scenario [7]. It is based on a major earthquake occurring at the southern edge of the Los Angeles basin.
The scenario work-sheets have been edited to delete material relating to the firms specific facilities and business operations. The scenario and solutions therefore do not represent the full set of responses that were expected from the organization. Administering a DMT Simulation Exercise As discussed in Rosenthal and Himel [2], operational simulation tests of emergency response teams are evaluated following the simulation. DMT simulations are of most value however, when an evaluation and redirection period occurs at the close of each scenario time period. This improves the learning experience and assures consistency between the following time periods scenarios and DMT planning. Simulation exercises of single emergency response teams can also be handled in the same manner.
External communications to the Policy Group and to the emergency response teams can be handled in two ways:
Most DMT members prefer the second alternative since it does not expose their mistakes to persons that work for them. The time allocated to the DMT to generate a solution to each scenario stage and then review the solution with the administration team normally takes 45 - 60 minutes for the first time step, reducing to 20 - 30 minutes for the final time step. Evaluating the Simulation Following the simulation exercise, the Test Administration Team with the Project Head- Business Continuity Planning, should plan to spend at least a half day evaluating the impacted BRP policies and procedures as well as each DMT members' knowledge. Brief individual briefings should then be held with each DMT member and their alternates, and action plans to correct any deficiencies prepared. The Project Head- Business Continuity Planning must then monitor the implementation of the action plans in preparation for the following years DMT simulation exercise. Conclusions At the disaster management team exercises that I have observed, the participants indicated that the review of policies and procedures, and the lessons learned were extremely valuable. They were also surprised at the number of omissions and inconsistencies found in their life safety and business resumption plans. The primary value of a DMT simulation exercise is the realization by management, that the extensive testing conducted for the emergency response teams had little impact on the Disaster Policy and Disaster Management Teams preparation. They realize how important it is that simulation exercises similar to the one described in this paper for the Disaster Management Team be conducted every few years. Additionally a Desk Top Walk Through for the Disaster Policy Team should also be conducted periodically. References
Exhibit 1A: Scenario 1 Announcement Simulated Time: 3:00 pm, Wednesday Earthquake Magnitude 7.0 - 7.4, Major Quake- major destruction within 5-10 miles, significant destruction within 10-15 miles, major damage within 15-20 miles.
What would you do? What are your plans? Exhibit 1B: Report from Project Head, BRP
Simulated Time: 3:30 pm Wednesday The following information is based on initial radio reports, A 7.0 - 7.4 Intensity earthquake has occurred on the Newport-Inglewood fault centered in the Long Beach area.
Bridges destroyed broad fissures in the ground, underground pipelines completely out of service, earth slumps and land slips in soft ground. Few masonry structures standing, many well built wooden structures destroyed, great damage in specially designed (high rise) buildings.
Ground badly cracked, shifted sand & mud, landslides from steep slopes. Some well-built wooden structures destroyed, most masonry and frame structures destroyed, severe damage in specially designed (high rise) buildings.
Damage considerable in specially designed buildings (high rise), damage great in substantial buildings with partial collapse, and many buildings shifted off foundations. High rise Buildings will lose substantial glass above 10 stories and almost total loss of glass above 20 stories, with violent shifting of contents in upper floors. Exhibit 1C: Report from the Damage Assessment Team
Simulated Time: 4:00 pm Wednesday
Exhibit 2: Scenario 2 Announcement
Simulated Time: 8:00 pm., Wednesday
Exhibit 3: Scenario 3 Announcement
Simulated Time: 11:00 pm., Wednesday
Exhibit 4: Scenario 4 Announcement
Simulated Time: 8:00 am, Thursday (Next Day)
Exhibit 5: Scenario 5 Announcement
Simulated Time 6:00 pm, Thursday (Next Day)
For
Further Information
Contact
Search Our Site Search the ENTIRE Business
Forum site. Search includes the Business
|