Chapter 6: The Five Nines Concept

Organizations that want to maximize the availability of their systems and data may take extraordinary measures to minimize or eliminate data loss. The goal is to minimize the downtime of mission critical processes. If employees cannot perform their regular duties, the organization is in jeopardy of losing revenue.

Organizations measure availability by percentage of uptime. This chapter begins by explaining the concept of five nines. Many industries must maintain the highest availability standards because downtime might literally mean a difference between life and death.

This chapter discusses various approaches that organizations can take to help meet their availability goals. Redundancy provides backup and includes extra components for computers or network systems to ensure the systems remain available. Redundant components can include hardware such as disk drives, servers, switches, and routers or software such as operating systems, applications, and databases. The chapter also discusses resiliency, the ability of a server, network, or data center to recover quickly and continue operation.

Organizations must be prepared to respond to an incident by establishing procedures that they follow after an event occurs. The chapter concludes with a discussion of disaster recovery and business continuity planning which are both critical in maintaining availability to an organization’s resources.

What Does the Five Nines Mean?

Five nines mean that systems and services are available 99.999% of the time. It also means that both planned and unplanned downtime is less than 5.26 minutes per year. The chart in the figure provides a comparison of the downtime for various availability percentages.

High availability refers to a system or component that is continuously operational for a given length of time. To help ensure high availability:

  • Eliminate single points of failure
  • Design for reliability
  • Detect failures as they occur

Sustaining high availability at the standard of five-nines can increase costs and utilize many resources. The increased costs are due to the purchase of additional hardware such as servers and components. As an organization adds components, the result is an increase in configuration complexity. Unfortunately, increased configuration complexity increases the risk factors. The more moving parts involved, the higher the likelihood of failed components.


Environments that Require Five Nines

Although the cost of sustaining high availability may be too costly for some industries, several environments require five nines.

  • The finance industry needs to maintain high availability for continuous trading, compliance, and customer trust. Click here to read about the four-hour outage on the New York Stock Exchange in 2015.
  • Healthcare facilities require high availability to provide around-the-clock care for patients. Click here to read about the average costs incurred for data center downtime in the healthcare industry.
  • The public safety industry includes agencies that provide security and services to a community, state, or nation. Click here to read about a network outage at the U.S. Pentagon Police Agency.
  • The retail industry depends on efficient supply chains and the delivery of products to customers. Disruption can be devastating, especially during peak demand times such as holidays.
  • The public expects that the news media industry communicate information on events as they happen. The news cycle is now around the clock, 24/7.

Threats to Availability

The following threats pose a high risk to data and information availability:

  • An unauthorized user successfully penetrates and compromises an organization’s primary database
  • A successful DoS attack significantly affects operations
  • An organization suffers a significant loss of confidential data
  • A mission-critical application goes down
  • A compromise of the Admin or root user occurs
  • The detection of a cross-site script or illegal file server share
  • The defacement of an organization’s website impacts public relations
  • A severe storm such as a hurricane or tornado
  • A catastrophic event such as a terrorist attack, building bombing, or building fire
  • Long-term utility or service provider outage
  • Water damage as the result of flooding or sprinkler failure

Categorizing the impact level for each threat helps an organization realize the dollar impact of a threat. Click the threat categories in the figure to see an example of each.


Designing High Availability System

High availability incorporates three major principles to achieve the goal of uninterrupted access to data and services:

  1. Elimination or reduction of single-points of failure
  2. System Resiliency
  3. Fault Tolerance

It is important to understand the ways to address a single point of failure. A single point of failure can include central routers or switches, network services, and even highly skilled IT staff. The key is that a loss of the system, process, or person can have a very disruptive impact on the entire system. The key is to have processes, resources, and components that reduce single points of failure. High availability clusters is one way to provide redundancy. These clusters consist of a group of computers that have access to the same-shared storage and have identical network configurations. All servers take part in processing a service simultaneously. From the outside, the server group looks like one device. If a server within the cluster fails, the other servers continue to process the same service as the failed device.

Systems resiliency refers to the capability to maintain availability of data and operational processing despite attacks or disrupting event. Generally, this requires redundant systems, in terms of both power and processing, so that should one system fail, the other can take over operations without any break in service. System resiliency is more than hardening devices; it requires that both data and services be available even when under attack.


Fault tolerance enables a system to continue to operate if one or more components fail. Data mirroring is one example of fault tolerance. Should a "fault" occur, causing disruption in a device such as a disk controller, the mirrored system provides the requested data with no apparent interruption in service to the user.

Asset Identification

An organization needs to know what hardware and software are present as a prerequisite to knowing what the configuration parameters need to be. Asset management includes a complete inventory of hardware and software.

This means that the organization needs to know all of components that can be subject to security risks, including:

  • Every hardware system
  • Every operating system
  • Every hardware network device
  • Every network device operating system
  • Every software application
  • All firmware
  • All language runtime environments
  • All individual libraries

An organization may choose an automated solution to keep track of assets. An administrator should investigate any changed configuration because it may mean that the configuration is not up-to-date. It can also mean that unauthorized changes are happening.


Asset Classification

Asset classification assigns all resources of an organization into a group based on common characteristics. An organization should apply an asset classification system to documents, data records, data files, and disks. The most critical information needs to receive the highest level of protection and may even require special handling.

An organization can adopt a labeling system according to how valuable, how sensitive, and how critical the information is. Complete the following steps to identify and classify the assets of an organization:

  1. Determine the proper asset identification category.
  2. Establish asset accountability by identifying the owner for all information assets and application software.
  3. Determine the criteria for classification.
  4. Implement a classification schema.

For example, the U.S. government uses sensitivity to classify data as follows: top secret; secret; confidential; public trust; and unclassified.

Asset Standardization

Asset management manages the lifecycle and inventory of technology assets including devices and software. As part of an IT asset management system, an organization specifies the acceptable IT assets that meet its objectives. This practice effectively reduces the different asset types. For example, an organization will only install applications that meet its guidelines. When administrators eliminate applications that do not meet the guidelines, they are effectively increasing security.

Asset standards identify specific hardware and software products that the organization uses and supports. When a failure occurs, prompt action helps to maintain both access and security. If an organization does not standardize its hardware selection, personnel may need to scramble to find a replacement component. Non-standard environments require more expertise to manage and they increase the cost of maintenance contracts and inventory. Click here to read about how the military shifted to standards-based hardware for its military communications.

Threat Identification

The United States Computer Emergency Readiness Team (US-CERT) and the U.S. Department of Homeland Security sponsor a dictionary of common vulnerabilities and exposure (CVE). CVE contains a standard identifier number with a brief description, and references to related vulnerability reports and advisories. The MITRE Corporation maintains the CVE List and its public website.

Threat identification begins with the process of creating a CVE Identifier for publicly known cybersecurity vulnerabilities. Each CVE Identifier includes the following:

  • The CVE Identifier number
  • A brief description of the security vulnerability
  • Any important references

Click here to learn more about CVE Identifier.

Risk Analysis

Risk analysis is the process of analyzing the dangers posed by natural and human-caused events to the assets of an organization.

A user performs an asset identification to help determine which assets to protect. A risk analysis has four goals:

  • Identify assets and their value
  • Identify vulnerabilities and threats
  • Quantify the probability and impact of the identified threats
  • Balance the impact of the threat against the cost of the countermeasure

There are two approaches to risk analysis.

Quantitative Risk Analysis

A quantitative analysis assigns numbers to the risk analysis process (Figure 1). The asset value is the replacement cost of the asset. The value of an asset can also be measured by the income gained through use of the asset. The exposure factor (EF) is a subjective value expressed as a percentage that an asset loses due to a particular threat. If a total loss occurs, the EF equals 1.0 (100%). In the quantitative example, the server has an asset value of $15,000. When the server fails, a total loss occurs (the EF equals 1.0). The asset value of $15,000 multiplied by the exposure factor of 1 results in a single loss expectancy of $15,000.

The annualized rate of occurrence (ARO) is the probability that a loss will occur during the year (also expressed as a percentage). An ARO can be greater than 100% if a loss can occur more than once a year.

The calculation of the annual loss expectancy (ALE) gives management some guidance on what it should spend to protect the asset.

Qualitative Risk Analysis

Qualitative Risk Analysis uses opinions and scenarios. Figure 2 provides an example of table used in qualitative risk analysis, which plots the likelihood of a threat against its impact. For example, the threat of a server failure may be likely, but its impact may only be marginal.

A team evaluates each threat to an asset and plots it in the table. The team ranks the results and uses the results as a guide. They may determine to take action on only threats that fall within the red zone.

The numbers used in the table do not directly relate to any aspect of the analysis. For example, a catastrophic impact of 4 is not twice as bad as a marginal impact of 2. This method is subjective in nature.


Mitigation

Mitigation involves reducing the severity of the loss or the likelihood of the loss from occurring. Many technical controls mitigate risk including authentication systems, file permissions, and firewalls. Organization and security professionals must understand that risk mitigation can have both positive and negative impact on the organization. Good risk mitigation finds a balance between the negative impact of countermeasures and controls and the benefit of risk reduction. There are four common ways to reduce risk:

  • Accept the risk and periodically re-assess
  • Reduce the risk by implementing controls
  • Avoid the risk by totally changing the approach
  • Transfer the risk to a third party

A short-term strategy is to accept the risk necessitating the creation of contingency plans for that risk. People and organizations have to accept risk on a daily basis. Modern methodologies reduce risk by developing software incrementally and providing regular updates and patches to address vulnerabilities and misconfigurations.

Outsourcing services, purchasing insurance, or purchasing maintenance contracts are all examples of risk transfer. Hiring specialists to perform critical tasks to reduce risk can be a good decision and yield greater results with less long term investment. A good risk mitigation plan can include two or more strategies.


Activity - Perform an Asset Risk Analysis


Layering

Defense in depth will not provide an impenetrable cyber shield, but it will help an organization minimize risk by keeping it one-step ahead of cyber criminals.


If there is only one defense in place to protect data and information, cyber criminals have only to get around that single defense. To make sure data and information remains available, an organization must create different layers of protection.

A layered approach provides the most comprehensive protection. If cyber criminals penetrate one layer, they still have to contend with several more layers with each layer being more complicated than the previous.

Layering is creating a barrier of multiple defenses that coordinate together to prevent attacks. For example, an organization might store its top secret documents on a server in a building surrounded by an electronic fence.

Limiting

Limiting access to data and information reduces the possibility of a threat. An organization should restrict access so that users only have the level of access required to do their job. For example, the people in the marketing department do not need access to payroll records to perform their jobs.

Technology-based solutions such as using file permissions are one way to limit access; an organization should also implement procedural measures. A procedure should be in place that prohibits an employee from removing sensitive documents from the premises.

Diversity

If all of the protected layers were the same, it would not be very difficult for cyber criminals to conduct a successful attack. Therefore, the layers must be different. If cyber criminals penetrate one layer, the same technique will not work on all of the other layers. Breaching one layer of security does not compromise the whole system. An organization may use different encryption algorithms or authentication systems to protect data in different states.

To accomplish the goal of diversity, organizations can use security products manufactured by different companies for multifactor authentication. For example, the server containing the top secret documents is in a locked room that requires a swipe card from one company and biometric authentication supplied by another company.

Obscurity

Obscuring information can also protect data and information. An organization should not reveal any information that cyber criminals can use to figure out what version of the operating system a server is running or the type of equipment it uses. For example, error messages should not contain any details that cyber criminals could use to determine what vulnerabilities are present. Concealing certain types of information makes it more difficult for cyber criminals to attack a system.

Simplicity

Complexity does not necessarily guarantee security. If an organization implements complex systems that are hard to understand and troubleshoot, it may actually backfire. If employees do not understand how to configure a complex solution properly, it may make it just as easy for cyber criminals to compromise those systems. To maintain availability, a security solution should be simple from the inside, but complex on the outside.

Activity - Identify the Layer of Defense


Single Points of Failure

A single point of failure is a critical operation within the organization. Other operations may rely on it and failure halts this critical operation. A single point of failure can be a special piece of hardware, a process, a specific piece of data, or even an essential utility. Single points of failure are the weak links in the chain that can cause disruption of the organization's operations. Generally, the solution to a single point of failure is to modify the critical operation so that it does not rely on a single element. The organization can also build redundant components into the critical operation to take over the process should one of these points fail.

N+1 Redundancy

N+1 redundancy ensures system availability in the event of a component failure. Components (N) need to have at least one backup component (+1). For example, a car has four tires (N) and a spare tire in the trunk in case of a flat (+1).

In a data center, N+1 redundancy means that the system design can withstand the loss of a component. The N refers to many different components that make up the data center including servers, power supplies, switches, and routers. The +1 is the additional component or system that is standing by ready to go if needed.

An example of N+1 redundancy in a data center is a power generator that comes online when something happens to the main power source. Although an N+1 system contains redundant equipment, it is not a fully redundant system.

RAID

A redundant array of independent disks (RAID) combines multiple physical hard drives into a single logical unit to provide data redundancy and improve performance. RAID takes data that is normally stored on a single disk and spreads it out among several drives. If any single disk is lost, the user can recover data from the other disks where the data also resides.

RAID can also increase the speed of data recovery. Using multiple drives will be faster retrieving requested data instead of relying on just one disk to do the work.

A RAID solution can be either hardware-based or software-based. A hardware-based solution requires a specialized hardware controller on the system that contains the RAID drives. The following terms describe how RAID stores data on the various disks:

  • Parity - Detects data errors.
  • Striping - Writes data across multiple drives.
  • Mirroring - Stores duplicate data on a second drive.

There are several levels of RAID available as shown in the figure.


Click here to view a RAID level tutorial that explains RAID technology.

Spanning Tree

Redundancy increases the availability of the infrastructure by protecting the network from a single point of failure, such as a failed network cable or a failed switch. When designers build physical redundancy in to a network, loops and duplicate frames occur. Loops and duplicate frames have severe consequences for a switched network.

Spanning Tree Protocol (STP) addresses these issues. The basic function of STP is to prevent loops on a network when switches interconnect via multiple paths. STP ensures that redundant physical links are loop-free. It ensures that there is only one logical path between all destinations on the network. STP intentionally blocks redundant paths that could cause a loop.

Blocking the redundant paths is critical to preventing loops on the network. The physical paths still exist to provide redundancy, but STP disables these paths to prevent the loops from occurring. If a network cable or switch fails, STP recalculates the paths and unblocks the necessary ports to allow the redundant path to become active.

Click Play in the figure to view STP when a failure occurs:

  • PC1 sends a broadcast out onto the network.
  • The trunk link between S2 and S1 fails, resulting in disruption of the original path.
  • S2 unblocks the previously blocked port for Trunk2 and allows the broadcast traffic to traverse the alternate path around the network, permitting communication to continue.
  • If the link between S2 and S1 comes back up, STP again blocks the link between S2 and S3.

Router Redundancy

The default gateway is typically the router that provides devices access to the rest of the network or to the Internet. If there is only one router serving as the default gateway, it is a single point of failure. The organization can choose to install an additional standby router.

In Figure 1, the forwarding router and the standby router use a redundancy protocol to determine which router should take the active role in forwarding traffic. Each router is configured with a physical IP address and a virtual router IP address. End devices use the virtual IP address as the default gateway. The forwarding router is listening for traffic addressed to 192.0.2.100. The forwarding router and the standby router use their physical IP addresses to send periodic messages. The purpose of these messages is to make sure both are still online and available. If the standby router no longer receives these periodic messages from the forwarding router, the standby router will assume the forwarding role, as shown in Figure 2.


The ability of a network to dynamically recover from the failure of a device acting as a default gateway is known as first-hop redundancy.

Router Redundancy Options

The following list defines the options available for router redundancy based on the protocol that defines communication between network devices:
  • Hot Standby Router Protocol (HSRP) - HSRP provides high network availability by providing first-hop routing redundancy. A group of routers use HSRP for selecting an active device and a standby device. In a group of device interfaces, the active device is the device that routes packets; the standby device is the device that takes over when the active device fails. The function of the HSRP standby router is to monitor the operational status of the HSRP group and to quickly assume packet-forwarding responsibility if the active router fails.
  • Virtual Router Redundancy Protocol (VRRP) - A VRRP router runs the VRRP protocol in conjunction with one or more other routers attached to a LAN. In a VRRP configuration, the elected router is the virtual router master, and the other routers act as backups, in case the virtual router master fails.
  • Gateway Load Balancing Protocol (GLBP) - GLBP protects data traffic from a failed router or circuit, like HSRP and VRRP, while also allowing load balancing (also called load sharing) between a group of redundant routers.

Location Redundancy

An organization may need to consider location redundancy depending on its needs. The following outlines three forms of location redundancy.

Synchronous
  • Synchronizes both locations in real time
  • Requires high bandwidth
  • Locations must be close together to reduce latency
Asynchronous Replication
  • Not synchronized in real time but close to it
  • Requires less bandwidth
  • Sites can be further apart because latency is less of an issue
Point-in-time-Replication
  • Updates the backup data location periodically
  • Most bandwidth conservative because it does not require a constant connection
The correct balance between cost and availability will determine the correct choice for an organization.

Resilient Design

Resiliency is the methods and configurations used to make a system or network tolerant of failure. For example, a network can have redundant links between switches running STP. Although STP does provide an alternate path through the network if a link fails, the switchover may not be immediate if the configuration is not optimal.

Routing protocols also provide resiliency, but fine-tuning can improve the switchover so that network users do not notice. Administrators should investigate non-default settings in a test network to see if they can improve network recovery times.

Resilient design is more than just adding redundancy. It is critical to understand the business needs of the organization, and then incorporate redundancy to create a resilient network.

Application Resilience

Application resilience is the application’s ability to react to problems in one of its components while still functioning. Downtime is due to failures caused by application errors or infrastructure failures. An administrator will eventually need to shut down applications for patching, version upgrades, or to deploy new features. Downtime can also be the result of data corruption, equipment failures, application errors, and human errors.

Many organizations try to balance out the cost of achieving the resiliency of application infrastructure with the cost of losing customers or business due to an application failure. Application high availability is complex and costly. The figure shows three availability solutions to address application resilience. As the availability factor of each solution increases, the complexity and cost also increase.

IOS Resilience

The Interwork Operating System (IOS) for Cisco routers and switches include a resilient configuration feature. It allows for faster recovery if someone maliciously or unintentionally reformats flash memory or erases the startup configuration file. The feature maintains a secure working copy of the router IOS image file and a copy of the running configuration file. The user cannot remove these secure files also known as the primary bootset.

The commands shown in the figure secure the IOS image and running configuration file.

Preparation

Incident response is the procedures that an organization follows after an event occurs outside the normal range. A data breach releases information to an untrusted environment. A data breach can occur as the result of an accidental or intentional act. A data breach occurs anytime an unauthorized person copies, transmits, views, steals, or accesses sensitive information.

When an incident occurs, the organization must know how to respond. An organization needs to develop an incident response plan and put together a Computer Security Incident Response Team (CSIRT) to manage the response. The team performs the following functions:
  • Maintains the incident response plan
  • Ensures its members are knowledgeable about the plan
  • Tests the plan
  • Gets management’s approval of the plan
The CSIRT can be an established group within the organization or an ad hoc one. The team follows a set of predetermined steps to make sure that their approach is uniform and that they do not skip any steps. National CSIRTs oversee incident handling for a country.

Detection and Analysis

Detection starts when someone discovers the incident. Organizations can purchase the most sophisticated detection systems; however, if administrators do not review the logs and monitor alerts, these systems are worthless. Proper detection includes how the incident occurred, what data it involved, and what systems it involved. Notification of the breach goes out to senior management and managers responsible for the data and systems to involve them in the remediation and repair. Detection and analysis includes the following:
  • Alerts and notifications
  • Monitoring and follow-up
Incident analysis helps to identify the source, extent, impact, and details of a data breach. The organization may need to decide if it needs to call in a team of experts to conduct the forensics investigation.

Containment and Eradication, and Recovery

Containment efforts include the immediate actions performed such as disconnecting a system from the network to stop the information leak.

After identifying the breach, the organization needs to contain and eradicate it. This may require additional downtime for systems. The recovery stage includes the actions that the organization needs to take in order to resolve the breach and restore the systems involved. After remediation, the organization needs to restore all systems to their original state before the breach.

Post-Incident Follow-Up

After restoring all operations to a normal state, the organization should look at the cause of the incident and ask the following questions:
  • What actions will prevent the incident from reoccurring?
  • What preventive measures need strengthening?
  • How can it improve system monitoring?
  • How can it minimize downtime during the containment, eradication, and recovery phases?
  • How can management minimize the impact to the business?
A look at the lessons learned can help the organization to better prepare by improving upon its incident response plan.

Activity - Order the Incident Response Phases


Network Admission Control

The purpose of Network Admission Control (NAC) allows authorized users with compliant systems access to the network. A compliant system meets all of the policy requirements of the organization. For example, a laptop that is part of a home wireless network may not be able to connect remotely to the corporate network. NAC evaluates an incoming device against the policies of the network. NAC also quarantines the systems that do not comply and manages the remediation of noncompliant systems.

A NAC framework can use the existing network infrastructure and third-party software to enforce the security policy compliance for all endpoints. Alternately, a NAC appliance controls network access, evaluates compliance, and enforces security policy. Common NAC systems checks include:
  1. Updated virus detection
  2. Operating systems patches and updates
  3. Complex password enforcement

Intrusion Detection Systems

Intrusion Detection Systems (IDSs) passively monitor the traffic on a network. The figure shows that an IDS-enabled device copies the traffic stream and analyzes the copied traffic rather than the actual forwarded packets. Working offline, it compares the captured traffic stream with known malicious signatures, similar to software that checks for viruses. Working offline means several things:
  • IDS works passively
  • IDS device is physically positioned in the network so that traffic must be mirrored in order to reach it
  • Network traffic does not pass through the IDS unless it is mirrored
Passive means that the IDS monitors and reports on traffic. It does not take any action. This is the definition of operating in promiscuous mode.

The advantage of operating with a copy of the traffic is that the IDS does not negatively affect the packet flow of the forwarded traffic. The disadvantage of operating on a copy of the traffic is that the IDS cannot stop malicious single-packet attacks from reaching the target before responding to the attack. An IDS often requires assistance from other networking devices, such as routers and firewalls, to respond to an attack.

A better solution is to use a device that can immediately detect and stop an attack. An Intrusion Prevention System (IPS) performs this function.

Intrusion Prevention Systems

An IPS builds upon IDS technology. However, an IPS device operates in inline mode. This means that all incoming and outgoing traffic must flow through it for processing. As shown in the figure, an IPS does not allow packets to enter the trusted side of the network unless it has analyzed the packets. It can detect and immediately address a network problem.

An IPS monitors network traffic. It analyzes the contents and the payload of the packets for more sophisticated embedded attacks that might include malicious data. Some systems use a blend of detection technologies, including signature-based, profile-based, and protocol analysis-based intrusion detection. This deeper analysis enables the IPS to identify, stop, and block attacks that would pass through a traditional firewall device. When a packet comes in through an interface on an IPS, the outbound or trusted interface does not receive that packet until the IPS analyzes the packet.

The advantage of operating in inline mode is that the IPS can stop single-packet attacks from reaching the target system. The disadvantage is that a poorly configured IPS can negatively affect the packet flow of the forwarded traffic.

The biggest difference between IDS and IPS is that an IPS responds immediately and does not allow any malicious traffic to pass, whereas an IDS allows malicious traffic to pass before addressing the problem.


NetFlow and IPFIX

NetFlow is a Cisco IOS technology that provides statistics on packets flowing through a Cisco router or multilayer switch. NetFlow is the standard for collecting operational data from networks. The Internet Engineering Task Force (IETF) used Cisco’s NetFlow Version 9 as the basis for IP Flow Information Export (IPFIX).

IPFIX is a standard format for exporting router-based information about network traffic flows to data collection devices. IPFIX works on routers and management applications that support the protocol. Network managers can export network traffic information from a router and use this information to optimize network performance.

Applications that support IPFIX can display statistics from any router that supports the standard. Collecting, storing, and analyzing the aggregated information provided by IPFIX supported devices provides the following benefits:
  • Secures the network against internal and external threats
  • Troubleshoots network failures quickly and precisely
  • Analyzes network flows for capacity planning

Advanced Threat Intelligence

Advanced threat intelligence can help organizations detect attacks during one of the stages of the cyberattack and sometimes before with the right information.

Organizations may be able to detect indicators of attack in its logs and system reports for the following security alerts:
  • Account lockouts
  • All database events
  • Asset creation and deletion
  • Configuration modification to systems
Advanced threat intelligence is a type of event or profile data that can contribute to security monitoring and response. As the cyber criminals become more sophisticated, it is important to understand the malware maneuvers. With improved visibility into attack methodologies, an organization can respond more quickly to incidents.

Types of Disasters

It is critical to keep an organization functioning when a disaster occurs. A disaster includes any natural or human-caused event that damages assets or property and impairs the ability for the organization to continue operating.

Natural Disasters

Natural disasters differ depending on location. Some of these events are difficult to predict. Natural disasters fall into the following categories:
  • Geological disasters include earthquakes, landslides, volcanoes, and tsunamis
  • Meteorological disasters include hurricanes, tornadoes, snow storms, lightning, and hail
  • Health disasters include widespread illnesses, quarantines, and pandemics
  • Miscellaneous disasters include fires, floods, solar storms, and avalanches
Human-caused Disasters

Human-caused disasters involve people or organizations and fall into the following categories:
  • Labor events include strikes, walkouts, and slowdowns
  • Social-political events include vandalism, blockades, protests, sabotage, terrorism, and war
  • Materials events include hazardous spills and fires
  • Utilities disruptions include power failures, communication outages, fuel shortages, and radioactive fallout
Click here to see satellite photos of Japan before and after the 2011 Earthquake and Tsunami.

Disaster Recovery Plan

An organization puts its disaster recovery plan (DRP) into action while the disaster is ongoing and employees are scrambling to ensure critical systems are online. The DRP includes the activities the organization takes to assess, salvage, repair, and restore damaged facilities or assets.

To create the DRP, answer the following questions:
  • Who is responsible for this process?
  • What does the individual need to perform the process?
  • Where does the individual perform this process?
  • What is the process?
  • Why is the process critical?
A DRP needs to identify which processes in the organization are the most critical. During the recovery process, the organization restores its mission critical systems first.

Implementing Disaster Recovery Controls

Disaster recovery controls minimize the effects of a disaster to ensure that resources and business processes can resume operation.

There are three types of IT disaster recovery controls:
  • Preventative measures include controls that prevent a disaster from occurring. These measures seek to identify risks.
  • Detective measures include controls that discover unwanted events. These measures uncover new potential threats.
  • Corrective measures include controls that restore the system after a disaster or an event.

Need for Business Continuity

Business continuity is one of the most important concepts in computer security. Even though companies do whatever they can to prevent disasters and loss of data, it is impossible to predict every possible scenario. It is important for companies to have plans in place that ensure business continuity regardless of what may occur. A business continuity plan is a broader plan than a DRP because it includes getting critical systems to another location while repair of the original facility is under way. Personnel continue to perform all business processes in an alternate manner until normal operations resume.

Availability ensures that the resources required to keep the organization going will continue to be available to the personnel and the systems that rely on them.

Business Continuity Considerations

Business continuity controls are more than just backing up data and providing redundant hardware. Organizations need employees to properly configure and operate systems. Data can be useless until it provides information. An organization should look at the following:
  • Getting the right people to the right places
  • Documenting configurations
  • Establishing alternate communications channels for both voice and data
  • Providing power
  • Identifying all dependencies for applications and processes so that they are properly understood
  • Understanding how to carry out automated tasks manually

Business Continuity Best Practices

As shown in the figure, the National Institute of Standards and Technology (NIST) developed the following best practices:

  1. Write a policy that provided guidance to develop the business continuity plan and assigns roles to carry out the tasks.
  2. Identify critical systems and processes and prioritize them based on necessity.
  3. Identify vulnerabilities, threats, and calculate risks.
  4. Identify and implement controls and countermeasures to reduce risk.
  5. Devise methods to bring back critical systems quickly.
  6. Write procedures to keep the organization functioning in a chaotic state.
  7. Test the plan.
  8. Update the plan regularly.





Ref : [1]