Organizations that want to maximize the availability of their systems and data may take extraordinary measures to minimize or eliminate data loss. The goal is to minimize the downtime of mission critical processes. If employees cannot perform their regular duties, the organization is in jeopardy of losing revenue.
Organizations measure availability by percentage of uptime. This chapter begins by explaining the concept of five nines. Many industries must maintain the highest availability standards because downtime might literally mean a difference between life and death.
This chapter discusses various approaches that organizations can take to help meet their availability goals. Redundancy provides backup and includes extra components for computers or network systems to ensure the systems remain available. Redundant components can include hardware such as disk drives, servers, switches, and routers or software such as operating systems, applications, and databases. The chapter also discusses resiliency, the ability of a server, network, or data center to recover quickly and continue operation.
Organizations must be prepared to respond to an incident by establishing procedures that they follow after an event occurs. The chapter concludes with a discussion of disaster recovery and business continuity planning which are both critical in maintaining availability to an organization’s resources.
What Does the Five Nines Mean?
Five nines mean that systems and services are available 99.999% of the time. It also means that both planned and unplanned downtime is less than 5.26 minutes per year. The chart in the figure provides a comparison of the downtime for various availability percentages.
High availability refers to a system or component that is continuously operational for a given length of time. To help ensure high availability:
- Eliminate single points of failure
- Design for reliability
- Detect failures as they occur
Sustaining high availability at the standard of five-nines can increase costs and utilize many resources. The increased costs are due to the purchase of additional hardware such as servers and components. As an organization adds components, the result is an increase in configuration complexity. Unfortunately, increased configuration complexity increases the risk factors. The more moving parts involved, the higher the likelihood of failed components.
Environments that Require Five Nines
Although the cost of sustaining high availability may be too costly for some industries, several environments require five nines.
- The finance industry needs to maintain high availability for continuous trading, compliance, and customer trust. Click here to read about the four-hour outage on the New York Stock Exchange in 2015.
- Healthcare facilities require high availability to provide around-the-clock care for patients. Click here to read about the average costs incurred for data center downtime in the healthcare industry.
- The public safety industry includes agencies that provide security and services to a community, state, or nation. Click here to read about a network outage at the U.S. Pentagon Police Agency.
- The retail industry depends on efficient supply chains and the delivery of products to customers. Disruption can be devastating, especially during peak demand times such as holidays.
- The public expects that the news media industry communicate information on events as they happen. The news cycle is now around the clock, 24/7.
Threats to Availability
The following threats pose a high risk to data and information availability:
- An unauthorized user successfully penetrates and compromises an organization’s primary database
- A successful DoS attack significantly affects operations
- An organization suffers a significant loss of confidential data
- A mission-critical application goes down
- A compromise of the Admin or root user occurs
- The detection of a cross-site script or illegal file server share
- The defacement of an organization’s website impacts public relations
- A severe storm such as a hurricane or tornado
- A catastrophic event such as a terrorist attack, building bombing, or building fire
- Long-term utility or service provider outage
- Water damage as the result of flooding or sprinkler failure
Categorizing the impact level for each threat helps an organization realize the dollar impact of a threat. Click the threat categories in the figure to see an example of each.
Designing High Availability System
High availability incorporates three major principles to achieve the goal of uninterrupted access to data and services:
- Elimination or reduction of single-points of failure
- System Resiliency
- Fault Tolerance
It is important to understand the ways to address a single point of failure. A single point of failure can include central routers or switches, network services, and even highly skilled IT staff. The key is that a loss of the system, process, or person can have a very disruptive impact on the entire system. The key is to have processes, resources, and components that reduce single points of failure. High availability clusters is one way to provide redundancy. These clusters consist of a group of computers that have access to the same-shared storage and have identical network configurations. All servers take part in processing a service simultaneously. From the outside, the server group looks like one device. If a server within the cluster fails, the other servers continue to process the same service as the failed device.
Systems resiliency refers to the capability to maintain availability of data and operational processing despite attacks or disrupting event. Generally, this requires redundant systems, in terms of both power and processing, so that should one system fail, the other can take over operations without any break in service. System resiliency is more than hardening devices; it requires that both data and services be available even when under attack.
Fault tolerance enables a system to continue to operate if one or more components fail. Data mirroring is one example of fault tolerance. Should a "fault" occur, causing disruption in a device such as a disk controller, the mirrored system provides the requested data with no apparent interruption in service to the user.
Asset Identification
An organization needs to know what hardware and software are present as a prerequisite to knowing what the configuration parameters need to be. Asset management includes a complete inventory of hardware and software.
This means that the organization needs to know all of components that can be subject to security risks, including:
- Every hardware system
- Every operating system
- Every hardware network device
- Every network device operating system
- Every software application
- All firmware
- All language runtime environments
- All individual libraries
An organization may choose an automated solution to keep track of assets. An administrator should investigate any changed configuration because it may mean that the configuration is not up-to-date. It can also mean that unauthorized changes are happening.
Asset Classification
Asset classification assigns all resources of an organization into a group based on common characteristics. An organization should apply an asset classification system to documents, data records, data files, and disks. The most critical information needs to receive the highest level of protection and may even require special handling.
An organization can adopt a labeling system according to how valuable, how sensitive, and how critical the information is. Complete the following steps to identify and classify the assets of an organization:
- Determine the proper asset identification category.
- Establish asset accountability by identifying the owner for all information assets and application software.
- Determine the criteria for classification.
- Implement a classification schema.
For example, the U.S. government uses sensitivity to classify data as follows: top secret; secret; confidential; public trust; and unclassified.
Asset Standardization
Asset management manages the lifecycle and inventory of technology assets including devices and software. As part of an IT asset management system, an organization specifies the acceptable IT assets that meet its objectives. This practice effectively reduces the different asset types. For example, an organization will only install applications that meet its guidelines. When administrators eliminate applications that do not meet the guidelines, they are effectively increasing security.
Asset standards identify specific hardware and software products that the organization uses and supports. When a failure occurs, prompt action helps to maintain both access and security. If an organization does not standardize its hardware selection, personnel may need to scramble to find a replacement component. Non-standard environments require more expertise to manage and they increase the cost of maintenance contracts and inventory. Click here to read about how the military shifted to standards-based hardware for its military communications.
Threat Identification
The United States Computer Emergency Readiness Team (US-CERT) and the U.S. Department of Homeland Security sponsor a dictionary of common vulnerabilities and exposure (CVE). CVE contains a standard identifier number with a brief description, and references to related vulnerability reports and advisories. The MITRE Corporation maintains the CVE List and its public website.
Threat identification begins with the process of creating a CVE Identifier for publicly known cybersecurity vulnerabilities. Each CVE Identifier includes the following:
- The CVE Identifier number
- A brief description of the security vulnerability
- Any important references
Click here to learn more about CVE Identifier.
Risk Analysis
Risk analysis is the process of analyzing the dangers posed by natural and human-caused events to the assets of an organization.
A user performs an asset identification to help determine which assets to protect. A risk analysis has four goals:
- Identify assets and their value
- Identify vulnerabilities and threats
- Quantify the probability and impact of the identified threats
- Balance the impact of the threat against the cost of the countermeasure
There are two approaches to risk analysis.
Quantitative Risk Analysis
A quantitative analysis assigns numbers to the risk analysis process (Figure 1). The asset value is the replacement cost of the asset. The value of an asset can also be measured by the income gained through use of the asset. The exposure factor (EF) is a subjective value expressed as a percentage that an asset loses due to a particular threat. If a total loss occurs, the EF equals 1.0 (100%). In the quantitative example, the server has an asset value of $15,000. When the server fails, a total loss occurs (the EF equals 1.0). The asset value of $15,000 multiplied by the exposure factor of 1 results in a single loss expectancy of $15,000.
The annualized rate of occurrence (ARO) is the probability that a loss will occur during the year (also expressed as a percentage). An ARO can be greater than 100% if a loss can occur more than once a year.
The calculation of the annual loss expectancy (ALE) gives management some guidance on what it should spend to protect the asset.
Qualitative Risk Analysis
Qualitative Risk Analysis uses opinions and scenarios. Figure 2 provides an example of table used in qualitative risk analysis, which plots the likelihood of a threat against its impact. For example, the threat of a server failure may be likely, but its impact may only be marginal.
A team evaluates each threat to an asset and plots it in the table. The team ranks the results and uses the results as a guide. They may determine to take action on only threats that fall within the red zone.
The numbers used in the table do not directly relate to any aspect of the analysis. For example, a catastrophic impact of 4 is not twice as bad as a marginal impact of 2. This method is subjective in nature.
Mitigation
Mitigation involves reducing the severity of the loss or the likelihood of the loss from occurring. Many technical controls mitigate risk including authentication systems, file permissions, and firewalls. Organization and security professionals must understand that risk mitigation can have both positive and negative impact on the organization. Good risk mitigation finds a balance between the negative impact of countermeasures and controls and the benefit of risk reduction. There are four common ways to reduce risk:
- Accept the risk and periodically re-assess
- Reduce the risk by implementing controls
- Avoid the risk by totally changing the approach
- Transfer the risk to a third party
A short-term strategy is to accept the risk necessitating the creation of contingency plans for that risk. People and organizations have to accept risk on a daily basis. Modern methodologies reduce risk by developing software incrementally and providing regular updates and patches to address vulnerabilities and misconfigurations.
Outsourcing services, purchasing insurance, or purchasing maintenance contracts are all examples of risk transfer. Hiring specialists to perform critical tasks to reduce risk can be a good decision and yield greater results with less long term investment. A good risk mitigation plan can include two or more strategies.
Activity - Perform an Asset Risk Analysis
Layering
Defense in depth will not provide an impenetrable cyber shield, but it will help an organization minimize risk by keeping it one-step ahead of cyber criminals.
If there is only one defense in place to protect data and information, cyber criminals have only to get around that single defense. To make sure data and information remains available, an organization must create different layers of protection.
A layered approach provides the most comprehensive protection. If cyber criminals penetrate one layer, they still have to contend with several more layers with each layer being more complicated than the previous.
Layering is creating a barrier of multiple defenses that coordinate together to prevent attacks. For example, an organization might store its top secret documents on a server in a building surrounded by an electronic fence.
Limiting
Limiting access to data and information reduces the possibility of a threat. An organization should restrict access so that users only have the level of access required to do their job. For example, the people in the marketing department do not need access to payroll records to perform their jobs.
Technology-based solutions such as using file permissions are one way to limit access; an organization should also implement procedural measures. A procedure should be in place that prohibits an employee from removing sensitive documents from the premises.
Diversity
If all of the protected layers were the same, it would not be very difficult for cyber criminals to conduct a successful attack. Therefore, the layers must be different. If cyber criminals penetrate one layer, the same technique will not work on all of the other layers. Breaching one layer of security does not compromise the whole system. An organization may use different encryption algorithms or authentication systems to protect data in different states.
To accomplish the goal of diversity, organizations can use security products manufactured by different companies for multifactor authentication. For example, the server containing the top secret documents is in a locked room that requires a swipe card from one company and biometric authentication supplied by another company.
Obscurity
Obscuring information can also protect data and information. An organization should not reveal any information that cyber criminals can use to figure out what version of the operating system a server is running or the type of equipment it uses. For example, error messages should not contain any details that cyber criminals could use to determine what vulnerabilities are present. Concealing certain types of information makes it more difficult for cyber criminals to attack a system.
Simplicity
Complexity does not necessarily guarantee security. If an organization implements complex systems that are hard to understand and troubleshoot, it may actually backfire. If employees do not understand how to configure a complex solution properly, it may make it just as easy for cyber criminals to compromise those systems. To maintain availability, a security solution should be simple from the inside, but complex on the outside.
Activity - Identify the Layer of Defense
Single Points of Failure
A single point of failure is a critical operation within the organization. Other operations may rely on it and failure halts this critical operation. A single point of failure can be a special piece of hardware, a process, a specific piece of data, or even an essential utility. Single points of failure are the weak links in the chain that can cause disruption of the organization's operations. Generally, the solution to a single point of failure is to modify the critical operation so that it does not rely on a single element. The organization can also build redundant components into the critical operation to take over the process should one of these points fail.
N+1 Redundancy
N+1 redundancy ensures system availability in the event of a component failure. Components (N) need to have at least one backup component (+1). For example, a car has four tires (N) and a spare tire in the trunk in case of a flat (+1).
In a data center, N+1 redundancy means that the system design can withstand the loss of a component. The N refers to many different components that make up the data center including servers, power supplies, switches, and routers. The +1 is the additional component or system that is standing by ready to go if needed.
An example of N+1 redundancy in a data center is a power generator that comes online when something happens to the main power source. Although an N+1 system contains redundant equipment, it is not a fully redundant system.
RAID
A redundant array of independent disks (RAID) combines multiple physical hard drives into a single logical unit to provide data redundancy and improve performance. RAID takes data that is normally stored on a single disk and spreads it out among several drives. If any single disk is lost, the user can recover data from the other disks where the data also resides.
RAID can also increase the speed of data recovery. Using multiple drives will be faster retrieving requested data instead of relying on just one disk to do the work.
A RAID solution can be either hardware-based or software-based. A hardware-based solution requires a specialized hardware controller on the system that contains the RAID drives. The following terms describe how RAID stores data on the various disks:
- Parity - Detects data errors.
- Striping - Writes data across multiple drives.
- Mirroring - Stores duplicate data on a second drive.
There are several levels of RAID available as shown in the figure.
Click here to view a RAID level tutorial that explains RAID technology.
Spanning Tree
Redundancy increases the availability of the infrastructure by protecting the network from a single point of failure, such as a failed network cable or a failed switch. When designers build physical redundancy in to a network, loops and duplicate frames occur. Loops and duplicate frames have severe consequences for a switched network.
Spanning Tree Protocol (STP) addresses these issues. The basic function of STP is to prevent loops on a network when switches interconnect via multiple paths. STP ensures that redundant physical links are loop-free. It ensures that there is only one logical path between all destinations on the network. STP intentionally blocks redundant paths that could cause a loop.
Blocking the redundant paths is critical to preventing loops on the network. The physical paths still exist to provide redundancy, but STP disables these paths to prevent the loops from occurring. If a network cable or switch fails, STP recalculates the paths and unblocks the necessary ports to allow the redundant path to become active.
Click Play in the figure to view STP when a failure occurs:
- PC1 sends a broadcast out onto the network.
- The trunk link between S2 and S1 fails, resulting in disruption of the original path.
- S2 unblocks the previously blocked port for Trunk2 and allows the broadcast traffic to traverse the alternate path around the network, permitting communication to continue.
- If the link between S2 and S1 comes back up, STP again blocks the link between S2 and S3.
Router Redundancy
Router Redundancy Options
- Hot Standby Router Protocol (HSRP) - HSRP provides high network availability by providing first-hop routing redundancy. A group of routers use HSRP for selecting an active device and a standby device. In a group of device interfaces, the active device is the device that routes packets; the standby device is the device that takes over when the active device fails. The function of the HSRP standby router is to monitor the operational status of the HSRP group and to quickly assume packet-forwarding responsibility if the active router fails.
- Virtual Router Redundancy Protocol (VRRP) - A VRRP router runs the VRRP protocol in conjunction with one or more other routers attached to a LAN. In a VRRP configuration, the elected router is the virtual router master, and the other routers act as backups, in case the virtual router master fails.
- Gateway Load Balancing Protocol (GLBP) - GLBP protects data traffic from a failed router or circuit, like HSRP and VRRP, while also allowing load balancing (also called load sharing) between a group of redundant routers.
Location Redundancy
- Synchronizes both locations in real time
- Requires high bandwidth
- Locations must be close together to reduce latency
- Not synchronized in real time but close to it
- Requires less bandwidth
- Sites can be further apart because latency is less of an issue
- Updates the backup data location periodically
- Most bandwidth conservative because it does not require a constant connection
Resilient Design
Application Resilience
IOS Resilience
Preparation
- Maintains the incident response plan
- Ensures its members are knowledgeable about the plan
- Tests the plan
- Gets management’s approval of the plan
Detection and Analysis
- Alerts and notifications
- Monitoring and follow-up
Containment and Eradication, and Recovery
Post-Incident Follow-Up
- What actions will prevent the incident from reoccurring?
- What preventive measures need strengthening?
- How can it improve system monitoring?
- How can it minimize downtime during the containment, eradication, and recovery phases?
- How can management minimize the impact to the business?
Activity - Order the Incident Response Phases
Network Admission Control
- Updated virus detection
- Operating systems patches and updates
- Complex password enforcement
Intrusion Detection Systems
- IDS works passively
- IDS device is physically positioned in the network so that traffic must be mirrored in order to reach it
- Network traffic does not pass through the IDS unless it is mirrored
Intrusion Prevention Systems
NetFlow and IPFIX
- Secures the network against internal and external threats
- Troubleshoots network failures quickly and precisely
- Analyzes network flows for capacity planning
Advanced Threat Intelligence
- Account lockouts
- All database events
- Asset creation and deletion
- Configuration modification to systems
Types of Disasters
- Geological disasters include earthquakes, landslides, volcanoes, and tsunamis
- Meteorological disasters include hurricanes, tornadoes, snow storms, lightning, and hail
- Health disasters include widespread illnesses, quarantines, and pandemics
- Miscellaneous disasters include fires, floods, solar storms, and avalanches
- Labor events include strikes, walkouts, and slowdowns
- Social-political events include vandalism, blockades, protests, sabotage, terrorism, and war
- Materials events include hazardous spills and fires
- Utilities disruptions include power failures, communication outages, fuel shortages, and radioactive fallout
Disaster Recovery Plan
- Who is responsible for this process?
- What does the individual need to perform the process?
- Where does the individual perform this process?
- What is the process?
- Why is the process critical?
Implementing Disaster Recovery Controls
- Preventative measures include controls that prevent a disaster from occurring. These measures seek to identify risks.
- Detective measures include controls that discover unwanted events. These measures uncover new potential threats.
- Corrective measures include controls that restore the system after a disaster or an event.
Need for Business Continuity
Business Continuity Considerations
- Getting the right people to the right places
- Documenting configurations
- Establishing alternate communications channels for both voice and data
- Providing power
- Identifying all dependencies for applications and processes so that they are properly understood
- Understanding how to carry out automated tasks manually
Business Continuity Best Practices
- Write a policy that provided guidance to develop the business continuity plan and assigns roles to carry out the tasks.
- Identify critical systems and processes and prioritize them based on necessity.
- Identify vulnerabilities, threats, and calculate risks.
- Identify and implement controls and countermeasures to reduce risk.
- Devise methods to bring back critical systems quickly.
- Write procedures to keep the organization functioning in a chaotic state.
- Test the plan.
- Update the plan regularly.



























































