210 likes | 288 Views
Lecture 13. Fault Tolerance Networked vs. Distributed Operating Systems. Fault-Tolerance.
E N D
Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems
Fault-Tolerance Fault-tolerance or graceful degradation is the property that enables a system (often computer-based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naïvely-designed system in which even a small failure can cause total breakdown. Fault-tolerance is particularly sought-after in high-availability or life-critical systems. Fault-tolerance is not just a property of individual machines; it may also characterise the rules by which they interact. For example, the Transmission Control Protocol (TCP) is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded. It does this by requiring the endpoints of the communication to expect packet loss, duplication, reordering and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount. http://en.wikipedia.org/wiki/Fault-tolerant_system
Fault Tolerant Computer Systems Most fault-tolerant computer systems are designed to be able to handle several possible failures, including hardware-related faults such as hard disk failures, input or output device failures, or other temporary or permanent failures; software bugs and errors; interface errors between the hardware and software, including driver failures; operator errors, such as erroneous keystrokes, bad command sequences, or installing unexpected software; and physical damage or other flaws introduced to the system from an outside source. A conceptual design of a segregated-component fault-tolerant computer design http://en.wikipedia.org/wiki/Fault-tolerant_computer_systems
RAID RAID, an acronym for Redundant Array of Inexpensive Disks or Redundant Array of Independent Disks, is a technology that allows high levels of storage reliability from low-cost and less reliable PC-class disk-drive components, via the technique of arranging the devices into arrays for redundancy. All implementations of RAID, redundant array of independent disks, except RAID 0 are examples of a fault-tolerant storage device that uses data redundancy. RAID 0 (striped disks) distributes data across multiple disks in a way that gives improved speed at any given instant. If one disk fails, however, all of the data on the array will be lost, as there is neither parity nor mirroring, that is, RAID 0 is not redundant. RAID 1 mirrors the contents of the disks, making a form of 1:1 ratio realtime backup. The contents of each disk in the array are identical to that of every other disk in the array. A RAID 1 array requires a minimum of two drives. RAID 3 or 4 (striped disks with dedicated parity) combines three or more disks in a way that protects data against loss of any one disk. Fault tolerance is achieved by adding an extra disk to the array, which is dedicated to storing parity information; the overall capacity of the array is reduced by one disk. RAID 5 Striped set with distributed parity or interleave parity requiring 3 or more disks. Distributed parity requires all drives but one to be present to operate; drive failure requires replacement, but the array is not destroyed by a single drive failure. Upon drive failure, any subsequent reads can be calculated from the distributed parity such that the drive failure is masked from the end user. RAID 6 (striped disks with dual parity) combines four or more disks in a way that protects data against loss of any two disks. RAID 1+0 (or 10) is a mirrored data set (RAID 1) which is then striped (RAID 0), hence the "1+0" name. A RAID 1+0 array requires a minimum of four drives. That is two mirrored drives to hold half of the striped data, plus another two mirrored for the other half of the data. RAID 0+1 (or 01) is a striped data set (RAID 0) which is then mirrored (RAID 1). A RAID 0+1 array requires a minimum of four drives: two to hold the striped data, plus another two to mirror the first pair.
Characteristics of Fault Tolerance The basic characteristics of fault tolerance require: In addition, fault tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. For example, a five nines system would statistically provide 99.999% availability. • No single point of failure • No single point of repair • Fault isolation to the failing component • Fault containment to prevent propagation of the failure • Availability of reversion modes
No Single Point of Failure Spare components addresses the first fundamental characteristic of fault-tolerance in three ways: Replication: Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum; Redundancy: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (failover); Diversity: Providing multiple different implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation.
No Single Point of Repair If a system experiences a failure, it must continue to operate without interruption during the repair process. Fault Tolerant servers surpass the concept of high availability to enter the era of the "continuous availability". Such servers are designed to guarantee an availability of 99.999%, that is to say on average less than 5 minutes of unplanned interruption per year, including time necessary for repairs, updates, and general maintenance. http://www.nec-computers.com/page.asp?id=222
Fault isolation to the failing component When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Cisco IOS XR Software is the first fully modular, fully distributed internetwork operating system built on a microkernel-based, memory-protected architecture that strictly segments all operating system components, from device drivers and file systems to management interfaces and routing protocols, helping to ensure complete process separation and fault isolation. Cisco IOS XR Software Architecture http://www.cisco.com/en/US/products/ps5763/products_white_paper09186a008022da42.shtml
Fault containment Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. From a computing perspective, Containers are objects that can `hold' a collection of other objects or entities. The Cisco hierarchical Containment Model can reflect the real world topology of the network that is being modelled, in a physical, logical or business-oriented sense. • Fault extensions for fault management and root cause analysis. •Configuration extensions for config-uration management and policy assurance. •Accounting extensions for the accounting aspect of network management. • Performance extensions for monitoring and analyzing a network's performance. • Security extensions for managing a network's security. http://www.ciscosystems.com/en/US/docs/wireless/cw4mw/MWFM2.0.1/FaultMedTop/04active.html