210 likes | 312 Views
Chapter 1. Introduction. Fault-Tolerance. Reliability : Continuity of Service Availability : Readiness for Usage Safety : Avoidance of Catastrophic Consequences on the Environment Security : Prevention of Unauthorized Access and/or Handling of Information. Fault-Tolerance (2).
E N D
Fault-Tolerance • Reliability : Continuity of Service • Availability : Readiness for Usage • Safety : Avoidance of Catastrophic Consequences on the Environment • Security : Prevention of Unauthorized Access and/or Handling of Information
Fault-Tolerance (2) • Fault-Tolerance : To provide service despite the presence of faults in the system • Fault-Prevention : To prevent faults from occurring or getting introduced into the system
Fault-Tolerance (3) • Manual Maintenance in case of System Failures? • Unacceptability of the read-tike delays caused by manual repairs • Inaccessibility of systems for manual repairs • Excessive high costs of lost time and maintenance
Basic Concepts and Definitions • System : An Identifiable Mechanism that Maintains a Pattern of Behavior at an Interface between the System and its Environment • Internal State and External State (Behavior) • Specification : The expected or correct behavior of a system (Completeness, Consistent, Correct)
Basic Concepts and Definitions (2) • Failure : When the behavior of the system first deviates from that required by its specification. • Error : The part of the system state which is liable to lead to subsequent failure. • Fault : The cause of an error.
Basic Concepts and Definitions (3) • Faults : • Transient Faults vs. Permanent Faults • Design Faults vs. Operational Faults • Fault Tolerance : The behavior of the system, despite the failure of some of its component, is consistent with its specification.
Phases in Fault Tolerance • Error Detection • Damage Confinement • Error Recovery • Fault Treatment and Continued System Service
Error Detection • Error Detection ? • Why Not Fault Detection or Failure Detection? • Use “Check” to Detect Errors • Replication Check • Timing Check • Structural and Coding Check • Reasonableness Check • Diagnostics Check
Error Detection (2) • Replication Check : Replicating some components of the system, and the results are compared or voted. • Timing Checks : Time-Out if the specification of a component include timing constraints.
Error Detection (3) • Structural and Coding Checks : To check the structure of the data is as it should be; Coding : Extra bits are added to the data bits. • Reasonableness Checks : To determine if the state of some object in the system is reasonable. (ex. Range check)
Error Detection (4) • Diagnostics Checks : Use special input values w/ known output values.
Damage Confinement and Assessment • Damage Assessment : The flow of information btw. different components of the system is examined. • Damage Confinement : Fire Walls - No information flow takes place across the walls.
Error Recovery • Error Recovery : Remove the errorneous state • Backward Recovery : Checkpointing & Rollback • Forward Recovery : Make the state error-free by taking the necessary corrective actions.
Fault Treatment and Continued Service • Transient Error, By Error Recovery ! Permanent Error, By ? • Fault Location : Identify the faulty component. System Repair : Bypass the faulty component. Dynamic System Reconfiguration. (Using Redundancy)
Overview of Hardware Fault Tolerance • Triple Modular Redundancy (TMR) • What if two units fail, or voting element fails? • Synchronization problem ? • No error detection or recovery ? M Input M V Output M
Overview ofHardware Fault Tolerance (2) • Dynamic Redundancy • Several units but with only one operating at a time • If a fault is detected, the faulty unit is switched out. • Cold-standby system vs. Hot-standby system • DR vs. TMR : Failure detection, Faulty unit is removed.
Overview of Hardware Fault Tolerance (3) • Dynamic Redundancy : P1 : P2 ==> P3 P1, P2 ==> P4 P1,P2,P3
Overview ofHardware Fault Tolerance (4) • Coding : Detectability/Correctability of a Code. • Hamming Distance : The minimum number of bit positions in which any two words in the code differ. d = C + D +1 C : # of bit errors the code can correct D : # of bit errors the code can detect
Overview ofHardware Fault Tolerance (5) • Hamming Code : C1 C2 D1 C3 D2 D3 D4 C1 = D1 + D2 + D4 C2 = D1 + D3 + D4 C3 = D2 + D3 + D4 • Hamming Distance = ? • 1 bit error can be detected and corrected.
Overview of Hardware Fault Tolerance (6) • Cyclic Redundancy Codes (CRC) : Data/A ------------------> (Data+R)/A A에 의해 나누어 떨어지는 error는? • Berger Code : Count the # of 0s and the count is appended. 10011010 ----------------> 10011010100