430 likes | 624 Views
Fault Tolerance. Motivation : Systems need to be much more reliable than their components Use Redundancy : Extra items that can be used to make up for failures Types of Redundancy : Hardware Software Time Information. Fault-Tolerant Scheduling.
E N D
Fault Tolerance • Motivation: Systems need to be much more reliable than their components • Use Redundancy: Extra items that can be used to make up for failures • Types of Redundancy: • Hardware • Software • Time • Information
Fault-Tolerant Scheduling • Fault Tolerance: The ability of a system to suffer component failures and still function adequately • Fault-Tolerant Scheduling: Save enough time in a schedule that the system can still function despite a certain number of processor failures
FT-Scheduling: Model • System Model • Multiprocessor system • Each processor has its own memory • Tasks are preloaded into assigned processors • Task Model • Tasks are independent of one another • Schedules are created ahead of time
Basic Idea • Preassign backup copies, called ghosts. • Assign ghosts to the processors along with the primary copies • A ghost and a primary copy of the same task can’t be assigned to the same processor • For each processor, all the primaries and a particular subset of the ghost copies assigned to it should be feasibly schedulable on that processor
Requirements • Two main variations: • Current and future iterations of the task have to be saved if a processor fails • Only future iterations need to be saved; the current iteration can be discarded
Forward and Backward Masking • Forward Masking: Mask the output of failed units without significant loss of time • Backward Masking: After detecting an error, try to fix it by recomputing or some other means
Failure Types • Permanent: The fault is incurable • Transient: The unit is faulty for some time, following which it starts functioning correctly again • Intermittent: Frequently cycles between a faulty and a non-faulty state
Faults and Errors • A fault is some physical defect or malfunction • An error is a manifestation of a fault • Latency: • Fault Latency: Time between occurrence of a fault and its manifestation as an error • Error Latency: Time between the generation of an error and its being caught by the system
Hardware Failure Recovery • If transient, it may be enough to wait for the fault to go away and then reinvoke the computation • If permanent, reassign the tasks to other, functional, processors
Faults: Output Characteristics • Stuck-at: A line is stuck at 0 or 1. • Dead: No output (e.g., high-impedance state) • Arbitrary: The output changes with time
Factors Affecting HW F-Rate • Temperature • Radiation • Power surges • Mechanical shocks • HW failure rate often follows the “bathtub” curve
Some Terminology • Fail-safe Systems: Systems which end up in a “safe” state upon failure • Example: All traffic lights turning red in an intersection • Fail-stop Systems: Systems that stop producing output when they fail
Example of HW Redundancy • Triple-Modular Redundancy (TMR): • Three units run the same algorithm in parallel • Their outputs are voted on and the majority is picked as the output of the TMR cluster • Can forward-mask up to one processor failure
Mathematical Background • Basic laws of probability • Density and distribution functions • Notion of stochastic independence • Expectation, variance, etc. • Memoryless distribution • Markov chains • Steady-state & transient solutions • Bayes’s Law
Hardware FT • N-Modular Redundancy (NMR) • Basic structure • Variations • Reliability evaluation • Independent failures • Correlated failures • Voter: • Bit-by-bit comparison • Median • Formalized majority • Generalized k-plurality
Exploiting Appln Semantics • Acceptance Test: Specify a range outside which the output is tagged as faulty (or at least suspicious) • No acceptance test is perfect: • Sensitivity: Probability of catching an incorrect output • Specificity: Probabililty that an output which is flagged as wrong is really wrong • Specificity = 1 - False Positive Probability
Checkpointing • Store partial results in a safe place • When failure occurs, roll back to the latest checkpoint and restart • Issues: • Checkpoint positioning • Implementation • Kernel level • Application level • Correctness: Can be a problem in distributed systems
Terminology • Checkpointing Overhead: The part of the checkpointing activity that is not hidden from the application • Checkpointing Latency: Time between when a checkpoint starts being taken to when it is stored in non-volatile storage.
Reducing Chkptg Overhead • Buffer checkpoint writes • Don’t checkpoint “dead” variables: • Never used again by the program, or • Next operation with respect to the variable is a write • Problem is how to identify dead variables • Don’t checkpoint read-only stuff, like code
Reducing Chkptg Latency • Consider compressing the checkpoint. Usefulness of this approach depends on: • Extent of the compression possible • Work required to execute the compression algorithm
Optimization of Chkptg • Objective in general-purpose systems is usually to minimize the expected execution time • Objective in real-time systems is to maximize the probability of meeting task deadlines • Need a mathematical model to determine this • Generally, we place checkpoints approximately equidistant from each other and just determine the optimal number of them
Distributed Checkpointing • Ordering of Events: • Easy to do if there’s just one thread • If there are multiple threads: • Events in the same thread are trivial to order • Event A in thread X is said to precede Event B in thread Y if there is some communication from the X after event A that arrives at Y before event B • Given two events A and B in separate threads, • A could precede B • B could precede A • They could be concurrent
Distributed Checkpointing • Domino Effect: An uncontrolled cascade of rollbacks can roll the entire system back to the starting state • To avoid the domino effect, we can coordinate the checkpointing • Tightly synchronize the checkpoints in all processors • Koo-Toueg algorithm
Checkptg with Clock Sync • Assume the clock skew is bounded at d and minimum message delivery time is f • Each processor: • Takes a local checkpoint at some specified time, t • Following its checkpoint, it does not sent out any messages until it is sure that this message will be received only after the recipient has itself checkpointed; i.e., until t+f+d
Koo-Toueg Algorithm • A processor that wants to checkpoint, • Does so, locally • Tells all processors which have communicated with it the last message (timestamp or message number) received from them • If these processors don’t have a checkpoint recording the transmission of this message, they take a checkpoint • This can result in a surge of checkpointing activity visible at the non-volatile storage
Software Fault Tolerance • It is practically impossible to produce a large piece of software that is bug-free • E.g., Even the space shuttle flew with several potentially disastrous bugs despite extensive testing • Single-version Fault Tolerance • Multi-version Fault Tolerance
Fault Models • Reasonably trustworthy hardware fault models exist • Many software fault models exist in the literature, but not one can be fully trusted to represent reality
Single-Version FT • Wrappers: Code “wrapped around” the software that checks for consistency and correctness • Software Rejuvenation: Reboot the machine reasonably frequently • Use data diversity: Sometimes an algorithm may fail on some data but not if these data are subjected to minor perturbations
Multi-version FT • Very, very expensive • Two basic approaches • N-version programming • Recovery Blocks
N-Version Programming (NVP) • Theoretically appealing, but hard to make it effective • Basic Idea: • Have N independent teams of programmers develop applications independently • Run them in parallel and vote on them • If they are truly independent, they will be highly reliable
Failure Diversity • Effectiveness hinges on whether faults in the versions are statistically independent of one another • Forces against truly independent failures: • Common programming “culture” • Common specifications • Common algorithms • Common software/hardware platforms
Failure Diversity • Incidental Diversity • Prohibit interaction between teams of programmers working on different versions and hope they produce independently failing versions • Forced Diversity • Diverse specifications • Diverse programming languages • Diverse development tools and compilers • Cognitively diverse teams: Probably not realistic
Experimental Results • Experiments suggest that correlated failures do occur at a much higher rate than would be the case if failures in the versions were stochastically independent • Example: Study conducted by Brilliant, Knight, and Leveson at UVa and UCI • 27 students writing code for anti-missile application • 93 correlated failures observed: if true independence had existed, we’d have expected about 5
Recovery Blocks • Also uses multiple versions • Only one version is active at any time • If the output of this version fails an acceptance test, another version is activated
Byzantine Failures • The worst failure mode known • Original Motivating Problem (~1978): • A sensor needs to disseminate its output to a set of processors. How can we ensure that, • If the sensor is functioning correctly: All functional processors obtain the correct sensor reading • If the sensor is malfunctioning: All functional processors agree on the sensor reading
Byzantine Generals Problem • Some divisions of the Byzantine Army are besieging a city. They must all coordinate their attacks (or coordinate their retreat) to avoid disaster • The overall commander communicates to his divisional commanders by means of a confidential messenger. This messenger is trustworthy and doesn’t alter the message; it can only be read by its intended recipient
Byz Generals Problem (contd.) • If the C-in-C is loyal • He sends consistent orders to the subordinate generals • All loyal subordinates must obey his order • If the C-in-C is a traitor • All loyal subordinate generals must agree on some default action (e.g., running away)
Impossibility with 3 Generals • Suppose there are 2 divisions, A and B. • Commander-in-chief is a traitor and sends message to Com(A) saying “Attack!” and to Com(B) saying “Retreat!” • Com(A) sends a messenger to Com(B), saying “The boss told me to attack!” • Com(B) receives: • Direct order from the C-in-C saying “Retreat” • Message from Com(A) saying “I was ordered to attack”
Byz. Generals Problem (contd.) • Com(B)’s dilemma: • Either the C-in-C or Com(A) is a traitor: it is impossible to know which • Further communication with Com(A) won’t add any useful information • Not possible to ensure that if Com(A) and Com(B) are both loyal, they both agree on the same action • The problem cannot be solved if there are 3 generals who may include at least one traitor
Byz. Generals Problem (contd.) • Central Result: To reach agreement with a total of N participants with up to m traitors, we must have N > 3m
Byzantine Generals Algorithm • Byz(0) // no-failure algorithm • C-in-C sends his order to every subordinate • The subordinate uses the order he receives, or the default if he receives no order
Byz(m) // For up to m traitors (failures) • (1) C-in-C sends order to every subordinate, G_i: let this be received as v_i • (2) G_i acts as the C-in-C in a Byz(m-1) algorithm to circulate this order to his colleagues • (3) For each (i,j) such that i!=j, let w_(i,j) be the order that G_i got from G_j in step 2 or the default if no message was received. G_i calculates the majority of the orders {v_i, w_(i,j)} and uses it as the correct order to follow