410 likes | 642 Views
Fault Tolerance. Fault tolerance terminology. “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability” - continuity of service metric: mean time between failures (MBTF) “availability” - readiness for usage
E N D
Fault tolerance terminology • “dependability” - extent to which reliance can justifiably be placed on service. • General concept • “reliability” - continuity of service • metric: mean time between failures (MBTF) • “availability” - readiness for usage • “safety” - avoidance of catastrophic effects on environment • “security” - resistance to unauthorized access.
Faults, errors, failures • “fault” - component malfunction • “error” - system state is wrong • “failure” - system departs from specification error fault failure
System System components fault failure Environment
Coping with faults • Reduce/eliminate faults in components. • Fault tolerance • Prevent faults from becoming failures • usually through redundancy.
Types of faults (fault models) Fault tolerance algorithms dependent on fault models. • “Crash fault” or “stop fault” - faulty component stops responding. No incorrect state changes in component. • “Timing fault” - response is too early or late. • “Byzantine fault” - arbitrary behavior. Can be considered adversarial (imagine worst case).
The agreement problem • Processors may fail • … so, use multiple processors • … but then, processors may disagree, causing failures. • Need a principled approach to distributed agreement
Example: AFTI 16 (from J. Rushby) • “Advanced Fighter Technology Integration F16 • Triple-redundant digital flight-control system (DFCS) with analog backup • DFCS design was “asynchronous” • processors ran independently • sample sensor, evaluate control law, send command to actuator • actuator averages or selects from commands • General Dynamics felt synchronization would introduce a single point of failure.
AFTI 16 problems • Processors can get widely varying sensor readings because of timing differences • Reconfiguration can cause sudden changes in control (“thumps”). • Need to allow wide range of “plausible values” before declaring a processor “bad” • Bad sensor reading drags average down • Sensor finally crosses threshhold and is called “bad” • average suddenly snaps back when sensor is excluded.
AFTI 16 problems (cont) • Processor states can diverge rapidly • especially when different processors go into different control modes. • Design complexity • 70% of application code was for redundancy management • Control laws had to be modified to ramp changes in and out smoothly
AFTI 16 flight test, Flight 36 • “Departure” from control laws for 3 seconds • acceleration exceeded -4g, then +7g • Angle of attack went to -10 degrees, then +20 degrees • Aircraft rolled 360 degreees • Cause: side air probe cut out at high angle of attack • Analysis showed this would cause complete failure of DFCS for several areas of flight envelope
AFTI 16 flight 44 • Each channel declared the others failed • asynchronous operation, timing skew, sensor noise • analog backup not selected • simultaneous failure of two channels not anticipated • Aircraft flown home on a single digital channel (not designed for this) • There were no hardware failures.
AFTI 16 Analysis (NASA) • Nearly all failure indications were design oversights related to asynchronous operation • Failures due to lack of understanding of interactions among • Air data system • redundancy management software • flight control laws (decision points, thumps, ramp-in/out) • Moral of the story: Reliability through redundancy is a lot harder than it looks.
Distributed consensus • Goal: multiple processors agree on something in the presence of various kinds of faults and errors • Intellectually difficult • Algorithms are tricky • Proofs are subtle • Sensitive to assumptions • Synchronous vs. asynchronous • Communication mechanism • Fault models • Many papers written
Synchronous vs. asynchronous • Synchronous: Processors run in lock-step • Hard to implement - model may be unrealistic • Requires clock synchronization. • Consensus is easier • Asynchronous: Processors run at arbitrary speed • Easier to implement - model is conservative • In most models, consensus problem is provably unsolvable.
Synchronous vs. asynchronous • Semi-synchronous • Bounds on how far out-of-sync processors can get • Model is fairly realistic • Consensus is almost as easy as synchronous
Fault models • Goal: Make claims such as: “the system will continue to function if any single processor stops.” • More conservative fault models: • Fault tolerance is harder • But, if successful, stronger claims can be made • Fewer assumptions = simpler FMEA, easier “certification” • A lot of models have been proposed.
Process fault models • “Stopping fault” - process stops sending messages • does not restart • does not send wrong messages • liberal (easy) model • “Byzantine fault” - process behaves arbitrarily • Name comes from cute “Byzantine generals” metaphor • May send arbitrary messages, enter arbitrary states • Equivalent to “evil” behavior, for our purposes
Synchronous agreement with stopping faults • Multiple processes want to “agree” on a value • Applications • sensor readings among redundant processors • decide what time it is • decide which of a group of processors are broken and should be removed from system.
Synchronous agreement - properties • Each process starts with an initial value, processes end with a decision value. • Agreement: all good processes decide on same values. • Validity: if all processors start with same value, that value is the final decision value. • Termination: All good processes eventually decide.
Flood set algorithm • Assumption: There is a dedicated link between each pair of processes • No more than f processes can stop • Each process has an initial value v • Each process accumulates a set W of all the values it has ever seen. • On each round, every process sends its W set to every other process • Every process sets W to the union of the old value and all the new values coming in from others.
Flood set • After f rounds, every process looks at W. • If W has only one value, choose that value. • Else, choose 0 (a predetermined default).
Flood set correctness • In f+1 rounds, there must be at least one round in which no processes stop • At most f processes can stop, and processes cannot stop more than once. • If no process stops in round r, W will be the same in all good processes in subsequent rounds. • All good processes successfully send all values in W to all other good processes, so all processes will have same W after the round. • After this, nothing can get added to any W sets, so it doesn’t matter whether more processes stop.
Flood set correctness • So, after f+1 rounds, all non-stopped processes have same W sets • If W has only one value, all processes pick this value. • Else all processes pick 1.
Dies after sending W to but not something something something A A B W sets for , are same {A} {A} {B} - {A,B} {A} Www s - {A,B} {A,B} 0 0 - Blank here blank here blank here Choose default because |W|>1 Flood set example • 3 processes, 1 fault, default value = 0 W in round 0 W in round 1 W in round 2 final
Flood set efficiency O((f + 1) n2) messages f+1 rounds n processes send n messages per round O((f+1)n3) values are sent (each message may have a set of up to n values)
Optimized flood set • Note: If W has more than one element, process doesn’t need to know what is in it. • Idea: Every process sends only first two distinct values. • Every process sends its initial value on first round • If process receives a different value, it sends it out on next round • Correctness proof: run Flood and OptFlood in parallel • same initial values, stopping pattern • W sets have more than one value iff OptFlood process gets two values.
OptFlood efficiency 2 n2messages n processes send at most two messages to n other processes. O(n2) values are sent
Byzantine agreement • Goal: non-faulty processes should agree on a value. • E.g., message received • e.g., sensor value • Faults may cause arbitrary behavior • arbitrary values communicated • different values communicated to different receivers • Advantage: reduces fault analysis • Disadvantage: hard or impossible to do.
Byzantine agreement properties Agreement: All good processes agree on a value Validity: If source of value was non-faulty, agreed upon value is the same.
Asynchronous agreement • Asynchronous model: • Message transmission takes arbitrary time. • Processes run at arbitrary speeds. • Theorem: There is no algorithm that reaches agreement in an asynchronous model with even one Byzantine failure • Fine print: Details of conditions, communication • This is one of the most important results about distributed systems.
Synchronous agreement • Synchronous model: Processes can communicate in a sequence of rounds. All processes complete a round before next round begins. • The agreement problem is solvable in this model. • Theorem: Tolerating k Byzantine faults requires > 3k processes. • So “Triple modular redundancy” can’t handle Byzantine faults. • Practical case: 1 Byzantine fault, 4 processes. • Assumes full connectivity (connections between each pair of processors).
Synchronous agreement with one fault • Single transmitter communicates value to all processes. • Round 0: Transmitter sends value to n-1 receivers. • Values are sent correctly if transmitter is not faulty. • Round 1: Each receiver sends value to n-2 other receivers. • Receivers record all values separately. • Intuition: receivers compare notes on what transmitter told them. • Each receiver choose majority value of all values it received. • If no majority, use pre-arranged default value.
Round 0: faulty xmtr sends varying results to rcvrs. 1 1 2 Xmtr P3 P3 P2 P2 P1 P1 consensus P1 1 1 1 2 Rcvr P2 1 1 1 2 P3 1 1 1 2 Finally, receivers take majority of all answers These are the round 0 values Example 1- faulty transmitter Round 1: rcvrs exchange values (reliably)
Round 0: faulty xmtr sends varying results to rcvrs. 1 2 3 Xmtr P3 P3 P2 P2 P1 P1 consensus P1 1 0 2 3 Rcvr P2 1 0 2 3 P3 1 0 2 3 There is no majority, so rcvrs use default These are the round 0 values Example 2- faulty transmitter Round 1: rcvrs exchange values (reliably)
Round 0: faulty xmtr sends varying results to rcvrs. 1 1 1 Xmtr P3 P3 P2 P2 P1 P1 consensus P1 1 5 1 1 Rcvr P2 2 1 1 1 P3 3 1 1 1 Majority computes correct values for processes 2,3 These are the round 0 values Example 3- faulty receiver Process 1 is broken, so result is not required to be correct Process 1 sends bogus values
General case • Previous algorithm can be generalized to handle more Byzantine faults. • General results: k faults require k+1 (k?) rounds, 3k+1 processors • Number of messages grows exponentially with number of rounds • Intuition: “Pn said that Pn-1 said that ... p1 said that p0 said that the value was x” • There are exponentially many chains pn ... p0.
Hybrid Byzantine agreement • Idea: Free bonus reliability with the purchase of Byzantine agreement. • Handles Byzantine faults, plus some more simpler faults • Symmetric fault: process sends same wrong value to everyone. • Nonmalicious fault: process sends a recognizable error value. • Advantages: • If processors have these faults, we can tolerate more faulty processors • These faults are more probable than true Byzantine faults - so this increases reliability
Hybrid Byzantine agreement • Modify previous algorithm by adding special error value “E”. • Nonmalicious faults send E value (other faults may send E, also). • Majority algorithm first removes E values. • Theorem: Algorithm reaches agreement if • n > 2a + 2s + b + r • a = Byzantine, s = symmetric, b = nonmalicious, r = number of rounds (excluding first transmission). • Previous case: a=1, s=0, b=0, r=1, so n > 3 • With 6 processors, can deal with 1 Byzantine + 2 nonmalicious faults. • or 1 Byzantine and 1 symmetric • ... but just 1 Byzantine in previous algorithm
Variations • Synchronous communication is difficult • Compromise between synchronous and asynchronous: real-time constraints. • “Authentication” - agreement can be made less costly by using digital signatures • transmitter digitally signs messages • processes can’t lie about who said what. • can handle any number of faults (in synchronous model). • May assume different network connectivity • Some links in network missing
Summary • Fault tolerance is tricky. Redundancy does not necessarily buy reliability. • Byzantine models can account for unforeseen fault types. • Byzantine agreement is impossible in some models. • There exist practical algorithms for Byzantine agreement if synchronous communication is available. • There are deep theoretical results in this area.