Fault Tolerance

Fault Tolerance

Fault tolerance terminology • “dependability” - extent to which reliance can justifiably be placed on service. • General concept • “reliability” - continuity of service • metric: mean time between failures (MBTF) • “availability” - readiness for usage • “safety” - avoidance of catastrophic effects on environment • “security” - resistance to unauthorized access.

Faults, errors, failures • “fault” - component malfunction • “error” - system state is wrong • “failure” - system departs from specification error fault failure

System System components fault failure Environment

Coping with faults • Reduce/eliminate faults in components. • Fault tolerance • Prevent faults from becoming failures • usually through redundancy.

Types of faults (fault models) Fault tolerance algorithms dependent on fault models. • “Crash fault” or “stop fault” - faulty component stops responding. No incorrect state changes in component. • “Timing fault” - response is too early or late. • “Byzantine fault” - arbitrary behavior. Can be considered adversarial (imagine worst case).

The agreement problem • Processors may fail • … so, use multiple processors • … but then, processors may disagree, causing failures. • Need a principled approach to distributed agreement

Example: AFTI 16 (from J. Rushby) • “Advanced Fighter Technology Integration F16 • Triple-redundant digital flight-control system (DFCS) with analog backup • DFCS design was “asynchronous” • processors ran independently • sample sensor, evaluate control law, send command to actuator • actuator averages or selects from commands • General Dynamics felt synchronization would introduce a single point of failure.

AFTI 16 problems • Processors can get widely varying sensor readings because of timing differences • Reconfiguration can cause sudden changes in control (“thumps”). • Need to allow wide range of “plausible values” before declaring a processor “bad” • Bad sensor reading drags average down • Sensor finally crosses threshhold and is called “bad” • average suddenly snaps back when sensor is excluded.

AFTI 16 problems (cont) • Processor states can diverge rapidly • especially when different processors go into different control modes. • Design complexity • 70% of application code was for redundancy management • Control laws had to be modified to ramp changes in and out smoothly

AFTI 16 flight test, Flight 36 • “Departure” from control laws for 3 seconds • acceleration exceeded -4g, then +7g • Angle of attack went to -10 degrees, then +20 degrees • Aircraft rolled 360 degreees • Cause: side air probe cut out at high angle of attack • Analysis showed this would cause complete failure of DFCS for several areas of flight envelope

AFTI 16 flight 44 • Each channel declared the others failed • asynchronous operation, timing skew, sensor noise • analog backup not selected • simultaneous failure of two channels not anticipated • Aircraft flown home on a single digital channel (not designed for this) • There were no hardware failures.

AFTI 16 Analysis (NASA) • Nearly all failure indications were design oversights related to asynchronous operation • Failures due to lack of understanding of interactions among • Air data system • redundancy management software • flight control laws (decision points, thumps, ramp-in/out) • Moral of the story: Reliability through redundancy is a lot harder than it looks.

Distributed consensus • Goal: multiple processors agree on something in the presence of various kinds of faults and errors • Intellectually difficult • Algorithms are tricky • Proofs are subtle • Sensitive to assumptions • Synchronous vs. asynchronous • Communication mechanism • Fault models • Many papers written

Synchronous vs. asynchronous • Synchronous: Processors run in lock-step • Hard to implement - model may be unrealistic • Requires clock synchronization. • Consensus is easier • Asynchronous: Processors run at arbitrary speed • Easier to implement - model is conservative • In most models, consensus problem is provably unsolvable.

Synchronous vs. asynchronous • Semi-synchronous • Bounds on how far out-of-sync processors can get • Model is fairly realistic • Consensus is almost as easy as synchronous

Fault models • Goal: Make claims such as: “the system will continue to function if any single processor stops.” • More conservative fault models: • Fault tolerance is harder • But, if successful, stronger claims can be made • Fewer assumptions = simpler FMEA, easier “certification” • A lot of models have been proposed.

Process fault models • “Stopping fault” - process stops sending messages • does not restart • does not send wrong messages • liberal (easy) model • “Byzantine fault” - process behaves arbitrarily • Name comes from cute “Byzantine generals” metaphor • May send arbitrary messages, enter arbitrary states • Equivalent to “evil” behavior, for our purposes

Synchronous agreement with stopping faults • Multiple processes want to “agree” on a value • Applications • sensor readings among redundant processors • decide what time it is • decide which of a group of processors are broken and should be removed from system.

Synchronous agreement - properties • Each process starts with an initial value, processes end with a decision value. • Agreement: all good processes decide on same values. • Validity: if all processors start with same value, that value is the final decision value. • Termination: All good processes eventually decide.

Flood set algorithm • Assumption: There is a dedicated link between each pair of processes • No more than f processes can stop • Each process has an initial value v • Each process accumulates a set W of all the values it has ever seen. • On each round, every process sends its W set to every other process • Every process sets W to the union of the old value and all the new values coming in from others.

Flood set • After f rounds, every process looks at W. • If W has only one value, choose that value. • Else, choose 0 (a predetermined default).

Flood set correctness • In f+1 rounds, there must be at least one round in which no processes stop • At most f processes can stop, and processes cannot stop more than once. • If no process stops in round r, W will be the same in all good processes in subsequent rounds. • All good processes successfully send all values in W to all other good processes, so all processes will have same W after the round. • After this, nothing can get added to any W sets, so it doesn’t matter whether more processes stop.

Flood set correctness • So, after f+1 rounds, all non-stopped processes have same W sets • If W has only one value, all processes pick this value. • Else all processes pick 1.

Dies after sending W to but not something something something A A B W sets for , are same {A} {A} {B} - {A,B} {A} Www s - {A,B} {A,B} 0 0 - Blank here blank here blank here Choose default because |W|>1 Flood set example • 3 processes, 1 fault, default value = 0 W in round 0 W in round 1 W in round 2 final

Flood set efficiency O((f + 1) n2) messages f+1 rounds n processes send n messages per round O((f+1)n3) values are sent (each message may have a set of up to n values)

Optimized flood set • Note: If W has more than one element, process doesn’t need to know what is in it. • Idea: Every process sends only first two distinct values. • Every process sends its initial value on first round • If process receives a different value, it sends it out on next round • Correctness proof: run Flood and OptFlood in parallel • same initial values, stopping pattern • W sets have more than one value iff OptFlood process gets two values.

OptFlood efficiency 2 n2messages n processes send at most two messages to n other processes. O(n2) values are sent

Byzantine agreement • Goal: non-faulty processes should agree on a value. • E.g., message received • e.g., sensor value • Faults may cause arbitrary behavior • arbitrary values communicated • different values communicated to different receivers • Advantage: reduces fault analysis • Disadvantage: hard or impossible to do.

Byzantine agreement properties Agreement: All good processes agree on a value Validity: If source of value was non-faulty, agreed upon value is the same.

Asynchronous agreement • Asynchronous model: • Message transmission takes arbitrary time. • Processes run at arbitrary speeds. • Theorem: There is no algorithm that reaches agreement in an asynchronous model with even one Byzantine failure • Fine print: Details of conditions, communication • This is one of the most important results about distributed systems.

Synchronous agreement • Synchronous model: Processes can communicate in a sequence of rounds. All processes complete a round before next round begins. • The agreement problem is solvable in this model. • Theorem: Tolerating k Byzantine faults requires > 3k processes. • So “Triple modular redundancy” can’t handle Byzantine faults. • Practical case: 1 Byzantine fault, 4 processes. • Assumes full connectivity (connections between each pair of processors).

Synchronous agreement with one fault • Single transmitter communicates value to all processes. • Round 0: Transmitter sends value to n-1 receivers. • Values are sent correctly if transmitter is not faulty. • Round 1: Each receiver sends value to n-2 other receivers. • Receivers record all values separately. • Intuition: receivers compare notes on what transmitter told them. • Each receiver choose majority value of all values it received. • If no majority, use pre-arranged default value.

Round 0: faulty xmtr sends varying results to rcvrs. 1 1 2 Xmtr P3 P3 P2 P2 P1 P1 consensus P1 1 1 1 2 Rcvr P2 1 1 1 2 P3 1 1 1 2 Finally, receivers take majority of all answers These are the round 0 values Example 1- faulty transmitter Round 1: rcvrs exchange values (reliably)

Round 0: faulty xmtr sends varying results to rcvrs. 1 2 3 Xmtr P3 P3 P2 P2 P1 P1 consensus P1 1 0 2 3 Rcvr P2 1 0 2 3 P3 1 0 2 3 There is no majority, so rcvrs use default These are the round 0 values Example 2- faulty transmitter Round 1: rcvrs exchange values (reliably)

Round 0: faulty xmtr sends varying results to rcvrs. 1 1 1 Xmtr P3 P3 P2 P2 P1 P1 consensus P1 1 5 1 1 Rcvr P2 2 1 1 1 P3 3 1 1 1 Majority computes correct values for processes 2,3 These are the round 0 values Example 3- faulty receiver Process 1 is broken, so result is not required to be correct Process 1 sends bogus values

General case • Previous algorithm can be generalized to handle more Byzantine faults. • General results: k faults require k+1 (k?) rounds, 3k+1 processors • Number of messages grows exponentially with number of rounds • Intuition: “Pn said that Pn-1 said that ... p1 said that p0 said that the value was x” • There are exponentially many chains pn ... p0.

Hybrid Byzantine agreement • Idea: Free bonus reliability with the purchase of Byzantine agreement. • Handles Byzantine faults, plus some more simpler faults • Symmetric fault: process sends same wrong value to everyone. • Nonmalicious fault: process sends a recognizable error value. • Advantages: • If processors have these faults, we can tolerate more faulty processors • These faults are more probable than true Byzantine faults - so this increases reliability

Hybrid Byzantine agreement • Modify previous algorithm by adding special error value “E”. • Nonmalicious faults send E value (other faults may send E, also). • Majority algorithm first removes E values. • Theorem: Algorithm reaches agreement if • n > 2a + 2s + b + r • a = Byzantine, s = symmetric, b = nonmalicious, r = number of rounds (excluding first transmission). • Previous case: a=1, s=0, b=0, r=1, so n > 3 • With 6 processors, can deal with 1 Byzantine + 2 nonmalicious faults. • or 1 Byzantine and 1 symmetric • ... but just 1 Byzantine in previous algorithm

Variations • Synchronous communication is difficult • Compromise between synchronous and asynchronous: real-time constraints. • “Authentication” - agreement can be made less costly by using digital signatures • transmitter digitally signs messages • processes can’t lie about who said what. • can handle any number of faults (in synchronous model). • May assume different network connectivity • Some links in network missing

Summary • Fault tolerance is tricky. Redundancy does not necessarily buy reliability. • Byzantine models can account for unforeseen fault types. • Byzantine agreement is impossible in some models. • There exist practical algorithms for Byzantine agreement if synchronous communication is available. • There are deep theoretical results in this area.

Fault Tolerance

Fault Tolerance

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance