270 likes | 412 Views
Unreliable Failure Detectors for Reliable Distributed Systems. Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber. Two-Army Problem. Unreliable Channel Can’t Guarantee Correct Communication Last Message May be Lost. Byzantine Generals Problem (1). 2. 1.
E N D
Unreliable Failure DetectorsforReliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber
Two-Army Problem • Unreliable Channel • Can’t Guarantee Correct Communication • Last Message May be Lost
Byzantine Generals Problem (1) 2 1 • Unreliable Processors (Traitors) • Report Incorrect Values (Troop Levels) 3 4 3 4 7 1 1 1 4 3
Byzantine Generals Problem (2) 1,2,3,4 1,2,3,4 • Loyal Generals Need to Verify Reports • Use Reports as Votes on Correct Values • That’s About It with the Color Diagrams 1,1,3,4 1,7,3,4 1,7,3,4 1,1,3,4 4,6,6,8 1,2,3,4 1,2,3,4 1,1,1,1 1,1,3,4 1,7,3,4
Distributed System • System of Processors • Connected In a Network • Running Independently • Solving Problems Together
Types of Failure • Unreliable Communication Channels • Processors Crash or Create Mischief • Synchronizing Processors • Atomic Broadcast • Problems Agreeing On Results • Consensus
Scope of This Solution • Processors Can Crash • Crashed Processors Never Recover • Processors are Not Malicious • Reliable Communication Channels • Asynchronous • Synchronize After a Finite Number of Steps • At Least One Processor is Correct • Every Down Processor is Detected By at Least One Up Processor • At Least One Up Processor is Detected By All Up Processors
Failure Detectors • Attached to Each Processor • Determine the Crash State of Some Processors • Processors Communicate Crash State Information • Imperfect • Suspect Processors Crashed • Slow Processors Might Become “Unsuspected” • Cause Host Processor to Abandon Other Processors
Completeness & Accuracy • Completeness • Down Processors are Abandoned • Accuracy • Up Processors are Not Abandoned
Function Definitions • abandons(p, q, t) • Processor p Abandons Processor qat Time t • isDown(q, t) • Processor q is Really Down at Time t
Completeness • Strong Completeness • Every Down Processor is Abandoned by Every Up Processor Eventually • p, q, t0, t > t0: isDown(q, t) abandons(p, q, t) • Weak Completeness • Every Down Processor is Abandoned by At Least One Up Processor Eventually • p, q, t0, t > t0: isDown(q, t) abandons(p, q, t)
Accuracy • Strong Accuracy (Perpetual/Eventual) • Every Up Processor is Not Abandoned by Every Processor Ever/Eventually • Perpetual: p, q, t: isDown(q, t) abandons(p, q, t) • Eventual: p, q, t0, t > t0: isDown(q, t) abandons(p, q, t) • Weak Accuracy (Perpetual/Eventual) • At Least One Up Processor is Not Abandoned by Any Processor Ever/Eventually • Perpetual: p, q, t: isDown(q, t) abandons(p, q, t) • Eventual: p, q, t0, t > t0: isDown(q, t) abandons(p, q, t)
Classes of Failure Detectors • 8 Combinations of Completeness and Accuracy
Reducibility (Emulation) • Some Classes are More Powerful Than Others • Strong Complete Can Emulate Weak Complete • Some Classes Can Emulate Others Using an Algorithm: • Up Processors Share Lists of Abandoned Processors, Exclude Themselves • Abandoned by One Becomes Abandoned by All • Weak Complete Can Emulate Strong Complete
Completeness Classes Are Equivalent • 4 Distinct Accuracy Classes
Relationship of Accuracy Classes • Perpetual is More Powerful Than Eventual • Perpetual: t • Eventual: t0, t > t0 • Strong is More Powerful Than Weak • Strong: q • Weak: q
Relationship of Failure Detector Classes • P is Most Powerful; S is Least Powerful
The Consensus Problem • Processors Reach Agreement on a Value • Termination: All Up Processors • Agreement: All Agree to Same Value • Integrity: Decision is Final • Validity: A Proposed Value is Chosen • If They Can Agree on One Thing,They Can Agree on Anything • Algorithms for S and S Detectors • At Least One Up Processor Using S Detectors • A Majority of Up Processors Using S Detectors
Algorithm for S Detectors • S Detectors – At Least One Up Processor is Not Abandoned by Any Up Processor Ever • Collect Proposed Values from Each Processor • or the News That the Process Crashed • Collect Other Processors’ Knowledge of Proposed Values • Discard Values not Known to All • Pick (Consistently) a Value from Known Values • All Processors Get Phase 1 & 2 Information from the Processor That is Never Abandoned
Algorithm for S Detectors • Rotating Coordinator • Each Processor Takes Their Turn • Tries to Make Decision • If the Processor is Up and is Not Abandoned by Any Up Processor, the Decision is Made
Each Round of S Algorithm • At Least One Up Processor is Not Abandoned by Any Up Processor Eventually • All Processors Send Value and the Round Number to Coordinator • Coordinator Waits for a Majority and Sends the Value with the Latest Round Number to All Processors • Each Processor Indicates If It Abandoned Coordinator • Coordinator Waits for a Majority, If No Processor Abandoned Coordinator, the Value is Decided • Repeat Until Coordinator is Not Abandoned Eventually
Atomic Broadcast • All Processors Receive the Same Messages in the Same Order • Atomic Broadcast is Equivalent to Consensus • Each Can Be Reduced to the Other • Solution to Consensus Applies to Atomic Broadcast
Atomic Broadcast Reduces to Consensus • Atomic Broadcast Can Be Implemented Using a Consensus Algorithm • Each Processor Proposes a Message • Consensus is Used to Decide Which Message is Recognized as the Next Atomically Broadcast Message
Consensus Reduces to Atomic Broadcast • Consensus Can Be Implemented Using An Atomic Broadcast Algorithm • To Decide a Value, a Process Atomically Broadcasts It • Go to Lunch Early
Summary • Reliable Distributed Systems • Unreliable Failure Detectors • Relationship of Detector Classes • Algorithms for Consensus • Equivalence with Atomic Broadcast