690 likes | 797 Views
On the Cost of Fault-Tolerant Consensus When There are no Faults. Idit Keidar and Sergio Rajsbaum PODC 2002 Tutorial. About This Tutorial. Preliminary version in SIGACT News and MIT Tech Report, June 2001 More polished lower bound proof to appear in IPL
E N D
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar and Sergio Rajsbaum PODC 2002 Tutorial
About This Tutorial • Preliminary version in SIGACT News and MIT Tech Report, June 2001 • More polished lower bound proof to appear in IPL • New version of the tutorial in preparation • The talk includes only a subset of references, sorry • We include some food for thought Any suggestions are welcome!
Consensus Each process has an input, should decide an output s.t. Agreement: correct processes’ decisions are the same Validity: decision is input of one process Termination: eventually all correct processes decide There are at least two possible input values 0 and 1
Basic Model • Message passing • Channels between every pair of processes • Crash failures • t<n potential failures out of n>1 processes • No message loss among correct processes
How Long Does It Take to Solve Consensus? Depends on the timing model: • Message delays • Processing times • Clocks • And on the metric used: • Worst case • Average • etc
The Rest of This Tutorial • Part I: Realistic timing model and metric • Part II: Upper bounds • Part III: Lower bounds • Part IV: New directions and extensions
Asynchronous Model • Unbounded message delay, processor speed Consensus impossible even for t=1 [FLP85]
Synchronous Model • Algorithm runs in synchronous rounds: • send messages to any set of processes, • receive messages from previous round, • do local processing (possibly decide, halt) Round • If process i crashes in a round, then any subset of the messages i sends in this round can be lost
Synchronous Consensus 1 round with no failures • Consider a run with f failures (f<t) • Processes can decide in f+1 rounds [Lamport Fischer 82; Dolev, Reischuk, Strong 90](early-deciding) • In this talk deciding • halting takes min(f+2,t+1) [Dolev, Reischuk, Strong 90]
The Middle Ground Many real networks are neither synchronous nor asynchronous • During long stable periods, delays and processing times are bounded • Like synchronous model • Some unstable periods • Like asynchronous model
Partial Synchrony Model [Dwork, Lynch, Stockmeyer 88] • Processes have clocks with bounded drift • D, upper bound on message delay • r, upper bound on processing time • GST, global stabilization time • Until GST, unstable: bounds do not hold • After GST, stable: bounds hold • GST unknown
Partial Synchrony in Practice • For D, r, choose bounds that hold with high probability • Stability forever? • We assume that once stable remains stable • In practice, has to last “long enough” for given algorithm to terminate • A commonly used model that alternates between stable and unstable times: Timed Asynchronous Model [Cristian, Fetzer 98]
Consensus with Partial Synchrony Unbounded running time by [FLP85], because model can be asynchronous for unbounded time • Solvable iff t < n/2 [DLS88]
In a Practical System Can we say more than: consensus will be solved eventually ?
Performance Metric Number of rounds in well-behavedruns • Well-behaved: • No failures • Stable from the beginning • Motivation: common case
The Rest of This Tutorial • Part II: best known algorithms decide in 2 rounds in well-behaved runs • 2 time (with delay bound , 0 processing time) • Part III: this is the best possible • Part IV: new directions and extensions
Part II: Algorithms, and the Failure Detector Abstraction II.a Failure Detectors and Partial Synchrony II.b Algorithms
Time-Free Algorithms • We describe the algorithms using failure detector abstraction [Chandra, Toueg 96] • Goal: abstract away time, get simpler algorithms
Unreliable Failure Detectors [Chandra, Toueg 96] • Each process has local failure detector oracle • Typically outputs list of processes suspected to have crashed at any given time • Unreliable: failure detector output can be arbitrary for unbounded (finite) prefix of run
Performance of Failure Detector Based Consensus Algorithms • Implement a failure detector in the partial synchrony model • Design an algorithm for the failure detector • Analyze the performance in well-behaved runs of the combined algorithm
A Natural Failure Detector Implementation in Partial Synchrony Model • Implement failure detector using timeouts: • When expecting a message from a process i, wait D + r + clock skew before suspecting i • In well-behaved runs, D, r always hold, hence no false suspicions
The resulting failure detector is <>P - Eventually Perfect • Strong Completeness: From some point on, every faulty process is suspected by every correct process • Eventual Strong Accuracy: From some point on, every correct process is not suspected* *holds in all runs
Weakest Failure Detectors for Consensus • <>S - Eventually Strong • Strong Completeness • Eventual Weak Accuracy: From some point on, some correct process is not suspected • <>W - Leader • Outputs one trusted process • From some point, all correct processes trust the same correct process
Relationships among Failure Detector Classes • <>S is a subset of <>P • <>S is strictly weaker than <>P • <>S ~ <>W[Chandra, Hadzilacos, Toueg 96] Food for thought: What is the weakest timing model where <>S and/or <>W are implementable but <>P is not?
Note on the Power of Consensus • Consensus cannot implement <>P, interactive consistency, atomic commit, … • So its “universality”, in the sense of • wait-free objects in shared memory [Herlihy 93] • state machine replication [Lamport 78; Schneider 90] does not cover sensitivity to failures, timing, etc.
A Natural <>W Implementation • Use <>P implementation • Output lowest id non-suspected process In well-behaved runs: process 1 always trusted
Other Failure Detector Implementations • Message efficient <>S implementation [Larrea, Fernández, Arévalo 00] • QoS tradeoffs between accuracy and completeness [Chen, Toueg, Aguilera 00] • Leader Election [Aguilera, Delporte, Fauconnier, Toueg 01] • Adaptive <>P[Fetzer, Raynal, Tronel 01] Food for thought: When is building <>P more costly than <>S or <> W?
Part II: Algorithms, and the Failure Detector Abstraction II.a Failure Detectors and Partial Synchrony II.b Algorithms
Algorithms that Take 2 Rounds in Well-Behaved Runs • <>S-based [Schiper 97; Hurfin, Raynal 99; Mostefaoui, Raynal 99] • <>W-based for t < n/3[Mostefaoui, Raynal 00] • <>W-based for t < n/2[Dutta, Guerraoui 01] • Paxos (optimized version) [Lamport 89; 96] • Leader-based (<>W) • Also tolerates omissions, crash recoveries • COReL - Atomic Broadcast [Keidar, Dolev 96] • Group membership based (<>P)
Of This Laundry List, We Present Two Algorithms • <>S-based [MR99] • Paxos
<>S-based Consensus [MR99] • val input v; est null for r =1, 2, … do coord(r mod n)+1 if I am coord,then send (r,val) to all wait for ( (r, val)from coordOR suspect coord ) if receive val from coord then estval send (r, est)to all wait for (r,est) from n-t if any non-null est received thenvalest if all ests have same vthen send (“decide”, v) to all; return(v) od • Upon receive (“decide”, v), forward to all, return(v) 1 2
In Well-Behaved Runs 1 1 1 decide v1 (1, v1) 2 2 . . . . . . n n est = v1 (1, v1)
In Case of Omissions The algorithm can block in case of transient message omissions, waiting for a specific round message that will not arrive
Paxos [Lamport 88; 96; 01] • Uses <>W failure detector • Phase 1: prepare • A process who trusts itself tries to become leader • Chooses largest unique (using ids) ballot number • Learns outcome of all smaller ballots • Phase 2: accept • Leader gets majority to accept a value associated with his ballot number • A value accepted by a majority can be decided
Paxos - Variables • Type Rank • totally ordered set with minimum element r0 • Variables: Rank BallotNum, initially r0 Rank AcceptNum, initially r0 Value {^} AcceptVal, initially ^
Paxos Phase I: Prepare • Periodically, until decision is reached do: if leader (by <>W) then BallotNum (unique rank > BallotNum) send (“prepare”, rank) to all • Upon receive (“prepare”, rank) from i if rank > BallotNum then BallotNum rank send (“ack”, rank, AcceptNum, AcceptVal) to i
Paxos Phase II: Accept Upon receive (“ack”, BallotNum, b, val) from n-t if all vals = ^ then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all Upon receive (“accept”, b, v) with b BallotNum AcceptNum b; AcceptVal v send (“accept”, b, v) to all (first time only)
Paxos – Deciding Upon receive(“accept”, b, v) from n-t decide v periodically send (“decide”, v) to all Upon receive (“decide”, v) decide v
In Well-Behaved Runs 1 1 1 1 1 2 2 2 (“prepare”,1) (“accept”,1 ,v1) . . . . . . . . . (“ack”,1,r0,^) n n n (“accept”,1 ,v1) Our <>W implementation always trusts process 1 decide v1
Optimization • Allow process 1 (only!) to skip Phase 1 • use rank r0 • propose its own initial value • Takes 2 rounds in well-behaved runs • Takes 2 rounds for repeated invocations with the same leader
What About Omissions? • Does not block in case of a lost message • Phase I can start with new rank even if previous attempts never ended • But constant omissions can violate liveness • Specify conditional liveness: If n-t correct processes including the leader can communicate with each other then they eventually decide
Upper Bounds From Part II We saw that there are algorithms that take 2 rounds todecide in well-behaved runs • <>S-based, <>W-based, Paxos, COReL • Presented two of them.
Why are there no 1-Round Algorithms? There is a lower bound of 2 rounds in well-behaved executions • Similar bounds shown in [Dwork, Skeen 83; Lamport 00] • We will show that the bound follows from a similar bound on Uniform Consensus in the synchronous model
Uniform Consensus • Uniform agreement: decision of every two processes is the same Recall: with consensus, only correct processes have to agree
From Consensus to Uniform Consensus In partial synchrony model, any algorithm A for consensus solves uniform consensus[Guerraoui 95] Proof: Assume by contradiction that A does not solve uniform consensus • in some run, p,q decide differently, p fails • p may be non-faulty, and may wake up after q decides
Synchronous Uniform Consensus Every algorithm has a well-behaved run that takes 2 rounds to decide • More generally, it has a run with f failures (f<t-1), that takes at least f+2 rounds to decide[Charron-Bost, Schiper 00; KR 01] • as opposed to f+1 for consensus
A Simple Proof of the Uniform Consensus Synchronous Lower Bound[Keidar, Rajsbaum 01]To Appear in IPL
States • State = list of processes’ local states • Given a fixed deterministic algorithm, state at the end of run determined by initial values and environment actions • failures, message loss • can be denoted as: x . E1. E2. E3 x state, Ei environment actions