On the Cost of Fault-Tolerant Consensus When There are no Faults

On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar and Sergio Rajsbaum PODC 2002 Tutorial

About This Tutorial • Preliminary version in SIGACT News and MIT Tech Report, June 2001 • More polished lower bound proof to appear in IPL • New version of the tutorial in preparation • The talk includes only a subset of references, sorry • We include some food for thought Any suggestions are welcome!

Consensus Each process has an input, should decide an output s.t. Agreement: correct processes’ decisions are the same Validity: decision is input of one process Termination: eventually all correct processes decide There are at least two possible input values 0 and 1

Basic Model • Message passing • Channels between every pair of processes • Crash failures • t<n potential failures out of n>1 processes • No message loss among correct processes

How Long Does It Take to Solve Consensus? Depends on the timing model: • Message delays • Processing times • Clocks • And on the metric used: • Worst case • Average • etc

The Rest of This Tutorial • Part I: Realistic timing model and metric • Part II: Upper bounds • Part III: Lower bounds • Part IV: New directions and extensions

Part I: Realistic Timing Model

Asynchronous Model • Unbounded message delay, processor speed Consensus impossible even for t=1 [FLP85]

Synchronous Model • Algorithm runs in synchronous rounds: • send messages to any set of processes, • receive messages from previous round, • do local processing (possibly decide, halt) Round • If process i crashes in a round, then any subset of the messages i sends in this round can be lost

Synchronous Consensus 1 round with no failures • Consider a run with f failures (f<t) • Processes can decide in f+1 rounds [Lamport Fischer 82; Dolev, Reischuk, Strong 90](early-deciding) • In this talk deciding • halting takes min(f+2,t+1) [Dolev, Reischuk, Strong 90]

The Middle Ground Many real networks are neither synchronous nor asynchronous • During long stable periods, delays and processing times are bounded • Like synchronous model • Some unstable periods • Like asynchronous model

Partial Synchrony Model [Dwork, Lynch, Stockmeyer 88] • Processes have clocks with bounded drift • D, upper bound on message delay • r, upper bound on processing time • GST, global stabilization time • Until GST, unstable: bounds do not hold • After GST, stable: bounds hold • GST unknown

Partial Synchrony in Practice • For D, r, choose bounds that hold with high probability • Stability forever? • We assume that once stable remains stable • In practice, has to last “long enough” for given algorithm to terminate • A commonly used model that alternates between stable and unstable times: Timed Asynchronous Model [Cristian, Fetzer 98]

Consensus with Partial Synchrony Unbounded running time by [FLP85], because model can be asynchronous for unbounded time • Solvable iff t < n/2 [DLS88]

In a Practical System Can we say more than: consensus will be solved eventually ?

Performance Metric Number of rounds in well-behavedruns • Well-behaved: • No failures • Stable from the beginning • Motivation: common case

The Rest of This Tutorial • Part II: best known algorithms decide in 2 rounds in well-behaved runs • 2 time (with delay bound , 0 processing time) • Part III: this is the best possible • Part IV: new directions and extensions

Part II: Algorithms, and the Failure Detector Abstraction II.a Failure Detectors and Partial Synchrony II.b Algorithms

Time-Free Algorithms • We describe the algorithms using failure detector abstraction [Chandra, Toueg 96] • Goal: abstract away time, get simpler algorithms

Unreliable Failure Detectors [Chandra, Toueg 96] • Each process has local failure detector oracle • Typically outputs list of processes suspected to have crashed at any given time • Unreliable: failure detector output can be arbitrary for unbounded (finite) prefix of run

Performance of Failure Detector Based Consensus Algorithms • Implement a failure detector in the partial synchrony model • Design an algorithm for the failure detector • Analyze the performance in well-behaved runs of the combined algorithm

A Natural Failure Detector Implementation in Partial Synchrony Model • Implement failure detector using timeouts: • When expecting a message from a process i, wait D + r + clock skew before suspecting i • In well-behaved runs, D, r always hold, hence no false suspicions

The resulting failure detector is <>P - Eventually Perfect • Strong Completeness: From some point on, every faulty process is suspected by every correct process • Eventual Strong Accuracy: From some point on, every correct process is not suspected* *holds in all runs

Weakest Failure Detectors for Consensus • <>S - Eventually Strong • Strong Completeness • Eventual Weak Accuracy: From some point on, some correct process is not suspected • <>W - Leader • Outputs one trusted process • From some point, all correct processes trust the same correct process

Relationships among Failure Detector Classes • <>S is a subset of <>P • <>S is strictly weaker than <>P • <>S ~ <>W[Chandra, Hadzilacos, Toueg 96] Food for thought: What is the weakest timing model where <>S and/or <>W are implementable but <>P is not?

Note on the Power of Consensus • Consensus cannot implement <>P, interactive consistency, atomic commit, … • So its “universality”, in the sense of • wait-free objects in shared memory [Herlihy 93] • state machine replication [Lamport 78; Schneider 90] does not cover sensitivity to failures, timing, etc.

A Natural <>W Implementation • Use <>P implementation • Output lowest id non-suspected process In well-behaved runs: process 1 always trusted

Other Failure Detector Implementations • Message efficient <>S implementation [Larrea, Fernández, Arévalo 00] • QoS tradeoffs between accuracy and completeness [Chen, Toueg, Aguilera 00] • Leader Election [Aguilera, Delporte, Fauconnier, Toueg 01] • Adaptive <>P[Fetzer, Raynal, Tronel 01] Food for thought: When is building <>P more costly than <>S or <> W?

Part II: Algorithms, and the Failure Detector Abstraction II.a Failure Detectors and Partial Synchrony II.b Algorithms

Algorithms that Take 2 Rounds in Well-Behaved Runs • <>S-based [Schiper 97; Hurfin, Raynal 99; Mostefaoui, Raynal 99] • <>W-based for t < n/3[Mostefaoui, Raynal 00] • <>W-based for t < n/2[Dutta, Guerraoui 01] • Paxos (optimized version) [Lamport 89; 96] • Leader-based (<>W) • Also tolerates omissions, crash recoveries • COReL - Atomic Broadcast [Keidar, Dolev 96] • Group membership based (<>P)

Of This Laundry List, We Present Two Algorithms • <>S-based [MR99] • Paxos

<>S-based Consensus [MR99] • val  input v; est null for r =1, 2, … do coord(r mod n)+1 if I am coord,then send (r,val) to all wait for ( (r, val)from coordOR suspect coord ) if receive val from coord then estval send (r, est)to all wait for (r,est) from n-t if any non-null est received thenvalest if all ests have same vthen send (“decide”, v) to all; return(v) od • Upon receive (“decide”, v), forward to all, return(v) 1 2

In Well-Behaved Runs 1 1 1 decide v1 (1, v1) 2 2 . . . . . . n n est = v1 (1, v1)

In Case of Omissions The algorithm can block in case of transient message omissions, waiting for a specific round message that will not arrive

Paxos [Lamport 88; 96; 01] • Uses <>W failure detector • Phase 1: prepare • A process who trusts itself tries to become leader • Chooses largest unique (using ids) ballot number • Learns outcome of all smaller ballots • Phase 2: accept • Leader gets majority to accept a value associated with his ballot number • A value accepted by a majority can be decided

Paxos - Variables • Type Rank • totally ordered set with minimum element r0 • Variables: Rank BallotNum, initially r0 Rank AcceptNum, initially r0 Value  {^} AcceptVal, initially ^

Paxos Phase I: Prepare • Periodically, until decision is reached do: if leader (by <>W) then BallotNum  (unique rank > BallotNum) send (“prepare”, rank) to all • Upon receive (“prepare”, rank) from i if rank > BallotNum then BallotNum  rank send (“ack”, rank, AcceptNum, AcceptVal) to i

Paxos Phase II: Accept Upon receive (“ack”, BallotNum, b, val) from n-t if all vals = ^ then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all Upon receive (“accept”, b, v) with b  BallotNum AcceptNum  b; AcceptVal  v send (“accept”, b, v) to all (first time only)

Paxos – Deciding Upon receive(“accept”, b, v) from n-t decide v periodically send (“decide”, v) to all Upon receive (“decide”, v) decide v

In Well-Behaved Runs 1 1 1 1 1 2 2 2 (“prepare”,1) (“accept”,1 ,v1) . . . . . . . . . (“ack”,1,r0,^) n n n (“accept”,1 ,v1) Our <>W implementation always trusts process 1 decide v1

Optimization • Allow process 1 (only!) to skip Phase 1 • use rank r0 • propose its own initial value • Takes 2 rounds in well-behaved runs • Takes 2 rounds for repeated invocations with the same leader

What About Omissions? • Does not block in case of a lost message • Phase I can start with new rank even if previous attempts never ended • But constant omissions can violate liveness • Specify conditional liveness: If n-t correct processes including the leader can communicate with each other then they eventually decide

Part III: Lower Bounds in Partial Synchrony Model

Upper Bounds From Part II We saw that there are algorithms that take 2 rounds todecide in well-behaved runs • <>S-based, <>W-based, Paxos, COReL • Presented two of them.

Why are there no 1-Round Algorithms? There is a lower bound of 2 rounds in well-behaved executions • Similar bounds shown in [Dwork, Skeen 83; Lamport 00] • We will show that the bound follows from a similar bound on Uniform Consensus in the synchronous model

Uniform Consensus • Uniform agreement: decision of every two processes is the same Recall: with consensus, only correct processes have to agree

From Consensus to Uniform Consensus In partial synchrony model, any algorithm A for consensus solves uniform consensus[Guerraoui 95] Proof: Assume by contradiction that A does not solve uniform consensus • in some run, p,q decide differently, p fails • p may be non-faulty, and may wake up after q decides

Synchronous Uniform Consensus Every algorithm has a well-behaved run that takes 2 rounds to decide • More generally, it has a run with f failures (f<t-1), that takes at least f+2 rounds to decide[Charron-Bost, Schiper 00; KR 01] • as opposed to f+1 for consensus

A Simple Proof of the Uniform Consensus Synchronous Lower Bound[Keidar, Rajsbaum 01]To Appear in IPL

States • State = list of processes’ local states • Given a fixed deterministic algorithm, state at the end of run determined by initial values and environment actions • failures, message loss • can be denoted as: x . E1. E2. E3 x state, Ei environment actions

On the Cost of Fault-Tolerant Consensus When There are no Faults