Lecture 4 Introduction to Principles of Distributed Computing

Lecture 4Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico

Lecture 4 Consensus in partially synchronous systems, and failure detectors • Part I: Realistic timing model and metric • Part II: Failure detectors, algorithms • Part III: this is the best possible • Part IV: New directions and extensions

CONSENSUS A fundamental Abstraction Each process has an input, should decide an output s.t. Agreement: correct processes’ decisions are the same Validity: decision is input of one process Termination: eventually all correct processes decide There are at least two possible input values 0 and 1. all possible vectors over the input values V

L2(X0) L(X0) X0 The lecture in a nutshell • Consensus solvability depends on how long connectivity preserved by a particular model • In synchronous it is solvable, in asynchronous not. What about intermediate, more realistic models? Connectivity destroyed Initial states states after one round Connectivity preserved states after 2 rounds

Basic Model • Message passing (essentially equivalent to read/write shared memory model) • Channels between every pair of processes • Crash failures t < n potential failures out of n >1 processes • No message loss among correct processes

Is consensus solvable?If so, how long does it take to solve it? • It depends on what exactly the model is • But what is a realistic model? • And what are the common scenarios within the model? The nature of a distributed system is to include complex combinations of failures and delays

How Fast Can We Solve Consensus? Depends on the timing model: • Message delays • Processing times • Clocks • And on the metric used: • Worst case • Average • etc

The Rest of This Lecture • Part I: Realistic timing model and metric • Part II: Upper bounds • Part III: this is the best possible • Part IV: New directions and extensions

Part I: Realistic Timing Model

First two simple models

Asynchronous Model • Unbounded message delay, processor speed Consensus impossible even for t=1 [FLP85]

Synchronous Model • Algorithm runs in synchronous rounds: • send messages to any set of processes, • receive messages from previous round, • do local processing (possibly decide, halt) Round • If process i crashes in a round, then any subset of the messages i sends in this round can be lost

Synchronous Consensus • In a run with f failures (f<t) • Processes can decide in f+1 rounds [Lamport Fischer 82; Dolev, Reischuk, Strong 90](early-deciding) • 1 round with no failures • In this talk deciding • halting takes min(f+2,t+1) [Dolev, Reischuk, Strong 90]

The Middle Ground Many real networks are neither synchronous nor asynchronous • During long stable periods, delays and processing times are bounded • Like synchronous model • Some unstable periods • Like asynchronous model

Partial Synchrony Model [Dwork, Lynch, Stockmeyer 88] • Processes have clocks (with bounded drift) • D, upper bound on message delay • r, upper bound on processing time • GST, global stabilization time • Until GST, unstable: bounds do not hold • After GST, stable: bounds hold • GST unknown

Partial Synchrony in Practice • For D, r, choose bounds that hold with high probability • Stability forever? • We assume that once stable remains stable • In practice, has to last “long enough” for given algorithm to terminate • A commonly used model that alternates between stable and unstable times: Timed Asynchronous Model [Cristian, Fetzer 98]

Consensus with Partial Synchrony • Solvable • requires t < n/2 [DLS88] Unbounded running time by [FLP85], because model can be asynchronous for unbounded time

Exercise • Prove that consensus is not solvable in the partially synchronous model, if t ≥ n/2 • Prove that if t<n/2, it takes unbounded running time to be solved

In a Practical System Can we say more than: consensus will be solved eventually ?

Performance Metric Number of rounds in well-behavedruns • Well-behaved: • No failures • Stable from the beginning • Motivation: common case

The Rest of This Lecture • Part II: best known algorithms decide in 2 rounds in well-behaved runs • 2 time (with delay bound , 0 processing time) • Part III: this is the best possible • Part IV: new directions and extensions

Part II: Algorithms, and the Failure Detector Abstraction II.a Failure Detectors and Partial Synchrony -= II.b Algorithms

Time-Free Algorithms • Goal: abstract away time, get simpler algorithms • We describe the algorithms using failure detector abstraction [Chandra, Toueg 96]

Unreliable Failure Detectors [Chandra, Toueg 96] • Each process has local failure detector oracle • Typically outputs list of processes suspected to have crashed at any given time • Unreliable: failure detector output can be arbitrary for unbounded (finite) prefix of run

Performance of Failure Detector Based Consensus Algorithms • Implement a failure detector in the partial synchrony model • Design an algorithm for the failure detector • Analyze the performance in well-behaved runs of the combined algorithm

A Natural Failure Detector Implementation in Partial Synchrony Model • Implement failure detector using timeouts: • When expecting a message from a process i, wait D + r + clock skew before suspecting i • In well-behaved runs, D, r always hold, hence no false suspicions

The resulting failure detector is <>P - Eventually Perfect • Strong Completeness: From some point on, every faulty process is suspected by every correct process • Eventual Strong Accuracy: From some point on, every correct process is not suspected

Weakest Failure Detectors for Consensus • <>S - Eventually Strong • Strong Completeness • Eventual Weak Accuracy: From some point on, some correct process is not suspected • W - Leader • Outputs one trusted process • From some point, all correct processes trust the same correct process

A Simple W Implementation • Use <>P implementation • Output lowest id non-suspected process In well-behaved runs: process 1 always trusted

Exercise • Write the algorithm code for this failure detector W, and prove it is correct

Relationships among Failure Detector Classes • <>S is a subset of <>P • <>S is strictly weaker than <>P • <>S ~ W[Chandra, Hadzilacos, Toueg 96] Food for thought: What is the weakest timing model where <>S and/or W are implementable but <>P is not?

Relationships among Failure Detector Classes- Recent Results Partial Answer: In PODC’03 Aguilera et al present a system with synchronous processes S : • any number of them may crash, and • only the output links of an unknown correct process are eventually timely (all other links can be asynchronous and/or lossy) <>P is not implementable in S, W yes New proof that: <>S is strictly weaker than <>P

Note on the Power of Consensus • Consensus cannot implement <>P, interactive consistency, atomic commit, … • So its “universality”, in the sense of • wait-free objects in shared memory [Herlihy 93] • state machine replication [Lamport 78; Schneider 90] does not cover sensitivity to failures, timing, etc.

Other Failure Detector Implementations Food for thought: When is building <>P more costly than <>S or W? Partial answer: Aguilera at al PODC’03 observe • any implementation of <>P (even in a perfectly synchronous system) requires all alive processes to send messages forever, while W can be implemented such that eventually only the leader sends messages

Other Failure Detector Implementations • Message efficient <>S implementation [Larrea, Fernández, Arévalo 00] • QoS tradeoffs between accuracy and completeness [Chen, Toueg, Aguilera 00] • Leader Election [Aguilera, Delporte, Fauconnier, Toueg 01] • Adaptive <>P[Fetzer, Raynal, Tronel 01]

Part II: Algorithms, and the Failure Detector Abstraction II.a Failure Detectors and Partial Synchrony II.b Algorithms

Algorithms that Take 2 Rounds in Well-Behaved Runs • <>S-based [Schiper 97; Hurfin, Raynal 99; Mostefaoui, Raynal 99] • W-based for t < n/3[Mostefaoui, Raynal 00] • W-based for t < n/2[Dutta, Guerraoui 01] • Paxos (optimized version) [Lamport 89; 96] • Leader-based (W) • Also tolerates omissions, crash recoveries • COReL - Atomic Broadcast [Keidar, Dolev 96] • Group membership based (<>P)

Of This Laundry List, We Present Two Algorithms • <>S-based [MR99] • Paxos

<>S-based Consensus [MR99] • val  input v; est null for r =1, 2, … do coord(r mod n)+1 if I am coord,then send (r,val) to all wait for ( (r, val)from coordOR suspect coord (by <>S)) if receive val from coord then estval elseest null send (r, est)to all wait for (r,est) from n-t processes if any non-null est received thenvalest if all ests have same vthen send (“decide”, v) to all; return(v) od • Upon receive (“decide”, v), forward to all, return(v) 1 2

In Well-Behaved Runs 1 1 1 decide v1 (1, v1) 2 2 . . . . . . n n est = v1 (1, v1)

In Case of Omissions The algorithm can block in case of transient message omissions, waiting for a specific round message that will not arrive

Paxos [Lamport 88; 96; 01] • Uses W failure detector • Phase 1: prepare • A process who trusts itself tries to become leader • Chooses largest unique (using ids) ballot number • Learns outcome of all smaller ballots • Phase 2: accept • Leader proposes a value with his ballot number. • Leader gets majority to accept his proposal. • A value accepted by a majority can be decided

Paxos - Variables • Type Rank • totally ordered set with minimum element r0 • Variables: Rank BallotNum, initially r0 Rank AcceptNum, initially r0 Value  {^} AcceptVal, initially ^

Paxos Phase I: Prepare • Periodically, until decision is reached do: if leader (by W) then BallotNum  (unique rank > BallotNum) send (“prepare”, rank) to all • Upon receive (“prepare”, rank) from i if rank > BallotNum then BallotNum  rank send (“ack”, rank, AcceptNum, AcceptVal) to i

Paxos Phase II: Accept Upon receive (“ack”, BallotNum, b, val) from n-t if all vals = ^ then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all /* proposal */ Upon receive (“accept”, b, v) with b  BallotNum AcceptNum  b; AcceptVal  v /* accept proposal */ send (“accept”, b, v) to all (first time only)

Paxos – Deciding Upon receive(“accept”, b, v) from n-t decide v periodically send (“decide”, v) to all Upon receive (“decide”, v) decide v

In Well-Behaved Runs 1 1 1 1 1 2 2 2 (“prepare”,1) (“accept”,1 ,v1) . . . . . . . . . (“ack”,1,r0,^) n n n (“accept”,1 ,v1) Our W implementation always trusts process 1 decide v1

Optimization • Allow process 1 (only!) to skip Phase 1 • use rank r0 • propose its own initial value • Takes 2 rounds in well-behaved runs • Takes 2 rounds for repeated invocations with the same leader

What About Message Loss? • Does not block in case of a lost message • Phase I can start with new rank even if previous attempts never ended • But constant omissions can violate liveness • Specify conditional liveness: If n-t correct processes including the leader can communicate with each other then they eventually decide

Synchronous Consensus • In a run with f failures (f<t) • Processes can decide in f+1 rounds • And no less ! [Lamport Fischer 82; Dolev, Reischuk, Strong 90](early-deciding) • 1 round with no failures • In this talk deciding • halting takes min(f+2,t+1) [Dolev, Reischuk, Strong 90]

Lecture 4 Introduction to Principles of Distributed Computing