1 / 52

Principles of Reliable Distributed Systems Lecture 8: Paxos

Principles of Reliable Distributed Systems Lecture 8: Paxos. Spring 2008 Prof. Idit Keidar. Material. Paxos Made Simple Leslie Lamport ACM SIGACT News (Distributed Computing Column) 32 , 4 (Whole Number 121, December 2001) 18-25. Issues in the Real World I/III.

doyle
Download Presentation

Principles of Reliable Distributed Systems Lecture 8: Paxos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of Reliable Distributed SystemsLecture 8: Paxos Spring 2008 Prof. Idit Keidar

  2. Material • Paxos Made SimpleLeslie LamportACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 18-25.

  3. Issues in the Real World I/III • Problem: Sometimes messages take longer than expected • Solution 1: Use longer timeouts • Slow • Solution 2: Assume asynchrony • Impossible - FLP • Solution 3: Assume eventual synchrony or unreliable failure detectors • See last week – MR Algorithm

  4. Reminder: MR in “Normal” (Failure-Free Suspicion-Free) Runs 1 1 1 (1, v1) 2 2 . . . . . . n n all have est = v1 (1, v1) (decide, v1) all decide v1

  5. On MR’s Performance • The algorithm can take unbounded time • What if no failures occur? • Is this inevitable? • Can we say more than “decision is reached eventually” ?

  6. Performance Metric Number of communication steps in well-behavedruns • Well-behaved: • No failures • Stable (synchronous) from the beginning • With failure detector: no false suspicions • Motivation: common case

  7. MR’s Running Time in Well-Behaved Runs • In round 1: • Coord is correct, not suspected by any process • All processes decide at the end of phase two • Decision in two communication steps • Halting (stopping) takes three steps • How much in synchronous model? • 2 Rounds for decision in Uniform Consensus • No performance penalty for indulgence!

  8. Back to Last Week’s Example • Example network: • 99% of packets arrive within 10 µsec • Upper bound of 1000 µsec on message latency • Now we can choose a timeout of 10 µsec, without violating safety! • Most of the time, the algorithm will be just as fast as a synchronous uniform consensus algorithm • We did pay a price in resilience, though

  9. Issues in the Real World II/III • Problem: Sometimes messages are lost • Solution 1: Use retransmissions • In case of transient partitions, a huge backlog can build up – catching up may take forever • More congestion, long message delays for extensive periods • Solution 2: Allow message loss • Impossible - 2 Generals • Solution 3: Assume eventually reliable links • That’s what we’ll do today

  10. Issues in the Real World III/III • Problem: Processes may crash and later recover (aka crash-recovery model) • Solution 1: Store information on stable storage (disk) and retrieve it upon recovery • What happens to messages arriving when they’re down? • See previous slide

  11. MR and Unreliable Links • From MR Algorithm Phase II: wait for (r,est) from n-t processes • Transient message loss violates liveness • What if we move to the next round in case we can’t get n-t responses for too long? • Notice the next line in MR: if any non- value e received thenvale

  12. What If MR Didn’t Wait … decide v1 (1, v1) 1 1 1 (1, ) (1, v1) 2 2 will decide v2 . . . . . . (2, v2) est =  (1, v1) n n no waiting no change of val2

  13. What Do We Want? • Do not get stuck in a round (like MR does) • Move on upon timeout • Move on upon hearing that others moved on • But, a new leader before proposing a decision value must learn any possibly decided value (must check with a majority)

  14. Paxos: Main Principles • Use “leader election” module • If you think you’re leader, you can start a new “ballot” • Paxos name for a round • Always join the newest ballot you hear about • Leave old ballots in the middle if you need to • Two phases: • First learn outcomes of previous ballots from a majority • Then propose a new value, and get a majority to endorse it

  15. Leader Election Failure Detector • W – Leader • Outputs one trusted process • From some point, all correct processes trust the same correct process • Can easily implement ◊S • Is the weakest for consensus[Chandra, Hadzilacos, Toueg 96]

  16. W Implementations • Easiest: use ◊P implementation • In eventual synchrony model • Output lowest id non-suspected process • W is implementable also in some situations where ◊P isn’t • Optimizations possible • Choose “best connected”, strongest, etc.

  17. Paxos: The Practicality • Overcomes message loss without retransmitting entire message history • Tolerates crash and recovery • Does not rotate through dead coordinators • Used in replicated file systems • Frangipani – DEC, early 90s • Nowadays Microsoft

  18. The Part-Time Parliament[Lamport 88,98,01] Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. The Paxon parliament’s protocol provides a new way of implementing the state-machine approach to the design of distributed systems.

  19. Annotation of TOCS 98 Paper • This submission was recently discovered behind a filing cabinet in the TOCS editorial office. • …the author is currently doing field work in the Greek isles and cannot be reached … • The author appears to be an archeologist with only a passing interest in computer science. • This is unfortunate; even though the obscure ancient Paxon civilization he describes is of little interest to most computer scientists, its legislative system is an excellent model for how to implement a distributed computer system in an asynchronous environment.

  20. The Setting • The data (ledger) is replicated at n processes (legislators) • Operations (decrees) should be invoked (recorded) at each replica (ledger) in the same order • Processes (legislators) can fail (leave the parliament) • At least a majority of processes (legislators) must be up (present in the parliament) in order to make progress (pass decrees) • Why majority?

  21. Eventually Reliable Links • There is a time after which every message sent by a correct process to a correct process eventually arrives • Old messages are not retransmitted • Usual failure-detector-based algorithms (like MR) do not work • Homework question

  22. The Paxos (Paxos) Atomic Broadcast Algorithm • Leader based: each process has an estimate of who is the current leader • To order an operation, a process sends it to its current leader • The leader sequences the operation and launches a Consensus algorithm (Synod) to fix the agreement

  23. The (Synod) Consensus Algorithm • Solves non-terminating consensus in asynchronous system • Or consensus in a partial synchrony system • Or consensus using an  failure detector • Overcomes transient crashes & recoveries and message loss • Can be modeled as just message loss

  24. The Consensus Algorithm Structure • Two phases • Leader contacts a majority in each phase • There may be multiple concurrent leaders • Ballots distinguish among values proposed by different leaders • Unique, locally monotonically increasing • Correspond to rounds of ◊S-based algorithms [MR] • Processes respond only to leader with highest ballot seen so far

  25. Ballot Numbers • Pairs num, process id • n1, p1 > n2, p2 • If n1 > n2 • Or n1=n2 and p1 > p2 • Leader p chooses unique, locally monotonically increasing ballot number • If latest known ballot is n, qthen p chooses n+1, p

  26. The Two Phases of Paxos • Phase 1: prepare • If trust yourself by  (believe you are the leader) • Choose new unique ballot number • Learn outcome of all smaller ballots from majority • Phase 2: accept • Leader proposes a value with its ballot number • Leader gets majority to accept its proposal • A value accepted by a majority can be decided

  27. Paxos - Variables BallotNumi, initially 0,0 Latest ballot pi took part in (phase 1) AcceptNumi, initially 0,0 Latest ballot piaccepted a value in (phase 2) AcceptVali, initially ^ Latest accepted value (phase 2)

  28. Phase I: Prepare - Leader • Periodically, until decision is reached do: if leader (by W) then BallotNum BallotNum.num+1, myId send (“prepare”, BallotNum) to all • Goal: contact other processes, ask them to join this ballot, and get information about possible past decisions

  29. Phase I: Prepare - Cohort This is a higher ballot than my current, I better join it • Upon receive (“prepare”, bal) from i if bal  BallotNum then BallotNum  bal send (“ack”, bal, AcceptNum, AcceptVal) to i This is a promise not to accept ballots smaller than bal in the future Tell the leader about my latest accepted value and what ballot it was accepted in

  30. Phase II: Accept - Leader Upon receive (“ack”, BallotNum, b, val) from n-t if all vals = ^ then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all /* proposal */ The value accepted in the highest ballot might have been decided, I better propose this value

  31. Phase II: Accept - Cohort This is not from an old ballot Upon receive (“accept”, b, v) ifb  BallotNum then AcceptNum  b; AcceptVal  v /* accept proposal */ send (“accept”, b, v) to all (first time only)

  32. Paxos – Deciding Upon receive(“accept”, b, v) from n-t decide v periodically send (“decide”, v) to all Upon receive (“decide”, v) decide v Why don’t we ever “return”?

  33. In Failure-Free Synchronous Runs (“prepare”, 1,1) (“accept”, 1,1 ,v1) 1 1 1 1 1 2 2 2 . . . . . . . . . (“ack”, 1,1, 0,0,^) n n n (“accept”, 1,1,v1) Simple W implementation always trusts process 1 decide v1

  34. Correctness: Agreement • Follows from Lemma 1:If a proposal (“accept”, b, v)is sent by a majority, then for every sent proposal (“accept”, b’, v’)with b’>b, it holds that v’=v.

  35. Proving Agreement Using Lemma 1 • Let v be a decided value. The first process that decides v receives n-t accept messages for v with some ballot b, i.e., (“accept”, b, v)is sent by a majority. • No other value is sent with an “accept” message with the same b. Why? • Let (“accept”, b1, v1)be the proposal with the lowest ballot number (b1) sent by n-t • By Lemma 1, v1 is the only possible decision value

  36. To Prove Lemma 1 • Use Lemma 2: (invariant):If a proposal (“accept”, b, v)is sent, then there is a set S consisting of a majority such that either • no pS accepts a proposal ranked less than b (all vals = ^)or • v is the value of the highest-ranked proposal among proposals ranked less than b accepted by processes in S (myVal = received val with highest b).

  37. What Makes Lemma 2 Hold • A process accepts a proposal numbered b only if it has not responded to a prepare request having a number greater than b • The “ack” response to “prepare” is a promise not to accept lower-ballot proposals in the future • The leader uses “ack” messages from a majority in choosing the proposed value

  38. Termination • Assume no loss for a moment • Once there is one correct leader – • It eventually chooses the highest ballot number • No other process becomes a leader with a higher ballot • All correct processes “ack” its prepare message and “accept” its accept message and decide

  39. What About Message Loss? • Does not block in case of a lost message • Phase 1 can start with new rank even if previous attempts never ended • Conditional liveness: If n-t correct processes including the leader can communicate with each other then they eventually decide • Holds with eventually reliable links

  40. Performance? Why is this phase needed? (“prepare”, 1,1) (“accept”, 1,1 ,v1) 1 1 1 1 1 2 2 2 . . . . . . . . . (“ack”, 1,1, 0,0,^) n n n (“accept”, 1,1,v1) 4 Communication steps in well-behaved runsCompared to 2 for MR

  41. Optimization • Allow process 1 (only!) to skip Phase 1 • Initiate BallotNum to 1,1 • Propose its own initial value • 2 steps in failure-free synchronous runs • Like MR • 2 steps for repeated invocations with the same leader • Common case

  42. Atomic Broadcast by Running A Sequence of Consensus Instances

  43. The Setting • Data is replicated at n servers • Operations are initiated by clients • Operations need to be performed at all correct servers in the same order • State-machine replication

  44. Client-Server Interaction • Leader-based: each process (client/server) has an estimate of who is the current leader • A client sends a request to its current leader • The leader launches the Paxos consensus algorithm to agree upon the order of the request • The leader sends the response to the client

  45. Failure-Free Message Flow C C request response S1 S1 S1 S1 S1 S2 S2 S2 . . . . . . (“prepare”) . . . (“ack”) (“accept”) Sn Sn Sn Phase 1 Phase 2

  46. Observation • In Phase 1, no consensus values are sent: • Leader chooses largest unique ballot number • Gets a majority to “vote” for this ballot number • Learns the outcome of all smaller ballots from this majority • In Phase 2, leader proposes either its own initial value or latest value it learned in Phase 1

  47. Message Flow: Take 2 C C request response S1 S1 S1 S1 S1 S1 S2 S2 S2 (“prepare”) . . . (“ack”) . . . . . . (“accept”) Sn Sn Sn Phase 1 Phase 2

  48. Optimization • Run Phase 1 only when the leader changes • Phase 1 is called “view change” or “recovery mode” • Phase 2 is the “normal mode” • Each message includes BallotNum (from the last Phase 1) and ReqNum • e.g., ReqNum = 7 when we’re trying to agree what the 7th operation to invoke on the state machine should be • Respond only to messages with the “right” BallotNum

  49. Paxos Atomic Broadcast: Normal Mode Upon receive (“request”, v) from client if (I am not the leader) then forward to leader else /* propose v as request number n */ ReqNum  ReqNum +1; send (“accept”, BallotNum , ReqNum, v) to all Upon receive (“accept”, b, n, v) with b = BallotNum /* accept proposal for request number n */ AcceptNum[n]  b; AcceptVal[n]  v send (“accept”, b, n, v) to all (first time only)

  50. Recovery Mode • The new leader must learn the outcome of all the pending requests that have smaller BallotNums • The “ack” messages include AcceptNums and AcceptVals of all pending requests • For all pending requests, the leader sends “accept” messages • What if there are holes? • e.g., leader learns of request number 13 and not of 12 • fill in the gaps with dummy “do nothing” requests

More Related