190 likes | 646 Views
Paxos. Lamport the archeologist and the “Part-time Parliament” of Paxos : The Part-time Parliament, TOCS 1998 Paxos Made Simple, ACM SIGACT News 2001. Paxos Made Live, PODC 2007 Paxos Made Moderately Complex, (Cornell) 2011. …….
E N D
Paxos • Lamport the archeologist and the “Part-time Parliament” of Paxos: • The Part-time Parliament, TOCS 1998 • Paxos Made Simple, ACM SIGACT News 2001. • Paxos Made Live, PODC 2007 • Paxos Made Moderately Complex, (Cornell) 2011. • …….. CS 271
The Paxos Atomic Broadcast AlgorithmThanks to Idit Keidar for slides • Asynchronous system with crash failures. • Leader based: each process has an estimate of who is the current leader • To order an operation, a process sends it to current leader • The leader sequences the operation and launches a Consensus algorithm to fix the agreement CS 271
The Consensus Algorithm Structure • Two phases • Leader contacts a majority in each phase • There may be multiple concurrent leaders • Ballots distinguish among values proposed by different leaders • Unique, locally monotonically increasing • Processes respond only to leader with highest ballot seen so far CS 271
Ballot Numbers • Pairs num, process id • n1, p1 > n2, p2 • If n1 > n2 • Or n1=n2 and p1 > p2 • Leader p chooses a unique, locally monotonically increasing ballot number • If latest known ballot is n, q then • p chooses n+1, p CS 271
The Two Phases of Paxos • Phase 1: prepare • If you believe you are the leader • Choose new unique ballot number • Learn outcome of all smaller ballots from majority • Phase 2: accept • Leaderproposes a value with its ballot number • Leader gets majority to acceptits proposal • A value accepted by a majority can be decided CS 271
Paxos - Variables BallotNumi, initially 0,0 Latest ballot pi took part in (phase 1) AcceptNumi, initially 0,0 Latest ballot piaccepted a value in (phase 2) AcceptVali, initially ^ Latest accepted value (phase 2) CS 271
Phase I: Prepare - Leader • Periodically, until decision is reached do: ifleaderthen BallotNum BallotNum.num+1, myId send(“prepare”, BallotNum) to all • Goal: contact other processes, ask them to join this ballot, and get information about possible past decisions CS 271
Phase I: Prepare - Cohort • Upon receive (“prepare”, bal) from i ifbal BallotNum then BallotNum bal send (“ack”, bal, AcceptNum, AcceptVal) to i This is a higher ballot than my current, I better join it This is a promise not to accept ballots smaller than bal in the future Tell the leader about my latest accepted value and what ballot it was accepted in CS 271
Phase II: Accept - Leader Upon receive (“ack”, BallotNum, b, val) from majority if all vals = ^ then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all /* proposal */ The value accepted in the highest ballot might have been decided, I better propose this value CS 271
Phase II: Accept - Cohort Upon receive (“accept”, b, v) ifb BallotNum then AcceptNum b; AcceptVal v /* accept proposal */ send (“accept”, b, v) to all(first time only) This is not from an old ballot CS 271
Paxos – Deciding Upon receive(“accept”, b, v) from n-t decide v periodically send (“decide”, v) to all Upon receive (“decide”, v) decide v CS 271
In Failure-Free Execution (“prepare”, 1,1) (“accept”, 1,1 ,v1) 1 1 1 1 1 2 2 2 . . . . . . . . . (“ack”, 1,1, 0,0,^) n n n (“accept”, 1,1,v1) decide v1 CS 271
Performance? Why is this phase needed? (“prepare”, 1,1) (“accept”, 1,1 ,v1) 1 1 1 1 1 2 2 2 . . . . . . . . . (“ack”, 1,1, 0,0,^) n n n (“accept”, 1,1,v1) CS 271
Failure-Free Execution C C request response S1 S1 S1 S1 S1 S2 S2 S2 . . . . . . (“prepare”) . . . (“ack”) (“accept”) Sn Sn Sn Phase 1 Phase 2 CS 271
Observation • In Phase 1, no consensus values are sent: • Leader chooses largest unique ballot number • Gets a majority to “vote” for this ballot number • Learns the outcome of all smaller ballots • In Phase 2, leader proposes its own initial value or latest value it learned in Phase 1 CS 271
Failure free execution C C request response S1 S1 S1 S1 S1 S1 S2 S2 S2 (“prepare”) . . . (“ack”) . . . . . . (“accept”) Sn Sn Sn Phase 1 Phase 2 CS 271
Optimization • Run Phase 1 only when the leader changes • Phase 1 is called “view change” or “recovery mode” • Phase 2 is the “normal mode” • Each message includes BallotNum (from the last Phase 1) and ReqNum • Respond only to messages with the “right” BallotNum CS 271
Paxos Atomic Broadcast: Normal Mode Upon receive(“request”, v) from client if (I am not the leader) then forward to leader else /* propose v as request number n */ ReqNum ReqNum +1; send (“accept”, BallotNum , ReqNum, v) to all Upon receive(“accept”, b, n, v) with b = BallotNum /* accept proposal for request number n */ AcceptNum[n] b; AcceptVal[n] v send (“accept”, b, n, v) to all(first time only) CS 271
Recovery Mode • The new leader must learn the outcome of all the pending requests that have smaller BallotNums • The “ack” messages include AcceptNums and AcceptVals of all pending requests • For all pending requests, the leader sends “accept” messages • What if there are holes? • e.g., leader learns of request number 13 and not of 12 • fill in the gaps with dummy “do nothing” requests CS 271