320 likes | 620 Views
Consensus Algorithms. Willem Visser RW334. Why do we need consensus?. Distributed Databases Need to know others committed/aborted a transaction to avoid inconsistency Also, to agree on the order of transaction log entries to ensure eventual consistency
E N D
Consensus Algorithms Willem VisserRW334
Why do we need consensus? • Distributed Databases • Need to know others committed/aborted a transaction to avoid inconsistency • Also, to agree on the order of transaction log entries to ensure eventual consistency • Actions that must still be performed, but at least you know everyone agrees on what those should be • Leader Elections • Many, many more applications
So what is the problem? • FAILURES! • Network failures • Node Failures • Two types of Node failures • Fail-Stop • Once a node fails it stops • Fail-Recover • A failed node can recover at some later stage
And it gets worse! • ‘Impossibility of Distributed Consensus with One Faulty Process’ • Fischer, Lynch and Patterson, 1985 • Consensus is possible in a synchronous system • Wait one time step and if someone didn’t respond they are dead • Asynchronous system • Impossible to tell the difference between a node taking a long time, or, is dead
The 2 Phase Commit Protocol • Phase 1 • The coordinator sends a Request-to-Preparemessage to each participant • Coordinator waits for all participants to vote • Each participant • votes Prepared if it’s ready to commit • may vote No for any reason • may delay voting indefinitely • Phase 2 • If Coordinator receives Prepared from all participants • Sends Commit to all • Otherwise Abort to all • Participants reply Done
Commit Coordinator Participant Request-to-Prepare Prepared Commit Done
Abort Coordinator Participant Request-to-Prepare No Abort Done
Performance • In the absence of failures, 2PC requires 3 rounds of messages before the decision is made • Request-to-prepare • Votes • Decision • Done messages are just for bookkeeping • they don’t affect response time • they can be batched
Uncertainty • Before it votes, a participant can abort unilaterally • After a participant votes Prepared and before it receives the coordinator’s decision, it is uncertain. It can’t unilaterally commit or abort during its uncertainty period. Coordinator Participant Request-to-Prepare Prepared Uncertainty Period Commit Done • The coordinator is never uncertain • If a participant fails or is disconnected from the coordinator while it’s uncertain, at recovery it must find out the decision
The problems • Blocking: If something went wrong you must wait for it before continuing • Failure Handling: What to do if a Coordinator or Participant times out waiting for a message • A participant times out waiting for coordinator’s Request-to-prepare. • It decides to abort. • The coordinator times out waiting for a participant’s vote • It decides to abort • A participant that voted Prepared times out waiting for the coordinator’s decision • It’s blocked. • Use a termination protocol to decide what to do. • Naïve termination protocol - wait till the coordinator recovers • The coordinator times out waiting for Done • it must resolicit them, so it can forget the decision
Logging 2PC State Changes • Logging may be eager • meaning it’s flushed to disk before the next Send Message • Or it may be lazy = not eager Coordinator Log Start2PC (eager) Participant Request-to-Prepare Log prepared (eager) Prepared Log commit (eager) Commit Log commit (eager) Done Log commit (lazy)
Coordinator Recovery • If the coordinator fails and later recovers, it must know the decision. It must therefore log • the fact that it began T’s 2PC protocol, including the list of participants, and • Commit or Abort, before sending Commit or Abort to any participant (so it knows whether to commit or abort after it recovers). • If the coordinator fails and recovers, it resends the decision to participants from whom it doesn’t remember getting Done • If the participant forgot the transaction, it replies Done • The coordinator should therefore log Done after it has received them all.
Participant Recovery • If a participant P fails and later recovers, it first performs centralized recovery (Restart) • For each distributed transaction T that was active at the time of failure • If P is not uncertain about T, then it unilaterally aborts T • If P is uncertain, it runs the termination protocol (which may leave P blocked) • To ensure it can tell whether it’s uncertain, P must log its vote before sending it to the coordinator • To avoid becoming totally blocked due to one blocked transaction, P should reacquire T’s locks during Restart and allow Restart to finish before T is resolved.
Heuristic Commit • Suppose a participant recovers, but the termination protocol leaves T blocked. • Operator can guess whether to commit or abort • Must detect wrong guesses when coordinator recovers • Must run compensations for wrong guesses • Heuristic commit • If T is blocked, the local resource manager (actually, transaction manager) guesses • At coordinator recovery, the transaction managers jointly detect wrong guesses.
The Main Issue with 2PC • Once Coordinator sends message to Commit, each Participant does commit without considering other participants • When Coordinator and all participants that finished committing goes down, then the rest doesn’t know the state of the system • All that knew are now dead • Cannot just abort, since the commit action might have completed at some and cannot be rolled back • Also cannot commit, since the original decision might have been to abort
3 Phase Commit (3PC) • Phase 1 as in 2PC • Prepared-to-Commit • Reply Prepared or No • Phase 2 is now split into two • First send Ready-to-Commit • When it receives all Yes votes • Then send Commit Message • The reason for the extra step is to let all the Participants know what the decision is, in case of failure everyone then knows and the state can be recovered
3PC Failure Handling • If coordinator times out before receiving Prepared from all participants, it decides to abort. • Coordinator ignores participants that don’t ack its Ready-to--Commit. • Participants that voted Prepared and timed out waiting for Ready-to-Commit or Commit use the termination protocol. • The termination protocol is where the complexity lies. (E.g. see [Bernstein, Hadzilacos, Goodman 87], Section 7.4)
3PC can still fail • Network partition failure • All the ones that gets Ready-to-Commit is on one side • All the rest on the other side • Recovery will take place on both sides • One side will commit • Other side will abort • When network merges back, you have an inconsistent state
2PC versus 3PC • FLP states you cannot have both safety and liveness! • Liveness • 2PC can block • 3PC will always make progress • Safety • 2PC is safe • 3PC is safe-ish • as seen in the network partitioning case one can get to the wrong result
Paxos! • Safety and Liveness (but only in perfect conditions, i.e. when process behave synchronously) • Leslie Lamport • 1990, but only published 1998 • “Paxos Made Simple” paper in 2001 describes it so that mere mortals have a chance to understand it too
Paxos Core Idea • Majority Voting • Nugget • In all possible majority groups there must be at least 1 shared member • Thus if anything fails rendering a majority group incapable of a decision, then the shared member will convey the information to the next majority • Also orders proposals to allow one to know which one should be considered
Paxos Continued • Paxos can tolerate • lost messages, delayed messages, repeated messages, and messages delivered out of order. • In fact can work if nearly halve of the nodes fail to reply • 2F+1 Nodes can tolerate F failures • It will reach consensus if there is a single leader for long enough that the leader can talk to a majority of processes twice. • Any process, including leaders, can fail and restart; in fact all processes can fail at the same time, the algorithm is still safe. • There can be more than one leader at a time. • Used in Google’s Chubby system, Zookeeper, etc.