1 / 49

Lecture IX: Coordination And Agreement

Lecture IX: Coordination And Agreement. CMPT 401 Summer 2007 Dr. Alexandra Fedorova. A Replicated Service. servers. client. network. slave. W. R. master. W. R. W. slave. client. data replication. write. read. W. W. R. A Need For Coordination And Agreement. servers. client.

mahala
Download Presentation

Lecture IX: Coordination And Agreement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture IX: Coordination And Agreement CMPT 401 Summer 2007 Dr. Alexandra Fedorova

  2. A Replicated Service servers client network slave W R master W R W slave client data replication write read W W R

  3. A Need For Coordination And Agreement servers client network slave Must coordinate election of a new master master Must agree on a new master slave client

  4. Roadmap • Today we will discuss protocols for coordination and agreement • This is a difficult problem because of failures and lack of bound on message delay • We will begin with a strong set of assumptions (assume few failures), and then we will relax those assumptions • We will look at several problems requiring communication and agreement: distributed mutual exclusion, election • We will finally learn that in an asynchronous distributed system it is impossible to reach a consensus

  5. Distributed Mutual Exclusion (DMTX) • Similar to a local mutual exclusion problem • Processes in a distributed system share a resource • Only one process can access a resource at a time • Examples: • File sharing • Sharing a bank account • Updating a shared database

  6. Assumptions and Requirements • An asynchronous system • Processes do not fail • Message delivery is reliable (exactly once) • Protocol requirements: Safety: At most one process may execute in the critical section at a time Liveness: Requests to enter and exit the critical section eventually succeed Fairness: Requests to enter the critical section are granted in the order in which they were received

  7. Evaluation Criteria of DMTX Algorithms • Bandwidth consumed • proportional to the number of messages sent in each entry and exit operation • Client delay • delay incurred by a process and each entry and exit operation • System throughput • the rate at which processes can access the critical section (number of accesses per unit of time)

  8. DMTX Algorithms • We will consider the following algorithms: • Central server algorithm • Ring-based algorithm • An algorithm based on voting

  9. The Central Server Algorithm

  10. The Central Server Algorithm • Performance: • Entering a critical section takes two messages (a request message followed by a grant message) • System throughput is limited by the synchronization delay at the server: the time between the release message to the server and the grant message to the next client) • Fault tolerance • Does not tolerate failures • What if the client holding the token fails?

  11. A Ring-Based Algorithm

  12. A Ring-Based Algorithm (cont) • Processes are arranged in the ring • There is a communication channel from process pi to process (pi+1) mod N • They continuouslypass the mutual exclusion token around the ring • A process that does not need to enter the critical section (CS) passes the token along • A process that needs to enter the CS retains the token; once it exits the CS, it keeps on passing the token • No fault tolerance • Excessive bandwidth consumption

  13. Maekawa’s Voting Algorithm • To enter a critical section a process must receive a permission from a subset of its peers • Processes are organized in voting sets • A process is a member of M voting sets • All voting sets are of equal size (for fairness)

  14. Maekawa’s Voting Algorithm • Intersection of voting sets guarantees mutual exclusion • To avoid deadlock, requests to enter critical section must be ordered p4 p1 p3 p2

  15. Elections • Election algorithms are used when a unique process must be chosen to play a particular role: • Master in a master-slave replication system • Central server in the DMTX protocol • We will look at the bully election algorithm • The bully algorithm tolerates failstop failures • But it works only in a synchronous system with reliable messaging

  16. The Bully Election Algorithm • All processes are assigned identifiers • The system always elects a coordinator with the highest identifier: • Each process must know all processes with higher identifiers than its own • Three types of messages: • election – a process begins an election • answer – a process acknowledges the election message • coordinator – an announcement of the identity of the elected process

  17. The Bully Election Algorithm (cont.) • Initiation of election: • Process p1 detects that the existing coordinator p4 has crashed an initiates the election • p1 sends an election messages to all processes with higher identifier than itself election election p1 p2 p3 p4

  18. The Bully Election Algorithm (cont.) • What happens if there are no crashes: • p2and p3 receive the election message from p1 send back the answer message to p1 , and begin their own elections • p3 sends answer to p2 • p3 receives no answer message from p4, so after a timeout it elects itself as a leader (knowing it has the highest ID) coordinator coordinator election election election election p1 p2 p3 p4 answer answer answer

  19. The Bully Election Algorithm (cont.) • What happens if p3 also crashes after sending the answer message but before sending the coordinator message? • In that case, p2 will time out while waiting for coordinator message and will start a new election election election election election p1 p2 p2 p3 p4 answer answer answer

  20. The Bully Election Algorithm (summary) • The algorithm does not require a central server • Does not require knowing identities of all the processes • Does require knowing identities of processes with higher IDs • Survives crashes • Assumes a synchronous system (relies on timeouts)

  21. Consensus in Asynchronous Systems With Failures • The algorithms we’ve covered have limitations: • Either tolerate only limited failures (failstop) • Or assume a synchronous system • Consensus is impossible to achieve in an asynchronous system • Next we will see why…

  22. Consensus • All processes agree on the same value (or set of values) • When do you need consensus? • Leader (master) election • Mutual exclusion • Transaction involving multiple parties (banking) • We will look at several variants of consensus problem • Consensus • Byzantine generals • Interactive consensus

  23. System Model • There is a set of processes Pi • There is a set of values {v0, …, vN-1} proposed by processes • Each processes Pi decides on di • di belongs to the set {v0, …, vN-1} • Assumptions: • Synchronous system (for now) • Failstop failures • Byzantine failures • Reliable channels

  24. Consensus algorithm Consensus P1 P1 v1 d1 v3 v2 d2 d3 P2 P3 P2 Step 1 Propose. P3 Step 2 Decide. Courtesy of Jeff Chase, Duke University

  25. Consensus (C) di = vk Pi selects di from {v0, …, vN-1}. All Pi select the same vk (make the same decision) Courtesy of Jeff Chase, Duke University

  26. Conditions for Consensus • Termination: All correct processes eventually decide. • Agreement: All correct processes select the same di. • Integrity: If all correct processes propose the same v, then di = v

  27. Byzantine Generals Problem (BG) leader or commander vleader subordinate or lieutenant di = vleader dj= vleader • Two types of generals: commander and subordinates • A commander proposes an action (vi). • Subordinates must agree Courtesy of Jeff Chase, Duke University

  28. Conditions for Consensus • Termination: All correct processes eventually decide. • Agreement: All correct processes select the same di. • Integrity: If the commander is correct than all correct processes decide on the value that the commander proposed

  29. Interactive Consistency (IC) di = [v0 , …, vN-1] • Each Pi proposes a value vi • Piselects di = [v0 , …, vN-1] vector reflecting the values proposed by all correct participants. • All Pi must decide on the same vector

  30. Conditions for Consensus • Termination: All correct processes eventually decide. • Agreement: The decision vector of all correct processes is the same • Integrity: If Pi is correct then all correct processes decide on vi as the ith component of their vector

  31. Equivalence of IC and BG • We will show that BG is equivalent to IC • If there is solution to one, there is solution to another • Notation: • BGi(j, v) returns the decision value of pi when the commander pj proposed v • ICi (v1, v2, …., vN)[j] returns the jth value in the decision vector of pi in the solution to IC, where {v1, v2, …., vN} are the values that the processes proposed • Our goal is to find solution to IC given a solution to BG

  32. Equivalence of IC and BG • We run the BG problem N times • Each time the commander pj proposes a value v • Recall that in IC each process proposes a value • After each run of BG problem we record BGi(j, v) for all i – that is what each process decided when the pj proposed v • Similarity with IC: we record what each pi decided for vector position j • We need to record decisions for N vector positions, so we run the problem N times

  33. Equivalence of IC and BG Initialization ? ? ? Empty decision vectors ? ? ? ? ? ? Run #1: Run #2: Run #3: P0 proposes v0 We record d0 for all p P1 proposes v1 We record d1 for all p P2 proposes v2 We record d2 for all p d0 ? ? d0 d1 ? d0 d1 d2 d0 ? ? d0 d1 ? d0 d1 d2 d0 ? ? d0 d1 ? d0 d1 d2

  34. Consensus in a Synchronous System Without Failures • Each process pi proposes a decision value vi • All proposed vi are sent around, such that each process knows all proposed vi • Once all processes receive all proposed v’s, they apply to them the same function, such as:minimum(v1, v2, …., vN) • Each process pi sets di = minimum(v1, v2, …., vN) • The consensus is reached • What if processes fail? Can other processes still reach an agreement?

  35. Consensus in a Synchronous System With Failstop Failures • We assume that at most f out of N processes fail • To reach a consensus despite f failures, we must extend the algorithm to take f+1 rounds • At round 1: each round process pi sends its proposed vi to all other processes and receives v’s from other processes • At each subsequent round process pi sends v’s that it has not sent before and receives new v’s • The algorithm terminates after f+1 rounds • Let’s see why it works…

  36. Consensus in a Synchronous System With Failstop Failures: Proof • Will prove by contradiction • Suppose some correct process pi possesses a value that another correct process pj does not possess • This must have happened because some other processes pk sent that value to pi but crashed before sending it to pj • The crash must have happened in round f+1 (last round). Otherwise, pi would have sent that value to pj in round f+1 • But how come pj have not received that value in any of the previous rounds? • If at every round there was a crash – some process sent the value to some other processes, but crashed before sending it to pj • But this implies that there must have been f+1 crashes • This is a contradiction: we assumed at most f failures

  37. Consensus in a Synchronous System: Discussion • Can this algorithm withstand other types of failures – omission failures, byzantine failures? • Let us look at consensus in presence of byzantine failures Processes separated by network partition: each group can agree on a separate value

  38. Consensus in a Synchronous System With Byzantine Failures • Byzantine failure: a process can forward to another process an arbitraryvalue v • Byzantine generals: the commander says to one lieutenant that v = A, says to another lieutenant that v = B • We will show that consensus is impossible with only 3 generals • Pease et. al generalized this to impossibility of consensus with N≤3f faulty generals

  39. p p (Commander) (Commander) 1 1 1:w 1:v 1:v 1:x 2:1:v 2:1:w p p p p 2 3 2 3 3:1:u 3:1:x Faulty processes are shown shaded BG: Impossibility With Three General Scenario 1 Scenario 2 • Scenario 1: p2 must decide v (by integrity condition) • But p2 cannot distinguish between Scenario 1 and Scenario 2, so it will decide w in Scenario 2 • By symmetry, p3 will decide x in Scenario 2 • p2 and p3 will have reached different decisions “3:1:u” means “3 says 1 says u”.

  40. Solution With Four Byzantine Generals • We can reach consensus if there are 4 generals and at most 1 is faulty • Intuition: use the majority rule Who is telling the truth? Majority rules! Correct process

  41. p p (Commander) (Commander) 1 1 1:v 1:v 1:u 1:w 1:v 1:v 2:1:v 2:1:u p p p p 3:1:u 3:1:w 2 3 2 3 4:1:v 4:1:v 4:1:v 4:1:v 2:1:v 3:1:w 2:1:u 3:1:w p p Faulty processes are shown shaded 4 4 Solution With Four Byzantine Generals Round 1: The commander sends v to all other generals Round 2: All generals exchange values that they sent to commander The decision is made based on majority

  42. Solution With Four Byzantine Generals p p2 receives: {v, v, u}. Decides v p4 receives: {v, v, w}. Decides v (Commander) 1 1:v 1:v 1:v 2:1:v p p 3:1:u 2 3 4:1:v 4:1:v 2:1:v 3:1:w p 4

  43. Solution With Four Byzantine Generals p (Commander) p2 receives: {u, w, v}. Decides NULL p4 receives: {u, v, w}. Decides NULL p3 receives: {w, u, v}. Decides NULL 1 1:u 1:w 1:v 2:1:u p p 3:1:w 2 3 4:1:v 4:1:v 2:1:u 3:1:w p 4 The result generalizes for system with N ≥ 3f + 1, (N is the number of processes, f is the number of faulty processes)

  44. Consensus in an Asynchronous System • In the algorithms we’ve looked at consensus has been reached by using several rounds of communication • The systems were synchronous, so each round always terminated • If a process has not received a message from another process in a given round, it could assume that the process is faulty • In an asynchronous system this assumption cannot be made! • Fischer-Lynch-Patterson (1985): No consensus can be guaranteed in an asynchronous communication system in the presence of any failures. • Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time.

  45. Consensus in Practice • Real distributed systems are by and large asynchronous • How do they operate if consensus cannot be reached? • Fault masking: assume that failed processes always recover, and define a way to reintegrate them into the group. • If you haven’t heard from a process, just keep waiting… • A round terminates when every expected message is received. • Failure detectors: construct a failure detector that can determine if a process has failed. • A round terminates when every expected message is received, or the failure detector reports that its sender has failed.

  46. Fault Masking • In a distributed system, a recovered node’s state must also be consistent with the states of other nodes. • Transaction processing systems record state to persistent storage, so they can recover after crash and continue as normal • What if a node has crashed before important state has been recorded on disk? • A functioning node may need to respond to a peer’s recovery. • rebuild the state of the recovering node, and/or • discard local state, and/or • abort/restart operations/interactions in progress • e.g., two-phase commit protocol

  47. Failure Detectors • First problem: how to detect that a member has failed? • pings, timeouts, beacons, heartbeats • recovery notifications • Is the failure detector accurate? – Does it accurately detect failures? • Is the failure detector live? – Are there bounds on failure detection time? • In an asynchronous system, it impossible for a failure detector to be both accurate and live

  48. Failure Detectors in Real Systems • Use a failure detector that is live but not accurate. • Assume bounded processing delays and delivery times. • Timeout with multiple retries detects failure accurately with high probability. Tune it to observed latencies. • If a “failed” site turns out to be alive, then restore it or kill it (fencing, fail-silent). • What do we assume about communication failures? • How much pinging is enough? • Tune parameters for your system – can you predict how your system will behave under pressure? • That’s why distributed system engineers often participate in multi-day support calls… • What about network partitions? • Processes form two independent groups, reach consensus independently. Rely on quorum.

  49. Summary • Coordination and agreement are essential in real distributed systems • Real distributed systems are asynchronous • Consensus cannot be reached in an asynchronous distributed system • Nevertheless, people still build useful distributed systems that rely on consensus • Fault recovery and masking are used as mechanisms for helping processes reach consensus • Popular fault masking and recovery techniques are transactions and replication – the topics of the next few lectures

More Related