1 / 71

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Explore fault-tolerant shared memory simulations and algorithms for register simulations in a distributed system.

josephlucas
Download Presentation

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Set 17: Fault-Tolerant Register Simulations CSCE 668DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch

  2. Fault-Tolerant Shared Memory Simulations • Previous algorithms implemented shared variable on top of message passing, assuming no failures. • What if some processors might crash? • Can we still provide a shared read/write variable on top of message passing? • Yes, even in an asynchronous system, if we have enough nonfaulty processors. • First, we must specify a failure-prone shared memory. Set 17: Fault-Tolerant Register Simulations

  3. Specification of f-Resilient Shared Memory • Inputs are invocations on the shared object. • Outputs are responses of the shared object. • A sequence of inputs and outputs is allowable iff: • there is a partitioning of proc. indices into "faulty" and "nonfaulty" • Correct Interaction: each proc. alternates invocations and matching responses • Nonfaulty Liveness:Every invocation by a nonfaulty proc. has a matching response • Extended Linearizability:Linearizability holds for all the completed operations and some subset of the pending operations some ops might never complete Set 17: Fault-Tolerant Register Simulations

  4. Assumptions for Algorithm • Each read/write variable ("register") to be simulated has • one reader and • one writer (Next topic will be to build more powerful variables out of these.) • There are n procs. which are cooperating to simulate a collection of such variables • Underlying communication system is asynchronous message passing • n > 2f (less than half the processors can crash) Set 17: Fault-Tolerant Register Simulations

  5. Main Ideas of Algorithm • Each simulated register has a replica stored at each of the n procs., not just at the designated reader and writer of that register. • Use the redundant storage to provide fault-tolerance. • Describe algorithm just for one simulated register; use a separate copy of the same algorithm in parallel for each simulated register. Set 17: Fault-Tolerant Register Simulations

  6. Writing the Simulated Register • generate the next sequence number • send a message with the value and the sequence number to all the procs. • each recipient updates its local copy of the register • wait to get back an ack from > n/2 procs. • safe since n - f > n/2 • do the ack for the write Set 17: Fault-Tolerant Register Simulations

  7. Reading the Simulated Register • send a request to all the procs. • each recipient sends back current value of its replica • wait to get reply from > n/2 procs. • return value associated with largest sequence number Set 17: Fault-Tolerant Register Simulations

  8. Key Idea for Correctness • Each read should return the value of "the most recent" write. • Each read or write communicates with > n/2 procs., so the set of procs. participating in operation O1 is guaranteed to intersect with the set of procs. participating in any other operation O2. Set 17: Fault-Tolerant Register Simulations

  9. But What About Asynchrony? • The underlying communication system is asynchronous: • message on behalf of one operation could be overtaken by a message on behalf of a later operation. • Avoid such problems by adding additional mechanism to the algorithm: • reader and writer keep track of "status" of each link • don't send a msg on a link until ack from previous msg has been received Set 17: Fault-Tolerant Register Simulations

  10. Outline of Correctness Proof Interesting part is proving(extended) linearizability. • Let ts(W) = sequence number of W • Let ts(R) = sequence number of write that R reads from • Let O1 O2 denote O1finishes before O2starts Key lemmas: • If W1 W2, then ts(W1) < ts(W2) • If W R, then ts(W) ≤ ts(R) • If RW, then ts(R ) < ts(W) • If R1 R2, then ts(R1) ≤ ts(R2) Set 17: Fault-Tolerant Register Simulations

  11. Matching Lower Bound on Resiliency Theorem (10.22): No simulation of a 1-reader, 1-writer read/write linearizable register using n procs and asynchronous message passing can tolerate f ≥ n/2 crash failures. Proof: Suppose in contradiction there is an algorithm A that tolerates f = n/2 crashes and simulates a 1-reader, 1-writer linearizable register on top of asynchronous message passing. Set 17: Fault-Tolerant Register Simulations

  12. Lower Bound Proof • Partition procs into two sets, S0 and S1, each of size f. • Let 0 be admissible exec. of A s.t. • initial value of simulated register is 0 • all procs. in S1crash initially • proc. p0in S0invokes write(1) at time 0 and no other operations are invoked. • the write completes at some time t0 without any proc in S0 receiving a message from any proc in S1: must happen since A is supposed to tolerate f failures. Set 17: Fault-Tolerant Register Simulations

  13. S0 S1 p0 X X 0: X X X Set 17: Fault-Tolerant Register Simulations

  14. Lower Bound Proof • Let 1 be admissible exec. of A s.t. • initial value of simulated register is 0 • all procs. in S0crash initially • proc. p1in S1invokes a read at time t0+1 and no other operations are invoked. • the read completes at some time t1 without any proc. in S1 receiving a message from any proc. in S0: must happen since A is supposed to tolerate f failures • the read returns 0: must be since A guarantees linearizability Set 17: Fault-Tolerant Register Simulations

  15. X X X 1: X X p1 Set 17: Fault-Tolerant Register Simulations

  16. Lower Bound Proof • Now create admissible execution  by "merging" the views of procs in S0 from 0 and the views of procs in S1 from 1: • messages that go between S0 and S1 are delayed so that they don't arrive until after time t1. •  is not linearizable, since read(0) follows write(1). Contradiction. Set 17: Fault-Tolerant Register Simulations

  17. S0 S1 p0 X X 0: X X X X X X 1: X X p1 p0 delay until after t1 : p1 Set 17: Fault-Tolerant Register Simulations

  18. t0 0 t0+1 t1 time p0 o: p1 X p0 X 1: p1 p0 : p1 Lower Bound Diagram for n = 2 write(1) read(0) write(1) read(0) Set 17: Fault-Tolerant Register Simulations

  19. Simulating R/W Registers Using R/W Registers • The previous algorithm showed how to simulate a 1-reader, 1-writer register on top of message passing. • How can we get more powerful (flexible) registers, i.e., with • more readers • more writers • We'll start with a warm-up: • simulate multi-valued register using binary-valued registers • 1-reader and 1-writer Set 17: Fault-Tolerant Register Simulations

  20. Wait-Free Register Simulations • Asynchronous model • Linearizable shared registers • Wait-free • tolerate any number of crash failures • We want to simulate one kind of (n-1)-resilient shared memory with another kind of (n-1)-resilient memory • recall earlier definition of f-resilient shared memory • recall earlier definition of one kind of communication system simulating another Set 17: Fault-Tolerant Register Simulations

  21. Alternative Definition of Wait-Free Simulation • Alternative definition for the wait-free shared memory case: • The failure-free version of one (SM) communication system simulates the failure-free version of the other, and • for any prefix of an admissible execution of the simulation algorithm in which pi has a pending operation, there is an extension in which the operation completes and only pi takes steps. • Equivalent to previous definition, sometimes more convenient. Set 17: Fault-Tolerant Register Simulations

  22. Proving Linearizability • We've seen one approach: • explicitly construct a permutation and prove that it has the desired properties • Alternative approach: • identify a time point for each operation, between invocation and response: linearization points • Linearization points give the permutation • Obviously real-time order is preserved • Just need to show that legality holds Set 17: Fault-Tolerant Register Simulations

  23. multi-reader single-writer multi-valued single-reader single-writer multi-valued multi-reader multi-writer multi-valued single-reader single-writer binary-valued Overview of Register Simulations Set 17: Fault-Tolerant Register Simulations

  24. Multi-Valued From Binary • Some ideas… • Use a different binary register to store each bit of the multi-valued register being simulated • Read algorithm is to read all the binary registers and return the resulting value • Write algorithm is to write the new bits in some order • Difficulties arise if the reader overlaps a slow write and sees some new bits and some old bits Set 17: Fault-Tolerant Register Simulations

  25. A Unary Approach • Suppose the simulated register is to take on the values {0,…,K-1}. • Use an array of K binary registers, B[0..K-1] • represent value v by having B[v] = 1 and the other entries 0 • Read algorithm: read B[0], B[1],…, until finding the first 1; return the index • Write algorithm: zero out the old entry of B and set the new entry Set 17: Fault-Tolerant Register Simulations

  26. Problems with Unary Approach • OK if reads and writes don't overlap. • If they do, have to worry about • reader never finding a 1 in B • new-old inversion: writer writes 1, then 2, but reader reads 2, then 1. • Counter-example execution on next slide • since binary registers are linearizable, we just mark the linearization points of the reads and writes on the binary registers Set 17: Fault-Tolerant Register Simulations

  27. read 0 from B[1] write 1 to B[1] write 0 to B[3] read 1 from B[2] write 1 to B[2] read 0 from B[0] read 1 from B[1] write 0 to B[1] read 0 from B[0] Counter-Example Initially B[0] = B[1] = B[2] = 0 and B[3] = 1 read 2 read 1 write 2 write 1 Set 17: Fault-Tolerant Register Simulations

  28. Corrected Multi-Valued Algorithm • To prevent "falling off the edge" of the end of B without finding a 1, write algorithm only clears (sets to 0) entries that are smaller the entry that is set (to 1) • To prevent new-old inversions, read algorithm scans up to find first 1, and then scans down to make sure those entries are still 0. • returns smallest value associated with a 1 entry in B that is observed during the downward scan Set 17: Fault-Tolerant Register Simulations

  29. reader alg. writer alg. Multi-Valued Construction B[0] 0/1 read write reader writer . . . B[K-1] read write 0/1 Set 17: Fault-Tolerant Register Simulations

  30. Algorithm is Wait-Free • Algorithm for writer does not involve any waiting: just do at most K (low-level) writes • Algorithm for reader does not involve any waiting: just do at most 2K-1 (low-level) reads. Set 17: Fault-Tolerant Register Simulations

  31. Algorithm Ensures Linearizability • Describe an ordering of the (high-level) operations that is obviously legal (by the definition of the ordering) • Then show that it respects real-time ordering of non-overlapping operations. • Fix any admissible execution of the algorithm. • Fix any linearization of the low-level operations (on the binary registers) • exists since the execution is admissible, which implies the underlying communication system (the binary registers) behaves properly (is linearizable) Set 17: Fault-Tolerant Register Simulations

  32. Reads-From Relations • Low-level read r on a binary register B[v] reads from low-level write w on the register if w is the latest write to B[v] that precedes r in the linearization of the low-level operations. • High-level read R on the simulated multi-valued register reads from high-level write W on the register if W returns v and W contains the low-level write that R's last read of B[v] reads from. Set 17: Fault-Tolerant Register Simulations

  33. read 1 write 0 to B[0] write 1 to B[1] read 1 from B[1] read 0 from B[0] read 0 from B[0] write 1 Reads-From Diagram low-level reads-from relationships high-level reads-from relationship Set 17: Fault-Tolerant Register Simulations

  34. Construct Permutation • Place all (high-level) writes in the order in which they occur • no concurrent writes • Consider each (high-level) read in the occur in which they occur • no concurrent reads • Suppose read R reads from write W. Place R immediately before the write that follows W in the permutation. Set 17: Fault-Tolerant Register Simulations

  35. Correctness of Permutation • Permutation is legal by construction • each read is placed after the write that it reads from • Why does it preserve order of non-overlapping operations? • two writes: by construction • a read that precedes a write in the execution: OK, since the read cannot read from a later write. Set 17: Fault-Tolerant Register Simulations

  36. Correctness of Permutation Lemma (10.1): Suppose • (high-level) read R returns v • R reads B[u], with u < v, during its upward scan • this read of B[u] reads from a (low-level) write contained in high-level write W1 Then R reads from a write that follows W1. Set 17: Fault-Tolerant Register Simulations

  37. top of upward scan or during downward scan during upward scan, u < v read 0 from B[u] read 1 from B[v] write 1 to B[w] write 0 to B[u] write 1 to B[v] read v write v write w low-level reads-from relationships high-level reads-from relationship Figure for Lemma 10.1 can't happen Set 17: Fault-Tolerant Register Simulations

  38. Correctness of Permutation • Two cases remain to show that real-time order of non-overlapping operations is preserved: • a write that precedes a read in the execution • two reads • Proof of both cases are by contradiction and showing that there is a situation that violates Lemma 10.1. Set 17: Fault-Tolerant Register Simulations

  39. Multi-Reader from Single-Reader • First consider a simple idea: • Use a different single-reader register for each reader (Val[1],…,Val[n]). • n is number of readers • Write algorithm: write the new value in each of the single-reader registers • Read algorithm: read your own single-reader register and return that value Set 17: Fault-Tolerant Register Simulations

  40. write 1 pw write 1 to Val[1] write 1 to Val[2] read 0 from Val[2] read 1 from Val[1] read 1 p1 read 0 p2 Counter-Example Suppose 0 is initial value of multi-reader register. Suppose n = 2. new-old inversion Set 17: Fault-Tolerant Register Simulations

  41. New Idea for Correct Algorithm • Have the multi-reader algorithm write some information to the single-reader registers to prevent new-old inversions on the simulated register. • This is provably necessary… Set 17: Fault-Tolerant Register Simulations

  42. Readers Must Write Theorem (10.3): In any wait-free simulation of a multi-reader single-writer register from single-reader single-writer registers, at least one reader must write. Proof: Suppose in contradiction there is an algorithm in which readers never write. Set 17: Fault-Tolerant Register Simulations

  43. Readers Must Write • pw is the writer, p1 and p2 are the readers • initial value of simulated register is 0 • S1 is the set of single-reader registers that are read by p1 • S2is the set of single-reader registers that are read by p2 Set 17: Fault-Tolerant Register Simulations

  44. Readers Must Write • Consider execution in which pw writes 1 to the simulated register. • The write algorithm performs a series of writes, w1,…,wk, to the single-reader registers. • Each wjis a write to a register in either S1 or S2. • Let vji be the value that would be returned if piwere to do a read immediately after w Set 17: Fault-Tolerant Register Simulations

  45. write to w1 write to wj write to wj+1 write to wk Readers Must Write write 1 pw … … pi read vji Set 17: Fault-Tolerant Register Simulations

  46. Readers Must Write • For each reader (p1and p2), there is a point when the writes w1, …, wk cause the value of the simulated register, as it would be observed by that reader, to "switch" from 0 (old) to 1 (new). • For p1: • v11 = v21 = … = va-11 = 0 • va1 = … = vk1= 1 • For p2: • v12 = v22 = … = vb-12 = 0 • vb2 = … = vk2= 1 a cannot equal b! Set 17: Fault-Tolerant Register Simulations

  47. Readers Must Write • Why must a and b be different? • a marks the point when p1's view of the simulated register's current value changes from old to new. So wamust write to a register in S1. • Similarly, wb must write to a register in S2. • W.l.o.g., assume a < b. Set 17: Fault-Tolerant Register Simulations

  48. write to w1 write to wa write to wa+1 write to wk p1 read va1 = 1 p2 read va2 = 0 Readers Must Write write 1 pw … … not linearizable! Set 17: Fault-Tolerant Register Simulations

  49. Readers Must Write • Where did we use the assumption in this proof that readers don't write? • The writer doing the slow write of 1 is oblivious to whether any readers are concurrently reading. • The readers are oblivious to each other. Set 17: Fault-Tolerant Register Simulations

  50. Corrected Multi-Reader Algorithm • As part of the algorithm for the read on the simulated register, announce the value to be returned. • Before deciding what value to return, check what values have been returned by previous reads and don't pick anything earlier. • Need timestamps to be able to determine relative age of returned values. • Reader pi uses row i of a matrix to report its most recently returned value to all the other readers (remember, we only have single-reader variables at our disposal) Set 17: Fault-Tolerant Register Simulations

More Related