270 likes | 380 Views
Dynamic atomic storage without consensus. Aguilera, Keidar , Malkhi , Shraer , J . ACM 58, 2, 2011 Sarai Duek. The Problem. Implement an read/write register in a dynamic system. Read Write Reconfig. atomic. The Problem. What is atomicity?. The Problem.
E N D
Dynamic atomic storage without consensus Aguilera, Keidar, Malkhi, Shraer, J. ACM 58, 2, 2011 Sarai Duek
The Problem • Implement an read/write register in a dynamic system. • Read • Write • Reconfig atomic
The Problem What is atomicity?
The Problem Atomicity is when each operation appears to occur at some point between its invocation and response. W W R R
The Problem Atomicity is when each operation appears to occur at some point between its invocation and response. What is liveness?
The Problem Atomicity is when each operation appears to occur at some point between its invocation and response. Liveness is a guarantee that the system will make progress under some conditions (e.g. majority).
The Problem t-resilient R/W storage guarantees progress if fewer than t processes crash. For an n-process system, it is well known that t-resilient R/W storage exists when t < n/2, and does not exist when t ≥ n/2. W P0 P2 P2 P1 P3 P3 R
The Problem In a dynamic system the majority can change. And liveness is achieved by reconfig operation. P2 P0 reconfig1(+,4) P3 P1 P4
The problem The model • Unknown and unbounded universe of processes ∏. • Asynchronous reliable communication channels between each pair of processes. • Processes can be added, removed, crash or halt. p… p9 p8 p1 p… p2 p7 p4 p3 p6 p5
The Problem Liveness conditions • The set of crashed processes and those whose removal is pending is a minority of the current or any pending future views. • No new reconfigoperations will be invoked for “sufficiently long” for the started operations to complete. A view is a set of changes. Changes lead to a new configuration of processes. p3 p0 p2 p4 p1 p5
The problem • MWMR – Any process can write and read. • Written values are unique – (val, pid, ts). • Every process in the system knows the initial view. • We say, by convention, that a reconfig(Init) completes by time 0. • Members of view w store information about the current view. Changes – {Remove, Add} View – Set of changes For view w: w.remove – removal set w.join – join set w.members – set w.join\w.remove V(t) – union of all sets c such that a reconfig(c) completes by time t Init = V(0) P(t) – set of pending changes at time t F(t) – set of processes that crashed by time t
The problem Dynamic Service Liveness If at every time t in the execution, fewer than |V(t).members|/2 processes out of V(t).members ∪ P(t).join are in F(t) ∪ P(t).remove, and the number of different changes proposed in the execution is finite, then the following hold: • Eventually, the enable operations event occurs at every active process that was added by a complete reconfig operation. • Every operation invoked at an active process eventually completes. Changes – {Remove, Add} View – Set of changes For view w: w.remove – removal set w.join – join set w.members – set w.join\w.remove V(t) – union of all sets c such that a reconfig(c) completes by time t Init = V(0) P(t) – set of pending changes at time t F(t) – set of processes that crashed by time t
The problem Dynamic Service Liveness at every time t in the execution, fewer than |V(t).members|/2 processes out of V(t).members ∪ P(t).join are in F(t) ∪ P(t).remove. V(t) P(t).remove p2 p0 p6 F(t) p3 p8 p7 p1 p4 p9 P(t).join p10 p5
The algorithm outline Write – phase • generate next sequence number • send a message with the value and the sequence number to all processes • each recipient updates its replica and sends ack • writer waits for majority of acks • Read configurations information • If a new view was discovered then restart read – phase in the new view (followed by a write – phase again). Write – phase • generate next sequence number • send a message with the value and the sequence number to all processes • each recipient updates its replica and sends ack • writer waits for majority of acks Read – phase send a request to all processes each recipient sends back current value of its replica wait for the majority to reply return value associated with largest sequence number Read – phase Read configurations information If a new view was discovered then restart read – phase in the new view. send a request to all processes each recipient sends back current value of its replica wait for the majority to reply return value associated with largest sequence number
The algorithm outline Reconfiguration • write information about the new view to the quorum of the old one • execute the read and write phases, starting in the old view.
Weak object Arrive and query obey the following semantics: • Integrity • Validity • Monotonicity of queries • Non-empty common intersection • Termination Allows a fixed set of processes P to use two operations Arrivei(c) Queryi()
Weak object • The weak object algorithm • Operationarrivei(c) if collect() = Ø then pi.val.wirte(c) • return OK • Operation queryi() • C1 collect() • if C1 = Ø then return Ø • C2 collect() • return C2 • Procedure collect() • C Ø • foreach pi P • c pi.val,read() • if c then C CU {c} • return C Each process pi in P has a value field pi.val SWMR – only pi can use pi.val.write(c) but all processes can use pi.val.read()
Weak object • The weak object algorithm • Operationarrivei(c) if collect() = Ø then pi.val.wirte(c) • return OK • Operation queryi() • C1 collect() • if C1 = Ø then return Ø • C2 collect() • return C2 • Procedure collect() • C Ø • foreach pi P • c pi.val.read() • if c then C CU {c} • return C arrive(v1) P3 P0 P0 v1 P4 P2 arrive(v2) P1 P5 P5 v2 C = { }
Weak object • The weak object algorithm • Operationarrivei(c) if collect() = Ø then pi.val.wirte(c) • return OK • Operation queryi() • C1 collect() • if C1 = Ø then return Ø • C2 collect() • return C2 • Procedure collect() • C Ø • foreach pi P • c pi.val.read() • if c then C CU {c} • return C P3 P0 v1 P4 P2 query() P1 P5 v2 C = { } C = {v1, v2} C = {v1}
Weak object • The weak object algorithm • Operationarrivei(c) if collect() = Ø then pi.val.wirte(c) • return OK • Operation queryi() • C1 collect() • if C1 = Ø then return Ø • C2 collect() • return C2 • Procedure collect() • C Ø • foreach pi P • c pi.val.read() • if c then C CU {c} • return C collect {a} collect {a, b} collect {a} collect {b} queryb{ } querya{ } queryaqueryb
The algorithm • operationreadi (): • pickNewTSi ← FALSE • newView ← Traverse(∅,⊥) • NotifyQ(newView) • returnvimax • operationwritei (v): • pickNewTSi ← TRUE • newView ← Traverse(∅, v) • NotifyQ(newView) • return OK • operationreconfigi (cng): • pickNewTSi ← FALSE • newView ← Traverse(cng, ⊥) • NotifyQ(newView) • returnOK procedure NotifyQ(w) if did not receive {NOTIFY, w } then send {NOTIFY, w } to w.members wait for {NOTIFY, w} from majority of w.members
The algorithm procedureTraverse(cng, v) desiredView← curViewi ∪ cng Front ← {curViewi} do s ← min{|| : ∈ Front} w ← any ∈ Front s.t. || = s if(iw.members) thenhalti ifw desiredViewthen arrivei(w, desiredView \ w) ChangeSets← ReadInView(w) ifChangeSets ∅ then Front ← Front \ {w} foreachc ∈ ChangeSets desiredView← desiredView ∪ c Front ← Front ∪ {w ∪ c} elseChangeSets ← WriteInView(w, v) whileChangeSets ∅ curViewi← desiredView returndesiredView Traverse is used to look for the next view considering all the changes suggested so far.
The algorithm procedureTraverse(cng, v) desiredView← curViewi ∪ cng Front ← {curViewi} do s ← min{|| : ∈ Front} w ← one ∈ Front s.t. || = s if(iw.members) thenhalti ifw desiredViewthen arrivei (w, desiredView \ w) ChangeSets← ReadInView(w) ifChangeSets ∅ then Front ← Front \ {w} foreachc ∈ ChangeSets desiredView← desiredView ∪ c Front ← Front ∪ {w ∪ c} elseChangeSets ← WriteInView(w, v) whileChangeSets ∅ curViewi← desiredView returndesiredView Init view
The algorithm procedureTraverse(cng, v) desiredView← curViewi ∪ cng Front ← {curViewi} do s ← min{|| : ∈ Front} w ← any ∈ Front s.t. || = s if(iw.members) thenhalti ifw desiredViewthen arrivei (w, desiredView \ w) ChangeSets← ReadInView(w) ifChangeSets ∅ then Front ← Front \ {w} foreachc ∈ ChangeSets desiredView← desiredView ∪ c Front ← Front ∪ {w ∪ c} elseChangeSets ← WriteInView(w, v) whileChangeSets ∅ curViewi← desiredView returndesiredView InitView U {(+,3), (+,5), (-,1), (+,4), (+,7)} = V1 {(+,5), (-,1), (+,4)} {(+,3)} Init view {(+,5)} V2 V4 {(+,7)} {(+,3), (-,1), (+,4)} V6 {(-,1), (+,4)} {(+,7)} V3 V5 {(+,3), (+,5)} Front after iteration6 Front after iteration4 Initial Front Front after iteration 1 Edge returned from ReadInView Edge updated by Pi
The algorithm • procedure ReadInView(w) • ChangeSets ← queryi (w) • ContactQ(R, w.members) • return ChangeSets • procedureWriteInView(w, v) • ifpickNewTSithen • (pickNewTSi, vimax , tsimax) ←(FALSE, v, (tsimax .num+ 1, i)) • ContactQ(W, w.members) • ChangeSets ← queryi (w) • return ChangeSets Procedure ContactQ sends a write-request including vimaxand tsimax when writing a quorum, and a when reading a quorum.
Established views The unique sequence of established views E is constructed as follows: the first view in E is the initial view Init if w is in E, then the next view after w in E is w’ = w ∪ c, where c is an element chosen arbitrarily from the intersection of all sets C∅ returned by some query(w) operation in the execution.