180 likes | 304 Views
Dynamic Atomic Storage Without Consensus. Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR ). The Goal. Reliable replicated storage Using unreliable components Asynchrony - tolerate unpredictable network delays.
E N D
Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR)
The Goal • Reliable replicated storage • Using unreliable components • Asynchrony - tolerate unpredictable network delays server (process) client
Designing an Asynchronous Replicated System • State machine replication (e.g., Paxos) • Any object • Impossible in asynchronous systems • Atomic R/W Register [Attiya, Bar-Noy, Dolev 95] • Simple object: read( ), write(v) • Possiblein asynchronous system • Atomic (linearizable) • Liveness:if #failures < #servers/2 thenevery operation invoked on a correct server eventually completes.
Breaking the Minority Barrier Our first contribution: First "black box" definition (in terms of user interface) • Over a long period of time #failures < #servers/2 is not good enough • Reconfiguration! • Increasing resilience by changing the set of servers • Example: 3 failures out of 5 • Semantics of Reconfigurable R/W register: • Atomic (linearizable) • Liveness: ? D E A B C
Reconfigurable Register: User Interface • read() (returns a value) • write(value) (returns OK) • reconfig(c) (returns OK) • c is a set of changes (relative to current config.) • Each change is either (Add, pid) or (Remove, pid) • Example: c = {+C, +E, –D} • Only processes that were successfully added can invoke ops • Universe of processes (servers): • Unknown, unbounded, possibly infinite • At any given time, only a finite number has been added change change change
Definitions • Current(t) – servers in the system at time t • the “current configuration” • AddPending(t) – servers whose Add is pending at t • RemovePending(t) – servers whose Remove is pending at t • Faulty(t) – servers that have crashed by t • pi is active in an execution if • During the execution, pi does not crash • Some process invokes reconfig adding pi • No process invokes reconfig removing pi
Dynamic System Liveness • Static system: operations complete if #failures<#servers/2 • What should this be in a dynamic system? • Try #1: for every t, a minority of Current(t) is in Faulty(t) What if processes crash while others are removed? no operation is guaranteed to complete in new configuration! • Try #2: for every t, a minority of Current(t) is in Faulty(t)RemovePending(t) reconfig({–A}) C A B OK
Adding Servers reconfig({+G}) reconfig({+F}) OK OK Q: At time t0, who can crash from {A, B, ..., G}? A: minority of {A, B, ..., E}, and in addition, • in this scenario G can crash • in a different scenario F can crash • Simple condition: any 2 servers can fail (fewer than |Current(t)|/2) B D E A C F G time t0
Dynamic Service Liveness If #reconfigs invoked in the execution is finite and at every time t in the execution, fewer than |Current(t)|/2 processes out of Current(t)AddPending(t) are in Faulty(t)RemovePending(t) Then: • Eventually, every active process that was successfully added can invoke operations • Every operation invoked by an active process eventually completes
Reconfigurable Solutions Many previous solutions: All use consensus (or similar) State machine replication (Paxos) Use state-machine to agree on set of servers Virtual Synchrony based solutions e.g.,[Yeger-Lotem, Keidar, Dolev 97] R/W register + reconfiguration service [Lynch, Shvartsman 97], [Englert, Shvartsman 00] Rambo [Lynch, Shvartsman 02] Rambo II [Gilbert, Lynch, Shvartsman 03] Long Lived Rambo [Georgiou, Musial, Shvartsman 04] Is consensus really necessary? Our second contribution: Consensus is NOT needed! DynaStore - algorithm for a completely asynchronous system membership service stronger than consensus (equivalent to P) one designated “reconfigurer” consensus to agree on next configuration 10
“Old” and “New” Configurations • A reconfiguration transfers the state from a majority of the old config. to a majority of the new config. • What if there are concurrent reconfigurations ? • Suppose that initial configuration is {A, B, C, D} • Ainvokes reconfig({+E}); C invokes reconfig({D}) • Awrites to {A, D, E}, a majority of {A, B, C, D, E} • C reads from {B, C}, a majority of {A, B, C} • No intersection Atomicity is violated! • Simple solution: consensus on the sequence of configurations • But how can we do this without consensus?
The approach in DynaStore • For each configuration c, we use a (weak) snapshot nextConfig(c) to store the next configuration • (weak) snapshot objects are (easily) implemented in an asynchronous environment • Processes update nextConfig(c) tosuggest the next configuration after c (concurrent updates possible) • Sequence of Established Configurations (simplified): • The initial configuration is established • If c is established, then the first snapshot update to nextConfig(c) is the next established configuration after c included in every scan from nextConfig(c)
Transferring the State • scan of nextConfig(c) returns a set of configs that follow c • if c is established, one config in the returned set is the nextestablished config after c • scanning nextConfig for each returned config returns a further set, etc.this creates a DAG of configurations • This DAG contains the sequence of established configs • A reconfiguration transfers state along all paths in the DAG • This guarantees that state is transferred along the sequence of established configurations
Example {A, B, C, D, E} • Suppose that initial configuration is {A, B, C, D} • Ainvokes reconfig({+E}); C invokes reconfig({D}) • A updates nextConfig(C0) to C1 • A scans nextConfig(C0) to check for concurrent updates. Scan returns {C1}, i.e., no concurrent updates detected • C1 is the next established config after C0 • A’s state transfer: • Read from maj. of C0 and maj. of C1 • Write latest value found to maj. of C1 C1 C0 {A, B, C, D}
Example {A, B, C, D, E} • Suppose that initial configuration is {A, B, C, D} • Ainvokes reconfig({+E}); C invokes reconfig({D}) • Concurrently, C updates nextConfig(C0)to C2 and scans it. Scan returns {C1, C2}, implying that A’s update was concurrent • C updates nextConfig(C1) and nextConfig(C2) to C3. No concurrent updates detected • C3 is an established configuration • C’s state transfer: • Read from maj. of each config on every path found from C0 to C3 • Write latest value found to maj. of C3 C1 C0 {A, B, C, D} {A, B, C, E} C3 C2 {A, B, C}
Example {A, B, C, D, E} • Suppose that initial configuration is {A, B, C, D} • Ainvokes reconfig({+E}); C invokes reconfig({D}) • A invokes a write(newValue) operation in C1 • In this scenario, DynaStore guarantees: • Either C’s state transfer finds newValue in C1, or A’s write op discovers C3 and ends after writing newValue to maj. of C3 • Read operations also traverse the DAG, and will find newValue on the path of established configurations, intersecting the write C1 C0 {A, B, C, E} {A, B, C, D} C3 C2 {A, B, C}
Conclusions • First “black box” definition of dynamic R/W register • In terms of events visible to user • A natural failure model – resilience changes dynamically • Possibly useful for specifying other dynamic problems • DynaStore: first asynch. dynamic storage protocol • Implements a Reconfigurable Atomic MWMR register • In a completely asynchronous system (consensus impossible) • Proves that R/W storage is really easier than consensus (not only in a static system)