400 likes | 414 Views
RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks Nancy Lynch, MIT Alex Shvartsman, U. Conn. DISC 2002 October 29, 2002. Goal. An algorithm to implement atomic read/write shared memory in a dynamic network setting.
E N D
RAMBO:A Reconfigurable Atomic Memory Service for Dynamic NetworksNancy Lynch, MIT Alex Shvartsman, U. Conn. DISC 2002 October 29, 2002
Goal • An algorithm to implement atomic read/write shared memory in a dynamic network setting. • Participants may join, leave, fail during computation. • Mobile networks, peer-to-peer networks. • High availability, low latency. • Atomicity for all patterns of asynchrony and change. • Good performance under reasonable limits on asynchrony and change. • Applications: • Battle data for teams of soldiers in military operation. • Game data for players in multiplayer game.
Approach: Dynamic Quorums • Objects are replicatedat several network locations. • To accommodate small, transient changes: • Uses quorum configurations: members, read-quorums, write-quorums. • Maintains atomicity during stable situations. • Allows concurrency. • To handle larger, more permanent changes: • Reconfigure • Maintains atomicity across configuration changes. • Any configuration can be installed at any time. • Reconfigure concurrently with reads/writes; no heavyweight view change.
RAMBO RAMBO • RAMBO: Reconfigurable Atomic Memory for Basic Objects (dynamic atomic read/write shared memory). • Global service specification: • Algorithm: • Reads and writes objects. • Chooses new configurations, notifies members. • Identifies, garbage-collects obsolete configurations. • All concurrently.
RRAMBO Recon Net RAMBO algorithm structure • Main algorithm + reconfiguration service • Loosely coupled • Recon service: • Provides the main algorithm with a consistent sequence of configurations. • Main algorithm: • Handles reading, writing. • Receives, disseminates new configuration information; no formal installation. • Garbage-collects old configurations. • Reads/writes may use several configurations. Recon
Main algorithm: Reads/writes • Uses two-phase strategy [Attiya, Bar-Noy, Dolev 96]: • Phase 1: Collect object values from read-quorums of active configurations. • Phase 2: Propagate latest value to write-quorums of active configurations. • Operations may execute concurrently. • Quorum intersection properties guarantee atomicity. • Our communication mechanism: • Background gossiping • Terminate by fixed-point condition, involving a quorum from each active configuration.
Removing old configurations • Main algorithm removes old configurations by garbage-collecting them in the background. • Two-phase garbage-collection procedure: • First phase: • Inform write-quorum of old configuration about the new configuration. • Collect object values from read-quorum of the old configuration. • Second phase: • Propagate the latest value to a write-quorum of the new configuration. • Garbage-collection concurrent with reads/writes. • Implemented using gossiping and fixed points.
Recon Consensus Net Implementation of Recon • Uses distributed consensus to determine successive configurations 1,2,3,… • Members of old configuration propose new configuration. • Proposals reconciled using consensus • Consensus is a heavyweight mechanism, but: • Used only for reconfigurations, infrequent. • Does not delay Read/Write operations.
decide(v) init(v) init(v) Consensus Implementation of consensus • Use a version of the Paxos algorithm [Lamport 89, 98, 02]. • Agreement, validity guaranteed absolutely. • Terminationguaranteed if/when underlying system stabilizes.
Models and analysis • I/O automaton models. • Prove atomicity for arbitrary patterns of asynchrony and change. • Analyze performance conditionally, based on failure and timing assumptions. • Reads and writes take time at most 8d, under reasonable “steady-state” assumptions.
Other approaches • Use consensus to agree on total ordering of operations: [Lamport 89…] • Not resilient to transient failures. • Termination of r/w depends on termination of consensus. • Totally-ordered broadcast over group communication: [Amir, Dolev, Melliar-Smith, Moser 94], [Keidar, Dolev 96] • View formation takes a long time, delays reads/writes. • One change may trigger view formation. • Dynamic quorums over GC: [De Prisco, et al, 99] • New view must satisfy intersection requirements. • Single reconfigurer: [Lynch, Shvartsman 97], [Englert, Shvartsman 00]
Outline of talk 1. Introduction 2. Reconfigurable Atomic Memory (RAMBO) specification 3. Reconfiguration service (Recon) specification 4. Implementation of RAMBO using Recon 5. Proof of atomicity 6. Implementation of Recon 7. Conditional performance results 8. Conclusions
RAMBO 2. RAMBO Service Specification • I, infinite set of participants’ locations • X, set of objects • C, configuration identifiers • External actions for each i and x: • Inputs: joinx,i, readx,i, write(v)x,i, recon(c,c’)x,i • Outputs: join-ackx,i, read-ack(v)x,i, …, report(c)x,i • Ignore joins in this talk. • Behavior: • Assuming basic well-formedness conditions, RAMBO guarantees atomicity. • Liveness replaced by latency bounds.
Atomicity • AKA linearizability • Definition: Each operation appears to occur at some point between its invocation and response. • Sufficient condition: For each object x, all the read and write operations for x can be partially ordered by , so that: • is consistent with the order of invocations and responses: there are no operations such that 1 completes before 2 starts, yet 2 1 . • All write operations are ordered with respect to each other and with respect to all the reads. • Every read returns the value of the last write preceding it in .
read, write RAMBO new-config Recon Net recon Implementing RAMBO • Composition of separate service for each x. • RAMBO (for x) uses separate Recon service (for x):
3. Recon Service Specification • External actions for each i: • Inputs: recon(c,c’)i • Outputs: recon-acki, report(c)i, new-config(c,k)i • And some joining actions (ignore) • Behavior: • Assuming well-formedness, Recon produces consistent configuration identifiers at participating locations: • Agreement: Two configs never assigned to same k. • Validity: Any announced new-config was previously requested by someone. • No duplication: No configuration is assigned to more than one k.
4. Implementing RAMBO using Recon • Recon • Chooses configurations • Tells members of the previous and new configuration. • Informs Reader-Writer components (new-config). • Reader-Writer • Conducts read and write operations • Two-phased quorum-based algorithm. • Uses all current configurations. • Garbage-collects obsolete configurations.
Static Reader-Writer protocol • Quorum configuration for I: • read-quorums, write-quorums, two collections of subsets of I • For any R in read-quorums, W in write-quorums, R W . • Replicate the object x at all locations in I. • At each i in I, keep: • value • tag, consisting of (sequence number, location) • Read, Write use two phases: • Phase 1: Read (value, tag) from a read-quorum • Phase 2: Write (value,tag) to a write-quorum
Static Reader-Writer protocol • Write at location i: • Phase 1: • Read (value, tag) from a read-quorum. • Determine largest seq-number among the tags that are read. • Choose new-tag := (larger sequence-number, i). • Phase 2: • Propagate (new-value, new-tag) to a write-quorum. • Read at location i: • Phase 1: • Read (value, tag) from a read-quorum. • Determine largest (value,tag) among those read. • Phase 2: • Propagate this (value,tag) to a write-quorum. • Return value. • Highly concurrent. • Quorum intersection implies atomicity
Extend to dynamic setting • Any member of current configuration can propose a new configuration. • Recon produces consistent configurations. • Reader-Writer processes run two-phase static quorum-based algorithm, using all current configurations. • Uses gossip and fixed-point tests. • When Recon provides new configuration, Reader-Writer doesn’t abort reads/writes in progress, but does extra work to access additional processes needed for new quorums.
Configurations and Config Maps • Configuration c • members(c) --“owners” of the data in configuration c • read-quorums(c) • write-quorums(c) • Configuration map cm • Sequence of configurations cm(k) • Can bedefined, undefined (), garbage-collected (±) ... ± ± c c c c ... c GC’d Defined Mixed Undefined
Configuration maps . . . c0 . . . c0 c1 . . . c0 c1 c2 ck . . . ± c1 c2 ck . . . ± ± c2 ck . . . ± ± ± c3 ck . . . . . . ± ± ± ± ± c c c c
Reader-Writer state • world • value, tag • cmap • pnum1, counts phases of locally-initiated operations • pnum2, records latest known phase numbers for all locations • op-record, keeps track of the status of a current locally initiated read/write operation • Includes op.cmap, consisting of consecutive configs. • gc-record, keeps track of the status of a current locally-initiated garbage-collection operation
Reader-Writer protocol • One kind of message, gossiped nondeterministically. • Message <W, v, t, cm, ns, nr > from i to j, where: • W is i ’s world • v,t are i’s value and tag • cm is i’s cmap • ns is i’s phase number, pnum1 • nr is the latest phase number i knows for j, pnum2(j) • (ns,nr) used to identify “fresh” messages. • Key actions are taken when “enough” information has been gathered (fixed point).
When <W,v,t,cm,ns,nr> arrives from j: • world := world W • if t > tag then (value,tag) := (v,t) • cmap := update(cmap,cm) • Updates cmap with newer information in cm. • pnum2(j) := max(pnum2(j), ns) • gc-record: If message is “fresh”, record the sender. • op-record: If message is “fresh”: • Record the sender. • Extend op.cmap with newly-discovered configurations.
Processing reads and writes • Reads and Writes perform Query and Propagation phases using known configurations, stored in op.cmap. • Query phase: Obtains fresh value, tag, cmap information from read-quorums. • Propagation phase: Propagates up-to-date (value,tag) to write-quorums; obtains fresh cmap information from write-quorums. • Both phases: Extend op.cmap with newly-discovered configurations; new configurations are also used in the phase. • Each phase ends with a fixed point, after hearing from quorums of all the configurations currently in op.cmap.
. . . . . . . . . ± ck ck+1 Garbage collection • A process can try to GC config k when its cmap looks like: • Phase 1: • Informs a write-quorum of ck about ck+1. • Collects latest (value, tag) from a read-quorum of ck. • Phase 2: • Propagates (value, tag) to a write-quorum of ck+1. • Set cmap(k) to ±. • GC operates concurrently with reads and writes.
5. Proof of Atomicity • Atomicity holds for: • arbitrary patterns of asynchrony, • arbitrary crash-failures and message loss, • arbitrary joins. • Proof: Construct partial order of read and write operations satisfying: • is consistent with the order of invocations and responses. • All write operations are ordered with respect to each other and with respect to all the reads. • Every read returns the value of the last write preceding it in . • Let be the lexicographic order on the operations’ tags, and order write with tag t before all reads with tag t.
Showing consistency • Lemma 1: Tags of GC operations are nondecreasing with respect to the configuration index. • Proof: GC is done sequentially. • Lemma 2: If the first GC of config k completes before a read/write operation begins, then the tag of the GC is less than or equal to the tag of (< if is a write). • Lemma 3: If 1 and 2 are two read/write operations and 1 completes before 2begins, then the tag of 1 is less than or equal to the tag of 2(< if 2 is a write).
Proof of Lemma 3 • Assume 1 and 2 are two read/write operations and 1 completes before 2begins. • Each phase uses consecutive configurations. • Case 1:prop-cmap(1) and query-cmap(2) share a configuration c. • Quorum intersection for c yields the tag inequality. • Case 2:All configs in prop-cmap(1) are less than all those in query-cmap(2). • The tag inequality follows from a chain of tag inequalities, following a chain of GC operations for the intervening configurations. Uses Lemmas 1 and 2. • Case 3: All configs in prop-cmap(1) are greater than all those in query-cmap(2). • Impossible.
6. Implementing Recon • Recon algorithm uses (static) consensus services to determine configurations 1, 2, 3,… • Cons(k,c): Used to determine config k, if config k-1 is c. • Consensus is used only for reconfigurations, does not delay read and write operations. recon-ack recon Recon Consensus Net
Implementing Recon • Simple---no atomicity issues. • Members of old configuration may propose a new configuration; proposals reconciled using consensus. • recon(c,c’): Request for reconfiguration from c to c’. If c is the k-1st configuration (and is current), then send init message to members; invoke Cons(k,c) with initial value c’ • Receive an init message: Participate in consensus. • decide(c’): Tell Reader-Writer the new configuration; send config message to members of c’. • Receipt of config message: Tell Reader-Writer the new configuration. • Consensus implemented using Paxos Synod algorithm.
7. Latency Analysis Consider a subset of timed executions: • Gossip occurs: • Periodically, and • At certain key times: • At beginning of operation phase. • Just after receiving a message from someone with a new phase number. • Just after certain join and reconfiguration events. • Perform local steps immediately. • Reliable message delivery, bounded delay. • Normal timing for consensus services.
Additional assumptions • e-Configuration-viability for time parameter e • A read-quorum and a write-quorum of configuration k remain alive, until at least time e after configuration k+1 is “installed” (decided upon by all non-failed members of configuration k). • e-Reconfiguration-spacing • recon(c,*)i occurs at least e time after report(c)i • e-Join-connectivity • If i and j join by time t then they learn about each other by time t+e
Latency results • Reconfiguration: • 13d, if recon(c,c’)i occurs and no members of c subsequently fail. • Garbage-collection of ck by process i: • 4d, if process i, a read-quorum and a write-quorum of ck, and a write-quorum of ck+1, do not fail. • Read or write operation by process i in a “stable” system: • 4d, if no reconfigurations occur, and process i’s cmap is “up-to-date”. • Learning about configurations: • If i and j are “old enough” and don’t fail, then information from i is conveyed to j within time 2d.
Latency results • Garbage-collection, in executions with 6d-reconfiguration-spacing and 5d-configuration-viability: • If report(c) occurs at i and i does not fail then any non-failed process that is old enough learns about c and garbage-collects any older configuration within time 6d. • Read and write operations, in executions with 12d-reconfiguration-spacing and 11d-configuration-viability: • 8d, for an operation managed by a process that is old enough and does not fail.
8. Conclusions • RAMBO algorithm • Composed of R/W algorithm, Recon service, Consensus • Atomicity in all executions. • Good latency bounds: • For reading, writing, garbage-collection. • Under assumptions about timing, joins, failures, and rate of reconfiguration.
Algorithmic innovations • Dynamic configurations: • Members can be changed dynamically. • Any current member may request reconfiguration. • Arbitrary configurations can be installed; no intersection requirements. • Loosely-coupled reconfiguration: • Concurrent reading, writing, reconfiguration. • Reads/writes can use several configurations; can complete during reconfiguration. • Efficient “steady-state”: • Assuming bounded delays, infrequent reconfiguration, and periodic gossip, read and write operations complete in time 8d.
Comparison with other approaches • Using consensus to agree on a total ordering of operations: • We use consensus only for the configurations. • Consensus termination impacts only reconfiguration latency, not read and write latency. • Group communication: • Our reads/writes work during “new view” establishment. • Dynamic quorum configurations over GC: • We allow arbitrary new configurations - no intersection requirements. • Single reconfigurer approaches: • We allow multiple reconfigurers. • We uncouple introduction of new configurations and garbage-collection of old configurations.
Current and future work • LAN implementation [Musial, Shvartsman] • More analysis: • “Normal behavior” starting from some point • Tradeoff between configuration-viability and gc rate. • Algorithmic improvements and additions: • Concurrent garbage-collection [Gilbert] • Reducing communication. • Better join protocol, explicit “leave” protocol. • Early return of read values. • Backup strategies for when configuration-viability fails. • Choosing good configurations. • Extensions to other data types?