170 likes | 191 Views
Flexible Update Propagation for Weakly Consistent Replication. Karin Petersen, Mike K. Spreitzer, Douglas B. Terry, Marvin M. Theimer and Alan J. Demers Presented by: Ryan Huebsch CS294-4 P2P Systems – 10/13/03. Outline. Anti-Entropy Goals Data Structures Ordering The Algorithm
E N D
Flexible Update Propagation for Weakly Consistent Replication Karin Petersen, Mike K. Spreitzer, Douglas B. Terry, Marvin M. Theimer and Alan J. Demers Presented by: Ryan Huebsch CS294-4 P2P Systems – 10/13/03
Outline • Anti-Entropy • Goals • Data Structures • Ordering • The Algorithm • Creation and Retirement • Discussion • Performance • P2P discussion/questions
Anti-Entropy • Entropy - a process of degradation or running down or a trend to disorder. • Bring 2 replicas up-to-date • Three Major Design Decisions • Pairwise communication between replicas • Exchange of update operations • Ordered propagation of operations
Goals • Support for arbitrary communication topologies • Operation over low-bandwidth networks • Incremental progress • Eventual consistency • Efficient storage management • Light-weight management of dynamic replica sets • Arbitrary policy choices
Data Structures • Replica: • Database • Write Log • Server: • Clock • V, O • CSN, OSN Database Committed (< CSN) Truncated (< OSN) Log A B C A B C V O Truncated Log Highest A.Clockfor server Athat is in log Highest A.Clock for server A that has been truncated …
Orderings • Prefix Property • If R has write Wi that was accepted by server X, it has all writes X accepted before Wi • Stable (Committed Order) • Decided by primary replica • Assigns the final CSN, which is < infinity • New CSN is propagated to nodes • Accept Order • Partial order of all writes accepted by a particular server • Accept stamp – logical or real-time clock
Orderings, continued • Causal-Accept Order • Accept-stamp is a logical clock • Clock is advanced when a write is received (through anti-entropy) that has a higher accept-stamp. • Provides better chances of a node seeing the same database from different servers • If they have the same writes, even if uncommitted, will be same order
The Algorithm (Quick Version) • R is being updated by S • S retrieves R.V and R.CSN • STEP 1: Decide if a full transfer is needed • IF (S.OSN > R.CSN) THEN [If S does have enough log] Rollback S’s database to the state corresponding to S.O [Remove all writes that S has a log for] OutputDatabase(S.DB) OutputVector(S.O) OutputOSN(S.OSN)[R now has the same database and truncated the write log to the same point as S]END
The Algorithm, continued • Step 2: Bring R up-to-date with remaining committed writes • IF R.CSN < S.CSN THEN[If R is missing committed writes] w = first write after CSNWHILE (w) DO IF w.accept-stamp <= R.V(w.server-id) THEN [Check R’s vector to see if it has the write]OutputCommitNotification(w) ELSE OutputWrite(w) END w = next commited write in S.log ENDEND
The Algorithm, continued • Step 3: Bring R up-to-date with remaining uncommitted writes • w = first tentative write in S.logWHILE (w) DO IF R.V(w.server-id) < w.accept-stamp THEN[Check R’s vector to see if has the write] OutputWrite(w) END w = next write in S.logEND • Step 4: Finish Up • OutputCSN(S.CSN)OutputVector(S.V)
Creation and Retirement • Treated just like a write (elegant) • Si is trying to join via server Sx • Sx creates a new write • <infinity, Tk,i, Sk> • Si is server id, <Tk,i, Sk> • Si sets clock to Tk,i + 1 • Notice the new server id is globally unique, recursive, and could be long • The write is propagated to other nodes through anti-entropy
Creation and Retirement, continued • Server S is updating server R • Server S.V has an entry for server Si (<Tk,i, Sk>), while R does not. • 2 Cases: • R has not seen the creation of Si • Then R.V(Sk) < Tk,i • S has not seen the retirement of Si • Then R.V(Sk) >= Tk,i • Why? Creation/Deletion is recorded as a normal write, thus the prefix property will hold. • Recursive naming helps too, if Sk retired, can still trace back and decide the proper state. This is explained as the virtual CompleteV in the paper.
Discussion, continued • Most properties are not special in themselves, the combination is novel • Different decisions are mostly independent • Ideas can be applied to other systems (other than Bayou) • Security • Use certificates to insure user can make update • Not much detail given • Used later on as an excuse for high overheads • Lots of policy decisions to be made • When to reconcile, with whom, when to truncate log
Performance • 1316 bytes of update overhead • 520 bytes for certificate • Network transfer most significant cost
Performance, continued • Hard to know if the numbers are good, nothing to compare them to • Would have been nice to see a larger deployment and measure propagation delay, consistency, etc.
P2P? • Is Anti-Entropy applicable to P2P systems? • Review the goals… arbitrary topology, low b/w, aggressive storage management… • There is a centralized component (the serializer)… is this okay? • Can it handle failures/churn? • Security, what happens if there is a faulty node?