Consistency without concurrency control in large, dynamic systems

1. Consistency without concurrency control in large, dynamic systems Marc Shapiro, INRIA & LIP6 Nuno Pregui�a, Universidade Nova de Lisboa Mihai Le?ia, ENS Lyon Two things � no concurrency control � large dynamicTwo things � no concurrency control � large dynamic

2. 2 Consistency without concurrency control Object x, operation f(x) propose f(x1) eventually replay f(x2), f(x3), ... If f || g commute: converges safely without concurrency control Commutative Replicated Data Type (CRDT): Designed for commutative operations Not same order at 1 and 2? OK if � concurrent f and g commute Assuming causal delivery Not same order at 1 and 2? OK if � concurrent f and g commute Assuming causal delivery

3. 3 A sequence CRDT Treedoc = sequence of elements: insert-at-pos, delete Commutative when concurrent Minimise overhead Scalable A commutative replicated data type for cooperative editing, ICDCS 2009 Focus today: Garbage collection vs. scale I will just skim the surface of the Treedoc design Refer to ICDCS paper for the detailsI will just skim the surface of the Treedoc design Refer to ICDCS paper for the details

4. 4 Commutative updates insert adds leaf ? non-destructive, TIDs don�t change Thanks to non-destructive updates, immutable TIDs: concurrent updates commute Efficient: Data structures and TID lengths logarithmic *if balanced* Ignoring lots of details, e.g. concurrent inserts at same position (see paper)Thanks to non-destructive updates, immutable TIDs: concurrent updates commute Efficient: Data structures and TID lengths logarithmic *if balanced* Ignoring lots of details, e.g. concurrent inserts at same position (see paper)

5. 5 Wikipedia GWB page: space overhead GWB: most edited page Edits translated into treedoc insert/deletes � Tree unbalanced, long TIDs, lots of tombstones: not logarathmicGWB: most edited page Edits translated into treedoc insert/deletes � Tree unbalanced, long TIDs, lots of tombstones: not logarathmic

6. 6 Rebalance In this example rebalancing is not spectacular. Imagine a deep unbalanced tree with lots of tombstones: large effect. Why rebalance: � Unbalanced tree costs time, space � Long TIDs � Tombstone overheadIn this example rebalancing is not spectacular. Imagine a deep unbalanced tree with lots of tombstones: large effect. Why rebalance: � Unbalanced tree costs time, space � Long TIDs � Tombstone overhead

7. 7 Rebalance Invalidates TIDs: Frame of reference = epoch Requires agreement Pervasive! e.g. Vector Clocks TID changed: R was e, now 10 Pervasive problem: � asynchronous updates ==> old data structures � see cleaning up Vector Clocks (Background colour indicates epoch)TID changed: R was e, now 10 Pervasive problem: � asynchronous updates ==> old data structures � see cleaning up Vector Clocks (Background colour indicates epoch)

8. 8 Rebalance in large, dynamic systems Rebalance requires consensus Consensus requires small, stable membership Large communities?! Dynamic scenarios?! Solution: two tiers Core: rebalancing (and updates) Nebula: updates (and rebalancing) Migration protocol Core: controls rebalancing Core: controls rebalancing

9. 9 Core Nebula Group membership Small, stable Rebalance: Unanimous agreement (2-phase commit) All core sites in same epoch Migrate core to nebula: just leave groupMigrate core to nebula: just leave group

10. 10 Core Nebula Migrate from nebula to core: migrate to core epoch + join groupMigrate from nebula to core: migrate to core epoch + join group

11. 11 Catch-up protocol summary Here is the basic insight to the migration protocol � Replay core's updates: N now in same state as C before rebalance � Replay rebalance: *ignoring concurrent N updates*, has same effect as in C ==> same TIDs, same epoch � Transform buffer: now ready to be replayed in C (in different order, but that's OK since they commute) � N now in C state, can join the core or remain in nebula Furthermore updates are idempotent (multiple catch-ups cause no harm)Here is the basic insight to the migration protocol � Replay core's updates: N now in same state as C before rebalance � Replay rebalance: *ignoring concurrent N updates*, has same effect as in C ==> same TIDs, same epoch � Transform buffer: now ready to be replayed in C (in different order, but that's OK since they commute) � N now in C state, can join the core or remain in nebula Furthermore updates are idempotent (multiple catch-ups cause no harm)

12. 12 Catch-up protocol del in old epoch ins + ins in old epoch rebalance starts new epochdel in old epoch ins + ins in old epoch rebalance starts new epoch

13. 13 Catch-up protocol del in old epoch ins + ins in old epoch rebalance starts new epochdel in old epoch ins + ins in old epoch rebalance starts new epoch

14. 14 Catch-up protocol white = old epoch pink = new epoch white = old epoch pink = new epoch

15. 15 Catch-up protocol del uttered in old epoch ==> can send to S S replays del; now up to date with Core send rebalance S replays rebalance; intervening ins move; S now in new epoch del uttered in old epoch ==> can send to S S replays del; now up to date with Core send rebalance S replays rebalance; intervening ins move; S now in new epoch

16. 16 Catch-up protocol ins arguments transformed to new epoch Core replays ins ins arguments transformed to new epoch Core replays ins

17. 17 Summary CRDT: Convergence ensured Design for commutativity GC cannot be ignored Requires commitment Pervasive issue Large-scale commitment: Core / Nebula To synchronise: catch-up + migration CRDT = non-locking synchronisation in weak memory modelCRDT = non-locking synchronisation in weak memory model

18. 18 Future work More CRDTs Understanding CRDTs: what invariants can be CRDTized Approximations of CRDTs Data types for consistent cloud computing without concurrency control

19. 19

20. 20 Replicated sequence Illustrating with a simple shared editing scenario. Applies to of any type of sequence.Illustrating with a simple shared editing scenario. Applies to of any type of sequence.

21. 21 State of the art Serialising updates Single, global execution order Lock-step: Poor user experience Doesn't scale Operational Transformation Local execution orders Modify arguments to take into account concurrent operations scheduled before Weak theory, hidden complexity Insight: design data type to be commutative Serialisation is slow and occurs on the critical path of applications. Atom identfier must have same meaning on initial and replay sites: lock-step! or OTOne approach would be to increase coarseness, e.g. synchronise per-line instead of per-character. This may be OK for some applications, but NOT for concurrent editing. If your edit atom is the line (for instance) the user suffers artefacts at line boundaries, e.g. can't easily delete an end-of-line, can't easily split a line.Serialisation is slow and occurs on the critical path of applications. Atom identfier must have same meaning on initial and replay sites: lock-step! or OTOne approach would be to increase coarseness, e.g. synchronise per-line instead of per-character. This may be OK for some applications, but NOT for concurrent editing. If your edit atom is the line (for instance) the user suffers artefacts at line boundaries, e.g. can't easily delete an end-of-line, can't easily split a line.

22. 22 Commutative ReplicatedData Type (CRDT) Assuming: All concurrent operations commute Non-concurrent operations execute in happens-before order All operations eventually execute at every replica Then replicas eventually converge to correct value Design data types to support commutative operations

23. 23 Concurrent inserts Exceptions to binary tree: disambiguator Concurrent inserts ordered by disambiguator Path = site-ID? < [0|1], disambiguator? >* Alternatives: site identifier of initiator: short, but delete leaves a tombstone or: Unique ID of operation: long, immediate delete When an operation has been executed at all sites, it is *stable*. Stability detection: Golding's algorithm. Must hear from all sites (directly or indirectly). Node is stable when all concurrent inserts at that node are stable. Once a node is stable: if there was a single conc. insert, remove all exceptions. When an operation has been executed at all sites, it is *stable*. Stability detection: Golding's algorithm. Must hear from all sites (directly or indirectly). Node is stable when all concurrent inserts at that node are stable. Once a node is stable: if there was a single conc. insert, remove all exceptions.

24. 24 Causal ordering Vector clocks: Number of messages received from each site Causal ordering Filter duplicate messages Efficient but grow indefinitely Treedoc TID encodes causal order Duplicates idempotent Approximate Vector Clocks + Treedoc Approximate vector clocks fixed size (scalability) incomplete filter (duplicates ignored by Treedoc)Approximate vector clocks fixed size (scalability) incomplete filter (duplicates ignored by Treedoc)

25. 25 Rebalance requires commitment Commutativity of update || rebalance Updates win Rebalance does not impact update performance Rebalance: unanimous agreement Standard 2- or 3-phase commit Initiator is coordinator Other site: If concurrent update: �Not OK� Off critical path! Rebalance changes frame of reference ==> requires unanimous consensus To avoid synchronisation in critical path: updates win, rebalance loses Any site can initiate Standard 2- or 3-phase commit Off critical path: no concurrency control (in critical path) Rebalance changes frame of reference ==> requires unanimous consensus To avoid synchronisation in critical path: updates win, rebalance loses Any site can initiate Standard 2- or 3-phase commit Off critical path: no concurrency control (in critical path)

26. 26 Experiments Estimate overheads, compare design alternatives: Atom granularity: word vs. line Disambiguator: siteID+tombstone vs. unique ID+immediate delete Are trees unbalanced? Effect of rebalance, heuristics Replay update history of CVS, SVN, Wiki repositories

27. 27 Implementation alternatives Disambiguator: next slide Atom: character, word, line, paragraph Fine-grain: structural overhead Coarse-grain: �delete� artefacts Data structure: Tree: no TID storage, no GC interior nodes vs. { (TID, atom) }: flexible, GC Mixed Arity: binary vs. 256-ary

28. 28 Disambiguator design Disambiguator options: 1 byte, no GC until rebalance or 10 bytes, immediate GC (if leaf) Stored in every node Intuitively which do you think is best? I thought that the most important was to avoid overhead. A node is small, keep it small, but a few extra nodes is no big deal. In fact, deletes are so frequent that n� 2 is superior. I thought that the most important was to avoid overhead. A node is small, keep it small, but a few extra nodes is no big deal. In fact, deletes are so frequent that n� 2 is superior.

29. 29 Latex files Latex files, in a collaborative group. Probably average size for a scientific publication (10-15 output pages). Relatively short history: each new paper starts history from scratch.Latex files, in a collaborative group. Probably average size for a scientific publication (10-15 output pages). Relatively short history: each new paper starts history from scratch.

30. 30 Latex / line Notice how, even in this collaborative environment, deletes are numerous. Partially an artefact of line granularity. Notice how, even in this collaborative environment, deletes are numerous. Partially an artefact of line granularity.

31. 31 Summarize: % deleted nodes (mean) Again, deletes are frequent. Conclusion: deletes must be efficient space-wise.Again, deletes are frequent. Conclusion: deletes must be efficient space-wise.

32. 32 Atom granularity and deletes Every update ==> delete atom + insert amount of deletes is only partially an artefact of atom granularityEvery update ==> delete atom + insert amount of deletes is only partially an artefact of atom granularity

33. 33 Wikipedia GWB benchmark en.wikipedia.org/George_W_Bush 150 kB text 42,000 revisions: most frequently revised Biggest revision: 100 kB Benchmark data Treedoc node = paragraph First 15,000 revisions = 350,000 updates Biggest revision < 2 s; average: 0.24 s/revision Rebalance every 1,000 revisions 256-ary tree Show that GC is an important problem, cannot be ignored Stopped at 15000 because of limitations of download programmeShow that GC is an important problem, cannot be ignored Stopped at 15000 because of limitations of download programme

34. 34 Wikipedia GWB page Even more than SVN: lots of deletes Wikipedia: delete/recreate war. This results in a LOT of tombstones. Conclusion: it's very important to deal efficiently with deletes.Even more than SVN: lots of deletes Wikipedia: delete/recreate war. This results in a LOT of tombstones. Conclusion: it's very important to deal efficiently with deletes.

35. 35 Time per operation Rebalance is essential to efficiency (10-times improvement) If the tree was well balance we'd expect a logarithmic growth. Instead it is super-linear. Obviously the tree is not balanced. GWB, paragraph granularity, 256-ary treeRebalance is essential to efficiency (10-times improvement) If the tree was well balance we'd expect a logarithmic growth. Instead it is super-linear. Obviously the tree is not balanced. GWB, paragraph granularity, 256-ary tree

36. 36 Size, frequency of�TIDs � Size: Each child has an 8-bit ID relative to its parent. All IDs = 127 are to left of parent, IDs = 128 to its right. � If tree balanced, should grow logarithmically. In fact grows linearly. � Why does curve flattens out? Mystified: deletes create lots of empty room? Or bug? � Frequency: Spikes <== "insert to the right of parent", "insert to right of predecessor". Explains why unbalanced. GWB, paragraph granularity, 256-ary tree � Size: Each child has an 8-bit ID relative to its parent. All IDs = 127 are to left of parent, IDs = 128 to its right. � If tree balanced, should grow logarithmically. In fact grows linearly. � Why does curve flattens out? Mystified: deletes create lots of empty room? Or bug? � Frequency: Spikes <== "insert to the right of parent", "insert to right of predecessor". Explains why unbalanced. GWB, paragraph granularity, 256-ary tree

37. 37 flat vs. WOOT vs. Treedoc

38. 38 Summary: garbage collection Efficiency, GC are important Tree + re-balance Requires commitment (move off critical path) Pervasive issue Large-scale commitment: Core: commitment Nebula: asynchronous updates Occasional synchronisation: migration

39. 39 Summary: CRDT CRDT: Convergence ensured Design for commutativity Techniques for commutativity: Partial order Non-destructive updates Identifiers don't depend on concurrent activity Consensus: off critical path

40. 40 Commutativity:Genuine vs. precedence Commutativity: ?S, u, v : S.u.v = S.v.u Genuine: both u and v take effect �Addition, subtraction commutative� Non-destructive updates Precedence: only one of u, v ultimately takes effect �File writes commute under Last-Writer-Wins� Destructive updates Commutative: For all states S, and all operations u and v such that S.u and S.v are states, S.u.v and S.v.u are both states, and are equivalent. (Equivalence and state are defined in an application-dependent manner.)Commutative: For all states S, and all operations u and v such that S.u and S.v are states, S.u.v and S.v.u are both states, and are equivalent. (Equivalence and state are defined in an application-dependent manner.)

41. 41 Future work Integrate with real editor (Single-document) Transactions Generalisation Characterise invariants More data types: set, multilog, others? Mixed data types: improve cf. persistent data structures When consensus required: Move off critical path Speculation + conflict resolution Other data types: � Trivial: set with add, deleteOther data types: � Trivial: set with add, delete

Consistency without concurrency control in large, dynamic systems

Consistency without concurrency control in large, dynamic systems

Presentation Transcript

Dynamic Systems And Control

Concurrency Control In Dynamic Database Systems

Concurrency Control

Concurrency Control

Concurrency Control

Concurrency Control

Concurrency Control

Concurrency Control

Concurrency Control

Parallelism without Concurrency

Database Systems II Concurrency Control

Concurrency Control

Concurrency Control

Concurrency Control

Concurrency Control

Concurrency Control

Database Systems II Concurrency Control

CONCURRENCY CONTROL

Concurrency Control

Concurrency Control