E N D
1. Consistency without concurrency control in large, dynamic systems Marc Shapiro, INRIA & LIP6
Nuno Preguiça, Universidade Nova de Lisboa
Mihai Le?ia, ENS Lyon
Two things
• no concurrency control
• large dynamicTwo things
• no concurrency control
• large dynamic
2. 2 Consistency without concurrency control Object x, operation f(x)
propose f(x1)
eventually replay f(x2), f(x3), ...
If f || g commute: converges safely without concurrency control
Commutative Replicated Data Type (CRDT): Designed for commutative operations Not same order at 1 and 2?
OK if
• concurrent f and g commute
Assuming causal delivery
Not same order at 1 and 2?
OK if
• concurrent f and g commute
Assuming causal delivery
3. 3 A sequence CRDT Treedoc = sequence of elements:
insert-at-pos, delete
Commutative when concurrent
Minimise overhead
Scalable
A commutative replicated data type for cooperative editing, ICDCS 2009
Focus today:
Garbage collection
vs. scale I will just skim the surface of the Treedoc design
Refer to ICDCS paper for the detailsI will just skim the surface of the Treedoc design
Refer to ICDCS paper for the details
4. 4 Commutative updates insert adds leaf ? non-destructive, TIDs don’t change Thanks to non-destructive updates, immutable TIDs: concurrent updates commute
Efficient: Data structures and TID lengths logarithmic *if balanced*
Ignoring lots of details, e.g. concurrent inserts at same position (see paper)Thanks to non-destructive updates, immutable TIDs: concurrent updates commute
Efficient: Data structures and TID lengths logarithmic *if balanced*
Ignoring lots of details, e.g. concurrent inserts at same position (see paper)
5. 5 Wikipedia GWB page: space overhead GWB: most edited page
Edits translated into treedoc insert/deletes
• Tree unbalanced, long TIDs, lots of tombstones: not logarathmicGWB: most edited page
Edits translated into treedoc insert/deletes
• Tree unbalanced, long TIDs, lots of tombstones: not logarathmic
6. 6 Rebalance In this example rebalancing is not spectacular.
Imagine a deep unbalanced tree with lots of tombstones: large effect.
Why rebalance: • Unbalanced tree costs time, space • Long TIDs • Tombstone overheadIn this example rebalancing is not spectacular.
Imagine a deep unbalanced tree with lots of tombstones: large effect.
Why rebalance: • Unbalanced tree costs time, space • Long TIDs • Tombstone overhead
7. 7 Rebalance Invalidates TIDs:
Frame of reference = epoch
Requires agreement
Pervasive!
e.g. Vector Clocks TID changed: R was e, now 10
Pervasive problem:
• asynchronous updates ==> old data structures
• see cleaning up Vector Clocks
(Background colour indicates epoch)TID changed: R was e, now 10
Pervasive problem:
• asynchronous updates ==> old data structures
• see cleaning up Vector Clocks
(Background colour indicates epoch)
8. 8 Rebalance in large, dynamic systems Rebalance requires consensus
Consensus requires small, stable membership
Large communities?!
Dynamic scenarios?!
Solution: two tiers
Core: rebalancing (and updates)
Nebula: updates (and rebalancing)
Migration protocol Core: controls rebalancing
Core: controls rebalancing
9. 9 Core Nebula Group membership
Small, stable
Rebalance:
Unanimous agreement (2-phase commit)
All core sites in same epoch Migrate core to nebula: just leave groupMigrate core to nebula: just leave group
10. 10 Core Nebula Migrate from nebula to core: migrate to core epoch + join groupMigrate from nebula to core: migrate to core epoch + join group
11. 11 Catch-up protocol summary Here is the basic insight to the migration protocol
• Replay core's updates: N now in same state as C before rebalance
• Replay rebalance: *ignoring concurrent N updates*, has same effect as in C ==> same TIDs, same epoch
• Transform buffer: now ready to be replayed in C (in different order, but that's OK since they commute)
• N now in C state, can join the core or remain in nebula
Furthermore updates are idempotent (multiple catch-ups cause no harm)Here is the basic insight to the migration protocol
• Replay core's updates: N now in same state as C before rebalance
• Replay rebalance: *ignoring concurrent N updates*, has same effect as in C ==> same TIDs, same epoch
• Transform buffer: now ready to be replayed in C (in different order, but that's OK since they commute)
• N now in C state, can join the core or remain in nebula
Furthermore updates are idempotent (multiple catch-ups cause no harm)
12. 12 Catch-up protocol del in old epoch ins + ins in old epoch
rebalance starts new epochdel in old epoch ins + ins in old epoch
rebalance starts new epoch
13. 13 Catch-up protocol del in old epoch ins + ins in old epoch
rebalance starts new epochdel in old epoch ins + ins in old epoch
rebalance starts new epoch
14. 14 Catch-up protocol white = old epoch
pink = new epoch
white = old epoch
pink = new epoch
15. 15 Catch-up protocol del uttered in old epoch ==> can send to S S replays del; now up to date with Core
send rebalance S replays rebalance;
intervening ins move;
S now in new epoch del uttered in old epoch ==> can send to S S replays del; now up to date with Core
send rebalance S replays rebalance;
intervening ins move;
S now in new epoch
16. 16 Catch-up protocol ins arguments transformed to new epoch
Core replays ins ins arguments transformed to new epoch
Core replays ins
17. 17 Summary CRDT:
Convergence ensured
Design for commutativity
GC cannot be ignored
Requires commitment
Pervasive issue
Large-scale commitment:
Core / Nebula
To synchronise: catch-up + migration CRDT = non-locking synchronisation in weak memory modelCRDT = non-locking synchronisation in weak memory model
18. 18 Future work More CRDTs
Understanding CRDTs: what invariants can be CRDTized
Approximations of CRDTs
Data types for consistent cloud computing without concurrency control
19. 19
20. 20 Replicated sequence Illustrating with a simple shared editing scenario. Applies to of any type of sequence.Illustrating with a simple shared editing scenario. Applies to of any type of sequence.
21. 21 State of the art Serialising updates
Single, global execution order
Lock-step: Poor user experience
Doesn't scale
Operational Transformation
Local execution orders
Modify arguments to take into account concurrent operations scheduled before
Weak theory, hidden complexity
Insight: design data type to be commutative Serialisation is slow and occurs on the critical path of applications. Atom identfier must have same meaning on initial and replay sites: lock-step! or OTOne approach would be to increase coarseness, e.g. synchronise per-line instead of per-character. This may be OK for some applications, but NOT for concurrent editing. If your edit atom is the line (for instance) the user suffers artefacts at line boundaries, e.g. can't easily delete an end-of-line, can't easily split a line.Serialisation is slow and occurs on the critical path of applications. Atom identfier must have same meaning on initial and replay sites: lock-step! or OTOne approach would be to increase coarseness, e.g. synchronise per-line instead of per-character. This may be OK for some applications, but NOT for concurrent editing. If your edit atom is the line (for instance) the user suffers artefacts at line boundaries, e.g. can't easily delete an end-of-line, can't easily split a line.
22. 22 Commutative ReplicatedData Type (CRDT) Assuming:
All concurrent operations commute
Non-concurrent operations execute in happens-before order
All operations eventually execute at every replica
Then replicas eventually converge to correct value
Design data types to support commutative operations
23. 23 Concurrent inserts Exceptions to binary tree: disambiguator
Concurrent inserts ordered by disambiguator
Path = site-ID? < [0|1], disambiguator? >*
Alternatives:
site identifier of initiator: short, but delete leaves a tombstone
or: Unique ID of operation: long, immediate delete
When an operation has been executed at all sites, it is *stable*.
Stability detection: Golding's algorithm. Must hear from all sites (directly or indirectly).
Node is stable when all concurrent inserts at that node are stable.
Once a node is stable: if there was a single conc. insert, remove all exceptions.
When an operation has been executed at all sites, it is *stable*.
Stability detection: Golding's algorithm. Must hear from all sites (directly or indirectly).
Node is stable when all concurrent inserts at that node are stable.
Once a node is stable: if there was a single conc. insert, remove all exceptions.
24. 24 Causal ordering Vector clocks:
Number of messages received from each site
Causal ordering
Filter duplicate messages
Efficient but grow indefinitely
Treedoc
TID encodes causal order
Duplicates idempotent
Approximate Vector Clocks + Treedoc Approximate vector clocks
fixed size (scalability)
incomplete filter (duplicates ignored by Treedoc)Approximate vector clocks
fixed size (scalability)
incomplete filter (duplicates ignored by Treedoc)
25. 25 Rebalance requires commitment Commutativity of update || rebalance
Updates win
Rebalance does not impact update performance
Rebalance: unanimous agreement
Standard 2- or 3-phase commit
Initiator is coordinator
Other site: If concurrent update: “Not OK”
Off critical path! Rebalance changes frame of reference ==> requires unanimous consensus
To avoid synchronisation in critical path: updates win, rebalance loses
Any site can initiate
Standard 2- or 3-phase commit
Off critical path: no concurrency control (in critical path)
Rebalance changes frame of reference ==> requires unanimous consensus
To avoid synchronisation in critical path: updates win, rebalance loses
Any site can initiate
Standard 2- or 3-phase commit
Off critical path: no concurrency control (in critical path)
26. 26 Experiments Estimate overheads, compare design alternatives:
Atom granularity: word vs. line
Disambiguator: siteID+tombstone vs. unique ID+immediate delete
Are trees unbalanced?
Effect of rebalance, heuristics
Replay update history of CVS, SVN, Wiki repositories
27. 27 Implementation alternatives Disambiguator: next slide
Atom: character, word, line, paragraph
Fine-grain: structural overhead
Coarse-grain: “delete” artefacts
Data structure:
Tree: no TID storage, no GC interior nodes
vs. { (TID, atom) }: flexible, GC
Mixed
Arity: binary vs. 256-ary
28. 28 Disambiguator design Disambiguator options:
1 byte, no GC until rebalance
or 10 bytes, immediate GC (if leaf)
Stored in every node
Intuitively which do you think is best? I thought that the most important was to avoid overhead. A node is small, keep it small, but a few extra nodes is no big deal. In fact, deletes are so frequent that n° 2 is superior.
I thought that the most important was to avoid overhead. A node is small, keep it small, but a few extra nodes is no big deal. In fact, deletes are so frequent that n° 2 is superior.
29. 29 Latex files Latex files, in a collaborative group. Probably average size for a scientific publication (10-15 output pages). Relatively short history: each new paper starts history from scratch.Latex files, in a collaborative group. Probably average size for a scientific publication (10-15 output pages). Relatively short history: each new paper starts history from scratch.
30. 30 Latex / line Notice how, even in this collaborative environment, deletes are numerous. Partially an artefact of line granularity.
Notice how, even in this collaborative environment, deletes are numerous. Partially an artefact of line granularity.
31. 31 Summarize: % deleted nodes (mean) Again, deletes are frequent.
Conclusion: deletes must be efficient space-wise.Again, deletes are frequent.
Conclusion: deletes must be efficient space-wise.
32. 32 Atom granularity and deletes Every update ==> delete atom + insert
amount of deletes is only partially an artefact of atom granularityEvery update ==> delete atom + insert
amount of deletes is only partially an artefact of atom granularity
33. 33 Wikipedia GWB benchmark en.wikipedia.org/George_W_Bush
150 kB text
42,000 revisions: most frequently revised
Biggest revision: 100 kB
Benchmark data
Treedoc node = paragraph
First 15,000 revisions = 350,000 updates
Biggest revision < 2 s; average: 0.24 s/revision
Rebalance every 1,000 revisions
256-ary tree Show that GC is an important problem, cannot be ignored
Stopped at 15000 because of limitations of download programmeShow that GC is an important problem, cannot be ignored
Stopped at 15000 because of limitations of download programme
34. 34 Wikipedia GWB page Even more than SVN: lots of deletes
Wikipedia: delete/recreate war. This results in a LOT of tombstones. Conclusion: it's very important to deal efficiently with deletes.Even more than SVN: lots of deletes
Wikipedia: delete/recreate war. This results in a LOT of tombstones. Conclusion: it's very important to deal efficiently with deletes.
35. 35 Time per operation Rebalance is essential to efficiency (10-times improvement)
If the tree was well balance we'd expect a logarithmic growth. Instead it is super-linear. Obviously the tree is not balanced.
GWB, paragraph granularity, 256-ary treeRebalance is essential to efficiency (10-times improvement)
If the tree was well balance we'd expect a logarithmic growth. Instead it is super-linear. Obviously the tree is not balanced.
GWB, paragraph granularity, 256-ary tree
36. 36 Size, frequency of TIDs • Size: Each child has an 8-bit ID relative to its parent. All IDs = 127 are to left of parent, IDs = 128 to its right.
• If tree balanced, should grow logarithmically. In fact grows linearly.
• Why does curve flattens out? Mystified: deletes create lots of empty room? Or bug?
• Frequency: Spikes <== "insert to the right of parent", "insert to right of predecessor". Explains why unbalanced.
GWB, paragraph granularity, 256-ary tree
• Size: Each child has an 8-bit ID relative to its parent. All IDs = 127 are to left of parent, IDs = 128 to its right.
• If tree balanced, should grow logarithmically. In fact grows linearly.
• Why does curve flattens out? Mystified: deletes create lots of empty room? Or bug?
• Frequency: Spikes <== "insert to the right of parent", "insert to right of predecessor". Explains why unbalanced.
GWB, paragraph granularity, 256-ary tree
37. 37 flat vs. WOOT vs. Treedoc
38. 38 Summary: garbage collection Efficiency, GC are important
Tree + re-balance
Requires commitment (move off critical path)
Pervasive issue
Large-scale commitment:
Core: commitment
Nebula: asynchronous updates
Occasional synchronisation: migration
39. 39 Summary: CRDT CRDT:
Convergence ensured
Design for commutativity
Techniques for commutativity:
Partial order
Non-destructive updates
Identifiers don't depend on concurrent activity
Consensus: off critical path
40. 40 Commutativity:Genuine vs. precedence Commutativity: ?S, u, v : S.u.v = S.v.u
Genuine: both u and v take effect
“Addition, subtraction commutative”
Non-destructive updates
Precedence: only one of u, v ultimately takes effect
“File writes commute under Last-Writer-Wins”
Destructive updates Commutative: For all states S, and all operations u and v such that S.u and S.v are states, S.u.v and S.v.u are both states, and are equivalent.
(Equivalence and state are defined in an application-dependent manner.)Commutative: For all states S, and all operations u and v such that S.u and S.v are states, S.u.v and S.v.u are both states, and are equivalent.
(Equivalence and state are defined in an application-dependent manner.)
41. 41 Future work Integrate with real editor
(Single-document) Transactions
Generalisation
Characterise invariants
More data types: set, multilog, others?
Mixed data types: improve
cf. persistent data structures
When consensus required:
Move off critical path
Speculation + conflict resolution Other data types:
• Trivial: set with add, deleteOther data types:
• Trivial: set with add, delete