1 / 40

What’s the Difference? Efficient Set Reconciliation without Prior Context

What’s the Difference? Efficient Set Reconciliation without Prior Context. Frank Uyeda University of California, San Diego David Eppstein , Michael T. Goodrich & George Varghese. Motivation. Distributed applications often need to compare remote state . R1. R2. Partition Heals.

trudy
Download Presentation

What’s the Difference? Efficient Set Reconciliation without Prior Context

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What’s the Difference?Efficient Set Reconciliation without Prior Context Frank Uyeda University of California, San Diego David Eppstein, Michael T. Goodrich & George Varghese

  2. Motivation • Distributed applications often need to compare remote state. R1 R2 Partition Heals Must solve the Set-Difference Problem!

  3. What is the Set-Difference problem? Host 1 Host 2 • What objects are unique to host 1? • What objects are unique to host 2? A B E F A C D F

  4. Example 1: Data Synchronization Host 1 Host 2 • Identify missing data blocks • Transfer blocks to synchronize sets C D A B E F A C D F B E

  5. Example 2: Data De-duplication Host 1 Host 2 • Identify all unique blocks. • Replace duplicate data with pointers A B E F A C D F

  6. Set-Difference Solutions • Trade a sorted list of objects. • O(n) communication, O(n log n) computation • Approximate Solutions: • Approximate Reconciliation Tree (Byers) • O(n) communication, O(n log n) computation • Polynomial Encodings (Minsky & Trachtenberg) • Let “d” be the size of the difference • O(d) communication, O(dn+d3) computation • Invertible Bloom Filter • O(d) communication, O(n+d) computation

  7. Difference Digests • Efficiently solves the set-difference problem. • Consists of two data structures: • Invertible Bloom Filter (IBF) • Efficiently computes the set difference. • Needs the size of the difference • Strata Estimator • Approximates the size of the set difference. • Uses IBF’s as a building block.

  8. Invertible Bloom Filters (IBF) Host 1 Host 2 • Encode local object identifiers into an IBF. A B E F A C D F IBF 1 IBF 2

  9. IBF Data Structure • Array of IBF cells • For a set difference of size, d, require αd cells (α> 1) • Each ID is assigned to many IBF cells • Each IBF cell contains:

  10. IBF Encode B C A Assign ID to many cells All hosts use the same hash functions Hash1 Hash2 Hash3 idSum⊕A hashSum⊕ H(A) count++ idSum⊕A hashSum⊕H(A) count++ idSum⊕ A hashSum⊕ H(A) count++ IBF: “Add” ID to cell Not O(n), like Bloom Filters! αd

  11. Invertible Bloom Filters (IBF) Host 1 Host 2 • Trade IBF’s with remote host A B E F A C D F IBF 1 IBF 2

  12. Invertible Bloom Filters (IBF) Host 1 Host 2 • “Subtract” IBF structures • Produces a new IBF containing only unique objects A B E F A C D F IBF 2 IBF 1 IBF (2 - 1)

  13. IBF Subtract

  14. Timeout for Intuition • After subtraction, all elements common to both sets have disappeared. Why? • Any common element (e.g W) is assigned to same cells on both hosts (assume same hash functions on both sides) • On subtraction, W XOR W = 0. Thus, W vanishes. • While elements in set difference remain, they may be randomly mixed  need a decode procedure.

  15. Invertible Bloom Filters (IBF) Host 1 Host 2 • Decode resulting IBF • Recover object identifiers from IBF structure. A B E F A C D F Host 1 Host 2 IBF 2 C D B E IBF 1 IBF (2 - 1)

  16. IBF Decode H(V ⊕ X ⊕ Z) ≠ H(V) ⊕ H(X) ⊕ H(Z) Test for Purity: H( idSum ) H( idSum ) = hashSum H(V) = H(V)

  17. IBF Decode

  18. IBF Decode

  19. IBF Decode

  20. How many IBF cells? Overhead to decode at >99% Hash Cnt 3 Hash Cnt 4 Space Overhead Small Diffs: 1.4x – 2.3x Large Differences: 1.25x - 1.4x Set Difference

  21. How many hash functions? • 1 hash function produces many pure cells initially but nothing to undo when an element is removed. C A B

  22. How many hash functions? • 1 hash function produces many pure cells initially but nothing to undo when an element is removed. • Many (say 10) hash functions: too many collisions. C C C B B C B A A A B A

  23. How many hash functions? • 1 hash function produces many pure cells initially but nothing to undo when an element is removed. • Many (say 10) hash functions: too many collisions. • We find by experiment that 3 or 4 hash functions works well. Is there some theoretical reason? C C B C A A A B B

  24. Theory • Let d= difference size, k = # hash functions. • Theorem 1: With (k + 1) d cells, failure probability falls exponentially. • For k = 3, implies a 4x tax on storage, a bit weak. • [Goodrich,Mitzenmacher]: Failure is equivalent to finding a 2-core (loop) in a random hypergraph • Theorem 2:With ckd, cells, failure probability falls exponentially • c4= 1.3x tax, agrees with experiments

  25. How many IBF cells? Overhead to decode at >99% Hash Cnt 3 Hash Cnt 4 Space Overhead Large Differences: 1.25x - 1.4x Set Difference

  26. Connection to Coding • Mystery: IBF decode similar to peeling procedure used to decode Tornado codes. Why? • Explanation: Set Difference is equivalent to coding with insert-delete channels • Intuition: Given a code for set A, send codewords only to B. Think of B’s set as a corrupted form of A’s. • Reduction: If code can correct D insertions/deletions, then B can recover A and the set difference. • Reed Solomon <---> Polynomial Methods • LDPC (Tornado) <---> Difference Digest

  27. Difference Digests • Consists of two data structures: • Invertible Bloom Filter (IBF) • Efficiently computes the set difference. • Needs the size of the difference • Strata Estimator • Approximates the size of the set difference. • Uses IBF’s as a building block.

  28. Strata Estimator Estimator B C A 1/16 • Divide keys into partitions of containing ~1/2k • Encode each partition into an IBF of fixed size • log(n) IBF’s of ~80 cells each IBF 4 ~1/8 IBF 3 ~1/4 Consistent Partitioning IBF 2 ~1/2 IBF 1

  29. Strata Estimator Estimator 1 Estimator 2 • Attempt to subtract & decode IBF’s at each level. • If level k decodes, then return:2kx (the number of ID’s recovered) … … IBF 4 IBF 4 4x IBF 3 IBF 3 Host 1 Host 2 IBF 2 IBF 2 Decode IBF 1 IBF 1

  30. Strata Estimator Estimator 1 Estimator 2 What about the other strata? • Attempt to subtract & decode IBF’s at each level. • If level k decodes, then return:2kx (the number of ID’s recovered) … … IBF 4 IBF 4 4x IBF 3 IBF 3 Decode Host 1 Host 2 IBF 2 IBF 2 IBF 1 IBF 1

  31. Strata Estimator Estimator 1 Estimator 2 • Observation: Extra partitions hold useful data • Sum elements from all decoded strata & return:2(k-1) x (the number of ID’s recovered) … … … Decode IBF 4 IBF 4 2x Decode IBF 3 IBF 3 Decode Host 1 Host 1 Host 1 Host 2 Host 2 Host 2 IBF 2 IBF 2 IBF 1 IBF 1

  32. Estimation Accuracy Average Estimation Error (15.3 KBytes) Relative Error in Estimation (%) Min-Wise good for large differences. Strata good for small differences. Set Difference

  33. Hybrid Estimator • Combine Strata and Min-Wise Estimators. • Use IBF Stratas for small differences. • Use Min-Wise for large differences. Hybrid Strata … Min-Wise IBF 4 IBF 3 IBF 3 IBF 2 IBF 2 IBF 1 IBF 1

  34. Hybrid Estimator Accuracy Average Estimation Error (15.3 KBytes) Converges with Min-wise for large differences Relative Error in Estimation (%) Hybrid matches Strata for small differences. Set Difference

  35. Application: KeyDiff Service • Promising Applications: • File Synchronization • P2P file sharing • Failure Recovery Application Application Add( key ) Remove( key ) Diff( host1, host2 ) Key Service Key Service Application Key Service

  36. Difference Digests Summary • Strata & Hybrid Estimators • Estimate the size of the Set Difference. • For 100K sets, 15KB estimator has <15% error • O(log n) communication, O(log n) computation. • Invertible Bloom Filter • Identifies all ID’s in the Set Difference. • 16 to 28 Bytes per ID in Set Difference. • O(d) communication, O(n+d) computation. • Implemented in KeyDiff Service

  37. Conclusions: Got Diffs? • New randomized algorithm (difference digests) for set difference or insertion/deletion coding • Could it be useful for your system? Need: • Large but roughly equal size sets • Small set differences (less than 10% of set size)

  38. Extra Slides

  39. Comparison to Logs • IBF work with no prior context. • Logs work with prior context, BUT • Redundant information when sync’ing with multiple parties. • Logging must be built into system for each write. • Logging add overhead at runtime. • Logging requires non-volatile storage. • Often not present in network devices. • IBF’s may out-perform logs when: • Synchronizing multiple parties • Synchronizations happen infrequently

More Related