1 / 70

Efficient Exact Set-Similarity Joins

Efficient Exact Set-Similarity Joins. Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research. Data Cleaning. Data Cleaning. Data Cleaning. Data Cleaning. Data Cleaning. String Similarity Join. Reference Table. String Similarity (Self) Join. Strings  Sets [CGK ’06].

Download Presentation

Efficient Exact Set-Similarity Joins

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research

  2. Data Cleaning Set-Similarity Joins

  3. Data Cleaning Set-Similarity Joins

  4. Data Cleaning Set-Similarity Joins

  5. Data Cleaning Set-Similarity Joins

  6. Data Cleaning Set-Similarity Joins

  7. String Similarity Join Reference Table Set-Similarity Joins

  8. String Similarity (Self) Join Set-Similarity Joins

  9. Strings  Sets [CGK ’06] microsoft mcrosoft 2-grams 2-grams {mi, ic, cr, ro, os, so, of, ft} {mc, cr, ro, os, so, of, ft} (edit distance ≤ 1) ----> (Δ ≤ 4) Set-Similarity Joins

  10. Strings  Sets String Sim Join edit distance ≤ 1 R … … S … … … … microsoft mcrosoft … … … … … … … …

  11. Strings  Sets Post-Process Set Sim Join Δ≤ 4 Tokenize Tokenize R … … S … … … … microsoft mcrosoft … … … … … … … …

  12. String  Set: Advantages • Generalizes to many string similarity funcs • Powerful primitive • Sets ≈ Relations • Leverage relational data processing • [CGK ‘06] Set-Similarity Joins

  13. Contributions • New algorithms for set-similarity joins • Exact answers • Performance guarantees • Outperform previous exact algorithms • Orders of magnitude Exact answers are important for operators Set-Similarity Joins

  14. Outline • Introduction • Algorithms • Experiments • Conclusion Set-Similarity Joins

  15. { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S

  16. Intersection size ≥ 5 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S

  17. Intersection size ≥ 5 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S

  18. Intersection size ≥ 5 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S

  19. { mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } Intersection size ≥ 5 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S

  20. { mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { lo, og, gi, is, so, of, ft } Intersection size ≥ 5 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S

  21. Sim ( ri , sj) ≥ θ r1 s1 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } r2 s2 { … } { lg, gi, is, so, of, ft } r3 s3 { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } rn sm { … } { … } R S

  22. Sim ( ri , sj) ≥ θ r1 s1 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } r2 s2 { … } { lg, gi, is, so, of, ft } r3 s3 { … } { lo, og, gi, is, so, of, ft } { … } { … } Large { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } rn sm { … } { … } R S

  23. Set-Similarity Join: Symmetric Difference • Input: • R: r1, r2 , … , rn (n sets) • S: s1 , s2 , … , sm (m sets) • Output: All pairs (ri , sj ) such that: • |riΔ sj| ≤ k ≤ k Running example: k = 4

  24. Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } Set-Similarity Joins

  25. Alternate Set Representation 1 25 50 s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } Set-Similarity Joins

  26. Alternate Set Representation 1 25 50 s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } Set-Similarity Joins

  27. Alternate Set Representation 1 25 50 s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } Set-Similarity Joins

  28. Alternate Set Representation 1 25 50 s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } Set-Similarity Joins

  29. Enumeration r s |r Δs| ≤ 4 Set-Similarity Joins

  30. Enumeration r s |r Δs| ≤ 4 Set-Similarity Joins

  31. Enumeration Errors r s |r Δs| ≤ 4 Set-Similarity Joins

  32. Enumeration r s 1 2 3 4 5 |r Δs| ≤ 4 Set-Similarity Joins

  33. Enumeration: Signature Generation s Sig (s ) { , , , , } Set-Similarity Joins

  34. Enumeration: Signature Generation s Sig (s ) { , , , , } Hash32() { 0x4f72ba91, 0x29c8af10, 0x594b2c17, 0xa3b0e20f, 0xdd21f32a} Set-Similarity Joins

  35. Property of Signatures r s 1 2 3 4 5 |r Δs | ≤ 4 Sig (r ) Sig (s ) ≠ Φ U Set-Similarity Joins

  36. Enumeration: Algorithm • Generate signatures for each ri, sj • Enumerate (ri , sj) s.t Sig (ri) Sig (sj) ≠ Φ • Output those satisfying |riΔsj| ≤ 4 U Set-Similarity Joins

  37. Enumeration Sig (r2)Sig (s1)≠Φ U r1 Sig (r1) Sig (s1) s1 r2 Sig (r2) Sig (s2) s2 r3 Sig (r3) Sig (s3) s3 r4 Sig (r4) Sig (s4) s4 r5 Sig (r5) Sig (s5) s5 Set-Similarity Joins

  38. Enumeration Sig (r2)Sig (s1)≠Φ U r1 Sig (r1) Sig (s1) s1 r2 Sig (r2) Sig (s2) s2 r3 Sig (r3) Sig (s3) s3 r4 Sig (r4) Sig (s4) s4 r5 Sig (r5) Sig (s5) s5 Set-Similarity Joins

  39. Enumeration Sig (r2)Sig (s1)≠Φ U r1 Sig (r1) Sig (s1) s1 r2 Sig (r2) Sig (s2) s2 r3 Sig (r3) Sig (s3) s3 r4 Sig (r4) Sig (s4) s4 r5 Sig (r5) Sig (s5) s5 False positive candidate pairs Output Set-Similarity Joins

  40. Post-Process each R.Id, S.Id δ R.Id, S.Id R.Sig = S.Sig R’ (Id, Sig) S’ (Id, Sig) Gen Signatures Gen Signatures R (Id, Elem) S (Id, Elem)

  41. No False Positive Candidate Pair r s 1 2 3 4 5 |r Δs| = 5 Set-Similarity Joins

  42. False Positive Candidate Pair s1 s2 1 2 3 4 5 |r Δs| = 5 Set-Similarity Joins

  43. Enumeration: Performance k = 4 Set-Similarity Joins

  44. Enumeration: Performance k = 4 Ideal Performance Set-Similarity Joins

  45. Enumeration r s |r Δs| ≤ 4 Set-Similarity Joins

  46. Enumeration r s 1 2 3 4 5 6 |r Δs| ≤ 4 Set-Similarity Joins

  47. Enumeration: Signature Generation 1 2 3 4 5 6 s1 Set-Similarity Joins

  48. Enumeration: Signature Generation 1 2 3 4 5 6 s1 Set-Similarity Joins

  49. Enumeration: Signature Generation 1 2 3 4 5 6 s1 Set-Similarity Joins

  50. Enumeration: Signature Generation 1 2 3 4 5 6 s1 Set-Similarity Joins

More Related