700 likes | 706 Views
Efficient Exact Set-Similarity Joins. Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research. Data Cleaning. Data Cleaning. Data Cleaning. Data Cleaning. Data Cleaning. String Similarity Join. Reference Table. String Similarity (Self) Join. Strings Sets [CGK ’06].
E N D
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research
Data Cleaning Set-Similarity Joins
Data Cleaning Set-Similarity Joins
Data Cleaning Set-Similarity Joins
Data Cleaning Set-Similarity Joins
Data Cleaning Set-Similarity Joins
String Similarity Join Reference Table Set-Similarity Joins
String Similarity (Self) Join Set-Similarity Joins
Strings Sets [CGK ’06] microsoft mcrosoft 2-grams 2-grams {mi, ic, cr, ro, os, so, of, ft} {mc, cr, ro, os, so, of, ft} (edit distance ≤ 1) ----> (Δ ≤ 4) Set-Similarity Joins
Strings Sets String Sim Join edit distance ≤ 1 R … … S … … … … microsoft mcrosoft … … … … … … … …
Strings Sets Post-Process Set Sim Join Δ≤ 4 Tokenize Tokenize R … … S … … … … microsoft mcrosoft … … … … … … … …
String Set: Advantages • Generalizes to many string similarity funcs • Powerful primitive • Sets ≈ Relations • Leverage relational data processing • [CGK ‘06] Set-Similarity Joins
Contributions • New algorithms for set-similarity joins • Exact answers • Performance guarantees • Outperform previous exact algorithms • Orders of magnitude Exact answers are important for operators Set-Similarity Joins
Outline • Introduction • Algorithms • Experiments • Conclusion Set-Similarity Joins
{ mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S
Intersection size ≥ 5 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S
Intersection size ≥ 5 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S
Intersection size ≥ 5 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S
{ mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } Intersection size ≥ 5 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S
{ mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { lo, og, gi, is, so, of, ft } Intersection size ≥ 5 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } { … } { lg, gi, is, so, of, ft } { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } { … } { … } R S
Sim ( ri , sj) ≥ θ r1 s1 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } r2 s2 { … } { lg, gi, is, so, of, ft } r3 s3 { … } { lo, og, gi, is, so, of, ft } { … } { … } { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } rn sm { … } { … } R S
Sim ( ri , sj) ≥ θ r1 s1 { mc, cr, ro, os, so, of, ft } { bo, oe, ei, in, ng } r2 s2 { … } { lg, gi, is, so, of, ft } r3 s3 { … } { lo, og, gi, is, so, of, ft } { … } { … } Large { … } { mi, ic, cr, ro, os, so, of, ft } { … } { … } rn sm { … } { … } R S
Set-Similarity Join: Symmetric Difference • Input: • R: r1, r2 , … , rn (n sets) • S: s1 , s2 , … , sm (m sets) • Output: All pairs (ri , sj ) such that: • |riΔ sj| ≤ k ≤ k Running example: k = 4
Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } Set-Similarity Joins
Alternate Set Representation 1 25 50 s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } Set-Similarity Joins
Alternate Set Representation 1 25 50 s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } Set-Similarity Joins
Alternate Set Representation 1 25 50 s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } Set-Similarity Joins
Alternate Set Representation 1 25 50 s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } Set-Similarity Joins
Enumeration r s |r Δs| ≤ 4 Set-Similarity Joins
Enumeration r s |r Δs| ≤ 4 Set-Similarity Joins
Enumeration Errors r s |r Δs| ≤ 4 Set-Similarity Joins
Enumeration r s 1 2 3 4 5 |r Δs| ≤ 4 Set-Similarity Joins
Enumeration: Signature Generation s Sig (s ) { , , , , } Set-Similarity Joins
Enumeration: Signature Generation s Sig (s ) { , , , , } Hash32() { 0x4f72ba91, 0x29c8af10, 0x594b2c17, 0xa3b0e20f, 0xdd21f32a} Set-Similarity Joins
Property of Signatures r s 1 2 3 4 5 |r Δs | ≤ 4 Sig (r ) Sig (s ) ≠ Φ U Set-Similarity Joins
Enumeration: Algorithm • Generate signatures for each ri, sj • Enumerate (ri , sj) s.t Sig (ri) Sig (sj) ≠ Φ • Output those satisfying |riΔsj| ≤ 4 U Set-Similarity Joins
Enumeration Sig (r2)Sig (s1)≠Φ U r1 Sig (r1) Sig (s1) s1 r2 Sig (r2) Sig (s2) s2 r3 Sig (r3) Sig (s3) s3 r4 Sig (r4) Sig (s4) s4 r5 Sig (r5) Sig (s5) s5 Set-Similarity Joins
Enumeration Sig (r2)Sig (s1)≠Φ U r1 Sig (r1) Sig (s1) s1 r2 Sig (r2) Sig (s2) s2 r3 Sig (r3) Sig (s3) s3 r4 Sig (r4) Sig (s4) s4 r5 Sig (r5) Sig (s5) s5 Set-Similarity Joins
Enumeration Sig (r2)Sig (s1)≠Φ U r1 Sig (r1) Sig (s1) s1 r2 Sig (r2) Sig (s2) s2 r3 Sig (r3) Sig (s3) s3 r4 Sig (r4) Sig (s4) s4 r5 Sig (r5) Sig (s5) s5 False positive candidate pairs Output Set-Similarity Joins
Post-Process each R.Id, S.Id δ R.Id, S.Id R.Sig = S.Sig R’ (Id, Sig) S’ (Id, Sig) Gen Signatures Gen Signatures R (Id, Elem) S (Id, Elem)
No False Positive Candidate Pair r s 1 2 3 4 5 |r Δs| = 5 Set-Similarity Joins
False Positive Candidate Pair s1 s2 1 2 3 4 5 |r Δs| = 5 Set-Similarity Joins
Enumeration: Performance k = 4 Set-Similarity Joins
Enumeration: Performance k = 4 Ideal Performance Set-Similarity Joins
Enumeration r s |r Δs| ≤ 4 Set-Similarity Joins
Enumeration r s 1 2 3 4 5 6 |r Δs| ≤ 4 Set-Similarity Joins
Enumeration: Signature Generation 1 2 3 4 5 6 s1 Set-Similarity Joins
Enumeration: Signature Generation 1 2 3 4 5 6 s1 Set-Similarity Joins
Enumeration: Signature Generation 1 2 3 4 5 6 s1 Set-Similarity Joins
Enumeration: Signature Generation 1 2 3 4 5 6 s1 Set-Similarity Joins