250 likes | 447 Views
Group Linkage. Byung -Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research). ICDE 2007. Outline. Introduction Matching Bipartite Graph Group Linkage Bipartite matching Pre-processing step to speed up
E N D
Group Linkage Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) DiveshSrivastava (AT&T Labs – Research) ICDE 2007
Outline • Introduction • Matching • Bipartite Graph • Group Linkage • Bipartite matching • Pre-processing step to speed up • Greedy matching • Heuristic measure • Experiment & Result • Conclusion
Introduction • Poor quality data in databases • Transcription errors • Lack of standards for recording • Poor database design • How to identify whether two entities are approximately the same? • Group linkage problem Ex: “J.Ullman” “J.D.Ullman” “Ullman, Jeffrey”
Ex : Paper1 K.L.Hsueh DBLP Paper2 Peter Pan Paper3 Davy Jones Paper4 ACM Lily Hsueh Paper5 Group : each author Records : a list of citations per author Implement Group Linkage Problem
Matching • Matching:A matching in a graph G is a set of non-loop edges with no shared endpoints • Maximum matching: A matching that contains the largest possible number of edges.
Bipartite Graph • Bipartite Graph:A graph is bipartite if V is the union of two disjoint independent sets called partite sets of G Bipartite matching
Group Linkage(1) • Jaccard similarity measure between two sets s1 and s2 • Records from the two groups can be put into matching when they are identical.
Group Linkage(3) Register Allocation & Spilling via graph coloring r11 r12 r13 r14 K.L.Hsueh g1 Lily.Hsueh g2 r21 r22 r23 r24 r25 Register Allocation and Spilling via graph coloring Similar records normalize Group similarity ,each
Bipartite Matching • Record-level similarity measure • [5] S.Chaudhuri, V.Ganti, and R. Kaushik. “A primitive Operator for Similarity Joins in Data Cleaning”. In IEEE ICED, 2006 • Maximum weight bipartite matching (BM) • [10] S. Guha, N.Koudas, A. Marathe, and D. Srivastava. “Merging the Results of Approximate Match Operations”. In VLCB, pages 636-647, 2004. • Applying this strategy for every pair of groups is infeasible. pre-processing step • Greedy matching • Heuristic measure
Greedy Matching(1) • S1: For each record ri ∈ g1, find a record rj ∈ g2with the highest record-level similarity among those with sim() ≥ ρ. • S2: Same as S1 r11 r12 r13 r14 g1 g2 r21 r22 r23 r24 r25 May not be a matching!
Greedy Matching(2) • Upper and lower bounds to BMsim,ρ r11 r12 r13 r14 g1 g2 r21 r22 r23 r24 r25
Greedy Matching(2) • is bounded • Only when , the more expensive computation would be needed.
Heuristic Measure • In practice that pairs of groups with a high value of will share at least one record with a high record-level similarity. • Simpler and faster measure
Implementation • Implemented UBsim,ρ, LBsim,ρ, and MAXsim,ρin SQL.(We only discuss UB) • Notation:
Experiment • Real data sets: Data sets from ACM and DBLP citation digital libraries. • R1: uniform data sets • R1a—average # of citations: left=41, right=25 • R1b—average # of citations: left=40, right=55 • R2: skewed data sets • R2DB —average # of citations: left=30, right=9 • R2AI —average # of citations: left=31, right=10 • R2Net —average # of citations: left=22, right=6
Experiment • Synthetic data sets: • S1a and S1b: same as R1a, but dummy authors are injected to the right • S1a: # of citations1/3 • S1b: # of citations3 • S2: using “dbgen” tool to generate dummy authors with varying levels of errors and inserted it to the right data set.
Experiment • Evaluation Metrics—average recall • if a2 is included in the top-k answer window for a1, then recall becomes 1, and 0 otherwise • Compared Methods • A(k1)|B(k2). • Step1: A, window size k1 • Step2: B, window size k2 • Microsoft SQL Server 2000 on Pentium III 3GHZ/512MB machine
Results • uniform data set : R1 real data set
Results • S1 and S2 synthetic data sets BM outperforms JA by 16-17% JA incorrect select dummy authors JA and BM are directly applied to S2
Results • R2 real data set UB outperform MAX in recall UB MAX Pre-processing using: UB MAX
Results • Record-level similarity measure :cosine similarity with TF/IDF weighting. • Running time against R2 (in sec)
Results • Window size
Conclusion • Proposed a bipartite matching based group similarity measure to solve group linkage problem. • Proved upper and lower bounds of BM can be used for speed-up. • BM is more robust group similarity measure than others