Group Linkage

Group Linkage Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) DiveshSrivastava (AT&T Labs – Research) ICDE 2007

Outline • Introduction • Matching • Bipartite Graph • Group Linkage • Bipartite matching • Pre-processing step to speed up • Greedy matching • Heuristic measure • Experiment & Result • Conclusion

Introduction • Poor quality data in databases • Transcription errors • Lack of standards for recording • Poor database design • How to identify whether two entities are approximately the same? • Group linkage problem Ex: “J.Ullman” “J.D.Ullman” “Ullman, Jeffrey”

Ex : Paper1 K.L.Hsueh DBLP Paper2 Peter Pan Paper3 Davy Jones Paper4 ACM Lily Hsueh Paper5 Group : each author Records : a list of citations per author Implement  Group Linkage Problem

Matching • Matching:A matching in a graph G is a set of non-loop edges with no shared endpoints • Maximum matching: A matching that contains the largest possible number of edges.

Bipartite Graph • Bipartite Graph:A graph is bipartite if V is the union of two disjoint independent sets called partite sets of G Bipartite matching

Group Linkage(1) • Jaccard similarity measure between two sets s1 and s2 • Records from the two groups can be put into matching when they are identical.

Group Linkage(2)

Group Linkage(3) Register Allocation & Spilling via graph coloring r11 r12 r13 r14 K.L.Hsueh g1 Lily.Hsueh g2 r21 r22 r23 r24 r25 Register Allocation and Spilling via graph coloring Similar records normalize Group similarity ,each

Bipartite Matching • Record-level similarity measure • [5] S.Chaudhuri, V.Ganti, and R. Kaushik. “A primitive Operator for Similarity Joins in Data Cleaning”. In IEEE ICED, 2006 • Maximum weight bipartite matching (BM) • [10] S. Guha, N.Koudas, A. Marathe, and D. Srivastava. “Merging the Results of Approximate Match Operations”. In VLCB, pages 636-647, 2004. • Applying this strategy for every pair of groups is infeasible. pre-processing step • Greedy matching • Heuristic measure

Greedy Matching(1) • S1: For each record ri ∈ g1, find a record rj ∈ g2with the highest record-level similarity among those with sim() ≥ ρ. • S2: Same as S1 r11 r12 r13 r14 g1 g2 r21 r22 r23 r24 r25 May not be a matching!

Greedy Matching(2) • Upper and lower bounds to BMsim,ρ r11 r12 r13 r14 g1 g2 r21 r22 r23 r24 r25

Greedy Matching(2) • is bounded • Only when , the more expensive computation would be needed.

Heuristic Measure • In practice that pairs of groups with a high value of will share at least one record with a high record-level similarity. • Simpler and faster measure

Implementation • Implemented UBsim,ρ, LBsim,ρ, and MAXsim,ρin SQL.(We only discuss UB) • Notation:

Experiment • Real data sets: Data sets from ACM and DBLP citation digital libraries. • R1: uniform data sets • R1a—average # of citations: left=41, right=25 • R1b—average # of citations: left=40, right=55 • R2: skewed data sets • R2DB —average # of citations: left=30, right=9 • R2AI —average # of citations: left=31, right=10 • R2Net —average # of citations: left=22, right=6

Experiment • Synthetic data sets: • S1a and S1b: same as R1a, but dummy authors are injected to the right • S1a: # of citations1/3 • S1b: # of citations3 • S2: using “dbgen” tool to generate dummy authors with varying levels of errors and inserted it to the right data set.

Experiment • Evaluation Metrics—average recall • if a2 is included in the top-k answer window for a1, then recall becomes 1, and 0 otherwise • Compared Methods • A(k1)|B(k2). • Step1: A, window size k1 • Step2: B, window size k2 • Microsoft SQL Server 2000 on Pentium III 3GHZ/512MB machine

Results • uniform data set : R1 real data set

Results • S1 and S2 synthetic data sets BM outperforms JA by 16-17% JA incorrect select dummy authors JA and BM are directly applied to S2

Results • R2 real data set UB outperform MAX in recall UB MAX Pre-processing using: UB MAX

Results • Record-level similarity measure :cosine similarity with TF/IDF weighting. • Running time against R2 (in sec)

Results • Window size

Conclusion • Proposed a bipartite matching based group similarity measure to solve group linkage problem. • Proved upper and lower bounds of BM can be used for speed-up. • BM is more robust group similarity measure than others

Group Linkage

Group Linkage

Presentation Transcript

Linkage Mapping

Linkage Mapping

LOGISTIC LINKAGE GROUP

Sex Linkage

Linkage Institutions

Linkage

Linkage Institutions

Joint Linkage and Linkage Disequilibrium Mapping

Linkage

Sex Linkage

Linkage

Sex Linkage

Linkage Disequilibrium

Linkage

LINKAGE ANALYSIS

Gene Linkage

Linkage, Sex linkage, Pedigrees

BeiJing Linkage Logistics WHO IS LINKAGE

Sex Linkage

Linkage Analysis

Linkage

Linkage Institutions