300 likes | 452 Views
Record Linkage with Uniqueness Constraints and Erroneous Values. Zhang Xiaojian 2010 November 26 WAMDM Group Meeting. Data integration process. Application2. Data fusion Felix ACMC08. Application1. Schema matching E.Rahm VLDBJ01. Data fusion Felix WWW06. Duplicate detection
E N D
Record Linkage with Uniqueness Constraints and Erroneous Values Zhang Xiaojian 2010 November 26 WAMDM Group Meeting
Data integration process • Application2 • Data fusion • Felix ACMC08 • Application1 • Schema matching • E.Rahm VLDBJ01 • Data fusion • Felix WWW06 • Duplicate detection • Record linkage • A.K.E TKDE07 • Entity resolution • Tect Report Stanford • Data fusion • X Dong VLDB09 • Data exchange • R.Fagin TODS05 Cleaned Data • uncertainty s s s s s s
Contents • Motivation • Problem definition • Solution • Experimental results • Conclusions • Getting some problems from the paper
Motivation s1 s2 integration s3 Cleaned Data Search Box s4
Current Solution • Current two-step solution • Step 1: Record Linkage • link records that are likely to refer to the same real-world entity • [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06] • Step 2: Data Fusion • merge the linked records and decide the correct values for each result entity in the presence of conflicts [J. Bleiholder et. al, ACM Computing Surveys08] • Uniqueness constraint • Many real world entities has a unique value for the attribute. E.g. Website(IP ), Phone, Facebook account • Co-existence of conflicts and duplicates makes the problem hard to solve
Limitations of Current Solution (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) Assume that Phone and Address satisfy uniqueness constraints • Erroneous values may prevent correct matching • Current solutions may fall short when the uniqueness constraints exist (PHONE) 9400 missing
Contents • Motivation • Problem definition • Solution • Experimental results • Conclusions and Future work
Problem Definition • Input • A set of records provided by a set of independent data sources • A set of (hard or soft) uniqueness constraints • Output: • Real-world entities • For each (hard or soft) uniqueness attribute of each entity • True value
Concepts • Entity and Attribute • E.g., • Value vs. Representations (e.g., New York City New York City, NYC, N.Y.C) • Constraint • Uniqueness constraint (hard constraint): DA • Business Name, Business Phone, Business Address • Soft uniqueness constraint (soft constraint): DA • Business Phone (e.g., p1=30%, p2=10% ) Where p1 is the upper bound probability of an entity having multiple values for A and p2 is the upper bound probability of a value of A being shared by multiple entities. Special case: key attribute (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) 1-p1 1-p2 1-p1 1-p2
Contents • Motivation • Problem definition • Solution • Experimental results • Conclusions and Future work
K-Partite Graph Encoding (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) Microsofe Corp. (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) N1 s(1) P1 s(1) xxx-1255 s(1) A1 1 Microsoft Way S1 Microsofe Corp. Xxx-1255 1 Microsoft Way
Encoding of the ideal solution Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P3 P2 P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Pre-processing for the K-partite graph Clustering in every partite (subset)
Clustering with Hard Constraint Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P4 P3 P2 xxx-9400 xxx-1255 xxx-0500 xxx-2255 A2 A3 A1 C2 C3 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Clustering the whole graph G(S) C4 C1
Clustering w.r.t hard constraint • Ideal clustering should meet two requests • High cohesion within each cluster • Low correlation between different clusters • Objective function for getting “best” clustering • Choosing Davies-Bouldin index [Davies and Bouldin TPAML79] • The goal is to minimize Davies-Bouldin index min( ) • corresponds to complement of cohesion • corresponds to complement of correlation High cohesion High cohesion Low correlation
Computing cluster distance • Cluster distance function • is similarity distance for measuring similarity between value representations of the same attributes. • is association distance for measuring association between value representations of different attributes. • The key is how to calculate and for computing cluster distance
Similarity Distance Within the same cluster • How to get C1 C4 d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3 = 0.25 (name) 0.7 d2S(C1,C1) = 0 (phone) d3S(C1,C1) = 0 (address) 0.7 N1 N2 N3 0.65 N4 0.4 0.95 0.65 MS Corp. dS(C1,C1) = (0.25+0+0)/3 = 0.083 Microsofe Corp. Microsoft Corp. Macrosoft Corp. P1 P4 0 xxx-0500 xxx-1255 Within the different clusters A1 A2 A3 d1S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4 (name) 0.9 0 2 Sylvan Way 2 Sylvan Way 1 Microsoft Way d2S(C1,C4) = 1-0 = 1 (phone) d3S(C1,C4) = 1-0 = 1 (address) 0 dS(C1,C4) = (0.4+1+1)/3=0.8
Association Distance How to get association distance Within the same cluster d1,2A (C1,C1) = 1 − 7/9 = 0.22 d1,3A(C1,C1) = 1− 8/9 = 0.11 d2,3A (C1,C1) = 1− 7/8 = 0.125 Macrosoft Inc. Microsoft Corp. MS Corp. Microsofe Corp. dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153 N3 N1 N2 N4 Within the different clusters S(10) S(1-9) s(2-5) s(1) S(7-8) d1,2A (C1,C4) = 1 − max(1/10,0/10) = 0.9 s(1) P1 S(2-9) S(10) P4 s(2-6) S(7-8) d1,3A(C1,C4) = 0.9 d2,3A (C1,C4) = 1 s(1-2) xxx-1255 xxx-0500 S(2-10) dA(C1,C4) = (0.9+0.9+1)/3 = 0.93 s(1) s(1-5,7,8) A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C1 C4
Greedy Algorithm--CLUSTER • Obtaining optimal clustering is intractable • [T.F. Gonzales., 82],[J. Simal et al., 06] • Algorithm: CLUSTER • Step1: Initialization • Cluster value representations according to their similarity distance and association distance • Step2: Adjustment • For each node, moving to the cluster that minimize this Davies-Bouldin(DB) index • Step3: Convergence checking • stop if step 2 doesn’t change the clustering result. Otherwise, repeat step 2
Φ=0.94 Φ=0.93 Φ=0.71 Φ=0.92 Microsoft Corp. Microsofe Corp. MS Corp. Macrosoft Inc. Φ=1.15 Φ=1.16 N3 N1 N2 N4 Φ=0.89 Φ=0.71 Φ=0.45 P4 P1 P3 P2 xxx-0500 xxx-9400 xxx-1255 xxx-2255 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C2 C4 C3 C1
Matching w.r.t. Soft Constraints MS Corp. • Next step is to find the best matching between key attribute and soft uniqueness attributes • How to match? Microsoft Corp. Microsofe Corp. Macrosoft Inc. NC1 NC4 7 s(1-5,7,8) 9 S(1-9) 1 S(6) 1 S(10) 5 s(1-5) Graph Transform 9 S(1-9) PC3 PC2 PC4 PC1 xxx-2255 xxx-9400 xxx-1255 xxx-0500 8 S(1-8) 1 S(10) AC4 AC1 2 Sylvan W. 1 Microsoft Way 2 Sylvan Way
Matching w.r.t. Soft Constraint • Goals • Maximizing the sum of weights of selected edges w(e) • Minimizing the gap for each node Gap(N) • How to balance above two goals? Giving a score function to balance w(e) and Gap(N) • Getting the “best” matching • Maximize Score function • Greedy algorithm: MATCHT • Getting Gap(N) and W(u,v) N1 9 (s2-s10) 1 (s1) 7 (s4-s10) P1 P2 P3
Continue the example Solution 1 Solution 2 N1 N2 N1 N2 3 (s3-s5) 3 (s3-s5) 9 (s2-s10) Greedily select 9 (s2-s10) 1 (s1) 1 (s1) 8 (s2-s9) 8 (s2-s9) 10 (s1-s10) 10 (s1-s10) 7 (s4-s10) 7 (s4-s10) Greedily select P1 P2 P3 P1 P2 P3 P1 P2 P4 P2 P4 P4 Gap(N1) = 9 Gap(N1) = 3 Gap(N2) = 5 Gap(N2) = 0 Gap(P1) = 0 Gap(P2) = 4 Gap(P2) = 4 Gap(P4) = 2 w(N1,P1) = 1 w(N1,P2) = 7 w(N2,P2) = 3 w(N2,P4) = 8 Solution 3 Solution 4 N1 N2 N1 N2 3 (s3-s5) 3 (s3-s5) 9 (s2-s10) 9 (s2-s10) 1 (s1) Greedily select 1 (s1) 8 (s2-s9) 8 (s2-s9) 10 (s1-s10) 7 (s4-s10) 10 (s1-s10) 7 (s4-s10) P1 P2 P3 P4 P4 P1 P2 P3 P4 P3 Gap(N1) =0 Gap(N2) = 0 Gap(N1) =1 Gap(N2) = 0 Gap(P4) = 2 Gap(P4) = 2 Gap(P3) = 0 Gap(P4) = 2 w(N1,P4) =10 w(N2,P2) = 8 w(N1,P3) =9 w(N2,P2) = 8
Contents • Motivation • Problem definition • Solution • Experimental results • Conclusions and Future work
Experiment Settings • Dataset I • Business listings for two zip codes(07035,07715) from multiple sources
Experiment Settings • Implementation • MATCH +CLUSTER • LINK: linkage only • FUSE: data fusion only • LINKFUSE: first LINK , second FUSE • Golden Standard: by manually checking • Measures: Precision/Recall/F-measure
Accuracy 07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME) 07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)
Conclusions • In the real-world, we need to resolve duplicates and conflicts at the same time. • We reduce the problem to a k-partite graph clustering and matching problem • Combine linkage and fusion • Experiments show high efficiency and scalability