Record Linkage with Uniqueness Constraints and Erroneous Values

Record Linkage with Uniqueness Constraints and Erroneous Values Zhang Xiaojian 2010 November 26 WAMDM Group Meeting

Data integration process • Application2 • Data fusion • Felix ACMC08 • Application1 • Schema matching • E.Rahm VLDBJ01 • Data fusion • Felix WWW06 • Duplicate detection • Record linkage • A.K.E TKDE07 • Entity resolution • Tect Report Stanford • Data fusion • X Dong VLDB09 • Data exchange • R.Fagin TODS05 Cleaned Data • uncertainty s s s s s s

Contents • Motivation • Problem definition • Solution • Experimental results • Conclusions • Getting some problems from the paper

Motivation s1 s2 integration s3 Cleaned Data Search Box s4

Current Solution • Current two-step solution • Step 1: Record Linkage • link records that are likely to refer to the same real-world entity • [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06] • Step 2: Data Fusion • merge the linked records and decide the correct values for each result entity in the presence of conflicts [J. Bleiholder et. al, ACM Computing Surveys08] • Uniqueness constraint • Many real world entities has a unique value for the attribute. E.g. Website(IP ), Phone, Facebook account • Co-existence of conflicts and duplicates makes the problem hard to solve

Limitations of Current Solution (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) Assume that Phone and Address satisfy uniqueness constraints • Erroneous values may prevent correct matching • Current solutions may fall short when the uniqueness constraints exist (PHONE) 9400 missing

Contents • Motivation • Problem definition • Solution • Experimental results • Conclusions and Future work

Problem Definition • Input • A set of records provided by a set of independent data sources • A set of (hard or soft) uniqueness constraints • Output: • Real-world entities • For each (hard or soft) uniqueness attribute of each entity • True value

Concepts • Entity and Attribute • E.g., • Value vs. Representations (e.g., New York City  New York City, NYC, N.Y.C) • Constraint • Uniqueness constraint (hard constraint): DA • Business Name, Business Phone, Business Address • Soft uniqueness constraint (soft constraint): DA • Business Phone (e.g., p1=30%, p2=10% ) Where p1 is the upper bound probability of an entity having multiple values for A and p2 is the upper bound probability of a value of A being shared by multiple entities. Special case: key attribute (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) 1-p1 1-p2 1-p1 1-p2

K-Partite Graph Encoding (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) Microsofe Corp. (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) N1 s(1) P1 s(1) xxx-1255 s(1) A1 1 Microsoft Way S1 Microsofe Corp. Xxx-1255 1 Microsoft Way

Encoding of the ideal solution Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P3 P2 P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Pre-processing for the K-partite graph Clustering in every partite (subset)

Clustering with Hard Constraint Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P4 P3 P2 xxx-9400 xxx-1255 xxx-0500 xxx-2255 A2 A3 A1 C2 C3 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Clustering the whole graph G(S) C4 C1

Clustering w.r.t hard constraint • Ideal clustering should meet two requests • High cohesion within each cluster • Low correlation between different clusters • Objective function for getting “best” clustering • Choosing Davies-Bouldin index [Davies and Bouldin TPAML79] • The goal is to minimize Davies-Bouldin index min( ) • corresponds to complement of cohesion • corresponds to complement of correlation High cohesion High cohesion Low correlation

Computing cluster distance • Cluster distance function • is similarity distance for measuring similarity between value representations of the same attributes. • is association distance for measuring association between value representations of different attributes. • The key is how to calculate and for computing cluster distance

Similarity Distance Within the same cluster • How to get  C1 C4 d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3 = 0.25 (name) 0.7 d2S(C1,C1) = 0 (phone) d3S(C1,C1) = 0 (address) 0.7 N1 N2 N3 0.65 N4 0.4 0.95 0.65 MS Corp. dS(C1,C1) = (0.25+0+0)/3 = 0.083 Microsofe Corp. Microsoft Corp. Macrosoft Corp. P1 P4 0 xxx-0500 xxx-1255 Within the different clusters A1 A2 A3 d1S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4 (name) 0.9 0 2 Sylvan Way 2 Sylvan Way 1 Microsoft Way d2S(C1,C4) = 1-0 = 1 (phone) d3S(C1,C4) = 1-0 = 1 (address) 0 dS(C1,C4) = (0.4+1+1)/3=0.8

Association Distance How to get association distance Within the same cluster d1,2A (C1,C1) = 1 − 7/9 = 0.22  d1,3A(C1,C1) = 1− 8/9 = 0.11 d2,3A (C1,C1) = 1− 7/8 = 0.125 Macrosoft Inc. Microsoft Corp. MS Corp. Microsofe Corp. dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153 N3 N1 N2 N4 Within the different clusters S(10) S(1-9) s(2-5) s(1) S(7-8) d1,2A (C1,C4) = 1 − max(1/10,0/10) = 0.9 s(1) P1 S(2-9) S(10) P4 s(2-6) S(7-8) d1,3A(C1,C4) = 0.9 d2,3A (C1,C4) = 1 s(1-2) xxx-1255 xxx-0500 S(2-10) dA(C1,C4) = (0.9+0.9+1)/3 = 0.93 s(1) s(1-5,7,8) A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C1 C4

Greedy Algorithm--CLUSTER • Obtaining optimal clustering is intractable • [T.F. Gonzales., 82],[J. Simal et al., 06] • Algorithm: CLUSTER • Step1: Initialization • Cluster value representations according to their similarity distance and association distance • Step2: Adjustment • For each node, moving to the cluster that minimize this Davies-Bouldin(DB) index • Step3: Convergence checking • stop if step 2 doesn’t change the clustering result. Otherwise, repeat step 2

Φ=0.94 Φ=0.93 Φ=0.71 Φ=0.92 Microsoft Corp. Microsofe Corp. MS Corp. Macrosoft Inc. Φ=1.15 Φ=1.16 N3 N1 N2 N4 Φ=0.89 Φ=0.71 Φ=0.45 P4 P1 P3 P2 xxx-0500 xxx-9400 xxx-1255 xxx-2255 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C2 C4 C3 C1

Matching w.r.t. Soft Constraints MS Corp. • Next step is to find the best matching between key attribute and soft uniqueness attributes • How to match? Microsoft Corp. Microsofe Corp. Macrosoft Inc. NC1 NC4 7 s(1-5,7,8) 9 S(1-9) 1 S(6) 1 S(10) 5 s(1-5) Graph Transform 9 S(1-9) PC3 PC2 PC4 PC1 xxx-2255 xxx-9400 xxx-1255 xxx-0500 8 S(1-8) 1 S(10) AC4 AC1 2 Sylvan W. 1 Microsoft Way 2 Sylvan Way

Matching w.r.t. Soft Constraint • Goals • Maximizing the sum of weights of selected edges w(e) • Minimizing the gap for each node Gap(N) • How to balance above two goals? Giving a score function to balance w(e) and Gap(N) • Getting the “best” matching • Maximize Score function • Greedy algorithm: MATCHT • Getting Gap(N) and W(u,v) N1 9 (s2-s10) 1 (s1) 7 (s4-s10) P1 P2 P3

Continue the example Solution 1 Solution 2 N1 N2 N1 N2 3 (s3-s5) 3 (s3-s5) 9 (s2-s10) Greedily select 9 (s2-s10) 1 (s1) 1 (s1) 8 (s2-s9) 8 (s2-s9) 10 (s1-s10) 10 (s1-s10) 7 (s4-s10) 7 (s4-s10) Greedily select P1 P2 P3 P1 P2 P3 P1 P2 P4 P2 P4 P4 Gap(N1) = 9 Gap(N1) = 3 Gap(N2) = 5 Gap(N2) = 0 Gap(P1) = 0 Gap(P2) = 4 Gap(P2) = 4 Gap(P4) = 2 w(N1,P1) = 1 w(N1,P2) = 7 w(N2,P2) = 3 w(N2,P4) = 8 Solution 3 Solution 4 N1 N2 N1 N2 3 (s3-s5) 3 (s3-s5) 9 (s2-s10) 9 (s2-s10) 1 (s1) Greedily select 1 (s1) 8 (s2-s9) 8 (s2-s9) 10 (s1-s10) 7 (s4-s10) 10 (s1-s10) 7 (s4-s10) P1 P2 P3 P4 P4 P1 P2 P3 P4 P3 Gap(N1) =0 Gap(N2) = 0 Gap(N1) =1 Gap(N2) = 0 Gap(P4) = 2 Gap(P4) = 2 Gap(P3) = 0 Gap(P4) = 2 w(N1,P4) =10 w(N2,P2) = 8 w(N1,P3) =9 w(N2,P2) = 8

Experiment Settings • Dataset I • Business listings for two zip codes(07035,07715) from multiple sources

Experiment Settings • Implementation • MATCH +CLUSTER • LINK: linkage only • FUSE: data fusion only • LINKFUSE: first LINK , second FUSE • Golden Standard: by manually checking • Measures: Precision/Recall/F-measure

Accuracy 07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME) 07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)

Efficiency and Scalability

Conclusions • In the real-world, we need to resolve duplicates and conflicts at the same time. • We reduce the problem to a k-partite graph clustering and matching problem • Combine linkage and fusion • Experiments show high efficiency and scalability

Thank You!

Record Linkage with Uniqueness Constraints and Erroneous Values

Record Linkage with Uniqueness Constraints and Erroneous Values

Presentation Transcript

Probabilistic Record Linkage: A Short Tutorial

NCHS Record Linkage Activities

Record Linkage Survey

Record Linkage: A Database Approach

Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints

Linking Records with Erroneous Values

Record Linkage in a Distributed Environment

Issues with record linkage

Record Linkage in a Distributed Environment

Record linkage results

Blindfolded Record Linkage

Record linkage in Birth cohort Biobanks

Record Linkage in Stata

Learning Blocking Schemes for Record Linkage

NCHS Record Linkage Program

(De-Identified) Record Linkage

Centre for Health Record Linkage (CHeReL)

Record environmental values with temperature recorder

ESSnet DI WP2: Record Linkage