Linking Records with Erroneous Values

Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and RemiZajac AT&T Labs

Motivation s s s s s s integration Cleaned Data Search Box

Motivation Which type of listing are they? • A: the same business • B: different businesses sharing the same phone# • C: different businesses, only one correctly associated with the given phone#

Current Solution • Uniqueness constraint • Each real-world entity has a unique value. E.g., phone, address • The data may not satisfy the constraint • Erroneous values • Small number of exceptions • Current two-step solution • Step 1: Record Linkage • link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06] • Step 2: Data Fusion • decide the correct values in the presence of conflicts[J. Bleiholder et. al, ACM Computing Surveys]

Limitations of Current Solution (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) • Erroneous values may prevent correct matching • Traditional techniques may fall short when exceptions to the uniqueness constraints exist • Locally resolving conflicts for linked records may overlook important global evidence ✓ ✓ ✗

Our Solution • Perform linkage and fusion simultaneously • Able to identify incorrect value from the beginning, so can improve linkage • Make global decisions • Consider sources that associate a pair of values in the same record, so can improve fusion • Allow small number of violations for capturing possible exceptions in the real world

Road Map • Motivation and overview • Problem definition • Solution • Evaluations on YP data • Conclusions

Problem Input • A set of independent data sources, each providing a set of records • A set of (soft) uniqueness constraints • Uniqueness constraint (hard constraint): • Business Name, Business Phone, Business Address • Soft uniqueness constraint (soft constraint): • Business Phone 1-p1 1-p2

Problem Output • Real-world entities • For each (soft) uniqueness attribute of each entity • True value (if any) • Various representations of each true value (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way)

K-Partite Graph Encoding Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N3 N1 N2 S(7-8) s(1-2) N4 s(1) S(1-9) S(10) S(3-5) s(2-5) s(6) s(1) s(2-6) P1 S(2-9) P3 P2 S(10) P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 s(1-2) s(1-5) s(6) s(1) s(1-5,7,8) S(2-10) s(1) s(1) S(7-8) A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W.

Solution Encoding Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P3 P2 P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Clustering problem & Matching problem

Solution Encoding with Hard Constraint Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P4 P3 P2 xxx-1255 xxx-0500 xxx-9400 xxx-2255 A2 A3 A1 C2 C3 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Clustering problem C4 C1

Road Map • Motivation and overview • Problem definition • Solution • Clustering w.r.t. hard constraint • Matching w.r.t. soft constraint • Evaluations on YP data • Conclusions

Clustering w.r.t. Hard Constraints • Ideal clustering: • high cohesion within each cluster • low correlation between different clusters • Objective function • Davis-Bouldin Index (Minimization) • Average distance of • similarity distance • association distance MS Corp. Microsoft Corp. Macrosoft Inc. Microsofe Corp. N1 N2 N3 N4 P1 P4 xxx-1255 xxx-0500 A1 A2 A3 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C1 C4

Similarity Distance • Similarity of values • Defined for each attribute 0.7 0.7 0.65 MS Corp. Microsoft Corp. Macrosoft Inc. Microsofe Corp. 0.4 0.95 0.65 N1 N2 N3 N4 d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3 = 0.25 (name) 0 d2S(C1,C1) = 0 (phone) d3S(C1,C1) = 0 (address) P1 P4 dS(C1,C1) = (0.25+0+0)/3 = 0.083 xxx-1255 xxx-0500 0 0 d1S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4 A1 A2 A3 d2S(C1,C4) = 1-0 = 1 d3S(C1,C4) = 1-0 = 1 0.9 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. dS(C1,C4) = (0.4+1+1)/3=0.8 C1 C4

Association Distance • Association by edges • Defined for each pair of attributes Macrosoft Inc. Microsoft Corp. MS Corp. Microsofe Corp. N3 N1 N2 N4 s(1-2) S(3-5) 10 sources (S1-S10) mention (N1,N2,N3,N4) (P1,P4) S(10) S(1-9) s(2-5) s(1) S(7-8) d1,2A (C1,C1) = 1 − 7/9 = 0.22 s(1) 1 source (S10) supports (N1,N2,N3)-P4 9 sources (S1-S8,S10) mention (N1,N2,N3,P1) P1 S(2-9) S(10) P4 d1,3A(C1,C1) = 1− 8/9 = 0.11 d2,3A (C1,C1) = 1− 7/8 = 0.125 s(2-6) S(7-8) s(1-2) No connection between (N4,P1) 7 sources (S1-S5,S7,S8) Support (N1,N2,N3)-P1 xxx-1255 dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153 xxx-0500 S(2-10) s(1) s(1-5,7,8) d1,2A (C1,C4) = 1 − max(1/10,0/10) = 0.9 A2 A3 A1 d1,3A(C1,C4) = 0.9 d2,3A (C1,C4) = 1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. dA(C1,C4) = (0.9+0.9+1)/3 = 0.93 C1 C4

Greedy Algorithm • Obtaining optimal clustering is intractable • [T.F. Gonzales., 82],[J. Simal et al., 06] • Hill climbing approximation: CLUSTER • Step1: Initialization • Cluster value representations by their similarity. Do majority voting to associate clusters • Step2: Adjustment • For each node, moving to the cluster that minimize this DB index • Step3: Convergence checking • terminate if step 2 doesn’t change the clustering result. Otherwise, repeat step 2 • The algorithm converges

Φ=0.94 Φ=0.93 Φ=1.16 Microsoft Corp. Microsofe Corp. MS Corp. Macrosoft Inc. N3 N1 N2 N4 Φ=0.89 Φ=0.71 Φ=0.45 P4 P1 P3 P2 xxx-0500 xxx-9400 xxx-1255 xxx-2255 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C4 C2 C3 C1

Road Map • Motivation and overview • Problem definition • Solution • Clustering w.r.t. hard constraint • Matching w.r.t. soft constraint • Evaluations on YP data • Conclusions

Matching w.r.t. Soft Constraints MS Corp. Microsoft Corp. Microsofe Corp. • Next? Matching problem • How to match? Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. Macrosoft Inc. NC1 NC4 7 s(1-5,7,8) 1 S(10) 9 S(1-9) 1 S(6) N3 N1 N2 5 s(1-5) N4 9 S(1-9) PC3 PC2 PC4 PC1 xxx-9400 xxx-1255 xxx-2255 xxx-0500 GRAPH TRANSFORM 8 S(1-8) 1 S(10) AC4 AC1 P3 P2 P4 P1 2 Sylvan W. 1 Microsoft Way xxx-0500 xxx-2255 xxx-9400 2 Sylvan Way xxx-1255 A2 A3 A1 2 Sylvan W. 1 Microsoft Way 2 Sylvan Way

Matching w.r.t. Soft Constraint • Intuitions • Largest sum of weights • Smallest gap • How to balance these two goals? • Optimization problem • Maximize • Subject to • Two-phase greedy algorithm: MATCH Gap(N) = 9 Gap(N) = 1 Gap(N) = 0 P1 P1 P1 P2 P2 P2 P3 P3 P3 Solution 3 Solution 1 Solution 2 N N N 10 (s1-s10) 1 (s1) 1 (s1) 9 (s2-s10) 9 (s2-s10) 10 (s1-s10) 9 (s2-s10) 10 (s1-s10) 1 (s1)

Road Map • Motivation and overview • Problem definition • Solution • Evaluations on YP data • Conclusions

Experiment Settings • Dataset I • Business listings for two zip codes(07035-Lincoln Park NJ, 07715-Belmar, NJ) from multiple sources

Experiment Settings • Implementation • MATCH (invoking CLUSTER first) • LINK: record linkage only • FUSE: data fusion only • LINKFUSE: first LINK, then FUSE • Golden Standard: by manually checking • Measures: Precision/Recall/F-measure

Accuracy • MATCH achieves highest F-measure in most cases • Improves LINK by 11% on name-phone matching, by 20% on name clustering • LINK vs. FUSE vs. LINKFUSE • LINK: high recall in matching • FUSE: high precision in matching, high precision in name clustering • LINKFUSE: only slightly better than FUSE in matching and similar to LINK in clustering 07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME) 07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)

Efficiency and Scalability • Data set II • Entire listing: 40+M records • Hadoop-based linkage framework • Fuzzy self-join using Hadoop • Partition records into strongly connected components • Efficiency • Linear growth • Execution time

Conclusions • In the real-world, we need to resolve duplicates and conflicts at the same time. • We reduce the problem to a k-partite graph clustering and matching problem • Combine linkage and fusion • Apply them in the global fashion • Experiments show high accuracy and scalability

Thank You!

Linking Records with Erroneous Values

Linking Records with Erroneous Values

Presentation Transcript

Record Linkage with Uniqueness Constraints and Erroneous Values

Developing Standards for Linking Electronic Health Records and Vital Records Systems: Opportunities and Challenges

DNA Barcodes: Linking GenBank records to Museum Specimens

Working with Missing Values

Persistent linking with CQ Researcher

Linking Verbs with Predicate Words

Linking Records with Value Diversity

Linking Records with Value Diversity

Linking Temporal Records

Data linking with kblog

Linking with School-Day Civics

Linking STEM Careers to Student Work Values

Linking With Lync

Corporate Solutions to Erroneous Dissemination

Research linking P Index Values to Stream Phosphorus Yields

Linking HTA to priority setting – framework, concepts, and values

New to Linking with China

Linking External Tools with Sakai

Coping with Electronic Records

Linking With Graphics

Matching records with titles

Supercharging therapy with values