280 likes | 429 Views
Linking Records with Erroneous Values. Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs. Motivation. s. s. s. s. s. s. integration. Cleaned Data. Search Box. Motivation. Which type of listing are they? A: the same business
E N D
Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and RemiZajac AT&T Labs
Motivation s s s s s s integration Cleaned Data Search Box
Motivation Which type of listing are they? • A: the same business • B: different businesses sharing the same phone# • C: different businesses, only one correctly associated with the given phone#
Current Solution • Uniqueness constraint • Each real-world entity has a unique value. E.g., phone, address • The data may not satisfy the constraint • Erroneous values • Small number of exceptions • Current two-step solution • Step 1: Record Linkage • link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06] • Step 2: Data Fusion • decide the correct values in the presence of conflicts[J. Bleiholder et. al, ACM Computing Surveys]
Limitations of Current Solution (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) • Erroneous values may prevent correct matching • Traditional techniques may fall short when exceptions to the uniqueness constraints exist • Locally resolving conflicts for linked records may overlook important global evidence ✓ ✓ ✗
Our Solution • Perform linkage and fusion simultaneously • Able to identify incorrect value from the beginning, so can improve linkage • Make global decisions • Consider sources that associate a pair of values in the same record, so can improve fusion • Allow small number of violations for capturing possible exceptions in the real world
Road Map • Motivation and overview • Problem definition • Solution • Evaluations on YP data • Conclusions
Problem Input • A set of independent data sources, each providing a set of records • A set of (soft) uniqueness constraints • Uniqueness constraint (hard constraint): • Business Name, Business Phone, Business Address • Soft uniqueness constraint (soft constraint): • Business Phone 1-p1 1-p2
Problem Output • Real-world entities • For each (soft) uniqueness attribute of each entity • True value (if any) • Various representations of each true value (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way)
K-Partite Graph Encoding Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N3 N1 N2 S(7-8) s(1-2) N4 s(1) S(1-9) S(10) S(3-5) s(2-5) s(6) s(1) s(2-6) P1 S(2-9) P3 P2 S(10) P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 s(1-2) s(1-5) s(6) s(1) s(1-5,7,8) S(2-10) s(1) s(1) S(7-8) A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W.
Solution Encoding Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P3 P2 P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Clustering problem & Matching problem
Solution Encoding with Hard Constraint Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P4 P3 P2 xxx-1255 xxx-0500 xxx-9400 xxx-2255 A2 A3 A1 C2 C3 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Clustering problem C4 C1
Road Map • Motivation and overview • Problem definition • Solution • Clustering w.r.t. hard constraint • Matching w.r.t. soft constraint • Evaluations on YP data • Conclusions
Clustering w.r.t. Hard Constraints • Ideal clustering: • high cohesion within each cluster • low correlation between different clusters • Objective function • Davis-Bouldin Index (Minimization) • Average distance of • similarity distance • association distance MS Corp. Microsoft Corp. Macrosoft Inc. Microsofe Corp. N1 N2 N3 N4 P1 P4 xxx-1255 xxx-0500 A1 A2 A3 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C1 C4
Similarity Distance • Similarity of values • Defined for each attribute 0.7 0.7 0.65 MS Corp. Microsoft Corp. Macrosoft Inc. Microsofe Corp. 0.4 0.95 0.65 N1 N2 N3 N4 d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3 = 0.25 (name) 0 d2S(C1,C1) = 0 (phone) d3S(C1,C1) = 0 (address) P1 P4 dS(C1,C1) = (0.25+0+0)/3 = 0.083 xxx-1255 xxx-0500 0 0 d1S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4 A1 A2 A3 d2S(C1,C4) = 1-0 = 1 d3S(C1,C4) = 1-0 = 1 0.9 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. dS(C1,C4) = (0.4+1+1)/3=0.8 C1 C4
Association Distance • Association by edges • Defined for each pair of attributes Macrosoft Inc. Microsoft Corp. MS Corp. Microsofe Corp. N3 N1 N2 N4 s(1-2) S(3-5) 10 sources (S1-S10) mention (N1,N2,N3,N4) (P1,P4) S(10) S(1-9) s(2-5) s(1) S(7-8) d1,2A (C1,C1) = 1 − 7/9 = 0.22 s(1) 1 source (S10) supports (N1,N2,N3)-P4 9 sources (S1-S8,S10) mention (N1,N2,N3,P1) P1 S(2-9) S(10) P4 d1,3A(C1,C1) = 1− 8/9 = 0.11 d2,3A (C1,C1) = 1− 7/8 = 0.125 s(2-6) S(7-8) s(1-2) No connection between (N4,P1) 7 sources (S1-S5,S7,S8) Support (N1,N2,N3)-P1 xxx-1255 dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153 xxx-0500 S(2-10) s(1) s(1-5,7,8) d1,2A (C1,C4) = 1 − max(1/10,0/10) = 0.9 A2 A3 A1 d1,3A(C1,C4) = 0.9 d2,3A (C1,C4) = 1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. dA(C1,C4) = (0.9+0.9+1)/3 = 0.93 C1 C4
Greedy Algorithm • Obtaining optimal clustering is intractable • [T.F. Gonzales., 82],[J. Simal et al., 06] • Hill climbing approximation: CLUSTER • Step1: Initialization • Cluster value representations by their similarity. Do majority voting to associate clusters • Step2: Adjustment • For each node, moving to the cluster that minimize this DB index • Step3: Convergence checking • terminate if step 2 doesn’t change the clustering result. Otherwise, repeat step 2 • The algorithm converges
Φ=0.94 Φ=0.93 Φ=1.16 Microsoft Corp. Microsofe Corp. MS Corp. Macrosoft Inc. N3 N1 N2 N4 Φ=0.89 Φ=0.71 Φ=0.45 P4 P1 P3 P2 xxx-0500 xxx-9400 xxx-1255 xxx-2255 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C4 C2 C3 C1
Road Map • Motivation and overview • Problem definition • Solution • Clustering w.r.t. hard constraint • Matching w.r.t. soft constraint • Evaluations on YP data • Conclusions
Matching w.r.t. Soft Constraints MS Corp. Microsoft Corp. Microsofe Corp. • Next? Matching problem • How to match? Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. Macrosoft Inc. NC1 NC4 7 s(1-5,7,8) 1 S(10) 9 S(1-9) 1 S(6) N3 N1 N2 5 s(1-5) N4 9 S(1-9) PC3 PC2 PC4 PC1 xxx-9400 xxx-1255 xxx-2255 xxx-0500 GRAPH TRANSFORM 8 S(1-8) 1 S(10) AC4 AC1 P3 P2 P4 P1 2 Sylvan W. 1 Microsoft Way xxx-0500 xxx-2255 xxx-9400 2 Sylvan Way xxx-1255 A2 A3 A1 2 Sylvan W. 1 Microsoft Way 2 Sylvan Way
Matching w.r.t. Soft Constraint • Intuitions • Largest sum of weights • Smallest gap • How to balance these two goals? • Optimization problem • Maximize • Subject to • Two-phase greedy algorithm: MATCH Gap(N) = 9 Gap(N) = 1 Gap(N) = 0 P1 P1 P1 P2 P2 P2 P3 P3 P3 Solution 3 Solution 1 Solution 2 N N N 10 (s1-s10) 1 (s1) 1 (s1) 9 (s2-s10) 9 (s2-s10) 10 (s1-s10) 9 (s2-s10) 10 (s1-s10) 1 (s1)
Road Map • Motivation and overview • Problem definition • Solution • Evaluations on YP data • Conclusions
Experiment Settings • Dataset I • Business listings for two zip codes(07035-Lincoln Park NJ, 07715-Belmar, NJ) from multiple sources
Experiment Settings • Implementation • MATCH (invoking CLUSTER first) • LINK: record linkage only • FUSE: data fusion only • LINKFUSE: first LINK, then FUSE • Golden Standard: by manually checking • Measures: Precision/Recall/F-measure
Accuracy • MATCH achieves highest F-measure in most cases • Improves LINK by 11% on name-phone matching, by 20% on name clustering • LINK vs. FUSE vs. LINKFUSE • LINK: high recall in matching • FUSE: high precision in matching, high precision in name clustering • LINKFUSE: only slightly better than FUSE in matching and similar to LINK in clustering 07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME) 07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)
Efficiency and Scalability • Data set II • Entire listing: 40+M records • Hadoop-based linkage framework • Fuzzy self-join using Hadoop • Partition records into strongly connected components • Efficiency • Linear growth • Execution time
Conclusions • In the real-world, we need to resolve duplicates and conflicts at the same time. • We reduce the problem to a k-partite graph clustering and matching problem • Combine linkage and fusion • Apply them in the global fashion • Experiments show high accuracy and scalability