1 / 28

Linking Records with Erroneous Values

Linking Records with Erroneous Values. Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs. Motivation. s. s. s. s. s. s. integration. Cleaned Data. Search Box. Motivation. Which type of listing are they? A: the same business

baina
Download Presentation

Linking Records with Erroneous Values

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and RemiZajac AT&T Labs

  2. Motivation s s s s s s integration Cleaned Data Search Box

  3. Motivation Which type of listing are they? • A: the same business • B: different businesses sharing the same phone# • C: different businesses, only one correctly associated with the given phone#

  4. Current Solution • Uniqueness constraint • Each real-world entity has a unique value. E.g., phone, address • The data may not satisfy the constraint • Erroneous values • Small number of exceptions • Current two-step solution • Step 1: Record Linkage • link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06] • Step 2: Data Fusion • decide the correct values in the presence of conflicts[J. Bleiholder et. al, ACM Computing Surveys]

  5. Limitations of Current Solution (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) • Erroneous values may prevent correct matching • Traditional techniques may fall short when exceptions to the uniqueness constraints exist • Locally resolving conflicts for linked records may overlook important global evidence ✓ ✓ ✗

  6. Our Solution • Perform linkage and fusion simultaneously • Able to identify incorrect value from the beginning, so can improve linkage • Make global decisions • Consider sources that associate a pair of values in the same record, so can improve fusion • Allow small number of violations for capturing possible exceptions in the real world

  7. Road Map • Motivation and overview • Problem definition • Solution • Evaluations on YP data • Conclusions

  8. Problem Input • A set of independent data sources, each providing a set of records • A set of (soft) uniqueness constraints • Uniqueness constraint (hard constraint): • Business Name, Business Phone, Business Address • Soft uniqueness constraint (soft constraint): • Business Phone 1-p1 1-p2

  9. Problem Output • Real-world entities • For each (soft) uniqueness attribute of each entity • True value (if any) • Various representations of each true value (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way)

  10. K-Partite Graph Encoding Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N3 N1 N2 S(7-8) s(1-2) N4 s(1) S(1-9) S(10) S(3-5) s(2-5) s(6) s(1) s(2-6) P1 S(2-9) P3 P2 S(10) P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 s(1-2) s(1-5) s(6) s(1) s(1-5,7,8) S(2-10) s(1) s(1) S(7-8) A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W.

  11. Solution Encoding Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P3 P2 P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Clustering problem & Matching problem

  12. Solution Encoding with Hard Constraint Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. N3 N1 N2 N4 P1 P4 P3 P2 xxx-1255 xxx-0500 xxx-9400 xxx-2255 A2 A3 A1 C2 C3 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Clustering problem C4 C1

  13. Road Map • Motivation and overview • Problem definition • Solution • Clustering w.r.t. hard constraint • Matching w.r.t. soft constraint • Evaluations on YP data • Conclusions

  14. Clustering w.r.t. Hard Constraints • Ideal clustering: • high cohesion within each cluster • low correlation between different clusters • Objective function • Davis-Bouldin Index (Minimization) • Average distance of • similarity distance • association distance MS Corp. Microsoft Corp. Macrosoft Inc. Microsofe Corp. N1 N2 N3 N4 P1 P4 xxx-1255 xxx-0500 A1 A2 A3 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C1 C4

  15. Similarity Distance • Similarity of values • Defined for each attribute 0.7 0.7 0.65 MS Corp. Microsoft Corp. Macrosoft Inc. Microsofe Corp. 0.4 0.95 0.65 N1 N2 N3 N4 d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3 = 0.25 (name) 0 d2S(C1,C1) = 0 (phone) d3S(C1,C1) = 0 (address) P1 P4 dS(C1,C1) = (0.25+0+0)/3 = 0.083 xxx-1255 xxx-0500 0 0 d1S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4 A1 A2 A3 d2S(C1,C4) = 1-0 = 1 d3S(C1,C4) = 1-0 = 1 0.9 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. dS(C1,C4) = (0.4+1+1)/3=0.8 C1 C4

  16. Association Distance • Association by edges • Defined for each pair of attributes Macrosoft Inc. Microsoft Corp. MS Corp. Microsofe Corp. N3 N1 N2 N4 s(1-2) S(3-5) 10 sources (S1-S10) mention (N1,N2,N3,N4) (P1,P4) S(10) S(1-9) s(2-5) s(1) S(7-8) d1,2A (C1,C1) = 1 − 7/9 = 0.22 s(1) 1 source (S10) supports (N1,N2,N3)-P4 9 sources (S1-S8,S10) mention (N1,N2,N3,P1) P1 S(2-9) S(10) P4 d1,3A(C1,C1) = 1− 8/9 = 0.11 d2,3A (C1,C1) = 1− 7/8 = 0.125 s(2-6) S(7-8) s(1-2) No connection between (N4,P1) 7 sources (S1-S5,S7,S8) Support (N1,N2,N3)-P1 xxx-1255 dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153 xxx-0500 S(2-10) s(1) s(1-5,7,8) d1,2A (C1,C4) = 1 − max(1/10,0/10) = 0.9 A2 A3 A1 d1,3A(C1,C4) = 0.9 d2,3A (C1,C4) = 1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. dA(C1,C4) = (0.9+0.9+1)/3 = 0.93 C1 C4

  17. Greedy Algorithm • Obtaining optimal clustering is intractable • [T.F. Gonzales., 82],[J. Simal et al., 06] • Hill climbing approximation: CLUSTER • Step1: Initialization • Cluster value representations by their similarity. Do majority voting to associate clusters • Step2: Adjustment • For each node, moving to the cluster that minimize this DB index • Step3: Convergence checking • terminate if step 2 doesn’t change the clustering result. Otherwise, repeat step 2 • The algorithm converges

  18. Φ=0.94 Φ=0.93 Φ=1.16 Microsoft Corp. Microsofe Corp. MS Corp. Macrosoft Inc. N3 N1 N2 N4 Φ=0.89 Φ=0.71 Φ=0.45 P4 P1 P3 P2 xxx-0500 xxx-9400 xxx-1255 xxx-2255 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C4 C2 C3 C1

  19. Road Map • Motivation and overview • Problem definition • Solution • Clustering w.r.t. hard constraint • Matching w.r.t. soft constraint • Evaluations on YP data • Conclusions

  20. Matching w.r.t. Soft Constraints MS Corp. Microsoft Corp. Microsofe Corp. • Next? Matching problem • How to match? Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. Macrosoft Inc. NC1 NC4 7 s(1-5,7,8) 1 S(10) 9 S(1-9) 1 S(6) N3 N1 N2 5 s(1-5) N4 9 S(1-9) PC3 PC2 PC4 PC1 xxx-9400 xxx-1255 xxx-2255 xxx-0500 GRAPH TRANSFORM 8 S(1-8) 1 S(10) AC4 AC1 P3 P2 P4 P1 2 Sylvan W. 1 Microsoft Way xxx-0500 xxx-2255 xxx-9400 2 Sylvan Way xxx-1255 A2 A3 A1 2 Sylvan W. 1 Microsoft Way 2 Sylvan Way

  21. Matching w.r.t. Soft Constraint • Intuitions • Largest sum of weights • Smallest gap • How to balance these two goals? • Optimization problem • Maximize • Subject to • Two-phase greedy algorithm: MATCH Gap(N) = 9 Gap(N) = 1 Gap(N) = 0 P1 P1 P1 P2 P2 P2 P3 P3 P3 Solution 3 Solution 1 Solution 2 N N N 10 (s1-s10) 1 (s1) 1 (s1) 9 (s2-s10) 9 (s2-s10) 10 (s1-s10) 9 (s2-s10) 10 (s1-s10) 1 (s1)

  22. Road Map • Motivation and overview • Problem definition • Solution • Evaluations on YP data • Conclusions

  23. Experiment Settings • Dataset I • Business listings for two zip codes(07035-Lincoln Park NJ, 07715-Belmar, NJ) from multiple sources

  24. Experiment Settings • Implementation • MATCH (invoking CLUSTER first) • LINK: record linkage only • FUSE: data fusion only • LINKFUSE: first LINK, then FUSE • Golden Standard: by manually checking • Measures: Precision/Recall/F-measure

  25. Accuracy • MATCH achieves highest F-measure in most cases • Improves LINK by 11% on name-phone matching, by 20% on name clustering • LINK vs. FUSE vs. LINKFUSE • LINK: high recall in matching • FUSE: high precision in matching, high precision in name clustering • LINKFUSE: only slightly better than FUSE in matching and similar to LINK in clustering 07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME) 07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)

  26. Efficiency and Scalability • Data set II • Entire listing: 40+M records • Hadoop-based linkage framework • Fuzzy self-join using Hadoop • Partition records into strongly connected components • Efficiency • Linear growth • Execution time

  27. Conclusions • In the real-world, we need to resolve duplicates and conflicts at the same time. • We reduce the problem to a k-partite graph clustering and matching problem • Combine linkage and fusion • Apply them in the global fashion • Experiments show high accuracy and scalability

  28. Thank You!

More Related