1 / 26

Modeling and Querying Possible Repairs in Duplicate Detection

Modeling and Querying Possible Repairs in Duplicate Detection. George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David. Data Cleaning. Real-world data is dirty Examples: Syntactical Errors: e.g., Micrsoft Heterogeneous Formats: e.g., Phone number formats

zofia
Download Presentation

Modeling and Querying Possible Repairs in Duplicate Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David

  2. Data Cleaning • Real-world data is dirty • Examples: • Syntactical Errors: e.g., Micrsoft • Heterogeneous Formats: e.g., Phone number formats • Missing Values (Incomplete data) • Violation of Integrity Constraints: e.g., FDs/INDs • Duplicate Records • Data Cleaning is the process of improving data qualityby removing errors, inconsistencies and anomalies of data George Beskales

  3. Duplicate Elimination Process Unclean Relation P1 P2 Compute Pairwise Similarity 0.9 Cluster Records P6 0.3 0.5 P5 P3 P4 1.0 P1 P2 Clean Relation C1 Merge Clusters P6 C3 P5 P3 P4 C2 George Beskales

  4. Probabilistic Data Cleaning Probabilistic Results Queries Queries Results RDBMS Uncertainty and Cleaning-aware RDBMS Deterministic Database Uncertain Database Single Clean Instance Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Deterministic Duplicate Elimination Probabilistic Duplicate Elimination Unclean Database Unclean Database (a) One-Shot Cleaning (b) Probabilistic Cleaning George Beskales

  5. Motivation Probabilistic Data Cleaning: • Avoid deterministic resolution of conflicts during data cleaning • Enrich query results by considering all possible cleaning instances • Allows specifying query-time cleaning requirements George Beskales

  6. Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Outline • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Probabilistic Duplicate Elimination Probabilistic Duplicate Elimination Unclean Database George Beskales

  7. Uncertainty in Duplicate Detection • Uncertain Duplicates: Determining what records should be clustered is uncertain due to noisy similarity measurements • A possible repairof a relation is a clustering (partitioning) of the unclean relation Person Possible Repairs Uncertain Clustering George Beskales

  8. Challenges • The space of all possible clusterings (repairs) is exponentially large • How to efficiently generate, store and query the possible repairs? George Beskales

  9. Possible repairs given by any fixed parameterized clustering algorithm 2 Possible repairs given by any fixed hierarchical clustering algorithm 3 Link-based algorithms Hierarchical cut clustering algorithm … The Space of Possible Repairs All possible repairs 1 George Beskales

  10. Hierarchical Clustering Algorithms • Records are clustered in a form of a hierarchy: all singletons are at leaves, and one cluster is at root • Example: Linkage-based Clustering Algorithm Distance Threshold () {r1,r2},{r3,r4,r5} Pair-wise Distance r1 r2 r3 r4 r5 George Beskales

  11. Uncertain Hierarchical Clustering • Hierarchical clustering algorithms can be modified to accept uncertain parameters • The number of generated possible repairs is linear {r1,r2},{r3,r4,r5} Possible Thresholds [l, u] {r1,r2},{r3},{r4,r5} {r1},{r2},{r3},{r4,r5} {r1},{r2},{r3},{r4},{r5} r1 r2 r3 r4 r5 George Beskales

  12. Probabilities of Repairs Clustering 1 Clustering 2 Clustering 3 ~U(0,10) 0 1 Pr = 0.1 1  3 Pr = 0.2 3 10 Pr = 0.7 George Beskales

  13. Representing the Possible Repairs Clustering 1 Clustering 2 Clustering 3 0 1 Pr = 0.1 1  3 Pr = 0.2 3 10 Pr = 0.7 U-clean Relation PersonC George Beskales

  14. Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Outline • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Probabilistic Duplicate Elimination Unclean Database George Beskales

  15. Queries over U-Clean Relations • We adopt the possible worlds semanticsto define queries on U-clean relations U-Clean Relation(s) RC Implementation U-Clean Relation Q(RC) Instances Representation Possible Clean Instances X1,X2,…,Xn Possible Clean Instances Q(X1), Q(X2),…,Q(Xn) Definition George Beskales

  16. Example: Selection Query PersonC SELECT ID, Income FROM Personc WHERE Income>35k George Beskales

  17. Example: Projection Query PersonC SELECT DISTINCT Income FROM Personc George Beskales

  18. Example: Join Query Vehiclec Personc SELECT Income, Price FROM Personc , Vehiclec WHERE Income/10 >= Price George Beskales

  19. Aggregation Queries Personc SELECT Sum(Income) FROM Personc X1 X2 X3 Pr = 0.7 Sum=109k Pr = 0.2 Pr = 0.1 Sum=156k Sum=187k George Beskales

  20. Other Meta-Queries • Obtaining the most probable clean instance • Obtaining the -certain clusters • Obtaining a clean instance corresponding to a specific parameters of the clustering algorithms • Obtaining the probability of clustering a set of records together George Beskales

  21. Outline • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation George Beskales

  22. Experimental Evaluation • A prototype as an extension of PostgreSQL • Synthetic data generator provided by Febrl (freely extensible biomedical record linkage) • Two hierarchical algorithms: • Single-Linkage (S.L.) • Nearest-neighbor based clustering algorithm (N.N.) [Chaudhuri et al., ICDE’05] George Beskales

  23. Experimental Evaluation George Beskales

  24. Experimental Evaluation George Beskales

  25. Experimental Evaluation George Beskales

  26. Conclusion • We allow representing and querying multiple possible clean instances • We modified hierarchical clustering algorithms to allow generating multiple repairs • We compactly store the possible repairs by keeping the lineage information of clusters in special attributes • New (probabilistic) query types can be issued against the population of possible repairs George Beskales

More Related