260 likes | 571 Views
Modeling and Querying Possible Repairs in Duplicate Detection. George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David. Data Cleaning. Real-world data is dirty Examples: Syntactical Errors: e.g., Micrsoft Heterogeneous Formats: e.g., Phone number formats
E N D
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David
Data Cleaning • Real-world data is dirty • Examples: • Syntactical Errors: e.g., Micrsoft • Heterogeneous Formats: e.g., Phone number formats • Missing Values (Incomplete data) • Violation of Integrity Constraints: e.g., FDs/INDs • Duplicate Records • Data Cleaning is the process of improving data qualityby removing errors, inconsistencies and anomalies of data George Beskales
Duplicate Elimination Process Unclean Relation P1 P2 Compute Pairwise Similarity 0.9 Cluster Records P6 0.3 0.5 P5 P3 P4 1.0 P1 P2 Clean Relation C1 Merge Clusters P6 C3 P5 P3 P4 C2 George Beskales
Probabilistic Data Cleaning Probabilistic Results Queries Queries Results RDBMS Uncertainty and Cleaning-aware RDBMS Deterministic Database Uncertain Database Single Clean Instance Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Deterministic Duplicate Elimination Probabilistic Duplicate Elimination Unclean Database Unclean Database (a) One-Shot Cleaning (b) Probabilistic Cleaning George Beskales
Motivation Probabilistic Data Cleaning: • Avoid deterministic resolution of conflicts during data cleaning • Enrich query results by considering all possible cleaning instances • Allows specifying query-time cleaning requirements George Beskales
Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Outline • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Probabilistic Duplicate Elimination Probabilistic Duplicate Elimination Unclean Database George Beskales
Uncertainty in Duplicate Detection • Uncertain Duplicates: Determining what records should be clustered is uncertain due to noisy similarity measurements • A possible repairof a relation is a clustering (partitioning) of the unclean relation Person Possible Repairs Uncertain Clustering George Beskales
Challenges • The space of all possible clusterings (repairs) is exponentially large • How to efficiently generate, store and query the possible repairs? George Beskales
Possible repairs given by any fixed parameterized clustering algorithm 2 Possible repairs given by any fixed hierarchical clustering algorithm 3 Link-based algorithms Hierarchical cut clustering algorithm … The Space of Possible Repairs All possible repairs 1 George Beskales
Hierarchical Clustering Algorithms • Records are clustered in a form of a hierarchy: all singletons are at leaves, and one cluster is at root • Example: Linkage-based Clustering Algorithm Distance Threshold () {r1,r2},{r3,r4,r5} Pair-wise Distance r1 r2 r3 r4 r5 George Beskales
Uncertain Hierarchical Clustering • Hierarchical clustering algorithms can be modified to accept uncertain parameters • The number of generated possible repairs is linear {r1,r2},{r3,r4,r5} Possible Thresholds [l, u] {r1,r2},{r3},{r4,r5} {r1},{r2},{r3},{r4,r5} {r1},{r2},{r3},{r4},{r5} r1 r2 r3 r4 r5 George Beskales
Probabilities of Repairs Clustering 1 Clustering 2 Clustering 3 ~U(0,10) 0 1 Pr = 0.1 1 3 Pr = 0.2 3 10 Pr = 0.7 George Beskales
Representing the Possible Repairs Clustering 1 Clustering 2 Clustering 3 0 1 Pr = 0.1 1 3 Pr = 0.2 3 10 Pr = 0.7 U-clean Relation PersonC George Beskales
Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Outline • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Probabilistic Duplicate Elimination Unclean Database George Beskales
Queries over U-Clean Relations • We adopt the possible worlds semanticsto define queries on U-clean relations U-Clean Relation(s) RC Implementation U-Clean Relation Q(RC) Instances Representation Possible Clean Instances X1,X2,…,Xn Possible Clean Instances Q(X1), Q(X2),…,Q(Xn) Definition George Beskales
Example: Selection Query PersonC SELECT ID, Income FROM Personc WHERE Income>35k George Beskales
Example: Projection Query PersonC SELECT DISTINCT Income FROM Personc George Beskales
Example: Join Query Vehiclec Personc SELECT Income, Price FROM Personc , Vehiclec WHERE Income/10 >= Price George Beskales
Aggregation Queries Personc SELECT Sum(Income) FROM Personc X1 X2 X3 Pr = 0.7 Sum=109k Pr = 0.2 Pr = 0.1 Sum=156k Sum=187k George Beskales
Other Meta-Queries • Obtaining the most probable clean instance • Obtaining the -certain clusters • Obtaining a clean instance corresponding to a specific parameters of the clustering algorithms • Obtaining the probability of clustering a set of records together George Beskales
Outline • Generating and Storing the Possible Clean Instances • Querying the Clean Instances • Experimental Evaluation George Beskales
Experimental Evaluation • A prototype as an extension of PostgreSQL • Synthetic data generator provided by Febrl (freely extensible biomedical record linkage) • Two hierarchical algorithms: • Single-Linkage (S.L.) • Nearest-neighbor based clustering algorithm (N.N.) [Chaudhuri et al., ICDE’05] George Beskales
Experimental Evaluation George Beskales
Experimental Evaluation George Beskales
Experimental Evaluation George Beskales
Conclusion • We allow representing and querying multiple possible clean instances • We modified hierarchical clustering algorithms to allow generating multiple repairs • We compactly store the possible repairs by keeping the lineage information of clusters in special attributes • New (probabilistic) query types can be issued against the population of possible repairs George Beskales