Data Cleaning: Strategies, Challenges, and Solutions

COMP3311: Database Systems Data Quality and Data Cleaning (Modified from the Slides of Prof. Wenfei Fan) COMP3311 Database Systems

Data Quality and Data Cleaning Constraint-based Data Cleaning • Data cleaning: overview • Repair framework • Cost model • Repair of FDs and INDs: traditional dependencies • Equivalence class • Repair algorithms COMP3311 Database Systems

Constraint-based Data Cleaning • Data cleaning: overview We focus on relational data cleaning • Repair framework • Cost model • Repair of FDs and INDs • Equivalence class • Repair algorithms COMP3311 Database Systems

Data in real-life is often dirty Errors, conflicts and inconsistencies • “When the federal Auditor-General's Office reported that the Health Insurance Commission had allowed up to 500,000 dead people to retain active Medicare numbers, it managed to put the complex issue of data quality into layman's terms.” (2005 The Australian) • In a database of 500,000 customers, 120,000 records become invalid within a year – death, divorce, marriage, move • typical data error rate in industry: 1% - 5%, up to 30% • . . . COMP3311 Database Systems

Dirty data is costly • Poor data costs US companies $600 billion annually • Wrong price data in retail databases costs US customers $2.5 billion each year • 30%-80% of the development time for data cleaning in a data integration project The need for (semi-)automated methods to clean data! COMP3311 Database Systems

Data quality problem • Accuracy • Completeness • Timeliness • Uniqueness • Consistency A data quality problem occurs when the data does not mean what you think it should mean • Does the data meet its specifications? • Are we interpreting it properly? • Is the data suitable for the task on hand? COMP3311 Database Systems

Approaches to cleaning data • ETL (Extraction, Transformation, Loading): Potter’s Wheel • Workflow: event, condition and action to handle errors • Record matching: Merge Purge/Record data deduplication • Data provenance/ Lineage Tracing: tracking the origin of the errors • Minimal Repair - Tuple Insert/Delete • Tuple Deletion Model • Delete + Insert Model • Minimal Repair - Value Modification • Semi-automated process – suggest repair; need domain experts COMP3311 Database Systems

data data data ETL (Extraction, Transformation, Loading) • for a specific domain, e.g., address data • transformation rules manually designed • low-level programs • difficult to write • difficult to maintain profiling transformation rules sample types of errors • Extract data (DB interfaces, web page fetch, parsing) • Validate data (ensure constraints) • Transform data (e.g. addresses, phone numbers) • Load data • Hard to check whether these rules themselves are dirty or not COMP3311 Database Systems

Potter’s Wheel • Interactive system to explore data transformations for ETL • Scalable spreadsheet interface • Visualized transform sequences • Structured domain definitions: syntax and irregularity • Domain definition language • Search for best domain definition for a column • Discrepancy detection “Potter's Wheel: An Interactive Data Cleaning System”,V. Raman, J.M. Hellerstein, Proc. VLDB Conf. Pg 381-390 2001 COMP3311 Database Systems

Transforms in Potter’s Wheel Format Bob Stewart Stewart,Bob '(.*), (.*)' to '\2 \1' Anna Davis Anna Davis Jerry Dole Dole,Jerry Joan Marsh Joan Marsh Split at ' ' Bob Stewart Bob Stewart 2 Merges Anna Davis Anna Davis Jerry Dole Jerry Dole Joan Marsh Joan Marsh COMP3311 Database Systems

AJAX • A prototype system for eliminating duplicates • A declarative language to model transformations as operators • Mapping (value transformation) • Matching (find similar pairs) • Clustering (find sets of similar records) • Merging (duplicate elimination) “AJAX: An Extensible Data Cleaning Tool”, H. Galhardas, D. Florescu, D. Shasha, E. Simon, Proc. SIGMOD Conf. Pg 590 2000 COMP3311 Database Systems

Provenance, Lineage Tracing • Keep track of the processes used to create data • Coarse grained: processing of a table • Fine grained: processing of a record • Record graph of data transformation steps. • Used for analysis, debugging, feedback loops Tracking the origin of the errors, determining confidence of the data based on its original source When transformations are expressed as queries/updates: • How is provenance passed through database queries? • How can we automatically record provenance when we update databases? Intractable even for extremely simple relational queries! COMP3311 Database Systems

name born period J.S. Bach 1685 baroque G.F. Handel 1685 baroque W.A. Mozart 1756 classical Two kinds of provenance? SELECT name, born FROM composer SELECT name, born FROM composer WHERE born < SELECT AVERAGE born FROM composer name born J.S. Bach 1685 . . . … Why is this element in the output? Where does this element come from? COMP3311 Database Systems

Why and Where • Why-provenance of an output tupled • the set of all witnesses for d • a witness for d is a minimal set of source tuples which “proves” that d exists in the output • For positive queries -- a set of tuples in the source whose deletion causes d to disappear • Where-provenance of output tuple d • the set of all source locations whose contents are copied to d Its effectiveness in data cleaning needs to be verified COMP3311 Database Systems

Data cleaning • Errors found in practice • Syntactic: a value not in the corresponding domain or range, e.g., name = 1.23, age = 250 • Semantic: a value representing a real-world entity different from the true value of the entity, e.g., CIA found WMD in Iraq • Given a database D, • Detection: Is the data in D consistent? • Cleaning: If not, how to find a database D’ such that D contains the information of D and is clean? • Central question: What model should be used to specify consistency? Hard to detect and fix COMP3311 Database Systems

Constraint Repair • Constraints as data quality rules: • Errors and inconsistencies in the data are detected as violations of the constraints Why integrity constraints? • Capture a fundamental part of the semantics of the data: detect inconsistencies and errors in the data • Reasoning techniques, remove redundant rules, identify “dirty” rules • profiling algorithms: automated discovery of dependencies (data quality rules), … • However, • constraint repair is typically expensive (intractable) • New kind of dependencies need to be developed, beyond Functional Dependencies/ Inclusion Dependencies • Reconcile and repair: efficient algorithms to suggest fix? COMP3311 Database Systems

x x x x x x In some cases, the user has to clean it up. Can the DBMS help? Data inconsistency in integrated data • Databases help maintain data integrity constraints… • but integrating data from different databases exposes inconsistencies. Billing Sales Maintenance • What to do? • automatic detection • automatic repair COMP3311 Database Systems

Cust Equip Example: networking service provider • Assume the billing and maintenance departments have separate databases. • Internally consistent, • yet containing errors. • Goal: reconcile and improve data quality of, for example, integrated customer and billing data. CustSites Cust Devices Equip Maintenance Billing COMP3311 Database Systems

equip cust phno phno name serno eqmfct street eqmodel city instdate state zip t5 t0 949-1212 949-1212 Alice Smith AC13006 AC 17 bridge XE5000 midville Mar-02 az 05211 t6 t1 555-8145 555-8145 L55001 Bob Jones LU 5 valley rd ze400 centre Jan-03 ny 10012 t2 555-8195 Bob Jones 5 valley rd centre nj 10012 t3 212-6040 Carol Blake 9 mountain davis ca 07912 t4 949-1212 Ali Stith 27 bridge midville az 05211 t7 555-8195 L55011 LU ze400 Mar-03 t8 555-8195 AC22350 AC XE5000 Feb-99 t9 949-2212 L32400 LU ze300 Oct-01 Maintenance Dept. Billing Dept. Service provider example, continued. COMP3311 Database Systems

cust phno name street city state zip t0 949-1212 Alice Smith 17 bridge midville az 05211 t1 555-8145 Bob Jones 5 valley rd centre ny 10012 t2 555-8195 Bob Jones 5 valley rd centre nj 10012 t3 212-6040 Carol Blake 9 mountain davis ca 07912 t4 949-1212 Ali Stith 27 bridge midville az 05211 x x x x x x Constraints and Violations (1) • Consider Inclusion and Functional Dependencies, (so each x is a violation of one or the other). • A functional dependency (FD) says that the values of some fields determine the values of others. cust[zip] -> cust[state] cust[phno] -> cust[name, street, city, state, zip] t1andt2violate COMP3311 Database Systems

equip[phno] cust[phno] t9violates cust phno name street city state zip t0 949-1212 Alice Smith 17 bridge midville az 05211 t1 555-8145 Bob Jones 5 valley rd centre ny 10012 t2 555-8195 Bob Jones 5 valley rd centre nj 10012 t3 212-6040 Carol Blake 9 mountain davis ca 07912 t4 949-1212 Ali Stith 27 bridge midville az 05211 Constraints and Violations (2) An inclusion dependency (IND) says that the values of some fields should appear in some others. equip phno serno eqmfct eqmodel instdate t9 949-2212 L32400 LU ze300 Oct-01 COMP3311 Database Systems

A B C A B C a1 b1 c2 a1 b1 c2 a2 b1 c1 a2 b1 c1 "Good" repair: “Bad” repair: a2 b1 c2 a2 b1 c1 A B C Constraint Repair • Find a “repair” to suggest • A repair is a database which does not violate constraints • A good repair is also similar to the original database Constraint 1: FD: R[A] -> R[B, C] R: COMP3311 Database Systems

Data Cleaning Based on Constraint Repair • Data cleaning: overview • Repair framework • System model • Repair of FDs and INDs • Equivalence class • Repair algorithms COMP3311 Database Systems

Integrity constraints Consider relations R1, R2, in which X, Y are sets (lists) of attributes • Functional Dependencies (FDs): R1[X]  R1[Y]. An instance D of R1 satisfies the FD only if for any tuples t1, t2 in D, if t1 and t2 agree on their X attributes then they must agree on Y attributes  t1 t2  D (A X (t1.A = t2.A)  B Y (t1.B = t2.B)) • Inclusion Dependencies (INDs): R1[X]  R2[Y]. An instance (D1, D2) of (R1, R2) satisfies the IND iff for any tuple t1 in D1 there exists t2 in D2, such that t1.X = t2.Y  t1  D1  t2  D2 (t1[X] = t2[Y]) Most common constraints found in practice COMP3311 Database Systems

Problem Statement Input: a relational database D, and a set C of integrity constraints (FDs, INDs) Question: find a “good” repair D’ of D • Repair: D’ satisfies C • “good”: D’ is “close” to the original data in D • changes are minimal: what metrics (referred to as cost) should we use? • changes: value modification, tuple insertion – to avoid loss of information (tuple deletion can be expressed via modification) We want to compute D’ efficiently, as suggested fix to the users COMP3311 Database Systems

A B C A B C a1 b1 c2 a1 b1 c2 a2 b1 c1 a2 b1 c1 D’ a2 b1 c2 a2 b1 c1 t1 t1 t2 t2 t3 t3 Repair Model • For the duration of constraint repair, each input tuple is uniquely identified, t1, t2, … • This value of attribute A of tuple t in the input database is D(t,A) • The output of the algorithm is a repaired database, D', so D'(t,A) is the value of t.A in the current proposed repair. • For example, D = D' below, except D(t3,C) <> D'(t3,C). D COMP3311 Database Systems

"Smith" "Jones" Distant: "a" "b" "949-2212" "949-1212" Close: "1.99" "2.0" "GM" "General Motors" weight(t) weight(t.A) dist(v,v') : [0,1) Cost Model Components Distance Weight Confidence placed by the user in the accuracy of a tuple or attribute. COMP3311 Database Systems

cust phno name street city state zip t0 949-1212 Alice Smith 17 bridge midville az 05211 2 t1 555-8145 Bob Jones 5 valley rd centre ny 2 10012 t2 555-8195 Bob Jones 5 valley rd centre nj 10012 1 t3 212-6040 Carol Blake 9 mountain davis ca 1 07912 t4 949-1212 Ali Stith 27 bridge midville az 1 05211 Intuition: Tuple/Attribute Weights • Simple model of data reliability • Example: Billing department may be more reliable about address information, weight = 2, but less about equipment, weight = 1 • Example 2: Download table of zip codes from post-office web site, weight = 100 • In the absence of other information, all weights default to 1. COMP3311 Database Systems

A B C A B C a1 b1 c2 a1 b1 c2 a2 b1 c1 a2 b1 c1 Repair: cost = 1/2 Repair: cost = 9 R: a2 b1 c1 a2 b1 c2 A B C +1 +1 +1 +1 +1 +1 +1 +1 +1 +1/2 Attribute-Level Cost Model • If D(t.A) = v and D'(t.A) = v', then Cost(t.A) = dist(v, v') * weight(t.A) ( ) • Cost(D’): the sum of Cost(t.A) for all changed tuples t and attributes A • Example: (if we model delete as dist(x,null)) * COMP3311 Database Systems

Problem and complexity Input: a relational database D, and a set C of integrity constraints Question: find a repair D’ of D such that cost(D’) is minimal Complexity: Finding an optimal repair is NP-complete in size of database • Intractable even for a small, constant number of FDs alone. • Intractable even for a small, constant number of INDs alone COMP3311 Database Systems

Heuristic approach to value-based repair • In light of complexity results, we turn to heuristic approaches. However, most have problems. • Straightforward constraint-by-constraint repair algorithms fail to terminate (fixing individual constraints one by one) consider R1(A, B), R2(B, C), with • FD: R1[A]  R1[B] • IND: R2[B]  R1[B] • D(R1): {(1, 2), (1, 3)}, D(R2) = {(2, 1), (3, 4)} • Also, user must be involved since any single decision can be wrong. • One approach: Equivalence-Class-Based Repair. COMP3311 Database Systems

cust phno name street city state zip t0 949-1212 Alice Smith 17 bridge midville az 05211 t1 555-8145 Bob Jones 5 valley rd centre ny 10012 t2 555-8195 Bob Jones 5 valley rd centre nj 10012 t3 212-6040 Carol Blake 9 mountain davis ca 07912 t4 949-1212 Ali Stith 27 bridge midville az 05211 eq1 Equivalence Classes • An equivalence class is a set of "cells" defined by a tuple t and an attribute A. • An equivalence class eq has a target value targ(eq) drawn from eq -- to be assigned after all the equivalence classes are found • For example, targ(EQ1) = "Alice Smith" or "Ali Stith" COMP3311 Database Systems

Equivalence Classes Continued • In the repair, give each member of the equivalence class the target value: that is  (t,A) in eq, D'(t,A) = targ(eq) • Target value is chosen from the set of values associated with eq in the input: {D(ti,Ai)} • A given eq class has an easily computed cost w.r.t. a particular target value. For example, if weights are all 1, we might have, • Cost({"A","A","B"},"A") = dist("B","A") = 1 • Cost({"A","A","B"},"B") = 2*dist("A","B") = 2 Separate • The decision of which attribute values need to be equivalent • The decision of exactly what value targ(eq) should be assigned COMP3311 Database Systems

phno name street city state zip t0 949-1212 Alice Smith 17 bridge midville az 05211 t1 555-8145 Bob Jones 5 valley rd centre ny 10012 t2 555-8195 Bob Jones 5 valley rd centre nj 10012 t3 212-6040 Carol Blake 9 mountain davis ca 07912 t4 949-1212 Ali Stith 27 bridge midville az 05211 FD: cust[phno] -> cust[name, street, city, state, zip] . . . EQ1 EQ2 Resolving FD violations • To resolve a tuple t violating FD R[A]->R[B], violations, we compute the set of tuples V from R that match t on A, and union equivalence classes of the attributes in B among all tuples in V – changing RHS • Changing the left-hand side to an existing value may not resolve the violation • Changing left-hand side to a new value is arbitrary – loss of e.g., keys COMP3311 Database Systems

equip[phno] cust[phno] equip EQ1 phno serno eqmfct eqmodel instdate t9 949-2212 L32400 LU ze300 Oct-01 cust phno name street city state zip t0 949-1212 Alice Smith 17 bridge midville az 05211 . . . t3 212-6040 Carol Blake 9 mountain davis ca 07912 t4 949-1212 Ali Stith 27 bridge midville az 05211 Resolving IND violations • To resolve a tuple t violating IND R[A] S[B], we pick a tuple s from S and union equivalence classes between t.A and s.B for attributes A in A and B in B. COMP3311 Database Systems

Data Cleaning Based on Constraint Repair • Data cleaning: overview • Repair framework • System model • Repair of FDs and INDs • Equivalence class • Repair algorithms “A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification”, Philip Bohannon, Wenfei Fan, Michael Flaster, and Rajeev Rastogi, ACM SIGMOD Conference on Management of Data (SIGMOD), 2005 COMP3311 Database Systems

t Cust t t Equip t t t t t unres(equip[phno] cust[phno]) t t t9 949-2212 L32400 LU ze300 . . . Unresolved Sets • Each constraint C has a set of tuples unres(C) • Invariant: any tuple that may violateC is in unres(C) • For INDs and FDs, unres(C) from a single table. unres(FD1) unres(FD2) unres(IND1) COMP3311 Database Systems

pick Cust tuple t w.r.t.IND1 t Cust t t t Equip t' t t t t t To resolve an IND, pick an existing tuple or create a new tuple t' to cover t, and update equivalence classes according to IND. t t Algorithmic Framework • pickt, C, s.t. t in unres(C) • resolve t with respect to C • remove t from unres(C) • maintain affected unres sets • repeat until all unres empty unres(FD1) unres(FD2) unres(IND1) COMP3311 Database Systems

t Cust t' t t …removetfrom unres(IND1) Equip t t' is added to unres(FD1) (Equivalence class update causes a target value to change, possibly causing new FD violations.) t t t t t t Algorithmic Framework (continued) • pickt, C, s.t. t in unres(C) • resolve t with respect to C • remove t from unres(C) • maintain affected unres set • repeat until all unres empty t' unres(FD1) unres(FD2) unres(IND1) COMP3311 Database Systems

Algorithm Framework Comments • Algorithmic framework guarantees termination and correctness (D' satisfies all constraints) • Termination (in polynomial time) follows from decreasing number of equivalence classes • Correctness follows from invariant: all violating tuples are in unres, and it is empty at termination • Of course, solutions with a single equivalence class are not be very satisfying. • Up to the cost metric to help find better solutions. COMP3311 Database Systems

Algorithms • Performance and quality of repair depend partially on implementation of step 1. pick t, Cs.t…: • Algorithm GreedyRepair uses PickGreedy: • return (t,C), such that cost of resolving t w.r.t. C is minimum • Algorithm GreedyFD uses PickGreedyFD: • return (t,FD), such that cost of resolving t w.r.t. FD is minimum • If none exist, return PickGreedy() COMP3311 Database Systems

Summary and review • Review questions: • What is data cleaning? Why bother? • What are the main approaches to cleaning data? Pros and cons? • Why constraint repair? Major challenges? • Given a database D and a set C of FDs and INDs, can you always find a repair D' of D? • a new cost model for attribute-based conflict repair (weight * dist) • Complexity for constraint-based data cleaning (NP-Hard problem) • A family of equivalence-class based heuristic repair techniques • Greedy techniques, extensions, inserts • Research issues: • More expressive constraints • XML data cleaning • Provenance to set weights COMP3311 Database Systems

Data Cleaning: Strategies, Challenges, and Solutions

Data Cleaning: Strategies, Challenges, and Solutions

Presentation Transcript

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

DATABASE SYSTEMS

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems