270 likes | 417 Views
Floris Geerts ( University of Antwerp ) Giansalvatore Mecca, Donatello Santoro ( Università della Basilicata ) Paolo Papotti ( Qatar Computing Research Institute ). Overview. Motivations and Goals. Semantics. Experimental Results. Overview. Motivations and Goals. Semantics.
E N D
FlorisGeerts (University of Antwerp) Giansalvatore Mecca, DonatelloSantoro (Universitàdella Basilicata) Paolo Papotti (Qatar Computing Research Institute)
Overview • Motivations and Goals • Semantics • Experimental Results
Overview • Motivations and Goals • Semantics • Experimental Results
Data Cleaning as Data Repairing • Data Cleaning is a very general term • standardization, entity resolution, data fusion… • We consider one of its facets: data repairing Database Table R Data Repairing Constraints C over R
A Motivating Example PREFERRED VALUES INTERACTION! fd1. Cust: SSN, Name → Phone fd2. Cust: SSN, Name → CC# fd3. Treat: SSN → Salary ? cfd4. Treat: Insur[‘Abx’] → Tr[‘Dental’] cfd5. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’] er6. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Street, City from MD
Many Techniques 6 FD - ID Cond FD Cond ID Matching D Edit Rules CurrencyConfidenceMaster Data Cost-MinimalityCertain FixesSampling How to put everything together? Language of Constraints Choose Preferred Values Repair Selection
Many Techniques 7 FD - ID Cond FD Cond ID Matching D Edit Rules CurrencyConfidenceMaster Data Problem 1: missing semantics and missing repair algorithm Cost-MinimalityCertain FixesSampling FD - ID Language of Constraints Choose Preferred Values Repair Selection
Many Techniques 8 FD - ID Cond FD Cond ID Matching D Edit Rules CurrencyConfidenceMaster Data Problem 2: missing formalism to handle user-specified preference rules Cost-MinimalityCertain FixesSampling FD - ID Language of Constraints Choose Preferred Values Repair Selection
Many Techniques 9 FD - ID Cond FD Cond ID Matching D Edit Rules CurrencyConfidenceMaster Data Problem 3: no DBMS-based scalable implementations Cost-MinimalityCertain FixesSampling FD - ID Language of Constraints Choose Preferred Values Repair Selection
System Comparison [1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10
Our Goal [1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10
Overview • Motivations and Goals • Semantics • Experimental Results
e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → ph = ph’ e2. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → cc = cc’ Cleaning EGDs Σ: Cleaning EGDs e3. Treat(ssn, sal, i, t, d) , Treat(ssn, sal’, i’, t’, d’) → sal = sal’ e4. Treat(ssn, s, ins, t, d) , ins =‘Abx’→ tr =‘Dental’ e5. Cust(ssn, n, ph, c , cc ) , Treat(ssn, s, ins, t, d) , ins =‘Abx’→ c =‘SF’ e6. Cust(ssn, n, ph, c , cc ) , Master(ssn, n’, ph, s, c’) , → n = n’, c = c’
Semantics • Partial Order • Cell Groups • Upgrades • LLUNs
The Partial Order • Standard preferencerules • NULLS CONST • source values are preferredto target values • Orderingattribute Π(Cust.Phone) = {dom(Cust.PhConf), ≥} Π(Treat.Salary) = {dom(Treat.Date), ≥time} • No order Π(Cust.CC#) = ∅
Cell Groups Partialorderover cellgroups e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’ • We model partial repair by a Cell-Group • set of target cells that must be changed together • we do not disrupt this equality in the following • Carryalsolineageinformation wechange the value of alloccurrences g1 = <123 → {t1.phone, t2.phone}> justifications occurrences value e2. Cust(s, n, ph) , Master(n, ph’) → ph= ph’ g2 = <999→ {t1.phone, t2.phone}by {tm.ph}>
Upgrades upgrade RepairN upgrade Repair 1 <999→ {t2.ph, t3.ph} by {tm.ph}> Starting database J <123→ {t2.ph, t3.ph} >
LLUNs • There are cases in whichwedon’thaveanyclearstrategy to remove a violation • LLUNs • a new class of symbols • placeholders used to mark conflicts • not only an unknown value but rather a “hypervalue” • the opposite of a NULL, since it upgrades constants L0 e3. Cust(ssn, n, ph, cc ) , Cust(ssn, n, ph’, cc’) → cc = cc’
Scenarios and Solutions Cleaning Scenario • S: source schema, T: target schema, Σ: set of cleaning EGDs • Π: the partial order specification • Π is the way to specify when a value is preferrable to another • Solution: Given C, an instance I of S, and an instance J of T compute an instance J’ such that: • it is a repair, i.e., “I and J’ satisfy Σ” • and “J’ is an upgradeof J according to Π”
The Chase Algorithm • To compute solutions: • Chase algorithm for chasingegds e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’ Repair2(backward) Repair3(backward) Repair1 (forward) <L1→ {t2.ssn} > <L2→ {t3.ssn} > <123→ {t2.ph, t3.ph} >
A Few Results • Given a cleaning scenario C, and instances <I, J> • C always has a solution for <I, J> • The chase always terminates (it never fails) • The chase computes all minimal solutions • The number of minimal solutions is exponential in the size of J
Overview • Motivations and Goals • Semantics • Experimental Results
Chase Tree Our goal: to make thisscalable ! Differentorders of application givedifferentresults J e1, b2 e0, b1 e0, f e1, f e0, b2 e1, b1 R1 R2 R3 R10 R11 R12 e1, b1 e0, b1 e1, f e1, b2 e0, f e0, b2 R4 R5 R6 R13 R14 R15 the e0-e1 sequence the e1-e0 sequence
ScalabilityTechniques • Chase implementationbased on equivalenceclasses • Delta Databases • a representationsystem for chasetrees • Costmanagers • pluggablestrategies to prune the chasetree
Scalability Llunatic-FR-S1 Llunatic-FR-S10 Llunatic-FR-S50 Llunatic-FR-S1-FO LLUNATICis the first scalable DBMS-based chasealgorithm for data repairing Hospital Customers sec. Customers 100K 250K 400K550K700K850K1M
Min. Cost Quality of Repairs Llunatic-FR-S1 Llunatic-FR-S10 Llunatic-FR-S1-FO Sampling-500 Vertex Cover Hospital 5k 10k 25k