1 / 27

Floris Geerts ( University of Antwerp )

Floris Geerts ( University of Antwerp ) Giansalvatore Mecca, Donatello Santoro ( Università della Basilicata ) Paolo Papotti ( Qatar Computing Research Institute ). Overview. Motivations and Goals. Semantics. Experimental Results. Overview. Motivations and Goals. Semantics.

linnea
Download Presentation

Floris Geerts ( University of Antwerp )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FlorisGeerts (University of Antwerp) Giansalvatore Mecca, DonatelloSantoro (Universitàdella Basilicata) Paolo Papotti (Qatar Computing Research Institute)

  2. Overview • Motivations and Goals • Semantics • Experimental Results

  3. Overview • Motivations and Goals • Semantics • Experimental Results

  4. Data Cleaning as Data Repairing • Data Cleaning is a very general term • standardization, entity resolution, data fusion… • We consider one of its facets: data repairing Database Table R Data Repairing Constraints C over R

  5. A Motivating Example PREFERRED VALUES INTERACTION! fd1. Cust: SSN, Name → Phone fd2. Cust: SSN, Name → CC# fd3. Treat: SSN → Salary ? cfd4. Treat: Insur[‘Abx’] → Tr[‘Dental’] cfd5. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’] er6. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Street, City from MD

  6. Many Techniques 6 FD - ID Cond FD Cond ID Matching D Edit Rules CurrencyConfidenceMaster Data Cost-MinimalityCertain FixesSampling How to put everything together? Language of Constraints Choose Preferred Values Repair Selection

  7. Many Techniques 7 FD - ID Cond FD Cond ID Matching D Edit Rules CurrencyConfidenceMaster Data Problem 1: missing semantics and missing repair algorithm Cost-MinimalityCertain FixesSampling FD - ID Language of Constraints Choose Preferred Values Repair Selection

  8. Many Techniques 8 FD - ID Cond FD Cond ID Matching D Edit Rules CurrencyConfidenceMaster Data Problem 2: missing formalism to handle user-specified preference rules Cost-MinimalityCertain FixesSampling FD - ID Language of Constraints Choose Preferred Values Repair Selection

  9. Many Techniques 9 FD - ID Cond FD Cond ID Matching D Edit Rules CurrencyConfidenceMaster Data Problem 3: no DBMS-based scalable implementations Cost-MinimalityCertain FixesSampling FD - ID Language of Constraints Choose Preferred Values Repair Selection

  10. System Comparison [1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10

  11. Our Goal [1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10

  12. Overview • Motivations and Goals • Semantics • Experimental Results

  13. e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → ph = ph’ e2. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → cc = cc’ Cleaning EGDs Σ: Cleaning EGDs e3. Treat(ssn, sal, i, t, d) , Treat(ssn, sal’, i’, t’, d’) → sal = sal’ e4. Treat(ssn, s, ins, t, d) , ins =‘Abx’→ tr =‘Dental’ e5. Cust(ssn, n, ph, c , cc ) , Treat(ssn, s, ins, t, d) , ins =‘Abx’→ c =‘SF’ e6. Cust(ssn, n, ph, c , cc ) , Master(ssn, n’, ph, s, c’) , → n = n’, c = c’

  14. Semantics • Partial Order • Cell Groups • Upgrades • LLUNs

  15. The Partial Order • Standard preferencerules • NULLS CONST • source values are preferredto target values • Orderingattribute Π(Cust.Phone) = {dom(Cust.PhConf), ≥} Π(Treat.Salary) = {dom(Treat.Date), ≥time} • No order Π(Cust.CC#) = ∅

  16. Cell Groups Partialorderover cellgroups e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’ • We model partial repair by a Cell-Group • set of target cells that must be changed together • we do not disrupt this equality in the following • Carryalsolineageinformation wechange the value of alloccurrences g1 = <123 → {t1.phone, t2.phone}> justifications occurrences value e2. Cust(s, n, ph) , Master(n, ph’) → ph= ph’ g2 = <999→ {t1.phone, t2.phone}by {tm.ph}>

  17. Upgrades upgrade RepairN upgrade Repair 1 <999→ {t2.ph, t3.ph} by {tm.ph}> Starting database J <123→ {t2.ph, t3.ph} >

  18. LLUNs • There are cases in whichwedon’thaveanyclearstrategy to remove a violation • LLUNs • a new class of symbols • placeholders used to mark conflicts • not only an unknown value but rather a “hypervalue” • the opposite of a NULL, since it upgrades constants L0 e3. Cust(ssn, n, ph, cc ) , Cust(ssn, n, ph’, cc’) → cc = cc’

  19. Scenarios and Solutions Cleaning Scenario • S: source schema, T: target schema, Σ: set of cleaning EGDs • Π: the partial order specification • Π is the way to specify when a value is preferrable to another • Solution: Given C, an instance I of S, and an instance J of T compute an instance J’ such that: • it is a repair, i.e., “I and J’ satisfy Σ” • and “J’ is an upgradeof J according to Π”

  20. The Chase Algorithm • To compute solutions: • Chase algorithm for chasingegds e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’ Repair2(backward) Repair3(backward) Repair1 (forward) <L1→ {t2.ssn} > <L2→ {t3.ssn} > <123→ {t2.ph, t3.ph} >

  21. A Few Results • Given a cleaning scenario C, and instances <I, J> • C always has a solution for <I, J> • The chase always terminates (it never fails) • The chase computes all minimal solutions • The number of minimal solutions is exponential in the size of J

  22. Overview • Motivations and Goals • Semantics • Experimental Results

  23. Chase Tree Our goal: to make thisscalable ! Differentorders of application givedifferentresults J e1, b2 e0, b1 e0, f e1, f e0, b2 e1, b1 R1 R2 R3 R10 R11 R12 e1, b1 e0, b1 e1, f e1, b2 e0, f e0, b2 R4 R5 R6 R13 R14 R15 the e0-e1 sequence the e1-e0 sequence

  24. ScalabilityTechniques • Chase implementationbased on equivalenceclasses • Delta Databases • a representationsystem for chasetrees • Costmanagers • pluggablestrategies to prune the chasetree

  25. Scalability Llunatic-FR-S1 Llunatic-FR-S10 Llunatic-FR-S50 Llunatic-FR-S1-FO LLUNATICis the first scalable DBMS-based chasealgorithm for data repairing Hospital Customers sec. Customers 100K 250K 400K550K700K850K1M

  26. Min. Cost Quality of Repairs Llunatic-FR-S1 Llunatic-FR-S10 Llunatic-FR-S1-FO Sampling-500 Vertex Cover Hospital 5k 10k 25k

  27. That’s all Folks!

More Related