330 likes | 396 Views
Achieving Data Quality with AJAX. (first version of AJAX designed and developed at INRIA Rocquencourt, France ). Existing technology. Ad-hoc programs written in a programming language like C or Java or using an RDBMS proprietary language Programs difficult to optimize and maintain
E N D
Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)
Existing technology • Ad-hoc programs written in a programming language like C or Java or using an RDBMS proprietary language • Programs difficult to optimize and maintain • RDBMS mechanisms for guaranteeing integrity constraints • Do not address important data instance problems • Data transformation scripts using an ETL (Extraction-Transformation-Loading)or data quality tool
App. Domain 1 App. Domain 2 App. Domain 3 Problems of data quality solutions (1) Data cleaning transformations ... The semantics of some data transformations is defined in terms of their implementation algorithms
Clean data Rejected data Cleaning process Dirty Data Problems of data quality solutions (2) There is a lack of interactive facilities to tune a data cleaning application program
Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name) Events(eventKey, name) PubsAuthors(pubKey, authorKey) Data Cleaning & Transformation Motivating example (1) DirtyData(paper:String)
Authors Publications DQua | Dallan Quass AGup | Ashish Gupta JWid | Jennifer Widom ….. QGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996 Events PubsAuthors PDIS | Conference on Parallel and Distributed Information Systems Data Cleaning & Transformation QGMW96 | DQua QGMW96 | AGup …. DirtyData [1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996 [2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95 Motivating example (2)
Modeling a data quality process Authors Duplicate Elimination DirtyTitles... DirtyEvents DirtyAuthors • A data quality process is • modeled by a directed acyclic • graph of data transformations Extraction Standardization Cities Tags Formatting DirtyData
AJAX features • An extensible data quality framework • Logical operators as extensions of relational algebra • Physical execution algorithms • A declarative language for logical operators • SQL extension • A debugger facility for tuning a data cleaning program application • Based on a mechanism of exceptions
AJAX features • An extensible data quality framework • Logical operators as extensions of relational algebra • Physical execution algorithms • A declarative language for logical operators • SQL extension • A debugger facility for tuning a data cleaning program application • Based on a mechanism of exceptions
View Cluster Match Map Merge Apply Logical level: parametric operators • View: arbitrary SQL query • Map: iterator-based one-to-many mapping with arbitrary user-defined functions • Match: iterator-based approximate join • Cluster: uses an arbitrary clustering function • Merge: extends SQL group-by with user-defined aggregate functions • Apply: executes an arbitrary user-defined algorithm
Authors Duplicate Elimination DirtyTitles... DirtyAuthors Extraction Standardization Cities Tags Formatting DirtyData Logical level
Physical level Authors Authors Merge Java Scan Duplicate Elimination Cluster TC Match NL DirtyTitles... DirtyTitles... DirtyAuthors DirtyAuthors Extraction Map Java Scan Standardization Map Java Scan Cities Tags Cities Tags Formatting Map SQL Scan DirtyData DirtyData Logical level
Match • Input: 2relations • Finds data records that correspond to the same real object • Calls distance functions for comparing field values and computing the distance between input tuples • Output: 1 relation containing matching tuples and possibly 1 or 2 relations containing non-matching tuples
Example Authors Merge Cluster MatchAuthors Match DirtyAuthors Duplicate Elimination
Example Authors CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Merge Cluster MatchAuthors Match DirtyAuthors Duplicate Elimination
Authors Merge Cluster Match DirtyAuthors Duplicate Elimination Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Input: DirtyAuthors(authorKey, name) 861|johann christoph freytag 822|jc freytag 819|j freytag 814|j-c freytag Output: MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag 822|814|jc freytag|j-c freytag ... MatchAuthors
Implementation of the match operator s1 S1,s2 S2 (s1, s2) is a match if editDistance (s1, s2) < maxDist
S1 S2 ... Nested loop editDistance • Very expensive evaluation when handling large amounts of data • Need alternative execution algorithms for the same logical specification
A database solution CREATE TABLE MatchAuthors AS SELECT authorKey1, authorKey2, distance FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2, editDistance (a1.name, a2.name) distance FROM DirtyAuthors a1, DirtyAuthors a2) WHERE distance < maxDist; • No optimization supported for a Cartesian product with external function calls
Window scanning S n
S Window scanning n
S Window scanning n • May loose some matches
length- 1 length length length + 1 editDistance String distance filtering S1 S2 John Smit John Smith Jogn Smith John Smithe maxDist = 1
Annotation-based optimization • The user specifies types of optimization • The system suggests which algorithm to use Ex: CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist % distance-filtering: map= length; dist = abs % INTO MatchAuthors
AJAX features • An extensible data quality framework • Logical operators as extensions of relational algebra • Physical execution algorithms • A declarative language for logical operators • SQL extension • A debugger facility for tuning a data cleaning program application • Based on a mechanism of exceptions
Declarative specification DEFINE FUNCTIONS AS Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600) DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paper KEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper }
Graph of data transformations DEFINE FUNCTIONS AS Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600) DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paper KEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper } Declarative specification
AJAX features • An extensible data quality framework • Logical operators as extensions of relational algebra • Physical execution algorithms • A declarative language for logical operators • SQL extension • A debugger facility for tuning a data cleaning program application • Based on a mechanism of exceptions
Management of exceptions • Problem: to mark tuples not handledby the cleaning criteria of an operator • Solution: to specify the generation of exception tupleswithin a logical operator • exceptions are thrown by external functions • output constraints are violated
Debugger facility • Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions • Supports the interactive data modification and, in the future, the incremental execution of logical operators
References • Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita: “Declarative Data Cleaning: Language, Model, and Algorithms”. VLDB 2001: 371-380