160 likes | 330 Views
Data Cleaning and Transformation – Instance conflict resolution. Helena Galhardas DEI IST
E N D
Data Cleaning and Transformation – Instance conflict resolution Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative Information Systems”, Carlo Batini, Tiziana Catarci, Monica Scannapieco, 23rd International Conference on Conceptual Modelling (ER 2004))
Instance conflicts resolution & merging • Occur due to poor data quality • Errors exist when collecting or entering data • Sources are not updated • Three types: • representation conflicts, e.g. dollar vs. euro • key equivalence conflicts, i.e. same real world objects with different identifiers • attribute value conflicts, i.e. instances corresponding to same real world objects and sharing an equivalent key, differ on other attributes
Example Attribute conflicts Key conflicts
Techniques • At design time or query time, even if actual conflicts occur at query time • Key conflicts require object identification techniques • There are several techniques for dealing with attribute conflicts
General approach for solving attribute-level conflicts • Declaratively specify how to deal with conflicts: • Set of conflict resolution functions • E.g., count(), min(), max(), random(), etc • Set of strategies to deal with conflicts • Correspond to diff. Tolerance degrees • Query model that takes into account posible conflicts directly • Use SQL directly or extensions to it.
Conflict resolution and result merging: list of proposals • SQL-based conflict resolution (Naumann et al 2002) • Aurora (Ozsu et al 1999) • Fan et al 2001 • FraSQL-based conflict resolution (Sattler et al 2003) • FusionPlex (Motro et al 2004) • OOra (Lim et al 1998) • DaQuinCIS (Scannapieco 2004) • (Wang et al 2000) • ConQuer (Fuxman 2005) • (Dayal 1985)
Conflict resolution and result merging: list of proposals • SQL-based conflict resolution (Naumann et al 2002) • Aurora (Ozsu et al 1999) • Fan et al 2001 • FraSQL-based conflict resolution (Sattler et al 2003) • FusionPlex (Motro et al 2004) • OOra (Lim et al 1998) • DaQuinCIS (Scannapieco 2004) • (Wang et al 2000) • ConQuer (Fuxman 2005) • (Dayal 1985)
Declarative (SQL-based) data merging with conflicts resolution [Naumann 2002] • Problem: Merging data in an integrated database or in a query against multiple sources • Solution: • proposal of a set of resolution functions (e.g. MAX, MIN, RANDOM; GROUP, AVG, LONGEST) • declarative merging of relational data sources by common queries through: • Grouping queries • Join queries (divide et impera) • nested joining (one at a time, hierarchical)
Declarative (SQL-based) data merging with conflicts resolution [Naumann 2002] • Problem: Merging data in an integrated database or in a query against multiple sources • Solution: • proposal of a set of resolution functions (e.g. MAX, MIN, RANDOM; GROUP, AVG, LONGEST) • declarative merging of relational data sources by common queries through: • Grouping queries • Join queries(divide et impera) • nested joining (one at a time, hierarchical)
Grouping queries SELECT isbn, LONGEST(title), MIN(price) FROM books GROUP BY isbn If multiple sources involved with potential duplicates: SELECT books.isbn,LONGEST(books.title), MIN(books.price) FROM ( SELECT * FROM books1 UNION SELECT * FROM books2) AS books GROUP BY books.isbn Advantages: good performance Inconvenient: user-defined aggregate functions not supported by most RDBMS
Join queries (1) • Dismantle a query result that merges multiple sources into different parts, resolve conflicts within each part, and union them to a final result. • Partitions the union of the two sources into : • The intersection of the two: Query 1 SELECT books1.isbn, LONGEST(books1.title, books2.title), MIN(books1.price, books2.price) FROM books1, books2 WHERE books1.isbn = books2.isbn
Join queries (2) • The part of books 1 that does not intersect with books2 (Query 2a or 2b) and vice-versa Query 2a SELECT isbn, title, price FROM books1 WHERE books1.isbn NOT IN (SELECT books1.isbn, FROM books1, books2 WHERE books1.isbn = books2.isbn) ) Advantages: Use of built-in scalar functions available by RDBMS Query 2b SELECT isbn, title, price FROM books1 WHERE books1.isbn NOT IN (SELECT books2.isbn, FROM books2 ) Inconvs: Length and complexity of queries
Advantages and inconvenients • Pros: the proposal considers the usage of SQL, thus exploiting capabilities of existing database systems, rather than proposing proprietary solutions • Cons: limited applicability of conflict resolution functions that often need to be domain-specific
References • “Data Quality: Concepts, Methodologies and Techniques”, C. Batini and M. Scannapieco, Springer-Verlag, 2006. • “Declarative data merging with conflict resolution”, F. Naumann and M. Haeussler, IQ2002.