A Principled Approach to Data Integration and Reconciliation in Data Warehousing

A Principled Approach to Data Integration and Reconciliation in Data Warehousing Diego Calvanese Giuseppe De Giacomo Maurizio Lenzerini Daniele Nardi Riccardo Rosati Presented by Alan Wessman

Introduction • Problem: Acquire data from a set of sources for a particular application • Typical architecture: wrappers and mediators • Core problem: specify and implement mediators • Paper focus: Data warehouses

Data Warehouse Integration • Most sources internal to organization • Need global corporate view of data • Conceptual model defines sources and data warehouse (local-as-view) • Three levels of architecture • Conceptual: Global model • Logical: Query specifications for sources and warehouse • Physical: Wrappers and mediators implementing query specifications

Conceptual Model q3, q4, q5 q6, q7 q1, q2 Source 1 Source 2 Data Warehouse Architecture

Specifying Logical Schemas • For each table of source S, create an adorned query • Head: Table name, # columns • Body: Content of table (query over conceptual model) • Adornment: • Domains (data types) of columns • Key attributes

Conceptual Model Source 1 Source 2 Lira Yen Euro Adorned Query: Example Halibut(Date, Price) <- Menu(Date, ‘Halibut’, Price) | Price :: Lira, Date :: JulianDate Swordfish(Date, Price) <- Menu(Date, ‘Swordfish’, Price) | Price :: Lira, Date :: JulianDate SushiMenu(TunaPrice, SquidPrice, Date) <- Menu(Date, ‘Tuna’, TunaPrice), Menu(Date, ‘Squid’, SquidPrice) | TunaPrice :: Yen, SquidPrice :: Yen, Date :: JulianDate

Query Consistency Let Q be an adorned query and B its body. Let M be the conceptual model. • B is inconsistent wrt M if for every interpretation of M, evaluation of B is empty • Q is inconsistent wrt M if either B is inconsistent or the annotations are inconsistent • Inference techniques exist for checking query consistency

Interschema Correspondences • Specify how data in different schemas relates • Non-materialized relational tables (computed on-demand) • Like adorned query but annotations identify helper programs • Reusable by other correspondences

Interschema Correspondences Three types of correspondence • Conversion • How data from one source is converted into data fitting a different schema • Matching • How data from different sources matches • Reconciliation • How data from different sources is reconciled to become data in the warehouse

Conversion Correspondence How data from one source is converted into data fitting a different schema convert([x], [y]) <- conj(x, y, z) through program(x, y, z) • conj: Conjunctive query, specifies when conversion applies • program: Program that performs the conversion • x: Input tuple of values satisfying conditions for x in conj • y: Output tuple of values satisfying conditions for y in conj • z: Additional parameters required by program

Matching Correspondence How data from different sources matches match([x1], …, [xk]) <- conj(x1, …, xk, z) through program(x1, …, xk, z) Differs from Conversion Correspondence in use of k tuples that may be matched program returns true if the k tuples match

Reconciliation Correspondence How data from different sources is reconciled to the warehouse reconcile([x1], …, [xk], [z]) <- conj(x1, …, xk, z, w) through program(x1, …, xk, z, w) z: Data warehouse tuple; result of reconciliation. w: Additional parameters (like z in previous slides)

Reusing Correspondences • Only reuse if previously defined • Example 1 match([x], [y]) <- convert1([x], [z]), convert2([y], [z]), conj(x, y, z, w) through none • Example 2 reconcile([x], [y], [z]) <- convert1([x], [w1]), convert2([y], [w2]), match1([w1], [w2]), convert3([w1], [z]), conj(x, y, z, w) through none

Specifying Mediators Aim: Specify for each relation in warehouse how the tuples should be constructed from the sources Task: Materialize a new relation T in the warehouse Steps: • Specify T as an adorned query q <- q’ | c1, …, cn • Look for a rewriting of q in terms of queries q1, …, qs corresponding to materialized views in the warehouse • Look for a rewriting of (what remains of q) in terms of queries corresponding to tables in the sources and the conversion, matching, and reconciliation correspondences Resulting query is specification for the mediator for T

Computing the Rewriting • Rewriting typically needs to merge results of several queries • Produce set of merging clausesForm:merging tuple-spec1 and … and tuple-specnsuch that matching-conditioninto tuple-spect1 and … and tuple-spectm • Generates template; designer specifies “such that” and “into” parts, or writes custom merging clauses

Conclusion • Start with conceptual model and several types of correspondences • Query rewriting algorithm generates mediator specifications • Designer fills in any remaining details • No empirical results

A Principled Approach to Data Integration and Reconciliation in Data Warehousing