160 likes | 267 Views
A Principled Approach to Data Integration and Reconciliation in Data Warehousing. Diego Calvanese Giuseppe De Giacomo Maurizio Lenzerini Daniele Nardi Riccardo Rosati Presented by Alan Wessman. Introduction. Problem: Acquire data from a set of sources for a particular application
E N D
A Principled Approach to Data Integration and Reconciliation in Data Warehousing Diego Calvanese Giuseppe De Giacomo Maurizio Lenzerini Daniele Nardi Riccardo Rosati Presented by Alan Wessman
Introduction • Problem: Acquire data from a set of sources for a particular application • Typical architecture: wrappers and mediators • Core problem: specify and implement mediators • Paper focus: Data warehouses
Data Warehouse Integration • Most sources internal to organization • Need global corporate view of data • Conceptual model defines sources and data warehouse (local-as-view) • Three levels of architecture • Conceptual: Global model • Logical: Query specifications for sources and warehouse • Physical: Wrappers and mediators implementing query specifications
Conceptual Model q3, q4, q5 q6, q7 q1, q2 Source 1 Source 2 Data Warehouse Architecture
Specifying Logical Schemas • For each table of source S, create an adorned query • Head: Table name, # columns • Body: Content of table (query over conceptual model) • Adornment: • Domains (data types) of columns • Key attributes
Conceptual Model Source 1 Source 2 Lira Yen Euro Adorned Query: Example Halibut(Date, Price) <- Menu(Date, ‘Halibut’, Price) | Price :: Lira, Date :: JulianDate Swordfish(Date, Price) <- Menu(Date, ‘Swordfish’, Price) | Price :: Lira, Date :: JulianDate SushiMenu(TunaPrice, SquidPrice, Date) <- Menu(Date, ‘Tuna’, TunaPrice), Menu(Date, ‘Squid’, SquidPrice) | TunaPrice :: Yen, SquidPrice :: Yen, Date :: JulianDate
Query Consistency Let Q be an adorned query and B its body. Let M be the conceptual model. • B is inconsistent wrt M if for every interpretation of M, evaluation of B is empty • Q is inconsistent wrt M if either B is inconsistent or the annotations are inconsistent • Inference techniques exist for checking query consistency
Interschema Correspondences • Specify how data in different schemas relates • Non-materialized relational tables (computed on-demand) • Like adorned query but annotations identify helper programs • Reusable by other correspondences
Interschema Correspondences Three types of correspondence • Conversion • How data from one source is converted into data fitting a different schema • Matching • How data from different sources matches • Reconciliation • How data from different sources is reconciled to become data in the warehouse
Conversion Correspondence How data from one source is converted into data fitting a different schema convert([x], [y]) <- conj(x, y, z) through program(x, y, z) • conj: Conjunctive query, specifies when conversion applies • program: Program that performs the conversion • x: Input tuple of values satisfying conditions for x in conj • y: Output tuple of values satisfying conditions for y in conj • z: Additional parameters required by program
Matching Correspondence How data from different sources matches match([x1], …, [xk]) <- conj(x1, …, xk, z) through program(x1, …, xk, z) Differs from Conversion Correspondence in use of k tuples that may be matched program returns true if the k tuples match
Reconciliation Correspondence How data from different sources is reconciled to the warehouse reconcile([x1], …, [xk], [z]) <- conj(x1, …, xk, z, w) through program(x1, …, xk, z, w) z: Data warehouse tuple; result of reconciliation. w: Additional parameters (like z in previous slides)
Reusing Correspondences • Only reuse if previously defined • Example 1 match([x], [y]) <- convert1([x], [z]), convert2([y], [z]), conj(x, y, z, w) through none • Example 2 reconcile([x], [y], [z]) <- convert1([x], [w1]), convert2([y], [w2]), match1([w1], [w2]), convert3([w1], [z]), conj(x, y, z, w) through none
Specifying Mediators Aim: Specify for each relation in warehouse how the tuples should be constructed from the sources Task: Materialize a new relation T in the warehouse Steps: • Specify T as an adorned query q <- q’ | c1, …, cn • Look for a rewriting of q in terms of queries q1, …, qs corresponding to materialized views in the warehouse • Look for a rewriting of (what remains of q) in terms of queries corresponding to tables in the sources and the conversion, matching, and reconciliation correspondences Resulting query is specification for the mediator for T
Computing the Rewriting • Rewriting typically needs to merge results of several queries • Produce set of merging clausesForm:merging tuple-spec1 and … and tuple-specnsuch that matching-conditioninto tuple-spect1 and … and tuple-spectm • Generates template; designer specifies “such that” and “into” parts, or writes custom merging clauses
Conclusion • Start with conceptual model and several types of correspondences • Query rewriting algorithm generates mediator specifications • Designer fills in any remaining details • No empirical results