360 likes | 515 Views
Tracing the Lineage of View Data in a Warehousing Environment. Seminar : “Digital Information Curation“ Winterterm 2005/2006. Sini š a Avramovi ć avramovi@inf.uni-konstanz.de. Outline. - The Lineage Problem :
E N D
Tracing the Lineage of View Data in a Warehousing Environment Seminar : “Digital Information Curation“ Winterterm 2005/2006 Siniša Avramović avramovi@inf.uni-konstanz.de
Outline - The Lineage Problem : Introduction and Definition • Motivating Examples • Basic Assumptions and Definitions • Derivation Tracing Algorithms for SPJ and ASPJ Views
Outline II • Derivation Tracing in a DWH Environment * handling intermediate results (Materialization vs. Recomputing) * The Multisource Problem * Example :DWH with Derivation Tracing Support - Discussion
Introduction • Data Warehouse : integration of data from multiple, heterogeneous, distributed, and possibly very large data sources. • Queries posed to the DW : - for analytical purposes (OLAP,OLAM) - often of high complexity (join, aggregation) • Materialized Views : intermediate results of query processing stored in DW for query efficiency improvement.
Introduction II • In general : view definition - mapping from base data to view data
Introduction III • In general : view definition - mapping from base data to view data View definition
Introduction II • View Data Lineage : inverse mapping from view of data item to the base relation data that produced it View data lineage * In this presentation, we deal with SPJ and ASPJ views
Introduction III • Unfortunately : inverse mapping not as straightforward as view computing. Here not only view definition needed, but also some additional information required. • Data Warehousing Environment introduces some additional challenges to the lineage problem. Questions to be Answered : * how to trace lineage in a distributed database environment * how to deal with inaccessible or inconsistent sources
Motivating Examples California πstore_name, item_name, num_sold σ state=‘ca‘ item(item_id,item_name,category); store(store_id,store_name,city,state); sales(store_id,item_id,price,num_sold); CREATE VIEW California AS SELECT store.store_name,item.item_name, sales.num_sold FROM store,item,sales WHERE sales.store_id = store.store_id AND sales.item_id = item.item_id AND store.sate = “CA“
Motivating Examples california view sales table item table store table
Motivating Examples california view sales table item table store table
Motivating Examples california view sales table item table store table
Motivating Examples california view sales table item table store table
Motivating Examples california view sales table item table store table
Basic Assumptions • Class of views defined over base relations using relation algebra operation (σ,π, ,⍺) considered • SPJ and ASPJ Views Set Semantics considered Basic Definitions Derivation of view tupels : tupels of base relations which produce the View tupel. Those tupel said to contribute to t* * t : materialized view tupel
Basic Definitions: Tupel Derivation for an Operator California πstore_name, item_name, num_sold σ state=‘ca‘ T1 T2 T3 - OP : any relational operator(σ,π, ⋈ ,⍺) over T1,T2,T3 • T=Op(T1,..,T3); • t∊T ; t‘ s Derivation in T1…T3 : Op-1〈T1,..,T3〉(t)=〈T1*,…. T3*〉 ; T1*,…. T3* maximal subsets of T1,…,T3 such that: a) the derivation tupel sets Ti derive exactly t b) Each tuple in derivation in fact contribute something to t
Basic Definitions: Tupel Derivation for Views Derivation of view tuple set T contains all tuples that contribute to any view tuple in set T Any View V with complex definition can be broken into intermediate views, and compute tuples derivation recursively by tracing the hierarchy of intermediate views.
SPJ View Derivation Tracing • Derivation tracing queries queries written for specific view definition, such that if applied to database D, it return t‘s derivation in D. Given database D, with base relations R1,…Rm; view definition v over R1,…Rm and tuple t∊ v(R1,…Rm ); TQtvis the derivation tracing for t and v if : TQtv(D)=v-1D(t) * v-1D(t) : t‘s derivation over D according to v
SPJ View Derivation Tracing • SPJ canonical form : using a sequence of algebraic transformations- all SPJ views can be transformed into the form : ∏A (σC (R1⋈ …. ⋈Rm)) SplitOperator : given table T with schema T , Split breaks T into list of tables, each table in the list is projection onto A⊆T
SPJ View Derivation Tracing California View Store Table Item Table Sales Table
SPJ View Derivation Tracing Given Database D, with RelationsR1,…Rm , and : v(D)= ∏A (σC (R1⋈ …. ⋈ Rm)). Derivation of t : For T⊆v(D) :
SPJ View Derivation Tracing : Optimization • Optimization by pushing selection conditions below the join • operator.
ASPJ View Derivation Tracing Clothing Additional
ASPJ View Derivation Tracing • Problems with ASPJ Views : - most ASPJ Views not traceable without storing additional information (intermediate results). - no simple canonical form, since selection operators cannot be pushed above or below aggregation operators . Basic Idea : ASPJ canonicalform transform general ASPJ views into sequences of ⍺,π,σ, ⋈ by commuting some SPJ operators ASPJ Segment. Each segment include aggregation operators . View defined by one ASPJ segment – one-level ASPJ view.
ASPJ View Derivation Tracing • Given a one–level ASPJ view V : and t∊T; t‘s derivation :
Recursive Derivation Tracing Algorithm for Multi-Level ASPJ Views • Problem : multiple queries required for tracing computations. • Idea : - transfer the view into ASPJ canonical form • Divide into set of ASPJ segments • Define intermediate views for each segment
Recursive Derivation Tracing Algorithm for Multi-Level ASPJ Views • Problem : multiple queries required for tracing computations. • Idea : • Trace recursively trough hierarchy of intermediate Views top-down list • Use at each level a one-level ASPJ query to compute derivations • Concatenate local results to form derivation of whole view tuple list
Derivation Tracing in a Warehousing Environment • Until now, always assumed that all base relations and intermediate results accessible for tracing computations. • By looking at DWH as integration of heterogeneous and distributed data sources, following problems may arise: - Efficiency problem : querying remote sources with join and aggregation operation, recomputing intermediate results for each query very inefficient in the distributed environment. - Consistence problem : view refreshing and view maintenance problems - Legacy sources : Views defined on inaccessible , legacy sources not traceable
Derivation Tracing in a Warehousing Environment • Solutions to the intermediate aggregation results problem: The “Recomputation Approach“: - Recompute the intermediate results for each query. - No permanent extra storage, tracing process takes much longer, especially in distributed environments. The “Materialized View“ Approach • Maintain auxiliary materialized views with intermediate results in the DWH. • Less computation required but .. • extra storage and maintenance costs
Storing Derivation Views in a Multi-Source DWH • Various strategies for storing auxiliary views : simple extreme solution : store all base relations or store less information with higher tracing computation cost. Goal : intermediate scheme with low tracing cost and modest extra storage and maintenance costs. Idea : Derivation Views • break down view definition again to ASPJ segments. • Only view defined by lowest segments directly over base relations • Those views can be computed by a simple selection with split operator
Storing Derivation Views in a Multi-Source DWH All_clothing DV_All_clothing
Warehousing System Supporting Derivation Tracing I • - Auxiliary view AllClothing maintained for tracing tuples from • Clothing • AllClothing records the intermediate aggregation results • to trace tuples from AllClothing, derivation view • DV_AllClothing is maintained.
Tracing the Lineage of View Data in a Warehousing Environment Questions ?