470 likes | 623 Views
Noget helt andet…. Platon vil gerne være vært (i Århus) for et BIT møde i efteråret SOA eller MDM Fint for mig, men hvad siger i ? Platon inviterer alle til www.bi2006.dk 7-8 juni Special pris for BIT medlemmer: 2995 kr. Tilmelding via Jørgen Davidsen, jda@platon.net.
E N D
Noget helt andet… • Platon vil gerne være vært (i Århus) for et BIT møde i efteråret • SOA eller MDM • Fint for mig, men hvad siger i ? • Platon inviterer alle til www.bi2006.dk • 7-8 juni • Special pris for BIT medlemmer: 2995 kr. • Tilmelding via Jørgen Davidsen, jda@platon.net
Lineage Tracing in DataWarehouses Torben Bach Pedersen Based on work by Yingwei Cui and Jennifer Widom Stanford University Database Group
Motivation: Data Warehousing Data Warehouse Wow?! Lucrative Fields Theory $320K Databases $8800K Databases $8800K Networks $800K Courses Enrollments Students Source 1 Source 2 Source 3
Data Warehouse Lucrative Fields Oh, I see... Theory $320K Databases $8800K Database 1800 Networks $800K Lineage Tracer Courses Enrollments Students CS154 Theory CS154 Joe Ann BS$1K CS145 Databases CS145 Ted Bob MS $1K CS244 Networks CS244 Bob Jane Web $5K CS245 Databases CS145 Ann Joe BS$1K CS245 Jane Ted Web $5K … … … … … Source 1 Source 2 Source 3
The Data Lineage Problem • Data warehouses integrate data from multiple sourcesfor analysis and mining • Data lineage: given data item o in the warehouse, which data items in the sources were used to derive o? • Sometimes called “drill-through” in industry • “Drill-through” often limited
Challenges • Warehouse of relational views over relational sources • What is a good formal definition for lineage? • How do we trace data lineage for arbitrary views? • How do we make it efficient? • Warehouse defined by graph of data transformations • No fixed, well-defined relational operators • Large transformation sequences and graphs
Outline of Talk • Part 1: Lineage tracing for relational views • Part 2: Lineage tracing for general data transformations
Part 1: Lineage Tracing for Relational Views • Declarative definition of data lineage • Lineage tracing algorithms • Using auxiliary views for efficient lineage tracing • Experimental results (small sample)
Views We Consider • Relational algebra • Arbitrary use of aggregation • Set semantics • Also in thesis • Set operators • Bag semantics s, p, V a a p a s , , R S T
8 b 8 b 0 8 b 0 b 6 8 b 6 8 b 6 b 0 b 6 Simple Lineage Example V = a (s(RS)) Y,sum(Z) X >Z X Y R a 3 8 b T X Y Z U X Y Z V Y sum a 2 s 3 a 3 a 2 a 2 8 b 0 X >Z Y,sum(Z) Y Z S 8 b 0 b 6 8 b 6 8 b 9 a 2 8 b 6 b 0 b 9 select Y,sum(Z) from R natural join S where X>Z group by Y b 6
Unary relational operators definition took a long time Lineage for Relational Operators (s, p, a) R R* op t • Lineage oftaccording to op is the maximal subset R* R such that • (1) op(R*) = {t} - output of R* through op is t • (2) t* R*: op({t*}) - op used on t* is nonempty
8 b 6 8 b 6 Lineage for Relational Operators • Example 1 – the two conditions ensure that only tuples contributing to t are included in lineage R X Y Z X Y Z a 2 3 a 2 3 s 8 b 0 8 b 0 X >Z 8 b 6 8 b 9 8 b 6 • Lineage oftaccording to op is the maximal subset R* R such that • (1) op(R*) = {t} • (2) t* R*: op({t*})
8 b 0 b 6 8 b 6 Lineage for Relational Operators • Example 2 –”maximal” requirement ensures that (8,b,0) tuple in included in (b,6) lineage X Y Z R Y sum a 2 a 3 a 2 Y,sum(Z) 8 b 0 b 6 8 b 6 • Lineage oftaccording to op is the maximal subset R* R such that • (1) op(R*) = {t} • (2) t* R*: op({t*})
* R1 op * R2 Lineage for Relational Operators • N-ary relational operators ( ,,) – lineage unique R1 R2 • Lineage oftaccording to op is the maximal subsets Ri* Ri for i = 1..n such that • (1) op(R1*, …, Rn*) = {t} • (2) ti* Ri*: op(R1, …, {ti*}, …, Rn)
* R1 R1 U V t op op 2 1 * R2 U* R2 Lineage for Relational Views • Lineage of a tuple set is union of lineage of each tuple in the set • Lineage for views is defined recursively => naive, but inefficient, algorithm (need to recompute/store all intermediate results) Lineage of t is R1*, R2*
a(p(s(E1 …En))) • Each segment Lineage Tracing • Convert view into segmented normal form (SPJ+agg) • Generate one tracing query for each segment • Apply tracing queries recursively • # non-top a+ 1 • Proof: lineage result is unaffected by normalization and segment-level tracing
V = a (s(RS)) Y,sum(Z) X >Z TQ =Split(s(RS)) R,S X >Z Y=b 8 b b 6 0 b 6 b R*={(8,b)}, S*={(b,0),(b,6)} Tracing Query for One Segment R X Y a 3 8 b V Y sum s a X >Z Y,sum(Z) a 2 Y Z S b 6 a 2 0 b b 9 6 b Split = ”unjoin” – project over R+S schemas
V = a (a (s(RS))T)) W, avg(sum) Y,sum(Z) X >Z 8 8 b b s a b 6 a b b 0 0 q 6 b b 6 6 b b q q TQ = Split (s(UT)) TQ = Split (s(RS)) R*={(8,b)}, S*={(b,0),(b,6)}, T*={(b,q)} 1 U,T W=q 2 R,S X >Z Y=b Recursive Tracing Procedure R X Y a 3 8 b Y sum U s a a 2 Y Z S V W avg b 6 a 2 a p 4 b 0 Y W T q 6 b 9 a p b 6 b p b q
Making It Efficient • Source accesses are usually expensive or impossible • Need some intermediate results for lineage tracing • Store auxiliary views at the warehouse • Reduce or eliminate source accesses • Reduce recomputation of intermediate results
a(p(s(R1 …Rn))) Auxiliary Views • There are many possible auxiliary views • For single-segment views • Identified 10 possible auxiliary view schemes • Studied performance tradeoffs • For arbitrary views • Hard optimization problem • Exhaustive and heuristic algorithms • Performance study
Single Segment Schemes • Store nothing (NO) • Store Base Tables (BT) • Store Lineage Views (LV) • Store Split Lineage Tables (SLT) • Store Partial Base Tables (PBT) • Store Base Table Projections (BP) • Store Lineage View Projections (LP) • Self-maintainable variations: LV-S, SLT-S, PBT-S
Auxiliary Views: Performance Tradeoffs • Always improve lineage tracing • Must be maintained when sources change • Can also help with maintenance of original user views
Auxiliary View Schemes for Single-Segment Views • Parameters: • 3-way SPJ view • sources: 10MB each • disk: 1Mbps • network: 50kbps • 1000 operations • q/u ratio = 4 • Measurements: • tracing time • maintenance time
Data Warehouse T6 T5 T4 T2 T1 T3 Source 1 Source 2 Source 3 Part 2: Transformation Graphs • Lineage definition • Tracing algorithms • Combining transformations for lineage tracing • Experimental results (tiny sample)
Transformation Example id name price valid id cust date prod-list 1 imac 1200 10/1/98- 1 A 2/8/99 1(10),2(10) 2 vaio 2400 6/1/98-9/1/99 2 C 4/5/99 2(5),3(10) 2 C 4/5/99 2(5),3(10) name avg3 Q4 2 vaio 1800 9/2/99- 3 D 6/1/99 1(20),2(10) palm2K 6K 4 B 8/6/991(10),3(5) 3 palm 500 2/1/98-7/1/98 4 B 8/6/991(10),3(5) 3 palm 400 7/2/98-9/1/99 5 D 10/8/99 1(5),3(10) 5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10) 3 palm 300 9/2/99- 6 B 12/1/99 2(10),3(10) “join” pivot projection projection selection selection split 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99- Order T1 T3 T4 T5 T6 T7 SalesJump Product T2 palm 2K 6K
? Lineage for General Transformations • A transformation can be an arbitrary program T • select … from … where … • main(int argc, char** argv) {…} • sed “s/string1/string2/g” … • One extreme: relational operators • Another extreme: we know nothing about T • Middle ground: based on transformation properties
Transformation Properties • Transformation classes • Additional properties • Transformation subclasses • Schema information • Provided inverse or tracing procedure
dispatcher I: T(I) = T({i}) i I T*(o) ={i | oT({i})} Transformation Classes Produces 0 or more output items per input item Applying T on complete set is the same as on each input item separately
id cust date prod-list id cust date pid quant 1 A 2/8/99 1 10 1 A 2/8/99 1(10),2(10) 1 A 2/8/99 2 10 2 C 4/5/99 2(5),3(10) : : : 3 D 6/1/99 1(20),2(10) 5 D 10/8/991 5 4 B 8/6/991(10),3(5) 5 D 10/8/99 3 10 5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2 10 6 B 12/1/99 2(10),3(10) 6 B 12/1/99 3 10 5 D 10/8/991 5 5 D 10/8/99 3 10 5 D 10/8/99 1(5),3(10) Dispatcher Example O1 Order T1 5 D 10/8/99 3 10 5 D 10/8/99 1(5),3(10) A non-relational operator, but a typical dispatcher
dispatcher aggregator I: T(I) = T({i}) I and T(I)={o1…on}: unique partition I1..In of I s.t. T(Ik) = {ok} i I T*(o) ={i | oT({i})} T*(ok) = Ik Transformation Classes
name Q1 Q2 Q3 Q4 imac 12K 24K 12K 6K vaio 24K 12K 24K 18K palm0K 4K 2K 6K 2 palm 4/5/99 400 10 2 palm 4/5/99 400 10 3 imac 6/1/991200 20 3 vaio 6/1/99 2400 10 T4 4 imac 8/6/99 1200 10 4 palm 8/6/99 400 5 4 palm 8/6/99 400 5 4 palm 8/6/99 400 5 palm0K 4K 2K 6K 5 imac 10/8/991200 5 5 palm 10/8/99 300 10 5 palm 10/8/99 300 10 5 palm 10/8/99 300 10 6 vaio 12/1/99 1800 10 6 palm 12/1/99 300 10 6 palm 12/1/99 300 10 6 palm 12/1/99 300 10 Aggregator Example O3 oid name date price quant 1 imac 2/8/99 1200 10 1 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5 O4 2 palm 4/5/99 400 10 palm0K 4K 2K 6K T4 computes quarterly sales per product by ”pivoting” Again, a non-relational operator, but a typical aggregator
dispatcher aggregator black-box I: T(I) = T({i}) I and T(I)={o1…on}: unique partition I1..In of I s.t. T(Ik) = {ok} All others i I T*(o) = I T*(o) ={i | oT({i})} T*(ok) = Ik Transformation Classes
Transformation Classes • Most transformations are dispatchers, aggregators, or their compositions • A transformation can be both dispatcher and aggregator • Proof: Lineage definitions are then equivalent • Transformations can be relational operators • Lineage definitions same as relational definitions
Transformation Properties • Transformation classes • Additional properties • Transformation subclasses • Schema information • Provided inverse or tracing procedure
Transformation Subclasses • Permit more efficient lineage tracing • Filter is a special dispatcher • Each input data item produces itself or nothing • Context-free aggregator • Whether two input data items are in the same partition is independent of other items • Key-preserving aggregator • Any subset of an input partition always produces the same output key
Tracing Example: Aggregators • Consider T(I) = {o1…on} • Tracing the lineage of o for aggregator • Partition input I into I1…In such that T(Ik) = {ok} • Return Ik such that T(Ik) = {o} • Tracing the lineage of o for context-free aggregator • Partition input I into I1…In such that |T(Ik)| = 1 • Return Ik such that T(Ik) = {o} • 2^n versus n^2 running time !
Schema Information • Input schema A=(A1…An) and key Akey • Output schema B=(B1…Bn) and key Bkey • Schema mappings: f(A) B and A g(B) • Transformations with special schema mappings • Forward key-map: f(A) Bkey • Backward key-map: Akey g(B) • Backward total-map: A g(B) • More efficient tracing for these
name Q1 Q2 Q3 Q4 imac 12K 24K 12K 6K vaio 24K 12K 24K 18K palm0K 4K 2K 6K 2 palm 4/5/99 400 10 T4 4 palm 8/6/99 400 5 5 palm 10/8/99 300 10 6 palm 12/1/99 300 10 Tracing Example: Forward Key-Maps O3 O4 oid name date price quant 1 imac 2/8/99 1200 10 1 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5 palm0K 4K 2K 6K 2 palm 4/5/99 400 10 3 imac 6/1/991200 20 3 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 10 4 palm 8/6/99 400 5 5 imac 10/8/991200 5 5 palm 10/8/99 300 10 6 vaio 12/1/99 1800 10 6 palm 12/1/99 300 10 ”name” is carried over as key - trace of ”palm” is easy : the O3 tuples with name = ’palm’
Other Properties • Transformation author provides Tracing Procedure • Provided Transformation Inverse T –1 • If T is an aggregator, then o’s lineage is T –1({o}) • Not always true for dispatchers or black-boxes
black-box aggregator dispatcher provided tracing-proc. or inverse Property Hierarchy ANY context-free aggr. key-preserving aggr. forward key-map backward key-map total-map filter
Summary of Our Approach for One Transformation • Properties are provided with transformations • Specified by the transformation author • Declared in prepackaged transformations • Derived using recent techniques [Clio01, RB01] • The best property of a transformation is selected based on the hierarchy • The tracing procedure using the best property is called at tracing time • Indexing techniques
T2 T3 Tn O I T1 Transformation Sequences • Naive algorithm traces backwards one transformation at a time • Need all intermediate results • Poor performance for long sequences
T2 T3 Tn O I T1 T’ Tn O I Transformation Sequences • Combine transformations and trace as one • Reduces number of intermediate results • By combining judiciously • Reduces tracing cost • Doesn’t lose accuracy