1 / 47

Noget helt andet…

Noget helt andet…. Platon vil gerne være vært (i Århus) for et BIT møde i efteråret SOA eller MDM Fint for mig, men hvad siger i ? Platon inviterer alle til www.bi2006.dk 7-8 juni Special pris for BIT medlemmer: 2995 kr. Tilmelding via Jørgen Davidsen, jda@platon.net.

verena
Download Presentation

Noget helt andet…

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Noget helt andet… • Platon vil gerne være vært (i Århus) for et BIT møde i efteråret • SOA eller MDM • Fint for mig, men hvad siger i ? • Platon inviterer alle til www.bi2006.dk • 7-8 juni • Special pris for BIT medlemmer: 2995 kr. • Tilmelding via Jørgen Davidsen, jda@platon.net

  2. Lineage Tracing in DataWarehouses Torben Bach Pedersen Based on work by Yingwei Cui and Jennifer Widom Stanford University Database Group

  3. Motivation: Data Warehousing Data Warehouse Wow?! Lucrative Fields Theory $320K Databases $8800K Databases $8800K Networks $800K Courses Enrollments Students Source 1 Source 2 Source 3

  4. Data Warehouse Lucrative Fields Oh, I see... Theory $320K Databases $8800K Database 1800 Networks $800K Lineage Tracer Courses Enrollments Students CS154 Theory CS154 Joe Ann BS$1K CS145 Databases CS145 Ted Bob MS $1K CS244 Networks CS244 Bob Jane Web $5K CS245 Databases CS145 Ann Joe BS$1K CS245 Jane Ted Web $5K … … … … … Source 1 Source 2 Source 3

  5. The Data Lineage Problem • Data warehouses integrate data from multiple sourcesfor analysis and mining • Data lineage: given data item o in the warehouse, which data items in the sources were used to derive o? • Sometimes called “drill-through” in industry • “Drill-through” often limited

  6. Challenges • Warehouse of relational views over relational sources • What is a good formal definition for lineage? • How do we trace data lineage for arbitrary views? • How do we make it efficient? • Warehouse defined by graph of data transformations • No fixed, well-defined relational operators • Large transformation sequences and graphs

  7. Outline of Talk • Part 1: Lineage tracing for relational views • Part 2: Lineage tracing for general data transformations

  8. Part 1: Lineage Tracing for Relational Views • Declarative definition of data lineage • Lineage tracing algorithms • Using auxiliary views for efficient lineage tracing • Experimental results (small sample)

  9. Views We Consider • Relational algebra • Arbitrary use of aggregation • Set semantics • Also in thesis • Set operators • Bag semantics s, p, V a a p a s , ,  R S T

  10. 8 b 8 b 0 8 b 0 b 6 8 b 6 8 b 6 b 0 b 6 Simple Lineage Example V = a (s(RS)) Y,sum(Z) X >Z X Y R a 3 8 b T X Y Z U X Y Z V Y sum a 2 s 3 a 3 a 2 a 2 8 b 0 X >Z Y,sum(Z) Y Z S 8 b 0 b 6 8 b 6 8 b 9 a 2 8 b 6 b 0 b 9 select Y,sum(Z) from R natural join S where X>Z group by Y b 6

  11. Unary relational operators definition took a long time Lineage for Relational Operators (s, p, a) R R* op t • Lineage oftaccording to op is the maximal subset R* R such that • (1) op(R*) = {t} - output of R* through op is t • (2) t* R*: op({t*})  - op used on t* is nonempty

  12. 8 b 6 8 b 6 Lineage for Relational Operators • Example 1 – the two conditions ensure that only tuples contributing to t are included in lineage R X Y Z X Y Z a 2 3 a 2 3 s 8 b 0 8 b 0 X >Z 8 b 6 8 b 9 8 b 6 • Lineage oftaccording to op is the maximal subset R* R such that • (1) op(R*) = {t} • (2) t* R*: op({t*}) 

  13. 8 b 0 b 6 8 b 6 Lineage for Relational Operators • Example 2 –”maximal” requirement ensures that (8,b,0) tuple in included in (b,6) lineage X Y Z R Y sum a 2 a 3 a 2 Y,sum(Z) 8 b 0 b 6 8 b 6 • Lineage oftaccording to op is the maximal subset R* R such that • (1) op(R*) = {t} • (2) t* R*: op({t*}) 

  14. * R1 op * R2 Lineage for Relational Operators • N-ary relational operators ( ,,) – lineage unique R1 R2 • Lineage oftaccording to op is the maximal subsets Ri* Ri for i = 1..n such that • (1) op(R1*, …, Rn*) = {t} • (2) ti* Ri*: op(R1, …, {ti*}, …, Rn) 

  15. * R1 R1 U V t op op 2 1 * R2 U* R2 Lineage for Relational Views • Lineage of a tuple set is union of lineage of each tuple in the set • Lineage for views is defined recursively => naive, but inefficient, algorithm (need to recompute/store all intermediate results) Lineage of t is R1*, R2*

  16. a(p(s(E1 …En))) • Each segment Lineage Tracing • Convert view into segmented normal form (SPJ+agg) • Generate one tracing query for each segment • Apply tracing queries recursively • # non-top a+ 1 • Proof: lineage result is unaffected by normalization and segment-level tracing

  17. V = a (s(RS)) Y,sum(Z) X >Z TQ =Split(s(RS)) R,S X >Z Y=b 8 b b 6 0 b 6 b R*={(8,b)}, S*={(b,0),(b,6)} Tracing Query for One Segment R X Y a 3 8 b V Y sum s a X >Z Y,sum(Z) a 2 Y Z S b 6 a 2 0 b b 9 6 b Split = ”unjoin” – project over R+S schemas

  18. V = a (a (s(RS))T)) W, avg(sum) Y,sum(Z) X >Z 8 8 b b s a b 6 a b b 0 0 q 6 b b 6 6 b b q q TQ = Split (s(UT)) TQ = Split (s(RS)) R*={(8,b)}, S*={(b,0),(b,6)}, T*={(b,q)} 1 U,T W=q 2 R,S X >Z  Y=b Recursive Tracing Procedure R X Y a 3 8 b Y sum U s a a 2 Y Z S V W avg b 6 a 2 a p 4 b 0 Y W T q 6 b 9 a p b 6 b p b q

  19. Making It Efficient • Source accesses are usually expensive or impossible • Need some intermediate results for lineage tracing • Store auxiliary views at the warehouse • Reduce or eliminate source accesses • Reduce recomputation of intermediate results

  20. Aux View Example

  21. Aux View Example

  22. a(p(s(R1 …Rn))) Auxiliary Views • There are many possible auxiliary views • For single-segment views • Identified 10 possible auxiliary view schemes • Studied performance tradeoffs • For arbitrary views • Hard optimization problem • Exhaustive and heuristic algorithms • Performance study

  23. Single Segment Schemes • Store nothing (NO) • Store Base Tables (BT) • Store Lineage Views (LV) • Store Split Lineage Tables (SLT) • Store Partial Base Tables (PBT) • Store Base Table Projections (BP) • Store Lineage View Projections (LP) • Self-maintainable variations: LV-S, SLT-S, PBT-S

  24. Auxiliary Views: Performance Tradeoffs • Always improve lineage tracing • Must be maintained when sources change • Can also help with maintenance of original user views

  25. Auxiliary View Schemes for Single-Segment Views • Parameters: • 3-way SPJ view • sources: 10MB each • disk: 1Mbps • network: 50kbps • 1000 operations • q/u ratio = 4 • Measurements: • tracing time • maintenance time

  26. Auxiliary View Selection Algorithms for Arbitrary Views

  27. Data Warehouse T6 T5 T4 T2 T1 T3 Source 1 Source 2 Source 3 Part 2: Transformation Graphs • Lineage definition • Tracing algorithms • Combining transformations for lineage tracing • Experimental results (tiny sample)

  28. Transformation Example id name price valid id cust date prod-list 1 imac 1200 10/1/98- 1 A 2/8/99 1(10),2(10) 2 vaio 2400 6/1/98-9/1/99 2 C 4/5/99 2(5),3(10) 2 C 4/5/99 2(5),3(10) name avg3 Q4 2 vaio 1800 9/2/99- 3 D 6/1/99 1(20),2(10) palm2K 6K 4 B 8/6/991(10),3(5) 3 palm 500 2/1/98-7/1/98 4 B 8/6/991(10),3(5) 3 palm 400 7/2/98-9/1/99 5 D 10/8/99 1(5),3(10) 5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10) 3 palm 300 9/2/99- 6 B 12/1/99 2(10),3(10) “join” pivot projection projection selection selection split 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99- Order T1 T3 T4 T5 T6 T7 SalesJump Product T2 palm 2K 6K

  29. ? Lineage for General Transformations • A transformation can be an arbitrary program T • select … from … where … • main(int argc, char** argv) {…} • sed “s/string1/string2/g” … • One extreme: relational operators • Another extreme: we know nothing about T • Middle ground: based on transformation properties

  30. Transformation Properties • Transformation classes • Additional properties • Transformation subclasses • Schema information • Provided inverse or tracing procedure

  31. dispatcher I: T(I) = T({i}) i  I T*(o) ={i | oT({i})} Transformation Classes Produces 0 or more output items per input item Applying T on complete set is the same as on each input item separately

  32. id cust date prod-list id cust date pid quant 1 A 2/8/99 1 10 1 A 2/8/99 1(10),2(10) 1 A 2/8/99 2 10 2 C 4/5/99 2(5),3(10) : : : 3 D 6/1/99 1(20),2(10) 5 D 10/8/991 5 4 B 8/6/991(10),3(5) 5 D 10/8/99 3 10 5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2 10 6 B 12/1/99 2(10),3(10) 6 B 12/1/99 3 10 5 D 10/8/991 5 5 D 10/8/99 3 10 5 D 10/8/99 1(5),3(10) Dispatcher Example O1 Order T1 5 D 10/8/99 3 10 5 D 10/8/99 1(5),3(10) A non-relational operator, but a typical dispatcher

  33. dispatcher aggregator I: T(I) = T({i}) I and T(I)={o1…on}:  unique partition I1..In of I s.t. T(Ik) = {ok} i  I T*(o) ={i | oT({i})} T*(ok) = Ik Transformation Classes

  34. name Q1 Q2 Q3 Q4 imac 12K 24K 12K 6K vaio 24K 12K 24K 18K palm0K 4K 2K 6K 2 palm 4/5/99 400 10 2 palm 4/5/99 400 10 3 imac 6/1/991200 20 3 vaio 6/1/99 2400 10 T4 4 imac 8/6/99 1200 10 4 palm 8/6/99 400 5 4 palm 8/6/99 400 5 4 palm 8/6/99 400 5 palm0K 4K 2K 6K 5 imac 10/8/991200 5 5 palm 10/8/99 300 10 5 palm 10/8/99 300 10 5 palm 10/8/99 300 10 6 vaio 12/1/99 1800 10 6 palm 12/1/99 300 10 6 palm 12/1/99 300 10 6 palm 12/1/99 300 10 Aggregator Example O3 oid name date price quant 1 imac 2/8/99 1200 10 1 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5 O4 2 palm 4/5/99 400 10 palm0K 4K 2K 6K T4 computes quarterly sales per product by ”pivoting” Again, a non-relational operator, but a typical aggregator

  35. dispatcher aggregator black-box I: T(I) = T({i}) I and T(I)={o1…on}:  unique partition I1..In of I s.t. T(Ik) = {ok} All others i  I T*(o) = I T*(o) ={i | oT({i})} T*(ok) = Ik Transformation Classes

  36. Transformation Classes • Most transformations are dispatchers, aggregators, or their compositions • A transformation can be both dispatcher and aggregator • Proof: Lineage definitions are then equivalent • Transformations can be relational operators • Lineage definitions same as relational definitions

  37. Transformation Properties • Transformation classes • Additional properties • Transformation subclasses • Schema information • Provided inverse or tracing procedure

  38. Transformation Subclasses • Permit more efficient lineage tracing • Filter is a special dispatcher • Each input data item produces itself or nothing • Context-free aggregator • Whether two input data items are in the same partition is independent of other items • Key-preserving aggregator • Any subset of an input partition always produces the same output key

  39. Tracing Example: Aggregators • Consider T(I) = {o1…on} • Tracing the lineage of o for aggregator • Partition input I into I1…In such that T(Ik) = {ok} • Return Ik such that T(Ik) = {o} • Tracing the lineage of o for context-free aggregator • Partition input I into I1…In such that |T(Ik)| = 1 • Return Ik such that T(Ik) = {o} • 2^n versus n^2 running time !

  40. Schema Information • Input schema A=(A1…An) and key Akey • Output schema B=(B1…Bn) and key Bkey • Schema mappings: f(A)  B and A  g(B) • Transformations with special schema mappings • Forward key-map: f(A)  Bkey • Backward key-map: Akey  g(B) • Backward total-map: A  g(B) • More efficient tracing for these

  41. name Q1 Q2 Q3 Q4 imac 12K 24K 12K 6K vaio 24K 12K 24K 18K palm0K 4K 2K 6K 2 palm 4/5/99 400 10 T4 4 palm 8/6/99 400 5 5 palm 10/8/99 300 10 6 palm 12/1/99 300 10 Tracing Example: Forward Key-Maps O3 O4 oid name date price quant 1 imac 2/8/99 1200 10 1 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5 palm0K 4K 2K 6K 2 palm 4/5/99 400 10 3 imac 6/1/991200 20 3 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 10 4 palm 8/6/99 400 5 5 imac 10/8/991200 5 5 palm 10/8/99 300 10 6 vaio 12/1/99 1800 10 6 palm 12/1/99 300 10 ”name” is carried over as key - trace of ”palm” is easy : the O3 tuples with name = ’palm’

  42. Other Properties • Transformation author provides Tracing Procedure • Provided Transformation Inverse T –1 • If T is an aggregator, then o’s lineage is T –1({o}) • Not always true for dispatchers or black-boxes

  43. Tracing Procedures

  44. black-box aggregator dispatcher provided tracing-proc. or inverse Property Hierarchy ANY context-free aggr. key-preserving aggr. forward key-map backward key-map total-map filter

  45. Summary of Our Approach for One Transformation • Properties are provided with transformations • Specified by the transformation author • Declared in prepackaged transformations • Derived using recent techniques [Clio01, RB01] • The best property of a transformation is selected based on the hierarchy • The tracing procedure using the best property is called at tracing time • Indexing techniques

  46. T2 T3 Tn O I T1 Transformation Sequences • Naive algorithm traces backwards one transformation at a time • Need all intermediate results • Poor performance for long sequences

  47. T2 T3 Tn O I T1 T’ Tn O I Transformation Sequences • Combine transformations and trace as one • Reduces number of intermediate results • By combining judiciously • Reduces tracing cost • Doesn’t lose accuracy

More Related