1 / 39

Data Provenance in ETL Scenarios

Data Provenance in ETL Scenarios. Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS). Outline. Introduction Conceptual Level Logical Level Physical Level Provenance &ETL. Outline.

tobit
Download Presentation

Data Provenance in ETL Scenarios

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)

  2. Outline • Introduction • Conceptual Level • Logical Level • Physical Level • Provenance &ETL PrOPr 2007

  3. Outline • Introduction • Conceptual Level • Logical Level • Physical Level • Provenance &ETL PrOPr 2007

  4. Data Warehouse Environment PrOPr 2007

  5. Extract-Transform-Load (ETL) PrOPr 2007

  6. ETL: importance • ETL and Data Cleaning tools cost • 30% of effort and expenses in the budget of the DW • 55% of the total costs of DW runtime • 80% of the development time in a DW project • ETL market: a multi-million market • IBM paid $1.1 billion dollars for Ascential • ETL tools in the market • software packages • in-house development • No standard, no common model • most vendors implement a core set of operators and provide GUI to create a data flow PrOPr 2007

  7. Fundamental research question • Now: currently, ETL designers work directly at the physical level (typically, via libraries of physical-level templates) • Challenge: can we design ETL flows as declaratively as possible? • Detail independence: • no care for the algorithmic choices • no care about the order of the transformations • (hopefully) no care for the details of the inter-attribute mappings PrOPr 2007

  8. Now: DW Physical templates Involved data stores + Physical scenario Engine PrOPr 2007

  9. Vision: DW Schema mappings ETL tool DW Conceptual to logical mapping Conceptual to logical mapper Physical templates Involved data stores + Logical templates Logical scenario Optimizer Physical scenario Physical templates Physical scenario Engine Engine PrOPr 2007

  10. Detail independence DW Schema mappings ETL tool Automate (as much as possible) Conceptual: the details of the inter-attribute mappings Logical: the order of the transformations Physical: the algorithmic choices Conceptual to logical mapping Conceptual to logical mapper Logical templates Logical scenario Optimizer Physical templates Physical scenario Engine PrOPr 2007

  11. Outline • Introduction • Conceptual Level • Logical Level • Physical Level • Provenance &ETL PrOPr 2007

  12. Conceptual Model: first attempts PrOPr 2007

  13. Conceptual Model: The Data Mapping Diagram • Extension of UML to handle inter-attribute mappings PrOPr 2007

  14. Conceptual Model: The Data Mapping Diagram • Aggregating computes the quarterly sales for each product. PrOPr 2007

  15. Conceptual Model: Skoutas’ annotations • Application vocabulary • Datastore mappings • Datastore annotation PrOPr 2007

  16. The class hierarchy Definition for class DS1_Products Conceptual Model: Skoutas’ annotations PrOPr 2007

  17. Outline • Introduction • Conceptual Level • Logical Level • Physical Level • Provenance &ETL PrOPr 2007

  18. Logical Model DSA DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PSNEW2 DS.PSNEW2.PKEY, DS.PSOLD2.PKEY QTY,COST SOURCE COST DATE DS.PS2 AddAttr2 γ DIFF2 A2EDate SK2 $2€ DS.PSOLD2 rejected rejected rejected rejected Log Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PSNEW1 DS.PSNEW1.PKEY, DS.PSOLD1.PKEY COST DATE=SYSDATE PKEY,DATE DS.PS1 U PK NotNULL AddDate SK1 DIFF1 DS.PSOLD1 rejected rejected rejected Log Log Log PKEY, DAY MIN(COST) S2.PARTS DW.PARTS FTP2 V1 Aggregate1 PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY S1.PARTS TIME FTP1 V2  Aggregate2 Sources DW PrOPr 2007

  19. Logical Model • Main question: What information should we put inside a metadatarepository to be able to answer questions like: • what is the architecture of my DW back stage? • which attributes/tables are involved in the population of an attribute? • what part of the scenario is affected if we delete an attribute? PrOPr 2007

  20. Architecture Graph DSA DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PSNEW2 DS.PSNEW2.PKEY, DS.PSOLD2.PKEY QTY,COST COST DATE SOURCE DS.PS2 AddAttr2 γ A2EDate $2€ DIFF2 SK2 DS.PSOLD2 rejected rejected rejected rejected Log Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PSNEW1 DS.PSNEW1.PKEY, DS.PSOLD1.PKEY COST DATE=SYSDATE PKEY,DATE DS.PS1 U PK NotNULL AddDate SK1 DIFF1 DS.PSOLD1 rejected rejected rejected Log Log Log PKEY, DAY MIN(COST) S2.PARTS DW.PARTS FTP2 V1 Aggregate1 PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY S1.PARTS TIME FTP1 V2  Aggregate2 Sources DW PrOPr 2007

  21. Architecture Graph Example 2 PrOPr 2007

  22. Architecture Graph Example 2 PrOPr 2007

  23. Optimization • Execution order… which is the proper execution order? PrOPr 2007

  24. Optimization • Execution order… order equivalence? SK,f1,f2 orSK,f2,f1 or ... ? PrOPr 2007

  25. Logical Optimization • Can we push selection early enough? • Can we aggregate before $2€ takes place? PrOPr 2007

  26. Outline • Introduction • Conceptual Level • Logical Level • Physical Level • Provenance &ETL PrOPr 2007

  27. “identify the best possible physical implementation for a given logical ETL workflow” Logical to Physical DW Schema mappings ETL tool Conceptual to logical mapper Conceptual to logical mapping Logical templates Logical scenario Optimizer Physical templates Physical scenario Engine PrOPr 2007

  28. Problem formulation • Given a logical-level ETL workflow GL • Compute a physical-level ETL workflow GP • Such that • the semantics of the workflow do not change • all constraints are met • the cost is minimal PrOPr 2007

  29. Solution • We model the problem of finding the physical implementation of an ETL process as a state-space search problem. • States. A state is a graph GPthat represents a physical-level ETL workflow. • The initial state G0Pis produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints. • Transitions. Given a state GP, a new state GP’is generated by replacing the implementation of a physical activity aPof GPwith another valid implementation for the same activity. • Extension: introduction of asorter activity (at the physical-level)as a new node inthe graph. • Sorter introduction • Intentionally introduce sorters to reduce execution & resumption costs PrOPr 2007

  30. Sorters: impact • We intentionally introduce orderings, (via appropriate physical-level sorteractivities) towards obtaining physical plans of lower cost. • Semantics: unaffected • Price to pay: • cost of sorting the stream of processed data • Gain: • it is possible to employ order-aware algorithms that significantly reduce processing cost • It is possible to amortize the cost over activities that utilize common useful orderings PrOPr 2007

  31. Sorter gains Cost(G) = 100.000+10.000 +3*[5.000*log2(5.000)+5.000] = 309.316 Ifsorter SA,B is addedtoV: Cost(G’) = 100.000+10.000 +2*5.000+[5.000*log2(5.000)+5.000] = 247.877 • Without order • cost(σi) = n • costSO(γ) = n*log2(n)+n • With appropriateorder • cost(σi) = seli * n • costSO(γ) = n PrOPr 2007

  32. Interesting orders A asc A desc {A,B, [A,B]} PrOPr 2007

  33. Outline • Introduction • Conceptual Level • Logical Level • Physical Level • Provenance &ETL PrOPr 2007

  34. A principled architecture for ETL DW Schema mappings ETL tool WHY Conceptual to logical mapper Conceptual to logical mapping Logical templates Logical scenario WHAT Optimizer Physical templates Physical scenario HOW Engine PrOPr 2007

  35. Logical Model: Questions revisited What information should we put inside a metadata repository to be able to answer questions like: • what is the architecture of my DW back stage? • it is described as the Architecture Graph • which attributes/tables are involved in the population of an attribute? • what part of the scenario is affected if we delete an attribute? • follow the appropriate pathin the Architecture Graph PrOPr 2007

  36. Fundamental questions on provenance & ETL • Why do we have a certain record in the DW? • Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples • Where did this record come from in my DW? • Hard! If there is a way to derive an “inverse” workflow that links the DW tuples to their sources you can answer it. • Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data… • Widom’s work on record lineage… PrOPr 2007

  37. Fundamental questions on provenance & ETL • How are updates to the sources managed? • (update takes place at the source, DW+data marts must be updated) • Done, although in a tedious way: log sniffing, mainly. Also, “diff” comparison of extracted snapshots • When errors are discovered during the ETL process, how are they handled? • (update takes place at the data staging area, sources must be updated) • Too hard to “back-fuse” data into the sources, both for political and workload issues. Currently, this is not automated. PrOPr 2007

  38. Fundamental questions on provenance & ETL • What happens if there are updates to the schema of the involved data sources? • Currently this is not automated, although the automation of the task is part of the detail independence vision • What happens if we must update the workflow structure and semantics? • Nothing is versioned back – still, not really any user requests for this to be supported • What is the equivalent of citations in ETL? • … nothing really … PrOPr 2007

  39. Thank you! PrOPr 2007

More Related