1 / 24

Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

A Provenance Framework to Capture, Store, Query, and Browse Data Lineage in Kepler. Manish Kumar Anand maanand@ucdavis.edu Eighth Biennial Ptolemy Miniconference Berkeley, California. Scientific workflow system. Scientific Workflows. Discoveries achieved via complex computations

Download Presentation

Manish Kumar Anand maanand@ucdavis Eighth Biennial Ptolemy Miniconference Berkeley, California

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Provenance Framework to Capture, Store, Query, and Browse Data Lineage in Kepler Manish Kumar Anand maanand@ucdavis.edu Eighth Biennial Ptolemy Miniconference Berkeley, California

  2. Scientific workflow system Scientific Workflows • Discoveries achieved via complex computations • Workflows replacing traditional scripting approaches • Enable automation, reproducibility, sharing, provenance Perl script

  3. AXG AYG AZG RI1 AI1 alignWarp:1 reslice:1 AH1 convert:1 WP1 slicer:1 RH1 AXG AXS RI2 AI2 alignWarp:2 reslice:2 AH2 AI WP2 softmean:1 slicer:2 RH2 convert:2 RI RH RI4 AYG AYS AH alignWarp:3 reslice:3 AI4 WP4 AH4 RH4 slicer:3 convert:3 RI4 AZG reslice:4 alignWarp:4 AZS AI4 WP4 AH4 RH4 outputs inputs Provenance AlignWarp Reslice Softmean Slicer Convert • A record of processes, inputs/outputs, dependencies • Supports reproducibility, interpretation, verification

  4. Outline Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance

  5. input(a, s1), output(a, s2), input(b, s2), input(c, s2), … Workflow execution graph Data dependency Invocation dependency Conventional Provenance Models • Assumptions: - Data is atomic - Invocations consume all inputs and produce new outputs - Every output depends on all inputs • Records • Inputs/outputs of invocations • Infers • Data dependency • Invocation dependency

  6. s1 s3 s1 a s3 a s2 s2 s2 s4 (a) s1 s1 a s1 s3 a s3 s2 s3 s2 s4 s4 s5 (c) (b) Challenges in Modeling Provenance Many scientific workflow systems also support: • Both data “transformers” and “pass-through” • Processes with different dependency patterns • Structured data (XML) • Models of provenance must consider these factors

  7. Unified Provenance Model

  8. 1 1 2 5 a 2 4 6 3 4 1 2 5 3 4 6 Efficient Provenance Representation • Instead of storing each version • Only store a single combined version • Along with a set of updates (’s) • Updates and dependencies represented as annotations a= {ins(5,a), dep(5,2), del(3,a), ins(6,a), dep(5,3), dep(5,4), dep(6,2), dep(6,3), dep(6,4)} a= {ins(5,a), dep(5,2), del(3,a)} 1 +a 2 5 +a -a -a +a 3 4 6 Condensed Expanded

  9. Expanding and Condensing Traces 1 1 +a +a 2 5 2 5 -a +a -a 3 4 6 3 4 6 Expanded Condensed

  10. Condensed Trace Expanded Trace Images Trace Views S6 S3 S5 Images Images S1 S2 Images Images Images S4 1 1 1 1 1 1 … … … … … … AtlasImage AtlasImage ReslicedImage AnatomyImage AnatomyImage AtlasImage 12 15 15 2 2 15 Image Header Header 13 19 14 16 17 6 7 8 11 18 RefImage WarpParamSet Image AtlasGraphic Image Header AtlasSlice Header Image 9 10 Using a postorder (i.e, bottom-up, left-to-right) traversal Remove annotations from a node n (i) dep(n,c) if dep(n,p) and child(p,c) (ii) dep(n,d) if child(p,n) and dep(p,d) (iii) ins(n,x) if child(p,n) and ins(p,x) (iv) del(n,y) if child(p,n) and del(p,y) Remove invocation order annotations -Those implied according to rules in (3--8) alignwarp reslicewarp softmean slicer convert Uses three distinct preorder (i.e., top-down, left-to-right) traversals Pass 1: rules (1-2) and rules (3-5) -Infers insertion and deletion annotations -Infers invocation order from nodes and parent-child relationships Pass 2: rules (6-8) -Infers remaining invocation precedence relationships Pass 3: rules (9-10) -Expands dependencies sets and propagates dependencies to child nodes

  11. Outline Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance

  12. Storage Strategies • Store immediate and transitive dependencies • Faster query execution • Reduction techniques • Represent dependencies in reduced form Use standard relational DBMS and minimize storage size, update time and query time

  13. Storage Strategies SE Trace Expanded Transitive Dep. NE Trace Expanded NC Trace Collapsed • 5 storage strategies • NC: Naive Collapsed • NE: Naive Expanded • SE: Simple Expanded • RE: Reduced Expanded • RC: Reduced Collapsed • Compare: • Storage size, update time, query time Reduction Algorithms RE Reduced Trace Expanded Transitive Dep. RC Reduced Trace Collapsed Transitive Dep.

  14. Analysis of Storage Strategies Update Time Storage Size Query Time SE NE RC NE Time(s) Time(s) Cells (1000) • SE: Worst storage size and update time • RC: Very expensive query time • RE: Recommended storage strategy NC RE RC RE SE Traces Traces Traces

  15. Outline Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance

  16. Lineage Structures Images S6 S3 S5 Images Images S1 S2 Images Images Images S4 1 1 1 1 1 1 … … … … … … AtlasImage AtlasImage ReslicedImage AnatomyImage AnatomyImage AtlasImage 12 15 15 2 2 15 Image Header Header 13 19 14 16 17 6 7 8 11 18 RefImage WarpParamSet Image AtlasGraphic Image Header AtlasSlice Header Image 9 10 alignwarp:1 reslicewarp:1 softmean:1 slicer:1 convert:1 Querying Provenance can be Expensive • Queries are often recursive • Complex to formulate • Expensive to evaluate • Standard querying approaches • Tied to storage representation • Query language expertise • Need to query across structures, lineage, or both (Q) Select lineage path that derived all children of AtlasImage created by slicer • How to express provenance queries easily and execute them efficiently?

  17. To Express this Query … SQL (eg, transitive dependencies) SQL (stored procedures) create procedure depc(in runId_in varchar(255), in nodeId_in Integer) begin DECLARE finished integer default 0; … declare cur_1 cursor for select depNodeId from dependency where runId=runId_in and itemNodeId=nodeId_tmp; set nodeId_tmp = nodeId_in; set depCnt = (select count(*) from dependency where runId=runId_in and itemNodeId=nodeId_tmp); if (depCnt is not null) then open cur_1; get_cur_1: loop fetch cur_1 into depNodeId_tmp; if finished then leave get_cur_1; end if; insert into depcT (nodeId) values(depNodeId_tmp); end LOOP get_cur_1; close cur_1; set cnt = 1; while (cnt <= depCnt) do set nodeId_tmp = (select nodeId from depcT where no=cnt); set row_limit = (select count(*) from dependency where itemnodeId=nodeId_tmp and runId=runId_in); set row_cnt =0; open cur_1; get_cur_1: loop fetch cur_1 into depNodeId_tmp; set flag = (select 1 from depcT where nodeId = depNodeId_tmp); if (flag is null) then insert into depcT (nodeId) values(depNodeId_tmp); end if; if (row_cnt > row_limit) then leave get_cur_1; end if; set row_cnt = row_cnt + 1; … … select t.runId, t2.nodeId, t.nodeId as depNodeId from ( select d1.runId, d1.pDep, d1.nodeId from dependency d1 where runId=runId_in union select p1.runId, p1.fromPointer as pDep, d2.nodeId from dependency d2, depSubsetPointer p1 where p1.runId=runId_in and d2.runId=runId_in and d2.pDep=p1.toPointer ) as t, depMinMaxPointer p2, ( select t.runId, r1.nodeId, t.pDep from ( select dc1.runId, dc1.pDepC, dc1.pDep from depCdepPointer dc1 where runId=runId_in union select p1.runId, p1.fromPointer as pDepC, dc2.pDep from depCdepPointer dc2, depCSubsetPointer p1 where p1.runId=runId_in and dc2.runId=runId_in and dc2.pDepC=p1.toPointer ) as t, depCMinMaxPointer p2, runCollData r1, runItemProv rp1 where p2.runId = runId_in and r1.runId=runId_in and rp1.runId=runId_in and r1.nodeId=nodeId_in and r1.pointer=rp1.pointer and rp1.pDep = p2.fromPointer and t.pDepC=p2.toPointer and t.pDep BETWEEN p2.depMin AND p2.depMax union … … • Hard for domain scientists (… and SQL experts) • Optimization depends on SQL engine [He et al. SIGMOD 08] • Need for higher-level provenance query language

  18. QLP Constructs First Provenance Challenge Queries Formulated in QLP

  19. Images S6 S3 S5 Images Images S1 S2 Images Images Images S4 1 1 1 1 1 1 … … … … … … AtlasImage AtlasImage ReslicedImage AnatomyImage AnatomyImage AtlasImage 12 15 15 2 2 15 Image Header Header 13 19 14 16 17 6 7 8 11 18 RefImage WarpParamSet Image AtlasGraphic Image Header AtlasSlice Header Image 9 10 alignwarp:1 reslicewarp:1 softmean:1 slicer:1 convert:1 S5 Images 1 … AtlasImage 15 18 AtlasSlice @out slicer Lineage Querying Multiple Dimensions Structures (Q) Select lineage path that derived all children of AtlasImage created by slicer 1. Obtain structures from @in and @out version operators 2. Apply XPath expressions to structure 3. Apply lineage queries to each resulting node QQLP: * derived//AtlasImage/*@out slicer //AtlasImage/* * derived 18

  20. Outline Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance

  21. Provenance Browser • Browse different views of a trace • Data dependencies, collection structure, actor invocations • Move “forward” and “backward” through execution

  22. Collection History • Collection and invocation view • Incrementally step through execution history

  23. Conclusion • Capture • Supports nested data collections, explicit data dependency, update semantics • Storage • Reduce update time, storage size and query time • Query • A high-level provenance query language (QLP) • Query structures with lineage graphs • Formulate queries easily and concisely • Browse/Vizualize • Provenance Browser, a visualization tool to view and navigate across provenance views

  24. References • M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Exploring Scientific Workflow Provenance using Hybrid Queries over Nested Data and Lineage Graphs. SSDBM 2009 • M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Efficient Provenance Storage over Nested Data Collections. EDBT 2009 • S. Bowers, T. McPhillips, S. Riddle, M. K. Anand, B. Ludäscher. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. IPAW 2008

More Related