240 likes | 250 Views
Learn about the Provenance Framework in Kepler, which allows recording and tracking of provenance data for different computational models and enables various applications including result recreation, workflow recovery, and debug and explain results.
E N D
Provenance Framework in Kepler Norbert Podhorszki Ilkay Altintas Contributors: S. Bowers, B. Ludäscher, T. McPhillips (UC Davis) O. Barney (U Utah), E. Jäger-Frank (SDSC)
Outline • Provenance? What is it? • Framework in Kepler to record provenance data • RWS: A provenance model suitable for Kepler's different computational models. • Possible Applications of Provenance
What to track and why • Do we need some tracking of what is happening? • Recreate results and rebuild workflows using the evolution information (see repeatable experiments) • Associate the workflow with the results it produced • Create links between generated data in different runs, and compare different runs • Recover from a system failure • Checkpoint a workflow • Debug and explain results (via lineage tracing, …) • Smart Reruns • Avoid re-generating the same data all the time
Model of Provenance • Core feature • capture the processing history(trace) leading to a data product • Model of Computation (MoC) • Well-defined in terms of input/output relations and the (partial) order of actions • MoC ( Program, Input ) Output • DAG, SDF, DDF, PN, etc • Different ways of specification • see Ptolemy-related papers, Kahn-McQueen paper, etc. • give abstract/high-level pseudo code • Practically it is defined through the implementation of the execution system (including the scheduling). In Kepler/Ptolemy it is the Director. • There are legal (possible) runs under a given MoC
Model of Provenance T • Model of Provenance (MoP) • The starting point is a MoC and its particular implementation • Observables e.g. a single fired(x, A, y) or reads, writes and actions separately • Trace: recorded assertions (about observable events) during a legal run • MoP is a MoC, except the “legal run” replaced with “legal trace” • There is a default MoP for a MoC: the total trace of each observable events • Turing machine: moves of the head, data read and written • A MoP may add another information or omit some (“T=R-I+M”) • Trace = Run – Ignored things + Modelled additional things • M: Add real timestamps of actions, execution host information • I: Omit the input for each action if this can be inferred unambiguously later (DAG) • Depends on the application of the trace
MoP Examples • DAG workflow • Record: Output data generated by the actions • Inference: Execution of actions and inputs to them can be inferred from the DAG itself • Smart-rerun • Record: Output of an action and the parameters for that action should be recorded • Inference: If an action’s parameter is not changed and actions on which this action depends (inferred from the workflow graph) are also unchanged, the action’s output will be the same in a future run. • Kitchen definition • A MoP is “good” if it can handle the intended questions & use cases.
A MoP Examples • Kepler: Streaming actors • Stateful actors • An output depends on all inputs in the past. e.g. AddSubstract • Stateless actors • An output depends only on inputs read in the current firing. E.g. Expression, RecordAssembler • Non-conformist actors • Filter, Running average, Daily average (someof the past inputs) • How do you determine correctly which inputs a given output depends on?
MoP Examples • Kepler: Data dependent routing (branches and loops) • The firing history of the actors cannot be inferred from the static workflow graph • Something should be recorded (e.g. firings)
r … r w … w A s! r, r … r,w, w, … w, r, … r,w, … w … ??? firing time PS RWS: Read − Write − State-reset • what about actor state? what about “real” dependencies? • State-reset event s defines when actor “cuts off” dependencies • a semantic notion, known to the actor [developer] (or part of a higher-order scheme) • r, r … r, w, w, … w, s!,r, … r, w, ... w, … • reference: IPAW’06, Bowers et al
RWS trace of some actors • Stateless actor (r+ w+ s)*: r … r w… w sr … r w… w s … • Stateful actor (r+ w+)* • Simple filter actor (conditional depends only on current token) (r w?s)* : either it emits a token or not • Daily average of hourly measurement ((r w)24 s)* • Generally: RWS firing is defined in terms of r and w events • r+ w+ defines one RWS firing (most Kepler actors behave similarly) • More general: definition of the RWS firing round • (r+ w+)* s : dependencies among several firings • …
Provenance Framework in Kepler • Modeled as a separate concern in the system • Optional drag and drop feature • Listen to execution and save information (customizable): • Context: who, what, where, and when that is associated with the run • Input data and its associated metadata • Workflow outputs and intermediate data products • Workflow definition (entities, parameters, connections): a specification of what exists in the workflow and can have a context of its own • Information about the workflow evolution -- workflow trail
Kepler System Architecture Authentication GUI …Kepler GUI Extensions… Vergil Documentation Provenance Recorder Smart Re-run / Failure Recovery SMS Kepler Object Manager Type System Ext Actor&Data SEARCH Kepler Core Extensions Ptolemy IPAW’06-Altintas et al.
Kepler Provenance Recorder(IPAW’06, Altintas et al) • Parametric and customizable • Different report formats • Variable levels of verbosity • all, some, medium, on error • Multiple cache destinations • Saves information on • User name, Date, Run, etc…
Implementation details • The Provenance Recorder • Extends the Ptolemy AbstractSettableAttribute • Listens to the Director for • Changes in the workflow graph • Initialization, workflow execution and stop • Actor firing • Listens to all IOPorts for • Token emissions on output ports to record output data • That is, we could say it is a • Ptolemy Provenance Framework
Implementation details • Builds an internal representation of the workflow graph • Ptolemy’s DirectedGraph • Nodes: IOPorts, Edges: port connections • Used for • Recording workflow structure (dependencies among ports) • Subscribing at all ports (listening for input/output)
Implementation of RWS in Kepler • Data model i.e. observables in all MoC implementations in Kepler • Port-actor relationship • portTable(Port, Actor, type) • type is a for atomic and c for composite actors (transparent) • Token-object relationship • tokenTable(Token, Object) • Object-value relationship • objectTable(Object, Value, Type) • type is currently not recorded • RWS trace • traceTable(Port, Event, Token, FiringCounter) • event: r as read, w as write or s as state-reset
Extending the framework • Initialization (initialize()) • Framework traverses the workflow graph (ports and connections) • RWS: generate specific data structures (port, actor and connection details) • Just before start (validate()) • Framework subscribes for event listeners • RWS: subscribe additional listener TokenGetEvent
Extending the framework • When workflow is modified (changeExecuted()) • Framework traverses the workflow graph (ports and connections) • RWS: re-generate data structures • During execution when an event occurs • TokenSendEvent() and TokenGetEvent() listeners are extended to generate RWS trace events
Possible applications of Provenance • Smart-rerun • Monitoring/debugging of a workflow • see LiDAR poster today by Efrat Jäger-Frank • Answering processing history, data related question • Participated at the First Provenance Challenge with Kepler-RWS • http://twiki.ipaw.info/bin/view/Challenge/RWS • Reporting/documentation ofworkflows and data products Generate my publication
Acknowledgement • RWS model • Shawn Bowers and Timothy McPhillips, UC Davis • Formalization of the MoPs • Bertram Ludäscher, UC Davis • Kepler Provenance Framework implementation • Oscar Barney, Univ. of Utah, Salt Lake City • Efrat Jäger-Frank, SDSC, San Diego
References • RWS model S.Bowers, T.McPhillips, B.Ludäscher, S.Cohen and S.B.DavidsonA Model for User-Oriented Data Provenance in Pipelined Scientific WorkflowsIntl. Provenance and Annotation Workshop (IPAW), Chicago, 2006 B.Ludäscher, N.Podhorszki, I.Altintas, S.Bowers, T.McPhillipsFrom Computation Models to Models of Provenance and the RWS Modelto appear in 2007 in Journal of Concurrency and Computation: Practice and Experience • Provenance framework I.Altintas, O.Barney, E.Jäger-FrankProvenance Collection Support in the Kepler Scientific Workflow SystemIntl. Provenance and Annotation Workshop (IPAW), Chicago, 2006