140 likes | 249 Views
REDUX – automatic capture, efficient storage. Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil. Considerations. What information needs to be captured? Which version of BLAST did I use?
E N D
REDUX – automatic capture, efficient storage Roger S. BargaMicrosoft Research (MSR)Luciano DigiampietriUniversity of Campinas, Sao Paolo, Brazil
Considerations What information needs to be captured? Which version of BLAST did I use? What codes (activities) did I invoke to get this result, and what were the parameters? What data transformations did I use to get this result? What machine was used to perform the alignment? Were any steps skipped in this experiment, or were any shims inserted? Did the experiment design differ between these two results? If so, where?... Are there any branches in the workflow that have not been explored? Additional Issues to Consider… Result of a provenance query is an executable workflow It may not possible to rerun an experiment, to either validate or recreate a result because original workflow is lost (activities have been updated). Allow the user to control what is shared/exposed – one size doesn’t fit all Provenance storage costs can quickly grow out of hand…
Implementation Extended enactment engine of WinOE to automatically capture steps during execution leading to a result Provenance capture is automatic & transparent A multilayer model for representing result provenance Abstract Workflow Service Instantiation Data Instantiation Runtime Store provenance in a RDBMS (SQL Server), utilize previous traces to significantly reduce storage costs Current query interface is SQL, eventually a forms based interface. Version and lock the executables Updating any activity will change the workflow version number, resulting in a new version. User is able to rerun an experiment by invoking workflow using fully-specified reference found in the provenance record;
Returns ExecutableWorkflowId (process), ExecutionId (id of specific execution of the process), EventId (event where data was produced) and ExecutableWorkflow_ ExecutableActivityId (activity that produced the data) of the processes that generated the Atlas X Graphic Provenance Queries – Query 1 Provenance queries 1, 4, 5, 7, 8 and 9 Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.
Provenance Queries – Query 7a Provenance queries 1, 4, 5, 7, 8 and 9 Our layered model allows the detection of differences in several ways A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.
Workflow Model captures information about the instances of the activities, and the links among the ports (or activities interfaces). At this layer, our model allows provenance queries to question, for example, what activities from Workflow 2 are not included in Workflow 1: Activities used by the second workflow but not the first Provenance Queries – Query 7b A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.
Data produced by workflow 2 that was not produced by workflow 1: Provenance Queries – Query 7c A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant. Runtime Level which contains information about the execution of the workflow (produced data, timestamps, activities invoked, etc.). Here the model allows queries about produced data, data flow (See Q2 and Q3), date/time, etc. One example query that illustrates the difference between two workflows, at this level, is: What is the data produced by the second workflow that was not produced by the first?
Efficiently Storing Provenance DataFor Provenance Query 7 Two workflows are sharing more that 99% of the provenance data (space) and sharing 46% of the database tuples.
To Sum Up… Extended Windows Workflow Foundation Transparently capture execution trace leading to a result A layered provenance model Relational database (SQL Server) as provenance store Store provenance as delta/edit over existing traces Initial query facility built over this provenance data Unique aspects of our system Result of a provenance query is an executable workflow Coupled code versioning to provenance collection An open (and interesting) data management challenge