280 likes | 353 Views
VisTrails. Second Provenance Challenge Tommy Ellkvist David Koop Juliana Freire. Joint work with: Erik Andersen, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, Cláudio Silva, and Huy T. Vo. Outline. VisTrails Introduction VisTrails Demo Provenance Model and API
E N D
VisTrails Second Provenance Challenge Tommy Ellkvist David Koop Juliana Freire Joint work with: Erik Andersen, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, Cláudio Silva, and Huy T. Vo
Outline • VisTrails Introduction • VisTrails Demo • Provenance Model and API • Challenge Results • Issues and Future Work
VisTrails • Comprehensive provenance infrastructure for computational tasks • Support for exploratory tasks such as visualization and data mining • Workflows are iteratively refined as users generate and test hypotheses • New change-based provenance model • Uniformly captures data and workflow provenance
Change-based Provenance • Provenance is stored as a tree of actions add module add connection
Provenance: Storing Actions • Each change writes new actions to the tree <action id=“27” prevId=“26” user=“dakoop” date=“2007-06-20”> <add what=“module” objectId=“12”> <module id=“12” name=“vtkProperty” cache=“1”> <location id=“17” x=“-7.0” y=“97.0”/> </module> </add> <add what=“connection” objectId=“13”> <connection id=“13”> <port type=“source” moduleId=“10”/> <port type=“destination” moduleId=“12”/> </connection> </add> </action>
Change-based Provenance • Data provenance: where does a specific data product come from? • Workflow evolution: how has workflow structure changed over time? • Treat workflow versions as data–store provenance of workflows
VisTrails Provenance • Normalized information–no redundancy! • Each layer provides more specific information but refers to parent layers • Workflow EvolutionWorkflowExecution • Extensible storage options • Support for both relational and XML • Flexible annotation framework–users can specify application-specific provenance information
Provenance for Reproducibility and Beyond • Infrastructure for querying and reusing provenance • Query workflows by example • Create workflows by analogy • Collaborative exploration • Scalable derivation of data products
Supporting Different Provenance Backends • VisTrails has powerful tools to query and reuse provenance information • There are many powerful workflow systems that produce such information • Problem: How to integrate different provenance backends? • Our approach: A mediation-based approach to provenance interoperability
Mediator Architecture Mapping from global schema to data source specific schema
Mediated Provenance Mapping from general model to engine-specific model
Combining Provenance • Establish model • Produce an API for this model • Wrap provenance access for each system so that queries become native over their provenance data
Provenance Model • Follows the layered architecture • Versions map to a workflows • Workflows are modeled as graphs • Parameters capture module state • User-defined annotations are available at each layer of the model • Module Definition stores information about the computational pieces
Provenance API • Implements common access queries and operations over the provenance model • Examples: getParent(module) getChildren(module) getUpstream(module) getDownstream(module) getAnnotations(module | workflow | …) getDataItems(module_exec) getParameters(module) getVersion(time) getExecutedModules(workflow) getConnection(data_item) getPorts(connection) findModulesByParameter(search_params) findModulesByAnnotation(search_params) findExecutionsByAnnotation(search_params) findVersionsByModules(search_params)
Provenance API Example getExecutedModules(wf_exec) VisTrails (XPath) def getExecutedModules(self, wf_exec): newdataitems = [] q = '//exec[@id="' + wf_exec.pid.key + '"]/@moduleId' dataitems = self.logcontext.xpathEval(q) Pasoa (XPath) def getExecutedModules(self, wf_exec): q = "//ps:relationshipPAssertion[ps:localPAssertionId='" + wf_exec.pid.key + "']/ps:relation" dataitems = self.context.xpathEval(q) Taverna (SPARQL) def getExecutedModules(self, wf_exec): " " q = ''' SELECT ?mi FROM <''' + self.path + '''> WHERE { <''' + wf_exec.pid.key + '''> <http://www.mygrid.org.uk/provenance#runsProcess> ?mi } ''' return self.processQueryAsList(q, pModuleInstance)
Provenance API Results • Implemented queries for each system and a combination of all three • Annotation issues for a couple queries • Example: Query 1 Results vt3:4 --> vt3:7 vt3:1 --> vt3:4 vt3:0 --> vt3:1 pas2:http://relation.org/softmean --> vt3:0 myg1:urn:www.mygrid.org.uk/process#reslice1 --> pas2:http://relation.org/softmean myg1:urn:www.mygrid.org.uk/process#reslice2 --> pas2:http://relation.org/softmean myg1:urn:www.mygrid.org.uk/process#reslice3 --> pas2:http://relation.org/softmean myg1:urn:www.mygrid.org.uk/process#reslice4 --> pas2:http://relation.org/softmean myg1:urn:www.mygrid.org.uk/process#align_warp1 --> myg1:urn:www.mygrid.org.uk/process#reslice1 myg1:urn:www.mygrid.org.uk/process#align_warp2 --> myg1:urn:www.mygrid.org.uk/process#reslice2 myg1:urn:www.mygrid.org.uk/process#align_warp3 --> myg1:urn:www.mygrid.org.uk/process#reslice3 myg1:urn:www.mygrid.org.uk/process#align_warp4 --> myg1:urn:www.mygrid.org.uk/process#reslice4
Provenance API Integration • Developed VisTrails Provenance Query Language for first challenge • Plan to integrate API with query language • Plan to integrate query language with VisTrails interfaces
Interoperability Issues • Uniquely identifying intermediate results • Intermediate file names were not specified and varied • Tracing ids is difficult for users–this should be transparent • A common query language should use concepts familiar to users • Mediator vs. Warehousing approach
Performance Issues • Redundant information can make queries inefficient • What is the best storage backend? • RDBMS vs. XML database? • What is the best data model? • XML vs. Relational vs. RDF? • Need good benchmarks–large data!
Mediated Provenance User queries Prov API Mapping from generic provenance model into the models of different systems General Provenance Model wrapper wrapper wrapper Taverna Pasoa …
Mediator Architecture User SQL/ODBC queries Mediator Mapping from global schema into source schemas Global Schema wrapper wrapper wrapper Data Source Data Source Data Source