370 likes | 489 Views
Was Derived From. UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems. Artem Chebotko Joint work with E. De Hoyos , C. Gomez, A. Kashlev , X. Lian , and C. Reilly Department of Computer Science University of Texas - Pan American
E N D
Was Derived From UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly Department of Computer Science University of Texas - Pan American 6th IEEE International Workshop on Scientific Workflows, June 24, 2012
Provenance in eScience • Metadata that captures history of an experiment • Problem diagnosis • Result interpretation • Experiment reproducibility • Scientific Workflow Community Provenance Challenges • 2006: understanding and sharing information about provenance representations and capabilities • 2006: interoperability of different provenance • 2009: evaluating various aspects of OPM • 2010: showcase OPM in the context of novel applications • Open Provenance Model • W3C Provenance Working Group UTPB – University of Texas Provenance Benchmark
SWFMS and Provenance • Support provenance collection • Use proprietary of third-party systems to manage provenance • Differ in provenance models, provenance vocabularies, inference support, and query languages. • Taverna • Kepler • View • VisTrails, • Pegasus • Swift • Galaxy • Triana • OPMProv • Karma • RDFProv • etc. UTPB – University of Texas Provenance Benchmark
Provenance Management Requirements • Non-functional • Data storage and querying efficiency and scalability • Inference soundness and completeness • Functional • Support of a particular, provenance model, provenance vocabulary, query type, inference feature, visualization and analysis • No standard way to evaluate provenance systems with respect to these requirements UTPB – University of Texas Provenance Benchmark
Provenance System Benchmarking Challenges • Well-documented and easy-to-understand datasets • Provenance data in a range of sizes • Provenance data with predefined inferred results that are known to be correct and complete • Test queries • Performance metrics • Result interpretation • Existing empirical studies of provenance systems use ad-hoc benchmarks or benchmarks developed in other research domains (see the paper for details) UTPB – University of Texas Provenance Benchmark
Our Contributions • University of Texas Provenance Benchmark (UTPB) • http://faculty.utpa.edu/chebotkoa/utpb/ • Focus on scalability and inference • Flexible data generator • 27 provenance templates • 3 virtual workflows • 3 workflow execution scenarios • 3 provenance vocabularies • 27 test queries in 11 categories • 5 performance metrics UTPB – University of Texas Provenance Benchmark
Talk Outline • University of Texas Provenance Benchmark • UTPB Architecture • Provenance Templates • Provenance Generation • UTPB Queries • Performance Metrics • Interpretation of Benchmark Results • Summary and Future work UTPB – University of Texas Provenance Benchmark
UTPB Architecture UTPB – University of Texas Provenance Benchmark
UTPB Architecture UTPB – University of Texas Provenance Benchmark
Provenance Templates UTPB – University of Texas Provenance Benchmark
Provenance Templates • A provenance template is a document that serializes provenance of one workflow execution according to a particular provenance model and a provenance vocabulary. • Provenance templates make the benchmark extensible and thus adaptable to the changing requirements of the field. • UTPB currently supports: • 1 provenance model (OPM) • 3 virtual workflows • 3 provenance vocabularies (OPMV, OPMO, OPMX) • 3 workflow execution scenarios • 1 x 3 x 3 x 3 = 27 provenance templates UTPB – University of Texas Provenance Benchmark
Virtual Workflow 1 • Database Experiment • Processes: 7 • Artifacts:14 • Accounts: 2 • Agents: 1 UTPB – University of Texas Provenance Benchmark
Virtual Workflow 2 • Jeans Manufacturing • Processes: 13 • Artifacts:18 • Accounts: 3 • Agents: 2 • Several processes use and generate the same artifacts and are “executed” in parallel UTPB – University of Texas Provenance Benchmark
Virtual Workflow 3 • French Press Coffee • Processes: 15 • Artifacts:15 • Accounts: 4 • Agents: 0 • Several branches with multiple processes are “executed” in parallel • Several processes trigger each other without the record of using or generating artifacts UTPB – University of Texas Provenance Benchmark
Provenance Vocabularies • Almost every existing scientific workflow management system defines its own proprietary model for provenance • Each model is serialized in some format, such as RDF, XML, or relational data, according to one or more predefined vocabularies or schemas. • Open Provenance Model (OPM) – a layer of interoperability • OPM Vocabulary • OPM Ontology • OPM XML Schema UTPB – University of Texas Provenance Benchmark
Workflow Execution Scenarios • successful execution • incomplete execution with an error • successful execution with materialized provenance inferences UTPB – University of Texas Provenance Benchmark
Provenance Generation UTPB – University of Texas Provenance Benchmark
Provenance Generation UTPB – University of Texas Provenance Benchmark
Provenance Generation UTPB – University of Texas Provenance Benchmark
Provenance Generation # Named graph: http://cs.panam.edu/utpb#opmGraph_C0_T0 @prefix opmv: <http://purl.org/net/opmv/ns#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix utpb: <http://cs.panam.edu/utpb#> . utpb:account_black_C0_T0 rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> . utpb:cuttingMachine_C0_T0 rdf:typeopmv:Artifact . utpb:denim_C0_T0 rdfs:label"blue" . utpb:andrey_C0_T0 rdf:typeopmv:Agent . utpb:cutDenim_C0_T0 opmv:used utpb:cuttingMachine_C0_T0, utpb:cuttingPattern_C0_T0, utpb:denim_C0_T0 . utpb:denimParts_C0_T0 opmv:wasGeneratedBy utpb:cutDenim_C0_T0 . # Default graph <http://cs.panam.edu/utpb#opmGraph_C0_T0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> . OPMV UTPB – University of Texas Provenance Benchmark
Provenance Generation # Named graph: http://cs.panam.edu/utpb#opmGraph_C0_T0 @prefix opmo: <http://openprovenance.org/model/opmo#> . @prefix opmv: <http://purl.org/net/opmv/ns#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix utpb: <http://cs.panam.edu/utpb#> . utpb:account_black_C0_T0 rdf:typeopmo:Account . utpb:cuttingMachine_C0_T0 rdf:typeopmv:Artifact . utpb:propertyDenim_C0_T0 opmo:key utpb:keyDenimType_C0_T0 ; opmo:value"blue" . utpb:andrey_C0_T0 rdf:typeopmv:Agent . utpb:used1_C0_T0 rdf:typeopmo:Used ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:cuttingMachine_C0_T0 ; opmo:role utpb:roleMachine_C0_T0 ; opmo:pname utpb:_used1 ; opmo:account utpb:account_black_C0_T0 . utpb:wgb1_C0_T0 rdf:typeopmo:WasGeneratedBy ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:denimParts_C0_T0 ; opmo:role utpb:roleDenim_C0_T0 ; opmo:pname utpb:_wgb1 ; opmo:account utpb:account_black_C0_T0 . # Default graph <http://cs.panam.edu/utpb#opmGraph_C0_T0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> . OPMO UTPB – University of Texas Provenance Benchmark
Provenance Generation <utpbxmlns="http://openprovenance.org/model/opmx#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <dictionary> <opmGraphid="opmGraph_C0_T0"> </dictionary> <opmGraphid="opmGraph_C0_T0"> <accounts> <account id="account_black"/> </accounts> <artifacts> <artifact id="cuttingMachine"> <account ref="account_black"/> <annotation> <property key="value"> <value>laser</value></property> <property key="label"> <value>Cutting machine</value></property> </annotation> </artifact> </artifacts> <agents> <agent id=“andrey”><account ref="account_black"/></agent> </agents> <dependencies> <used id=“used1”> <effect ref="cutDenim"/> <role id="roleMachine1” value="machine"/> <cause ref="cuttingMachine"/> <account ref="account_black"/> </used> OPMX UTPB – University of Texas Provenance Benchmark
UTPB Queries UTPB – University of Texas Provenance Benchmark
UTPB Queries • 27 Queries • 11 Categories • Graphs • Dependencies • Artifacts • Processes • Accounts • Agents • Roles • Values • Cross-Graph Queries • Inferences • Application-Specific UTPB – University of Texas Provenance Benchmark
UTPB Queries UTPB – University of Texas Provenance Benchmark
UTPB Queries UTPB – University of Texas Provenance Benchmark
UTPB Queries UTPB – University of Texas Provenance Benchmark
UTPB Queries effectArtifactcauseArtifact --------------------------------------------- utpb:denimParts_C0_T0 utpb:denim_C0_T0 utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0 utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0 utpb:washedJeans_C0_T0 utpb:rawJeans_C0_T0 utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:inspectedJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:buttons_C0_T0 utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:jeans_C0_T0 utpb:labels_C0_T0 utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0 OPMV UTPB – University of Texas Provenance Benchmark
UTPB Queries effectArtifactcauseArtifact --------------------------------------------- utpb:denimParts_C0_T0 utpb:denim_C0_T0 utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0 utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0 utpb:washedJeans_C0_T0 utpb:rawJeans_C0_T0 utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:inspectedJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:buttons_C0_T0 utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:jeans_C0_T0 utpb:labels_C0_T0 utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0 OPMO UTPB – University of Texas Provenance Benchmark
UTPB Queries <result xmlns="http://openprovenance.org/model/opmx#" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <wasDerivedFrom> <effect ref="denimParts_C0_T0"/> <cause ref="denim_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="rawJeans_C0_T0"/> <cause ref="denimParts_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="rawJeans_C0_T0"/> <cause ref="sewingThread_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="washedJeans_C0_T0"/> <cause ref="rawJeans_C0_T0"/> </wasDerivedFrom> … <wasDerivedFrom> <effect ref="inspectedJeans_C0_T0"/> <cause ref="washedJeans_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="qualityJeans_C0_T0"/> <cause ref="buttonedJeans_C0_T0"/> </wasDerivedFrom> </result> OPMX UTPB – University of Texas Provenance Benchmark
Performance Metrics UTPB – University of Texas Provenance Benchmark
Performance Metrics • Data loading time • Repository size • Query response time • Query soundness • Query completeness UTPB – University of Texas Provenance Benchmark
Interpretation of Benchmark Results UTPB – University of Texas Provenance Benchmark
Interpretation of Benchmark Results • Comparison across datasets of varying sizes • Comparison using a fixed dataset • Comparison across data serialized with different vocabularies (e.g., OPMV vs. OPMO) • Comparison across data managed using different technologies (e.g., RDF vs. XML) • Comparison across data of different provenance models (e.g., OPM vs. PROV-DM) – in the future UTPB – University of Texas Provenance Benchmark
Summary and Future Work UTPB – University of Texas Provenance Benchmark
Summary and Future Work • UTPB: A first formal benchmark for scientific workflow provenance management systems • Extensible with new provenance templates • Flexible data generation • Large selection of test queries • Well defined performance metrics • Future work • Benchmarking existing system using UTPB • Extending UTPB (functional requirements, PROV-DM, new metrics – query expressiveness) UTPB – University of Texas Provenance Benchmark
THANK YOU! Questions? • UTPB website: • http://faculty.utpa.edu/chebotkoa/utpb/ • My contact information: • Artem Chebotko, Department of Computer Science, University of Texas – Pan American • chebotkoa@utpa.edu • http://www.cs.panam.edu/~artem UTPB – University of Texas Provenance Benchmark