1 / 19

Validation of E-Science Experiments using a Provenance-based Approach

Validation of E-Science Experiments using a Provenance-based Approach. Sylvia Wong, Simon Miles, Weijian Fang, Paul Groth and Luc Moreau University of Southampton, UK. Overview. E-Science experiment validation Bioinformatics scenario Provenance-based validation architecture

tirza
Download Presentation

Validation of E-Science Experiments using a Provenance-based Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Validation of E-Science Experiments using a Provenance-based Approach Sylvia Wong, Simon Miles, Weijian Fang, Paul Groth and Luc Moreau University of Southampton, UK

  2. Overview • E-Science experiment validation • Bioinformatics scenario • Provenance-based validation architecture • Evaluation results

  3. E-Science Experiments • Large scale computations for conducting scientific research • Multiple distributed services on the Grid • Workflow validation • Part of the scientific process • Verify correctness of their own experiments • Review correctness of their peers' work

  4. Static Validation • Operates on workflow source code • Checks if workflow satisfies some properties before it is run • Examples • type inference, escape analysis, concurrency analysis, graph-based partitioning • Workflow script may not be accessible or may be expressed in a language not supported by analysis tool

  5. Dynamic Validation • Verifies data values satisfy constraints during execution • interface matching, runtime type checking • Cannot assume services will perform validation • Interfaces may be under-specified • In bioinformatics, biological sequences commonly specified as strings in interfaces

  6. Provenance-based Validation • Allows for validation of experiments after execution • Third parties may want to verify that the results obtained were computed correctly according to some criteria • These criteria may not be known when the experiment was designed or run • Important because science progresses (and models evolve!)

  7. Bioinformatics Scenario • A biologist has a set of proteins, for each of which he/she wishes to determine a particular biological property ?

  8. Experiment Services • Service C • ………. • ……… • …………….. • …. • Design experiment (abstract plan) • For each step in the plan, decide on the concrete service to use • Each service may be designed by the biologist or adopted from the work of another biologist • For each service there is a description of that service stating: • what the service does • what type of data it analyses (its inputs) and • what type of results it produces (its outputs) • All the descriptions are stored in a registry • Service B • ………. • ……… • …………….. • …. • Service A • ………. • ……… • …………….. • …. Registry Description of Service A Function: ….. Inputs: ….. Outputs: …..

  9. Provenance Store Performing Experiment • Service A • ………. • ……… • ….. • Performs experiment • Details of experimental process documented in a provenance store • Each service documenting its own execution • Service B • ………. • ……… • ….. • Service C • ………. • ……… • ….. • Service D • ………. • ……… • ….. • Service E • ………. • ……… • …..

  10. Questions • Did I perform each service on the type of data that the service was intended to analyse? • Were the inputs and outputs of each activity compatible? • Did the services I used actually fulfil my high level plan?

  11. Answering the Questions • Using the documentation in the Provenance Store, we can reconstruct the process that led to each result • Along with the high level plan and the descriptions in the registry we have all the information required to answer the questions

  12. Provenance Store Registry Q1: Were the inputs and outputs compatible? Retrieve descriptions for each service A Retrieve each pair of services performed in an experiment, where one service’s output is the other’s input B Description of Service A Function: ….. Inputs: ….. Outputs: ….. Description of Service B Function: ….. Inputs: ….. Outputs: ….. Compare the output type of the first service with the input type of the second

  13. Registry Provenance Store Q2: Did the experiment follow the plan? Retrieve procedure descriptions Compare procedure function to planned activity Description of Service A Function: ….. Inputs: ….. Outputs: ….. Retrieve documentation of experiment that led to a result A ?

  14. Ontological Reasoning • High-level activity may be described in a more general way than the service which performs it • Also, one service’s input may be a generalisation of the preceding service’s output • Therefore, exact matching of types may produce a false negative: the biologist will wrongly be told the experiment was invalid • By using an ontology, describing how types are related, we can reason about types and determine whether they are truly compatible Ontology PPMZ is generalisation of Compression Algorithm

  15. Service A • ………. • ……… • ….. provide • Service A • ………. • ……… • ….. Registry Provenance Store • Service A • ………. • ……… • ….. document execution invoke advertise Workflow Enactment Engine document execution query validator get advertisements get p-assertions Architecture service providers

  16. Testing • Workflow - protein compressibility • Provenance store – PASOA (pasoa.org) • Registry – Grimoires (grimoires.org) • Validator – Java, Jena 2.1 • Ontology in OWL, based on myGrid bioinformatics ontology

  17. Performance Evaluation • Potentially, large number of experiments are performed • Evaluate if our approach can scale with the size of the provenance store • Time to validate an experiment with increasing number of experiments recorded

  18. Performance input/output type validation plan validation

  19. Summary • Provenance-based validation of workflow executions • Validation of experiments after execution • Previously unknown criteria • Third party validation • Tested with a sample bioinformatics experiment • Evaluation shows framework scales well with increasing data store size

More Related