1 / 27

Provenance challenge --- my Grid

Provenance challenge --- my Grid. David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester. Outline. Short team introduction Workflow implementation Provenance schema and storage Provenance queries Suggestions Reflection Acknowledgement.

treva
Download Presentation

Provenance challenge --- my Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Provenance challenge --- myGrid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

  2. Outline • Short team introduction • Workflow implementation • Provenance schema and storage • Provenance queries • Suggestions • Reflection • Acknowledgement

  3. Provenance Challenge Overview Given an abstract workflow • Implement this workflow in your system • Collect provenance from runs of this workflow • Present the implemented workflow and collected provenance • Answer a list of provenance questions and present these answers

  4. 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt Taverna and myGrid • A UK e-Science project to build middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications. • Sequence analysis, microarray analysis, proteomics, chemoinformatics, image processing, rendering Dilbert cartoons.

  5. Scufl • Data links • Control links: limited support • Failure tolerance: retry and alternative services • Implicit iterations: cross/dot iterations • Nested workflows • Semantic metadata annotations

  6. What has to be done • Design the workflow using Scufl in Taverna • Build services (Web services, Soaplab services, local java, or beanshell scripts) to implement each process • Gather and process the real data products

  7. Doing it properly • Wrap each procedure as a service • Process the real data as a real experiment • Use iterations, nested workflow or interactive workflows supported by Taverna • Real examples: • Chimatica (http://www.chimatica.co.uk/) supports high throughput workflows using Taverna 1.X • MIAS-Grid (http://www.mias-irc.net/) uses myGrid to build medical image processing workflows

  8. What we did actually • Realize each procedure as a beanshellscript, to avoid real service implementation and deployment • Pass pseudo data products rather than real image data products • But keep the metadata about data products along with provenance to answer semantic questions

  9. Implemented Scufl workflow in Taverna

  10. Provenance schema • Four aspects • Workflow provenance • Data provenance • Organization provenance • Knowledge provenance • Provenance ontology • RDFS • OWL-lite

  11. WSDL similarData data1 Genomic Project serviceInvocation1 data2 data3 serviceInvocation2 data4 Provenance Pyramid Model Knowledge Level Data Level Organiza tion Level Workflow Level

  12. Workflow provenance Process Organisation provenance e.g. BLAST @ NCBI runsProcess hasProcesses ProcessRun iteration Organisation Workflow executesProcessRune.g. web service invocation of BLAST @ NCBI ProcessIteration belongsTo createdBy runsWorkflow Experimenter Workflow run launchedBy workflowOutput hasInput Data/ knowledge provenance Knowledge statements e.g. similar_sequence_to derivedFrom Data isA isA Atomic Data Data Collection containsData

  13. Workflow provenance ontology

  14. Data provenance ontology

  15. Organization & Knowledge provenance ontology • userPredicate • Semantic concept about a data product or a service, e.g. nucleotide_sequence • Semantic (knowledge) relationships between two data products, e.g. similar_sequence_to

  16. Collected & stored provenance • LSIDs used to identify: • data, workflows, workflow runs • LSIDs are names of graphs • Named RDF graphs • retrieve whole workflow runs • implementation in • Sesame2 native store • scalable • alpha release (bugs) • NG4J (Jena + MySQL) • scalability issues • Future implementations: Oracle and Boca

  17. Find the process that led tod0(Atlas X Graphic) Find the process that led to d0(Atlas X Graphic) excluding everything prior to d1(the averaging of images with softmean) Find the Stage 3, 4 and 5 details of the process that led to d0(Atlas X Graphic) Find all invocations of procedure align_warp using p0(a twelfth order nonlinear 1365 parameter model) 5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers) had an entry global maximum=4095 Find all the d0 that are derived from d1 where value(d1) = 4095 6. Find all output averaged images of softmean, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model Find all the d0 that are derived from d1 where derivedFrom(d1) = d2 Answer matrix Process provenance Data provenance

  18. 7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. 8. Find the outputs of align_warp where the inputs are annotated with center=UChicago. 9. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files. Answer matrix Provenance cross runs Knowledge provenance

  19. Suggested Workflow Variants Implicit iterations

  20. Suggested Workflow Variants Nested workflow runs

  21. Suggested Workflow Variants User interactions

  22. Suggested Queries • Compare, merge and union provenance from different workflow runs • Explain why different outputs were produced in repeated workflow runs • Replay a workflow run

  23. Categorisation of queries Four levels: 1. queries to support the provenance browser 2. semantic queries 3. integration queries 4. pre-canned queries to support provenance usage scenarios.

  24. Live systems • Taverna: http://taverna.sourceforge.net • Provenance plugin and browser beta release: bundled with the Taverna release 1.4. • Provenance ontology: http://cvs.mygrid.org.uk/cgi-bin/viewcvs.cgi/mygrid/miasgrid/rdf-provenance/etc/ontology/ • System requirement: • Windows, Linux, Mac • Java 5.0 • mySQL database (optional)

  25. Reflection • A systematic provenance query framework is needed • Separate data and provenance metadata • Better storage scalability • Avoid archiving duplicate data products • A consensus of provenance models

  26. Acknowledgement • The myGrid Taverna team: Tom Oinn, Stuart Owen, Stian Soiland, David Withers, Katy Wolstencroft and June Finch • Daniele Turi: provenance plugin • Matthew Gamble: Taverna provenance browser • Chris Wroe from the original myGrid project

More Related