470 likes | 561 Views
Taverna the story from up-above. Antoon Goderis The University of Manchester, UK. http://www.mygrid.org.uk/taverna http://www.omii.ac.uk. DART workshop, Brisbane, Australia, 14 December 2006. Overview. The situation in –omics Creating new biology using Taverna Taverna Key traits
E N D
Tavernathe story from up-above Antoon Goderis The University of Manchester, UK http://www.mygrid.org.uk/taverna http://www.omii.ac.uk DART workshop, Brisbane, Australia, 14 December 2006
Overview • The situation in –omics • Creating new biology using Taverna • Taverna • Key traits • Features on the OMII roadmap • Including today’s release
Open environmentData, Data, Data National Center for Biotechnology Information (USA) EBI Tokyo, Japan Cambridge, UK SeqHound SRS
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
The situation in {genomics, transcriptomics, proteomics, metabolomics ..} • Lots of data • Lots of parameters to choose • An analysis takes a long time • The analysis services are unreliable • Lots of analysis steps • Need to record and explain your steps
Enter workflows • Lots of data[high throughput] • Lots of parameters to choose[best practice] • An analysis takes a long time [long running] • The analysis services are unreliable [fault tolerance] • Lots of analysis steps [data and control flow] • Need to record and explain your steps [provenance]
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg Workflow-based middleware
myGrid • myGrid http://www.mygrid.org.uk • UK e-Science pilot project since 2001 • Part of the Open Middleware Infrastructure Institute UK • Build middleware for Life Scientists that enables them to undertake in silico experiments and share those experiments and their results. • Individual scientists, in under-resourced labs, who use other people’s applications. • Open source. • Workflows & Semantic Techologies for metadata management. • Data flows. Ad hoc & exploratory
Overview • The situation in -omics • Creating new biology using Taverna • Taverna • Key traits • Features on the OMII roadmap • Including today’s release
Phenotype Genotype 200 ? Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping Genes captured in microarray experiment and present in QTL region Microarray + QTL [Andy Brass, Steve Kemp, Paul Fisher, 2006]
Key: A – Retrieve genes in QTL region B – Annotate genes with external database Ids C – Cross-reference Ids with KEGG gene ids D – Retrieve microarray data from MaxD database E – For each KEGG gene get the pathways it’s involved in F – For each pathway get a description of what it does G – For each KEGG gene get a description of what it does [Andy Brass, Steve Kemp, Paul Fisher, 2006]
Result • Captured the pathways returned by QTL and Microarray workflows over the MaxD microarray database • Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance. • Manually analysis on the microarray and QTL data had failed to identify this gene as a candidate. [Andy Brass, Steve Kemp, Paul Fisher, 2006]
Trichuris muris (mouse whipworm) infection • Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite. • Manual experimentation: Two year study of candidate genes, processes unidentified • Workflows: trypanosomiasis cattle experiment, was reused without change. • Analysis of the data by a biologist found the processes in a couple of days. [Joanne Pennock, Paul Fisher, 2006]
Changing scientific practice • Systematic and comprehensive automation. • Eliminated user bias and premature filtering of datasets and results leading to single sided, expert-driven hypotheses • Dry people hypothesise, wet people validate. • “make sense of this data” -> “does this make sense?” • Workflow factories. • Different dataset, different result • Accurate provenance.
Overview • The situation in -omics • Creating new biology using Taverna • Taverna • Key traits • Features on the OMII roadmap • Including today’s release
User Uptake • ~25000 downloads • Systems biology • Proteomics • Gene/protein annotation • Microarray data analysis • Medical image analysis • Heart simulations • High throughput screening • Phenotypical studies • Plants, Mouse, Human • Astronomy • Dilbert Cartoons
Finding and Sharing Tools 3rd Party Applications and Portals Taverna Workbench myExperiment DAS Utopia Feta Workflow Enactor Clients Workflow enactor Service Management LSIDs Provenance log Metadata DefaultData Store Custom Store Results Management KAVE BAKLAVA
3000+ services • Open domain services and resources, Third party. • Enforce NO common data model. • No common typing, Missing metadata. • Soaplab • InstantSoap
User Interaction • Allows a workflow to call out to an expert human user • E.g. Used to embed the Artemis annotation editor within an otherwise automated genome annotation pipeline [University of Bergen]
Tools, Tools, Tools Pedro Annotation tool Feta Search tool
Capture and Curation Effort Ontology and Annotation Curation Team Franck Tanoh and Katy Wolstencroft Community Scientists Community Service Providers
Workflow enactor Processor Processor Processor Processor Processor Processor Processor Processor Processor Bio MOBY Bio MART Seq Hound Plain Web Service Soap lab Local Java App WF Enactor WSRF Beanshell Shielding & Extensible plug-ins Taverna Workbench Application Scufl Model Simple Conceptual Unified Flow Language Nested workflows, Automatic iterations, Best guess data type handling Workflow Execution
Duncan Hull, myGrid Khalid Belhajjame, ISPIDER Service incompatibility • Fix up the services to be compatible or…. • Shims – libraries of adapters. • Automated data type matching using reasoning over a mismatch and service ontology
Shimidentification Mismatchdetection
Service failure? • Most services are owned by other people • No control over service failure • Some are research level Workflows only as good as the services they connect. • Notify failures • Instigate retries • Set criticality • Substitute services
[instanceOf] urn:data1 SwissProt_seq [similar_sequence_to] [input] urn:hit1… [performsTask] [instanceOf] urn:BlastNInvocation3 urn:hit2…. [contains] [output] Find similar sequence urn:hit50….. urn:data2 Sequence_hit urn:data12 [input] [hasHits] [instanceOf] urn:compareinvocation3 Blast_report [directlyDerivedFrom] [distantlyDerivedFrom] [instanceOf] [output] urn:hit5… urn:data:3 urn:hit8…. [contains] Data generated by services/workflows [output] urn:hit10….. [output] urn:data:f1 urn:invocation5 [ ] Properties [type] [hasName] urn:data:f2 Concepts [type] [hasName] Services Missed sequence DatumCollection New sequence LSDatum literals Provenance Collection • Observes events from the workflow engine • Populates an RDF triple store with information from these events • Browse interface • Simple browser replicates Taverna’s existing result and status browser • Graphical browser • ProQA Query API [Zhao et al 07 provenance challenge paper]
Provenance Tracking From which Ensembl gene does pathway mmu004620 come from?
Workflows over Results Automatically backtrack through the data provenance graph Entrez dF dF dF dF Pathway_id KEGG_id Uniprot Ensembl_gene_id
Overview • The situation in -omics • Creating new biology using Taverna • Taverna • Key traits • Features on the OMII roadmap • Including today’s release
myGrid Alliance Source-forge community Ingest OMII-UK Release myGrid Release myGrid Pre-release Evaluation Software Engineering Quality & Test OMII Software Engineering Quality & Test Software Engineering XP Prioritise & Plan Applications & Professional Services Production Conservatives Early adopters Pioneers Early adopters Pioneers Pioneers
Who are the OMII Users? Different scientific/research domains End Users Different activities Application Developers Increasing variation in requirements with the scientific domain. Service and Middleware Developers Middleware Deployers Systems Administrators
Taverna is now part of OMII-UK • Taverna 1.5 – Today! • Taverna 1.6 • myExperiment
Taverna 1.5 • Integrated provenance • Raven release mechanism to simplify updates for the user • +/- 300 semantic annotations for core services • Patterns for using proxies for bulk data transactions • Redeveloped plug in and enactor framework, improved iteration events, data management
Taverna 1.5 • Integrated provenance
Taverna 1.5 • Integrated provenance • Raven release mechanism to simplify updates for the user
Taverna 1.5 • Integrated provenance • Raven release mechanism to simplify updates for the user • +/- 300 semantic annotations for core services Add_ncbi_to_string : beanshell script, need to ask Paul for more details Input: Output: Kegg_gene_ids_all_species (bconv): converts external IDs to KEGG IDs [mapping] string: External ID . e.g. NCBI ID [Genebank_GI] return: KEGG gene ID [KEGG_record_id] Get_pathways_by_genes: Search all pathways which include all the given genes [Searching] Input: List of KEGG genes id [KEGG_gene_id] Output: Return a list of pathway_id of specified KEGG genes_id Merge_pathways Stringlist Concatenated This workflow takes in Entrez gene ids then adds the string "ncbi-geneid:" to the start of each gene id. These gene ids are then cross-referenced to KEGG gene ids. Each KEGG gene id is then sent to the KEGG pathway database and its relevant pathways returned.
Taverna 1.5 • Integrated provenance • Raven release mechanism to simplify updates for the user • +/- 300 semantic annotations for core services • Patterns for using proxies for bulk data transactions • Redeveloped plug in and enactor framework, improved iteration events, data management
Taverna 1.6 • Due out Summer 2007 • Revised enactment core • Native support for long running workflows • Data proxy to deal with bulk data transactions • Improved service discovery and provenance management
Obtaining Taverna • Taverna is available under the LGPL from our project site on Sourceforge.net • http://taverna.sourceforge.net • Win32, Solaris / Linux & OS-X • Includes online and downloadable user manual, examples etc. • Support via project mailing lists
Conclusions • See plans for Taverna 2.0 on myGrid wiki • Taverna development is user-driven • Please keep in touch and tell us what you would like to see by the myGrid mailing lists: Taverna Users, Taverna Hackers Taverna http://taverna.sourceforge.net myGrid http://www.mygrid.org.uk OMII-UK http://www.omii.ac.uk
Acknowledgements • Phase1 myGrid researchers, Phase2 OMII-UK, myGrid Research Team • Peter Li, Paul Fisher, Andy Brass, Robert Stevens, Mark Wilkinson • EPSRC, Wellcome Foundation, EU