280 likes | 392 Views
The myGrid Project. Professor Chris Greenhalgh University of Nottingham. Open Source Upper Middleware for Bioinformatics (Web) Service-based architecture Targeted at Tool Developers, Bioinformaticians and Service Providers. Newcastle. Sheffield. Manchester. Nottingham. Hinxton.
E N D
The myGrid Project Professor Chris Greenhalgh University of Nottingham
Open Source Upper Middleware for Bioinformatics • (Web) Service-based architecture • Targeted at Tool Developers, Bioinformaticians and Service Providers Newcastle Sheffield Manchester Nottingham Hinxton Southampton
Philosophy • Openness • open source • open world of services • open to wider eScience context • open to user feedback • open to third party metadata • Collection of components for assembly • Pick and mix
Data-intensive bioinformatics ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
Use Scenarios Grave’s Disease • Autoimmune disease of the thyroid • Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle • Discover all you can about a gene • Annotation pipelines and Gene expression analysis • Services from Japan, Hong Kong, various sites in UK Williams-Beuren Syndrome • Microdeletion of 155 Mbases on Chromosome 7 • Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK • Characterise an unknown gene • Annotation pipelines and Gene expression analysis Services from USA, Japan, various sites in UK
Physical Map CTA-315H11 CTB-51J22 GTF2IRD2P Gap FKBP6T POM121 GTF2IP NOLR1 NCF1P PMS2L STAG3 Block B Block A Block C Williams-Beuren Syndrome Microdeletion A-cen B-cen C-cen C-mid B-mid A-mid B-tel A-tel C-tel WBSCR1/E1f4H WBSCR5/LAB GTF2IRD1 WBSCR21 WBSCR22 WBSCR18 WBSCR14 GTF2IRD2 POM121 NOLR1 BAZ1B BCL7B FKBP6 GTF2I CLDN3 CLDN4 CYLN2 STX1A LIMK1 NCF1 TBL2 RFC2 FZD9 ELN ~1.5 Mb 7q11.23 Patient deletions * * WBS SVAS Chr 7 ~155 Mb
Manually filling a genomic gap • Numerous web-based services (i.e. BLAST, RepeatMasker) • Cutting and pasting • Large number of steps • Frequently repeated – info now rapidly added to public databases • Don’t always get results • Time consuming • Huge amount of interrelated data is produced – handled in lab book and files saved to local hard drive • Mundane • Much knowledge remains undocumented .: Bioinformatician does the analysis
WBS Workflows: Query nucleotide sequence RepeatMasker ncbiBlastWrapper Pink: Outputs/inputs of a service Purple: Taylor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns GenBank Accession No URL inc GB identifier Translation/sequence file. Good for records and publications prettyseq GenBank Entry Amino Acid translation Sort for appropriate Sequences only Identifies PEST seq epestfind 6 ORFs Seqret Identifies FingerPRINTS pscan MW, length, charge, pI, etc Nucleotide seq (Fasta) pepstats sixpack ORFs transeq Predicts Coiled-coil regions RepeatMasker pepcoil tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr Coding sequence GenScan ncbiBlastWrapper Restriction enzyme map restrict SignalP TargetP PSORTII Predicts cellular location CpG Island locations and % cpgreport InterPro PFAM Prosite Smart Identifies functional and structural domains/motifs RepeatMasker Repetative elements Hydrophobic regions Pepwindow? Octanol? Blastn Vs nr, est databases. ncbiBlastWrapper
Workflow approach:in-silico experiments Williams-Beuren Syndrome • Manually: takes two days (+) including analysis • Now takes 30 mins to produce results and half a day for analysis • Manually: Do analysis as perform experiment • Workflow: Do analysis at end of experiment • Therefore need good result co-ordination for back-tracking
(e-)Scientists… • …Experiment • Can workflow be used as an experimental method? • How many times has this experiment been run? • …Analyze • How do we manage the results to draw conclusions from them? • How reliable are these results? • …Collaborate • Can we share workflows, results, metadata etc? • …Publish • Can we link to these workflows and results from our papers? • …Review • Can I find, comprehend and review your work? • How was that result derived?
Bioinformaticians Tool Providers Service Providers myGrid Service Stack Work bench Taverna Talisman Web Portal Applications Gateway Personalisation Registries Service and Workflow Discovery Provenance Event Notification Ontologies Ontology Mgt Views Metadata Mgt Core services myGrid Information Repository OGSA-DQP Distributed Query Processor FreeFluo Workflow Enactment Engine Web Service (Grid Service) communication fabric External services SoapLab GowLab Native Web Services AMBIT Text Extraction Service Legacy apps Legacy apps
Bioinformaticians Tool Providers Service Providers myGrid Service Stack Work bench Taverna Talisman Web Portal Applications Gateway Personalisation Registries Service and Workflow Discovery Provenance Event Notification Ontologies Ontology Mgt Views Metadata Mgt Core services myGrid Information Repository OGSA-DQP Distributed Query Processor FreeFluo Workflow Enactment Engine Web Service (Grid Service) communication fabric External services SoapLab GowLab Native Web Services AMBIT Text Extraction Service Legacy apps Legacy apps
FreeFluo Features • Control flow, iteration and data flow • Data sets and nested flows • Configurable failure handling • Incorporated Life Science Id resolution • Provenance and status reporting • Type and data management • Plug-ins • User notification • Data entry wizard • Libraries of SHIM services • Libraries of workflows
Domain Services • Native WSDL Web services • DDBJ, NCBI BLAST, PathPort, BioMOBY • Wrapped legacy services • SoapLab • GowLab • Web pages as web services • One button wrapping • Leveraged the EMBOSS Suite • ~159 services • Lots of them and lots of redundant services • The joys of firewalls and licensing • For each application • CreateJob • Run • WaitFor • GetResults • Destroy EBI Support agreed to support Soaplab services as core business • http://industry.ebi.ac.uk/soaplab/
Core functionality Services – Soaplab and Gowlab Workflow enactment engine – Freefluo Workflow workbench – Taverna Data integration – OGSADQP Information model & management Innovative work Service and workflow registration Semantic discovery Provenance management Text mining Two+ Paths In between • Event notification • Gateway
Drilling Down: myGrid and Semantics • Workflow and service discovery • Prior to and during enactment • Semantic registration • Workflow assembly • Semantic service typing of inputs and outputs • Provenance of workflows and other entities • Experimental metadata glue • Use of RDF, RDFS, DAML+OIL/OWL • Instance store, ontology server, reasoner • Materialised vs at point of delivery reasoning. • myGrid Information Model
Provenance (1) Organisation level provenance Process level provenance Service Project runBye.g. BLAST @ NCBI Experiment design Process Workflow design componentProcesse.g. web service invocation of BLAST @ NCBI partOf Event instanceOf componentEvente.g. completion of a web service invocation at 12.04pm Workflow run Data/ knowledge level provenance knowledge statementse.g. similar protein sequence to run for User can add templates to each workflow process to determine links between data items. Data item Person Organisation Data item Data item data derivation e.g. output data derived from input data
..masked_sequence_of .. nucleotide_sequence project ..part_of organisation >gi|19747251|gb|AC005089.3| Homo sapiens BAC clone CTA-315H11 from 7, complete sequence AAGCTTTTCTGGCACTGTTTCCTTCTTCCTGATAACCAGAGAAGGAAAAGATCTCCATTTTACAGATGAG GAAACAGGCTCAGAGAGGTCAAGGCTCTGGCTCAAGGTCACACAGCCTGGGAACGGCAAAGCTGATATTC AAACCCAAGCATCTTGGCTCCAAAGCCCTGGTTTCTGTTCCCACTACTGTCAGTGACCTTGGCAAGCCCT GTCCTCCTCCGGGCTTCACTCTGCACACCTGTAACCTGGGGTTAAATGGGCTCACCTGGACTGTTGAGCG experiment definition rdf:type ..part_of group urn:lsid:taverna:datathing:13 ..part_of ..author workflow definition ..works_for ..invocation_of ..author person ..BLAST_Report workflow invocation ..similar_sequences_to ..run_for ..run_during service description rdf:type 19747251 AC005089.3 831 Homo sapiens BAC clone CTA-315H11 from 7, complete sequence 15145617 AC073846.6 815 Homo sapiens BAC clone RP11-622P13 from 7, complete sequence 15384807 AL365366.20 46.1 Human DNA sequence from clone RP11-553N16 on chromosome 1, complete sequence 7717376 AL163282.2 44.1 Homo sapiens chromosome 21 segment HS21C082 16304790 AL133523.5 44.1 Human chromosome 14 DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence 34367431 BX648272.1 44.1 Homo sapiens mRNA; cDNA DKFZp686G08119 (from clone DKFZp686G08119) 5629923 AC007298.17 44.1 Homo sapiens 12q22 BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence 34533695 AK126986.1 44.1 Homo sapiens cDNA FLJ45040 fis, clone BRAWH3020486 20377057 AC069363.10 44.1 Homo sapiens chromosome 17, clone RP11-104J23, complete sequence 4191263 AL031674.1 44.1 Human DNA sequence from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence 17977487 AC093690.5 44.1 Homo sapiens BAC clone RP11-731I19 from 2, complete sequence 17048246 AC012568.7 44.1 Homo sapiens chromosome 15, clone RP11-342M21, complete sequence 14485328 AL355339.7 44.1 Human DNA sequence from clone RP11-461K13 on chromosome 10, complete sequence 5757554 AC007074.2 44.1 Homo sapiens PAC clone RP3-368G6 from X, complete sequence 4176355 AC005509.1 44.1 Homo sapiens chromosome 4 clone B200N5 map 4q25, complete sequence 2829108 AF042090.1 44.1 Homo sapiens chromosome 21q22.3 PAC 171F15, complete sequence urn:lsid:taverna:datathing:15 service invocation ..described_by ..created_by ..filtered_version_of A B RDF Rules Relationship BLAST report has with other items in the repository Other classes of information related to BLAST report
Information Model v2 Bioinformatics middleware – domain neutral • Resources and Identifiers • People, teams and organizations • Representing the e-science process • Experimental methods for e-science • Scientific data and the life-science identifier • Types • Identifier Types • Values and Documents • Provenance information • Annotation and Argumentation In the middle of deployment
LSIDs http://www.i3c.org/wgr/ta/resources/lsid/docs/ • LSID provides a uniform naming scheme. • LSID Resolver guarantees to resolve to same data object. • LSID Authority dishes them out. • Also returns metadata of object. • Used throughout myGrid as an object naming device. • myGrid Repository acts an LSID Authority • LSID allows universal access to results for collaboration, as well as for review. • RDF+LSID explains the context of results, and provides guidance for further investigations. I3C / IBM / EBI proposal for a Life Science Identifier Pioneered by myGrid
Prototype 2 Second generation services Reworked information model Open information management Life Science Identifiers RDF based provenance Taverna workbench Web-based portal In a nutshell Pre-Prototype Experimental Web-based Requirements gathering Prototype 1 Demo at ISMB 2003 Architectural workout All services represented NetBeans workbench API-based integration Info Repository oriented XML-based process provenance Workflow enactment engine Full paper and demo at ISMB 2004 GSK deployment Real biology
To Dos • Improve results management • Deployment of mIR • Portal for finding workflows, launching & monitoring workflows, launching taverna, browsing results • Deploying publicly accessible semantic registry • Reinstate service discovery during enactment • Large scale data throughput workflow engine • Event notification on services • Using provenance graphs for impact analysis • Hiding LSIDs • Lexicons for concept names • Hardening semantic discovery • Ambient Text • Er..Security • Etc… • “myGrid in a box”
Ongoing/Future Activities • myGrid-in-a-box • Technical follow-ons • Best practice (6) and OMII (Freefluo,Taverna, Event notification) bids • Research follow-ons • Semantic Grids, Data Grids, Workflow, Provenance services • PhD students • Science follow-ons • Life Sciences: ISPIDER, e-Fungi • Clinical: PsyGrid, CLEF-II • PhD students • Networking • LinK-up with BIRN/SEEK/GEON (SDSC) & SCEC/GriPhyN (ISI,USC)
Wrap Up • Managed the transition from generic middleware development to practical day to day useful services • Real users (plural) fundamental to that • End to end support for an entire scenario • A broad view of the e-Science process • Show stoppers for practical adoption are not sexy technical showstoppers • Can I incorporate my favourite service? • Can I manage the results? • Tapping into (defacto) standards and communities to leverage others results and tools – LSID, Haystack, Pedro… • http://www.mygrid.org.uk
Acknowledgements myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project, http://taverna.sf.net
myGrid People Core • Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe. Users • Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK • Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Postgraduates • Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire Industrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM) • Robin McEntire (GSK) Collaborators • Keith Decker