220 likes | 238 Views
This project, funded with £3.4 million from the UK e-Science Pilot Project, aimed to develop data-intensive bioinformatics tools and services using the Semantic Web. It involved collaborations between the University of Manchester, Newcastle University, University of Sheffield, University of Nottingham, EMBL-EBI Hinxton, and University of Southampton. The project focused on developing tools for workflow management, service discovery, provenance tracking, ontologies, and metadata management.
E N D
myGrid and the Semantic Web Phillip Lord School of Computer Science University of Manchester
Newcastle Sheffield Manchester Nottingham Hinxton Southampton myGrid: eScience and Bioinformatics • Oct 2001 – April 2005. • £3.4 million. • UK e-Science Pilot Project. • £0.4 million studentships.
Data (Type) Intensive Bioinformatics ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
Work bench Taverna Talisman Web Portal Applications Gateway Bioinformaticians Personalisation Registries Service and Workflow Discovery Provenance Event Notification Ontologies Ontology Mgt Views Metadata Mgt myGrid Information Repository Core services Tool Providers OGSA-DQP Distributed Query Processor FreeFluo Workflow Enactment Engine Web Service (Grid Service) communication fabric Service Providers External services SoapLab GowLab Native Web Services AMBIT Text Extraction Service Legacy apps Legacy apps
Thin Semantics CPGREPORT of CDS1|>CDS2|strand_1 from 1 to 129 Sequence Begin End Score CpG %CG CG/GC CDS1|>CDS2|strand_1 5109 58 9 64.8 1.12 • PRETTYSEQ of CDS1|>CDS2|strand_1 from 1 to 129 • ---------|---------|---------|---------|---------|---------| • 1 atgacggacactgctggtcgctgtggcttcctcctacgcgttcggtcactcctgcacatg 60 • 1 M T D T A G R C G F L L R V R S L L H M 20 • ---------|---------|---------|---------|---------|---------| • 61 tccgcagtagtggtgctctcggggaccccctcgccaccccacaataccgctcaccacatg 120 • 21 S A V V V L S G T P S P P H N T A H H M 40 • --------- • 121 gccaaacag 129 • 41 A K Q 43 ######################################## # Program: restrict # Rundate: Thu Jul 15 16:32:30 2004 # Report_format: table # Report_file: /scratch/emboss_interfaces/a/unknown/Projects/default/Data/out1089905549241 ######################################## Start End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev 4 8 TspGWI ACGGA 19 17 . . 9 15 TspRI CASTGNN 15 6 . . 14 19 BtsI GCAGTG 8 6 . . 25 28 CviJI RGCY 26 26 . . 30 33 MnlI CCTC 40 39 . . 36 41 MluI ACGCGT 36 40 . . #--------------------------------------- #---------------------------------------
Semantic Discovery with Feta Query-ontology – discovering workflows and services described in the registry by building a query in Taverna. A common ontology is used to annotate and query. (Planning For OBO release)
Knowledge in Feta Service Descriptions (XML) Ontology (OWL-DL) Jena Querying (RDF)
Service Discovery Good: RDF provides a convenient search capability, with a well defined link to an ontology Bad: Unsure about scalability. Issues of security, Concurrency will probably also affect us.
Provenance • Bioinformatics has a data circularity problem. • Computational data is hard to trace, reproduce or repeat. • We need to store provenance. • Service Orientated Architecture and Service Descriptions start to enable us to do this.
Generating Provenance Web Services Data Repository FreeFluo Taverna Metadata Repository (reified) LaunchPad Haystack
Organisation level provenance Process level provenance Service Project runBye.g. BLAST @ NCBI Experiment design Process Workflow design componentProcesse.g. web service invocation of BLAST @ NCBI partOf Event instanceOf componentEvente.g. completion of a web service invocation at 12.04pm Workflow run run for User can add templates to each workflow process to determine links between data items. Data item Person Organisation Data item Data item data derivation e.g. output data derived from input data
Provenance GOOD: RDF provides a convenient data model, which is flexible, and adaptable. BAD: Visualisation tools are lacking. Scalability even more an of issue with reification
LSID’s • Standard identifier mechanism, aimed at the life sciences • Has standard resolution mechanism by which the data can be obtained. • Has semantics for versioning • Has standard association with metadata • Abbreviation distressingly similar to LSD
Provenance • Used LSID within provenance; all of our data is stored and resolved with LSID • Notion of a single identifier system within myGrid is attractive.
Worries • We are unclear as how the metadata/data split happens with LSID: Use former for mutability, later for immutability. • We have also tending toward using “metadata” for RDF based data, and “data” for relational.
LSID GOOD: Defined resolution mechanism, data and metadata. BAD: Unclear how to use data/metadata split.
Acknowledgements Core • Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pocock, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe. Users • Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK • Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Postgraduates • Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire Industrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM) • Robin McEntire (GSK) Collaborators • Keith Decker
Summary GOOD: RDF provides a convenient search capability, with a well defined link to an ontology RDF provides a convenient data model, which is flexible, and adaptable. LSID: Defined resolution mechanism, data and metadata. BAD: Unsure about scalability. Issues of security, Concurrency will probably also affect Visualisation tools are lacking. Scalability even more an of issue with reification LSID: Unclear how to use data/metadata split.