430 likes | 592 Views
Migrating to the Semantic Web: Bioinformatics as a case study. Phillip Lord, Dept of Computer Science, University of Manchester. What is the Semantic Web. We are here!. OWL RDF XML. The talk. Three (and a half) example case studies Two different technologies.
E N D
Migrating to the Semantic Web: Bioinformatics as a case study. Phillip Lord, Dept of Computer Science, University of Manchester
What is the Semantic Web We are here! OWL RDF XML
The talk • Three (and a half) example case studies • Two different technologies. • Why we choose the different technologies.
The Motivation “At the doctor’s office, Lucy instructed her semantic web agent. It promptly retrievedinformation about her Mom’s prescribed treatment, looked up a list of several providers within 20 miles of home, with a good trust rating.”
Scientific American, May 2001: Beware of the Hype!
The Motivating Example Lucy Doctor
UK e-Science Pilot Project. Oct 2001 – April 2005. £3.4 million. £0.4 million studentships. myGrid Newcastle Sheffield Manchester Nottingham Hinxton Southampton
Data(type)-intensive bioinformatics ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
Service Stack Bioinformaticians Tool Providers Service Providers Work bench Taverna Talisman Web Portal Applications Gateway Personalisation Registries Service and Workflow Discovery Provenance Event Notification Ontologies Ontology Mgt Views Metadata Mgt Core services myGrid Information Repository OGSA-DQP Distributed Query Processor FreeFluo Workflow Enactment Engine Web Service (Grid Service) communication fabric External services SoapLab GowLab Native Web Services AMBIT Text Extraction Service Legacy apps Legacy apps
WBS Workflows: Query nucleotide sequence RepeatMasker ncbiBlastWrapper Pink: Outputs/inputs of a service Purple: Tailor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns GenBank Accession No URL inc GB identifier Translation/sequence file. Good for records and publications prettyseq GenBank Entry Amino Acid translation Sort for appropriate Sequences only Identifies PEST seq epestfind 6 ORFs Seqret Identifies FingerPRINTS pscan MW, length, charge, pI, etc Nucleotide seq (Fasta) pepstats sixpack ORFs transeq Predicts Coiled-coil regions RepeatMasker pepcoil tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr Coding sequence GenScan ncbiBlastWrapper Restriction enzyme map restrict SignalP TargetP PSORTII Predicts cellular location CpG Island locations and % cpgreport InterPro PFAM Prosite Smart Identifies functional and structural domains/motifs RepeatMasker Repetative elements Hydrophobic regions Pepwindow? Octanol? Blastn Vs nr, est databases. ncbiBlastWrapper
Semantic discovery • Query-ontology – discovering workflows and services described in the registry by building a query in Taverna. • A common ontology is used to annotate and query. • Look for all workflows that accept an input of semantic type nucleotide sequence. • Aim to have semantic discovery over public view on the Web.
Service annotation • Adding structured metadata to a workflow registration to enable others to discover and reuse it more effectively. E.g. what semantic type of input does it accept.
Semantic Discovery Pedro data capture tool View annotations on workflow Drag a workflow entry into the explorer pane and the workflow loads. Drag a service/ workflow to the scavenger window for inclusion into the workflow
Biologist Ontologist Service Providers
Problems when doing In Silico Experiments Experiments being performed repeatedly, at different site, different time, by different users or groups; A large repository of records about experiments!! • verification of data; • “recipes” for experiment designs; • explanation for the impact of changes; • ownership; • performance of services; • data quality; Scientists In silico experiments:
A Semantic Web of Provenance what how/which/ when/where Literature relevant to provenance study or data in this workflow Provenance record of a workflow run DAML+OiL Ontologies linking provenance documents how XML PDF HTML XML XML who why Interlinking graph of the workflow that generates the provenance logs Web page of people who has related interests as the owner of the workflow Experiment Notes
Population Semantic Data Web Services Data Repository FreeFluo Taverna Metadata Repository LaunchPad Haystack
Biologist Biologist Database Biologist
Gene Ontology Next Generation Project(GONG) • Demonstrate the utility of finer grained concept descriptions in DAML+OIL (OWL-DL) • Develop methodologies and tools to support the process
Translating theory into practice • Gene Ontology provides a service to the model organism database community • Description logic (DL) is a technology born out of computer science research • OWL is a standard ontology interchange language underpinned by DL
GONG - proof of concept • Maintaining an exhaustive is-a structure Parent Is-a relationship GO concept
Example: heparin biosynthesis [chemical] biosynthesis (GO:0009058) [i]carbohydratebiosynthesis (GO:0016051) Axis 1: Chemicals [i]aminoglycanbiosynthesis (GO:0006023) [i]heparinbiosynthesis (GO:0030210)
Example: heparin biosynthesis [chemical] biosynthesis (GO:0009058) [i]carbohydratebiosynthesis (GO:0016051) Axis 1: Chemicals [i]aminoglycanbiosynthesis (GO:0006023) [i]heparinbiosynthesis (GO:0030210) Axis 2: Process [i]heparinmetabolism (GO:0030202) [i]heparinbiosynthesis (GO:0030210)
Example: heparin biosynthesis [chemical] biosynthesis (GO:0009058) [i]carbohydratebiosynthesis (GO:0016051) Axis 1: Chemicals [i]aminoglycanbiosynthesis (GO:0006023) [i] glycosaminoglycan biosynthesis (GO:0006024) [i]heparinbiosynthesis (GO:0030210) Axis 2: Process [i]heparinmetabolism (GO:0030202) [i]heparinbiosynthesis (GO:0030210)
Is this important? • Missing is-a not noticed by users • BUT… improves fidelity of DB record retrieval. • Asking for gene products involved in ‘glycosaminoglycan biosynthesis’ will lead to an additional result: O94923 SPTr ISS - D-glucuronyl C5-epimerase (Fragment)
Paraphrased reasoning process • heparin biosynthesis • class heparin biosynthesis definedsubClassOf biosynthesisrestrictiononProperty acts_on hasClassheparin • glycosaminoglycan biosynthesis • class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass glycosaminoglycan Is-a
Inferring a new is-a link • heparin biosynthesis • class heparin biosynthesis definedsubClassOf biosynthesisrestrictiononProperty acts_on hasClassheparin • glycosaminoglycan biosynthesis • class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClassglycosaminoglycan Is-a Is-a
Results • Carbohydrate metabolism ~250 concepts • 22 additional is-a links 17 of which now in GO • Amino acid metabolism ~ 250 concepts • Further 17 additional is-a links now in GO • GO team will be reviewing results for metabolism as a whole once we have the tools to support the process • Useful results come from even a partial coverage
Build a practical environment • Tools needed for: • Creating OWL definitions • Tracking changes • Reporting reasoning results • Viewing definitions
OWL for GONG Biologist Ontologist
Conclusions • Three problems, three different solutions, all making use of semantic web technologies. • A little semantics can go a long way. • The expressivity of the language has to be chosen at least in part based on the tasks to be performed, and the user base. • Tools, tools, tools.
Acknowledgments • Chris Wroe, Robert Stevens, Carole GobleUniversity of Manchester, UK • Michael Ashburner • EBI, Hinxton, UK • Jane Lomax and Midori Harris of the GO editorial team for help and advice and responding to the suggested changes • UMLS and MeSH which provided valuable resources for chemical information • Sean Bechhofer for development on OilEd • Project funded as a subcontract of the DARPA DAML programme
Acknowledgements myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project, http://taverna.sf.net
myGrid People Core • Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pocock, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe. Users • Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK • Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Postgraduates • Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire Industrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM) • Robin McEntire (GSK) Collaborators • Keith Decker