490 likes | 625 Views
Semantic Webs for Life Sciences. PSB 2006. Personnel. Olivier Bodenreider National Library of Medicine, NIH, USA. Yves Lussier Department of Medical Informatics, Columbia University, USA. Robert Stevens University of Manchester, UK. Introduction. The Web today The Semantic Web vision
E N D
Semantic Webs for Life Sciences PSB 2006
Personnel • Olivier BodenreiderNational Library of Medicine, NIH, USA. • Yves LussierDepartment of Medical Informatics,Columbia University, USA. • Robert StevensUniversity of Manchester, UK.
Introduction • The Web today • The Semantic Web vision • Talking about facts… • The Resource Description Framework • Naming things in a Semantic Web • Ontologies on the SW • RDFS and the Web Ontology Language • Semantic Webs and Semantic Web applications
A Web of Information in Bioinformatics …with human in the middle
Human Centric Information • Over 700 bioinformatics data resources • Many analysis tools producing more data • Still largely in human readable form only • A human biologist sits at the centre and does all the semantic work
A Stack of Troubles Semantics Structure Heterogeneity Syntax System
Heterogeneity all Around • Numerous, distributed resources highly heterogeneous • Differing platforms, API, storage paradigms, query languages, … • Differing formats, syntax, etc. • Differing schema implying conceptualisations • Differing values for schema • Semantic heterogeneity…
Uniprot:- A protein database? CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE CC BRAIN OF HUMANS AND ANIMALS INFECTED CC WITH NEURODEGENERATIVE DISEASES KNOWN AS CC TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION CC DISEASES,LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), CC GERSTMANN-STRAUSSLER SYNDROME (GSS), FATAL CC FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; CC SCRAPIE IN SHEEP AND GOAT; BOVINE SPONGIFORM CC ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE CC MINK ENCEPHALOPATHY (TME); CHRONIC WASTING CC DISEASE (CWD) OF MULE DEER AND ELK; FELINE CC SPONGIFORM ENCEPHALOPATHY (FSE) IN CATS AND CC EXOTIC UNGULATE ENCEPHALOPATHY (EUE) IN CC NYALA AND GREATER KUDU. THE PRION DISEASES CC ILLUSTRATE THREE MANIFESTATIONS OF CNS CC DEGENERATION: (1) INFECTIOUS (2) CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS. CC TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO CC OCCUR AFTER CONSUMPTION OF PRION-INFECTED CC FOODSTUFFS. DR EMBL; M13667; AAA19664.1; -. DR EMBL; M13899; AAA60182.1; -. DR EMBL; D00015; BAA00011.1; -. DR PIR; A05017; A05017. DR PIR; A24173; A24173. DR PIR; S14078; S14078. DR PDB; 1E1G; 20-JUL-00. DR PDB; 1E1J; 20-JUL-00. DR PDB; 1E1P; 20-JUL-00. DR PDB; 1E1S; 21-JUL-00. DR PDB; 1E1U; 20-JUL-00. DR PDB; 1E1W; 20-JUL-00. DR MIM; 176640; -. DR MIM; 123400; -. DR MIM; 137440; -. DR MIM; 245300; -. DR MIM; 600072; -. DR MIM; 604920; -. DR InterPro; IPR000817; Prion. DR Pfam; PF00377; prion; 1. DR PRINTS; PR00341; PRION. DR SMART; SM00157; PRP; 1. DR PROSITE; PS00291; PRION_1; 1. DR PROSITE; PS00706; PRION_2; 1. KW Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal; KW 3D-structure; Polymorphism; Disease mutation. FT SIGNAL 1 22 FT CHAIN 23 230 MAJOR PRION PROTEIN. FT PROPEP 231 253 REMOVED IN MATURE FORM (BY SIMILARITY). FT LIPID 230 230 GPI-ANCHOR (BY SIMILARITY). FT CARBOHYD 181 181 N-LINKED (GLCNAC...) (PROBABLE). FT DISULFID 179 214 BY SIMILARITY. FT DOMAIN 51 91 5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-G- FT Q. FT REPEAT 51 59 1. FT REPEAT 60 67 2. FT REPEAT 68 75 3. FT REPEAT 76 83 4. FT REPEAT 84 91 5. FT IN PATIENTS WHO HAVE A PRP MUTATION AT FT CODON 178: PATIENTS WITH MET DEVELOP FFI, FT THOSE WITH VAL DEVELOP CJD). FT /FTId=VAR_006467. FT VARIANT 171 171 N -> S (IN SCHIZOAFFECTIVE DISORDER). FT /FTId=VAR_006468. FT VARIANT 178 178 D -> N (IN FFI AND CJD). FT /FTId=VAR_006469. FT VARIANT 180 180 V -> I (IN CJD). FT /FTId=VAR_006470. FT VARIANT 183 183 T -> A (IN FAMILIAL SPONGIFORM FT ENCEPHALOPATHY). FT /FTId=VAR_006471. FT VARIANT 187 187 H -> R (IN GSS). FT /FTId=VAR_008746. FT VARIANT 188 188 T -> K (IN EOAD; DEMENTIA ASSOCIATED TO FT PRION DISEASES). FT /FTId=VAR_008748. FT VARIANT 188 188 T -> R. FT /FTId=VAR_008747. FT VARIANT 196 196 E -> K (IN CJD). FT /FTId=VAR_008749. FT /FTId=VAR_006472. SQ SEQUENCE 253 AA; 27661 MW; 43DB596BAAA66484 CRC64; MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA VVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV ILLISFLIFL IVG // ID PRIO_HUMAN STANDARD; PRT; 253 AA. AC P04156; DT 01-NOV-1986 (Rel. 03, Created) DT 01-NOV-1986 (Rel. 03, Last sequence update) DT 20-AUG-2001 (Rel. 40, Last annotation update) DE Major prion protein precursor (PrP) (PrP27-30) (PrP33-35C) (ASCR). GN PRNP. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP SEQUENCE FROM N.A. RX MEDLINE=86300093; PubMed=3755672; RA Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H., RA Prusiner S.B., Dearmond S.J.; RT "Molecular cloning of a human prion protein cDNA."; RL DNA 5:315-324(1986). RN [2] RP SEQUENCE OF 8-253 FROM N.A. RX MEDLINE=86261778; PubMed=3014653; RA Liao Y.-C.J., Lebo R.V., Clawson G.A., Smuckler E.A.; RT "Human prion protein cDNA: molecular cloning, chromosomal mapping, RT and biological implications."; RL Science 233:364-367(1986). RN [3] RP SEQUENCE OF 58-85 AND 111-150 (VARIANT AMYLOID GSS). RX MEDLINE=91160504; PubMed=1672107; RA Tagliavini F., Prelli F., Ghiso J., Bugiani O., Serban D., RA Prusiner S.B., Farlow M.R., Ghetti B., Frangione B.; RT "Amyloid protein of Gerstmann-Straussler-Scheinker disease (Indiana RT kindred) is an 11 kd fragment of prion protein with an N-terminal RT glycine at codon 58."; RL EMBO J. 10:513-519(1991). RN [4] RP STRUCTURE BY NMR OF 118-221. RX MEDLINE=20359708; PubMed=10900000; RA Calzolai L., Lysek D.A., Guntert P., von Schroetter C., Riek R., RA Zahn R., Wuethrich K.; RT "NMR structures of three single-residue variants of the human prion RT protein."; RL Proc. Natl. Acad. Sci. U.S.A. 97:8340-8345(2000). CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE CC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS. CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED CC "RODS". CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR. CC -!- POLYMORPHISM: THE FIVE TANDEM OCTAPEPTIDE REPEATS REGION IS HIGHLY CC UNSTABLE. INSERTIONS OR DELETIONS OF OCTAPEPTIDE REPEAT UNITS ARE CC ASSOCIATED TO PRION DISEASE.
Representing facts on the Web • Much knowledge represented as natural language… • … and much more that is unusable by a computer • Stating that a thing has a relationship to another thing • “protein has function hydrolysis” • “protein has name lymphocyte associated receptor of death” • “protein expressed by gene trp!”
hasName Gene “TrpA” Facts as RDF Triples Subject Object Predicate
The Vision • By describing facts in a computationally amenable form, computers can do some of the work • RDF forms the semantic glue for the Semantic Web • RDF triples form a graph of facts a computer can traverse, make inferences and query
What Do We Need for this Vision? • A standard way of finding and naming things • A standard way of describing things • Standard vocabularies for talking about things
Semantic Web Technologies • The Resource Description Framework (RDF) • Uniform Resource Identifiers (URIs) • RDF Schema (RDFS) • The Web Ontology Language (OWL) • RDF query languages • RDF stores • Web Services • Rules languages
Triple Basics • Stating a thing about another thing • Subject predicate object • Subject verb object • Animal hasName Aardvark • A triple describes resources on the Web • A resource can be any form of information – not just a page • Resources identified by Uniform Resource Identifiers (URI), of which URLs are a kind • The names used in a triple are Uniform Resource Names (URN) • Names are also literals • All parts of triples have names, providing a vocabulary
A Family of Identifiers URI URN URL LSID URI = Uniform Resource Identifier URL = Uniform Resource Locator URN = Uniform Resource Name LSID = Life Science Identifier
URI URN URL Uniform Resource Identifiers • A URI does as the name suggests – it identifies unique things on the Web • This identifier can be a Uniform Resource Locator or a Uniform Resource Name
Uniform Resource Locators • URLs identify unique locations on the Web • Things of fragments of things • They specify a protocol by which they are retrieved – http, ftp, mailto,… • Also specify domain and host • http://www.cs.man.ac.uk/~stevensr
URN LSID Uniform Resource Names • A name, but not necessarily location • Urn:isbn: • urn:lsid:gene.ucl.ac.uk.lsid.biopathways.org:hugo:MVP • Life Science Identifiers (LSIDs) are special kinds of URNs for biological entities
Life Science Identifiers • OMG standard for uniquely identifying biological entities • LSID can be resolved by a LSR to deliver the entity • The entity is immutable and versioned • Metadata can be delivered alongside entity • Different parties can provide different metadata
RDF Triple with URI Subject Object Predicate hasName urn:lsid:gene.ucl.ac.uk.lsid.biopathways.org:hugo:MVP “MVP”
Aggregation By Names 7 6 4 1 3 2 5 8 9
Aggregation of Triples hasName “TrpA” Gene hasName expresses Tryptophan Synthetase Protein hasSubstrate Chemical
Lingua Franca • Everything can be represented as RDF triples • A common data model • Deliver underlying resources (DB) as RDF • Common data format • Common semantics for that format • Provide a Semantic Bus
RDF Vocabularies • A collection of RDF statements with their names forms a vocabulary • RDF names designed for representing Uniprot becomes an Uniprot vocabulary • Creating standard vocabularies for a domain attacks the major semantic barrier • For example, a vocabulary for talking about molecular function, biological processes, cellular components and sequence features etc. • That is, vocabularies delivered by ontologies
Uniprot Keywords in RDF rdf:comment Acetoin Biosynthesis Protein involved in the synthesis of acetoin rdf:type owl:sameAs Uniprot Concept urn:lsid:uniprot.org:go:45151 After Eric Jain http://www.isb-sib.ch/~ejain/rdf/
Knowledge Discovery tools Data mining tools Semantic Portals Social networking Smart Discovery & Retrieval Semantic Bus RDF RDF RDF RDF RDF RDF RDF BLASTp UniProt PubMed Instruments Web Pages PDF docs Notes Semantic Bus
An RDF World • Distributed heterogeneous resources present their data as RDF • A common data model for a sea of data • A “bus” into which resources can plug • Common, syntax, common data model • But no common vocabulary for values on the bus • Also need vocabularies from ontologies • Build ontology is the Web Ontology Language (OWL) and use via RDF Schema
SciFOAF http://www.urbigene.com/foaf/
UniProt RDF http://www.isb-sib.ch/~ejain/rdf/intro.html
A Shared Understanding • Semantic Understanding perhaps the most trouble • (Computer science) ontology – a technique for representing the things and relationships between things in our world • Overcoming vocabulary problems • Same word different understandings (polysemy) • Different terms, same understandings (synomymy) • To compare understandings in different resources, need a common understanding
RDF Schema (RDFS) • A collection of names in a graph forms a vocabulary • RDF Schema (RDFS) is an RDF vocabulary for talking about ontologies • Can talk about • Classes • Class/sub-class and other associative relationships • Able to describe ontologies in RDFS
RDFS Vocabularies • The RDF Vocabulary Description Language • RDF Schema “semantically extends” RDF to enable us to talk about classes of resources, and the properties that will be used with them • It does this by giving special meaning to certain RDF properties and resources • RDF Schema provides the means to describe application specific RDF vocabularies
Web Ontology Language (OWL) • Latest standard in ontology languages from the World Wide Web Consortium (W3C) • Built on top of RDF • OWL semantically extends RDF(S) • Based on its predecessor language DAML+OIL • OWL has a rich set of modelling constructors • Three ‘species’ • OWL-Lite • OWL-DL • OWL-Full
Components of an OWL Ontology • Individuals • Properties • Classes
Three Kinds of OWL • OWL-Full • The union of OWL and RDF(S) • No restrictions on how/where language constructs can be used • OWL-Full is not decidable • OWL-DL • Restricted version of OWL-Full • Corresponds to a description logic • Certain restrictions on how/where language constructs can be used in order to guarantee decidability • OWL-Lite • A subset of OWL-DL • The simplest and easiest to implement of the three species
OWL and RDF • OWL-Lite is valid OWL-DL is valid OWL-Full • Not the other way around • All can be delivered as RDF • RDFS using all its expressivity can be delivered as OWL-Full
What Makes a Semantic Web • Stores of RDF statements about pages • Semantically enabled browsers • Using links provided by RDF • Using vocabulary provided by RDF to search • Presentation sits on top of semantic layer – so can differ
What Makes a Semantic Web Web Pages RDF Stores Semantic Web Browser Semantic Bus RDQL, SeRQL, SPARQL QUERY ENGINES
Semantic Web Applications RDF Stores Semantic Web Applications Semantic Bus RDQL, SeRQL, SPARQL QUERY ENGINES
5 reasons for SW 4 LS • Problem matches up • Fragmented, distributed, scale, mismatches, dynamic, variable, information driven community matches up • Loosely coupled multiple suppliers and consumers making connections • Culture of (controlled) sharing, curating and connecting content • Lots of publicly available diverse information and knowledge content • Return on investment is worth it. • Business as usual!
On a Threshold? • Biology in a prime position to make Semantic Webs • Lots of semantically heterogeneous distributed resources • Web orientated • Many ontologies potentially delivered as RDF • Beginning to see SW applications
Acknowledgements • Slides provided by: Carole Goble, Matt Horridge, Nick Drummond, and many others from the University of Manchester • Yeliz Yesilada for drawing and formatting
A Smooth Journey on the Semantic Web Underground from Tim Berners-Lee
Another Level Down Visualisation and Querying Graph Model RDF/XML Data Sources
Semantic Web Stack From Tim Berners-Lee