1 / 38

An Introduction to Bio-Ontologies

An Introduction to Bio-Ontologies. Robert Stevens Robert.Stevens@manchester.ac.uk. Introduction. How we do bioinformatics What is knowledge What is an ontology Classes, individuals, … The components of an ontology Examples. How We Do bioinformatics. No Euclid, no Newton

fathia
Download Presentation

An Introduction to Bio-Ontologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Bio-Ontologies Robert Stevens Robert.Stevens@manchester.ac.uk

  2. Introduction • How we do bioinformatics • What is knowledge • What is an ontology • Classes, individuals, … • The components of an ontology • Examples

  3. How We Do bioinformatics • No Euclid, no Newton • No equations and no axioms • Cannot take an amino acid sequence, submit to an equation and get some biology • … so we do similarity searches

  4. Tra1 La2 La3 Transferring Characteristics Uncharacterised protein High similarity transfer characteristics

  5. What do we Transfer? • When sequences sufficiently similar we transfer what we understand about one sequence to another • The “understanding” is our knowledge about that protein

  6. Name Job Institution Country C o n f Michael Ashburner Professor University of Cambridge UK I S M B What is Knowledge? man academic, senior ancient university, 5 rated European important figure in biology B I O L O G Y • Knowledge – all information and an understanding to carry out tasks and to infer new information • Information -- data equipped with meaning • Data -- un-interpreted signals that reach our senses

  7. Uniprot:- A protein database? CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE CC BRAIN OF HUMANS AND ANIMALS INFECTED CC WITH NEURODEGENERATIVE DISEASES KNOWN AS CC TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION CC DISEASES,LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), CC GERSTMANN-STRAUSSLER SYNDROME (GSS), FATAL CC FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; CC SCRAPIE IN SHEEP AND GOAT; BOVINE SPONGIFORM CC ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE CC MINK ENCEPHALOPATHY (TME); CHRONIC WASTING CC DISEASE (CWD) OF MULE DEER AND ELK; FELINE CC SPONGIFORM ENCEPHALOPATHY (FSE) IN CATS AND CC EXOTIC UNGULATE ENCEPHALOPATHY (EUE) IN CC NYALA AND GREATER KUDU. THE PRION DISEASES CC ILLUSTRATE THREE MANIFESTATIONS OF CNS CC DEGENERATION: (1) INFECTIOUS (2) CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS. CC TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO CC OCCUR AFTER CONSUMPTION OF PRION-INFECTED CC FOODSTUFFS. DR EMBL; M13667; AAA19664.1; -. DR EMBL; M13899; AAA60182.1; -. DR EMBL; D00015; BAA00011.1; -. DR PIR; A05017; A05017. DR PIR; A24173; A24173. DR PIR; S14078; S14078. DR PDB; 1E1G; 20-JUL-00. DR PDB; 1E1J; 20-JUL-00. DR PDB; 1E1P; 20-JUL-00. DR PDB; 1E1S; 21-JUL-00. DR PDB; 1E1U; 20-JUL-00. DR PDB; 1E1W; 20-JUL-00. DR MIM; 176640; -. DR MIM; 123400; -. DR MIM; 137440; -. DR MIM; 245300; -. DR MIM; 600072; -. DR MIM; 604920; -. DR InterPro; IPR000817; Prion. DR Pfam; PF00377; prion; 1. DR PRINTS; PR00341; PRION. DR SMART; SM00157; PRP; 1. DR PROSITE; PS00291; PRION_1; 1. DR PROSITE; PS00706; PRION_2; 1. KW Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal; KW 3D-structure; Polymorphism; Disease mutation. FT SIGNAL 1 22 FT CHAIN 23 230 MAJOR PRION PROTEIN. FT PROPEP 231 253 REMOVED IN MATURE FORM (BY SIMILARITY). FT LIPID 230 230 GPI-ANCHOR (BY SIMILARITY). FT CARBOHYD 181 181 N-LINKED (GLCNAC...) (PROBABLE). FT DISULFID 179 214 BY SIMILARITY. FT DOMAIN 51 91 5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-G- FT Q. FT REPEAT 51 59 1. FT REPEAT 60 67 2. FT REPEAT 68 75 3. FT REPEAT 76 83 4. FT REPEAT 84 91 5. FT IN PATIENTS WHO HAVE A PRP MUTATION AT FT CODON 178: PATIENTS WITH MET DEVELOP FFI, FT THOSE WITH VAL DEVELOP CJD). FT /FTId=VAR_006467. FT VARIANT 171 171 N -> S (IN SCHIZOAFFECTIVE DISORDER). FT /FTId=VAR_006468. FT VARIANT 178 178 D -> N (IN FFI AND CJD). FT /FTId=VAR_006469. FT VARIANT 180 180 V -> I (IN CJD). FT /FTId=VAR_006470. FT VARIANT 183 183 T -> A (IN FAMILIAL SPONGIFORM FT ENCEPHALOPATHY). FT /FTId=VAR_006471. FT VARIANT 187 187 H -> R (IN GSS). FT /FTId=VAR_008746. FT VARIANT 188 188 T -> K (IN EOAD; DEMENTIA ASSOCIATED TO FT PRION DISEASES). FT /FTId=VAR_008748. FT VARIANT 188 188 T -> R. FT /FTId=VAR_008747. FT VARIANT 196 196 E -> K (IN CJD). FT /FTId=VAR_008749. FT /FTId=VAR_006472. SQ SEQUENCE 253 AA; 27661 MW; 43DB596BAAA66484 CRC64; MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA VVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV ILLISFLIFL IVG // ID PRIO_HUMAN STANDARD; PRT; 253 AA. AC P04156; DT 01-NOV-1986 (Rel. 03, Created) DT 01-NOV-1986 (Rel. 03, Last sequence update) DT 20-AUG-2001 (Rel. 40, Last annotation update) DE Major prion protein precursor (PrP) (PrP27-30) (PrP33-35C) (ASCR). GN PRNP. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP SEQUENCE FROM N.A. RX MEDLINE=86300093; PubMed=3755672; RA Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H., RA Prusiner S.B., Dearmond S.J.; RT "Molecular cloning of a human prion protein cDNA."; RL DNA 5:315-324(1986). RN [2] RP SEQUENCE OF 8-253 FROM N.A. RX MEDLINE=86261778; PubMed=3014653; RA Liao Y.-C.J., Lebo R.V., Clawson G.A., Smuckler E.A.; RT "Human prion protein cDNA: molecular cloning, chromosomal mapping, RT and biological implications."; RL Science 233:364-367(1986). RN [3] RP SEQUENCE OF 58-85 AND 111-150 (VARIANT AMYLOID GSS). RX MEDLINE=91160504; PubMed=1672107; RA Tagliavini F., Prelli F., Ghiso J., Bugiani O., Serban D., RA Prusiner S.B., Farlow M.R., Ghetti B., Frangione B.; RT "Amyloid protein of Gerstmann-Straussler-Scheinker disease (Indiana RT kindred) is an 11 kd fragment of prion protein with an N-terminal RT glycine at codon 58."; RL EMBO J. 10:513-519(1991). RN [4] RP STRUCTURE BY NMR OF 118-221. RX MEDLINE=20359708; PubMed=10900000; RA Calzolai L., Lysek D.A., Guntert P., von Schroetter C., Riek R., RA Zahn R., Wuethrich K.; RT "NMR structures of three single-residue variants of the human prion RT protein."; RL Proc. Natl. Acad. Sci. U.S.A. 97:8340-8345(2000). CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE CC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS. CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED CC "RODS". CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR. CC -!- POLYMORPHISM: THE FIVE TANDEM OCTAPEPTIDE REPEATS REGION IS HIGHLY CC UNSTABLE. INSERTIONS OR DELETIONS OF OCTAPEPTIDE REPEAT UNITS ARE CC ASSOCIATED TO PRION DISEASE.

  8. A Web of Knowledge in Bioinformatics

  9. “When I use a word,” Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean - neither more nor less.” “The question is,” said Alice, “whether you can make words mean so many different things.” “The question is,” said Humpty Dumpty, “which is to be master - that’s all” Through the Looking Glass Lewis Carroll Words in Bioinformatics

  10. Post-Genomic Biology • Fly, mouse, yeast, worm all have their own terminologies • I want to compare genomes • How? • Sequences comparable • What we know about sequences is not (by human or machine) • Need a common understanding of what sequences do

  11. A Shared Understanding • Synonyms and homonyms are rife • Need to know that terms in one resource mean the same in another resource • Means comparisons are much easier: Can ask questions over many resources • A structure of relationships enables discovery and query abstractions • Useful for both humans and computers • The Gene Ontology allows queries outside one model organismm

  12. Gene Ontology http://www.geneontology.org • “a dynamic controlled vocabulary that can be applied to all eukaryotes” • Built by the community for the community. • Three organising principles: • Molecular function, Biological process, Cellular component • Describes kinds of things and parts of things • Describes ~17,000 things

  13. London Bills of Mortality

  14. Aggregated Stats

  15. The art of ranking things in genera and species is of no small importance and very much assists our judgment aswell as our memory. You know how much it matters in botany, not to mention animals and other substances, oragain moral and notional entities as some call them. Order largely depends on it, and many good authors write insuch a way that their whole account could be divided and subdivided according to a procedure related to generaand species. This helps one not merely to retain things, but also to find them. And those who have laid out all sortsof notions under certain headings or categories have done something very useful. Gottfried Wilhelm Leibniz, New Essays on Human Understanding

  16. Ontology • Semantics – the meaning of meaning. • Philosophical discipline, branch of philosophy that deals with the nature and the organisation of reality. • Science of Being (Aristotle, Metaphysics, IV,1) • What is being? • What are the features common to all beings?

  17. So What? • Describing what “exists” in our domain • We have Protein, Gene, Intron, Exon, Hydrolase activity, etc. • We can also describe how these “things” relate to each other • We can define what they mean; define the properties of these things such that we can recognise those things • We are capturing our understanding • Sharing this understanding between humans and computer • Making what we understand explicit

  18. What Is An Ontology? • No universally agreed-upon definition • A “specification of a conceptualisation” • Conceptualisation refers to the set of concepts that people use to talk about a given domain and the relationships among these concepts • A set of vocabulary terms and definitions that capture a community’s understanding of their domain • CS has perverted the original philosophy • Ontology == conceptual model of a domain

  19. What Is An Ontology? Elements that most agree on: • classes = sets of things • instances = members of classes • relationships • axioms = additional logical statements

  20. What Is An Ontology? • Idea of a controlled vocabulary: • Each element has a unique name • Each element has a specified definition • For a given entity or relationship in the domain, there should only be one element in the ontology representing it • Ask for hydrolase actibity” and get all and only hydrolase activity

  21. What Is an Ontology? • Hierarchy (or taxonomy) is very important: • Classes arranged into a hierarchy • subclass = descendant class • direct subclass = child class • superclass = ancestor class • direct superclass = parent class

  22. What Is an Ontology? • Can be a single hierarchy, in which each class can only have one direct superclass, or a multiple hierarchy (or polyhierarchy), in which each class can have more than one direct superclass • is-a relationship between a class and its superclass(es) • A classinheritsthe properties that have been defined for its superclass(es)

  23. Why develop an ontology? • To make domain assumptions explicit • Easier to change domain assumptions • Easier to understand and update legacy data • To separate domain knowledge from operational knowledge • Re-use domain and operational knowledge separately • A community reference for applications • To share a consistent understanding of what information means.

  24. Classes • Classes: Sets of things in the world (nouns) • Classes of individuals • Classes: Person, protein, gene, DNA • Individuals: Robert (NE 67 51 48A), a LARD protein, a TrpA gene, a bacterium O23912 • Classes represent the things we know in our domain

  25. Properties • Classes have properties that describe their nature • Properties held by the individuals in a class • Properties made by relationships to individuals in other classes • Some properties must be held by a class • These are necessary to be a member of a class • Some properties are sufficient to define membership of a class • These are sufficient to recognise an individual as being a class member

  26. Classes • Primitive classes: • properties are necessary • Globular protein must have hydrophobic core, but a protein with a hydrophobic core need not be a globular protein • Defined classes: • properties are necessary + sufficient • Eukaryotic cells must have a nucleus. Every cell that contains a nucleus must be Eukaryotic.

  27. animal domestic vermin dog cat cow rodent eats mouse An explicit description of a domain • Rather than arguing about meaning of words • We argue about characteristics of things • Experience shows writing a list of characteristics or properties describing a “thing” saves much time • Computationally useful – gives a computer something to work with…

  28. Classification of the Classical Tyrosine Phosphatases

  29. Incremental Addition of Protein Functional Domains

  30. Determining Class Definitions for Phosphatases R2A • Contains 2 protein tyrosine phosphatase domains • Contains 1 transmembrane domain • Contains 4 fibronectin domains • Contains 1 immunoglobulin domain • Contains 1 MAM domain • Contains 1 cadherin-like domain Form complete OWL descriptions and clasify

  31. What is the Ontology Telling Us? • Each class of phosphatase defined in terms of domain composition • We know the characteristics by which an individual protein can be recognised to be a member of a particular class of phosphatase • We have this knowledge in a computational form • If we had protein instances described in terms of the ontology, we could classify those individual proteins • A catalogue of phosphatases

  32. Classification of Protein Tyrosine Phosphatases

  33. So what is an ontology? General Logical constraints [Deborah McGuinness, Stanford] Frames (properties) Formal Is-a Thesauri Catalog/ ID Disjointness, Inverse, partof Formal instance Informal Is-a Terms/ glossary Value restrictions Arom Gene Ontology TAMBIS EcoCyc Mouse Anatomy PharmGKB

  34. EcoCyc

  35. Gene Ontology http://www.geneontology.org

  36. Controlled vocabulary • AGROVOC: Agricultural Vocabulary

  37. UMLS (Unified Medical Language System)http://umlsks.nlm.nih.gov/ • National Library of Medicine (NLM) database of medical terminology. Terms from several medical databases (MEDLINE, SNOMED International, Read Codes, etc.) are unified so that different terms are identified as the same medical concept. • Metathesaurus provides the concordance of medical concepts: 730.000 concepts, 1.5 million concept names in different source vocabularies • Specialist lexicon provides word synonyms, derivations, lexical variants, and grammatical forms of words used in MetaThesaurus terms: 130,000 entries. • Semantic Network codifies the relationships (e.g. causality, "is a", etc.) among medical terms: 134 semantic types, 54 relationships.

  38. An Ontology Building Life-cycle Identify purpose and scope Consistency Checking Knowledge acquisition Building Language and representation Conceptualisation Integrating existing ontologies Available development tools Encoding Ontology Learning Evaluation

More Related