460 likes | 558 Views
Biomedical Ontologies. Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Overview. What is ontology ?
E N D
Biomedical Ontologies Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
Overview • What is ontology? • What is biomedical ontology? • What is gene ontology? • How is it generated? • How is it used for annotation? • What is protein ontology? • Why is it necessary? • How to use it? • ……
Tree of Porphyry with Aristotle’s Categories Aristotle, 384 BC – 322 BC
Ontology: onto-, of being or existence; -logy, study. Greek origin; Latin, ontologia,1606 • In philosophy, it seeks to describe basic categories and relationships of being or existence to define entities and types of entities within its framework: • What do you know? How do you know it? • What is existence? What is a physical object? • What constitutes the identity of an object? …… • Central goal is to have a definitive and exhaustive classification of all entities. “The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality” – Barry Smith, U Buffalo
is_a In computer and information science • Ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain. Most ontologies describe individuals (instances), classes (concepts), attributes, and relations Classes Relations Attributes Classes (concepts) e.g. color, engine, door… Individuals (instances) your Ford, my Ford, his Ford…
What are ontology useful for? Ontology is a form of knowledge representation about the world or some part of it. • Terminology management • Integration, interoperability, and sharing of data • promote precise communication between scientists • enable information retrieval across multiple resources • Knowledge reuse and decision support • extend the power of computational approaches to perform data exploration, inference, and mining Biomedical Terminologyvs. Biomedical Ontology • UMLS (unified medical language system) • MeSH (medical subject heading) • NCI Thesaurus • SNOMED / SNODENT • Medical WordNet
Ontology Enables Large-Scale Biomedical Science • Structured representation of biomedicine: • For different types of entities and relations to describe biomedicine (ontology content curation). • Annotation: using ontologies to summarize and describe biomedical experimental results to enable: • Integration of their data with other researchers’ results • Cross-species analyses The center of two major activities currently in biomedical research:
Gene Ontology (GO) what makes it so wildly successful ?
GO Consortium • The Gene Ontology was originally constructed in 1998 by a consortium of researchers studying the genome of three model organisms: • Drosophila melanogaster (fruit fly) (FlyBase) • Mus musculus (mouse) (MGD) • Saccharomyces cerevisiae (yeast) (SGD) • Many other model organism databases have joined the GO consortium, contributing: • development of the ontologies • annotations for the genes of one or more organisms http://www.geneontology.org/
Need for annotation of genome sequences • What is Gene Ontology? GO provides controlled vocabulary to describe gene and gene product attributes in any organism – how gene products behave in a cellular context Three key concepts: [Currently total 27399 GO terms] (May 2009) • Biological process: series of events accomplished by one or more ordered assemblies of molecular functions, e.g. signal transduction, or pyrimidine metabolism, and alpha-glucoside transport. [total: 16468] • Molecular function: describes activities, such as catalytic or binding activities, that occur at the molecular level. Activities that can be performed by individual gene products, or by assembled complexes of gene products; e.g. catalytic activity, transporter activity. [total: 8585] • Cellular component: a component of a cell that it is part of some larger object, maybe an anatomical structure (e.g. ER or nucleus) or a gene product group (e.g. ribosome, or a protein dimer). [total: 2346] • GO annotation • - Characterization of gene products using GO terms • - Members submit their data which are available at GO website.
root A C C has two parents, A and B B Relations: is_a, or part_of C Leaf node GO Representation: Tree or Network? GO is a network structure Node, a concept or a term
GO search and display tool Leaf node GO term (GO:0006366): mRNA transcription from RNA polymerase II promoter
Human p53 – GO annotation (UniProtKB:P04637) GO:0006289:nucleotide-excision repair [PMID:7663514; evidence:IMP]
GO annotation of gene products • Science basis of the GO: trained experts use the experimental observations from literature to associate GO terms with gene products (to annotate the entities represented in the gene/protein databases) • Enabling data integration across databases and making them available to semantic search http://www.geneontology.org/GO.current.annotations.shtml Human, mouse, plant, worm, yeast …
What GO is NOT…… • Ontology of gene products: e.g. cytochrome c is not in GO, but attributes of cytochrome c are, e.g. oxidoreductase activity. • Processes, functions and component unique to mutants or diseases: e.g. oncogenesis is not a valid GO. • Protein domains or structural features. • Protein-protein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of cellular components, including cell types. Neither GO is Ontology of Genes!! – a misnomer
Missing GO nodes… not deep enough…not broad enough…
Estrogen receptor Lack of connections among GOs
GO: A Common Standard for Omics Data Analysis what molecular function? what biological process? what cellular component?
need more… • need to improve the quality of GO to support more rigorous logic-based reasoning across the data annotated in its terms • need to extend the GO by engaging ever broader community support for addition of new terms and for correction of errors • need to extend the methodology to other domains, including clinical domains, such as: • disease ontology • immunology ontology • symptom (phenotype) ontology • clinical trial ontology • ...
http://www.obofoundry.org/ • Establish common rules governing best practices for creating ontologies and for using these in annotations • Apply these rules to create a complete suite of orthogonal interoperable biomedical reference ontologies National Center for Biomedical Ontology (NCBO) http://bioontology.org/
http://www.obofoundry.org/index.cgi?sort=domain&show=ontologieshttp://www.obofoundry.org/index.cgi?sort=domain&show=ontologies
The OBO Foundry • A family of interoperable gold standard biomedical reference ontologies to serve annotation of: • scientific literature • model organism databases • clinical trial data … • tight connection to the biomedical basic sciences • compatibility, interoperability, common relations • support for logic-based reasoning OBO Foundry = a subset of OBO ontologies, whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure: OBO Foundry Principles: http://www.obofoundry.org/crit.shtml
RELATION TO TIME GRANULARITY Rationale of OBO Foundry coverage
OBO Relation Ontology e.g.: A is_a B =def. every instance of A is an instance of B “rose is_a plant all instances of rose is_a plant”
What is Protein Ontology? Why? http://pir.georgetown.edu/pro/ PRO
PRotein Ontology (PRO) • PRO in OBO Foundry to represent protein entities • Ontology for Protein Evolution (ProEvo) for evolutionary classes of proteins: captures the protein classes reflecting evolutionary relationships at full-length protein levels. It is meant to indicate the relatedness of proteins, not their evolution. Use is_a relationship. • Ontology for Protein Forms (ProForm) for multiple protein forms of a gene: captures the different protein forms of a specific gene from genetic variations, alternative splicing, cleavage, post-translational modifications. Relations used • is_a or derives_from 27
Why PRO? Allows specification of relationships between PRO and other ontologies, such as GO and Disease Ontology Provides a structure to support formal, computer-based inferences based on shared attributes among homologous proteins Provides a stable unique identifier to any protein type PRO:000000650 Smad2 isoform 1 phosphorylated 1 (phosphorylated SSxS motif) NOThas_functionGO:0003677:DNA binding PRO:000000656 Smad2 isoform 2 phosphorylated 1 (phosphorylated SSxS motif) has_functionGO:0003677 DNA binding • Provides formalization and precise annotation of specific protein forms/ classes, allowing accurate and consistent data mapping, integration and analysis: for dendritic cell ontology or pathway [Term] http://www.biomedcentral.com/1471-2105/10/70 id: DC_CL:0000003 name: conventional dendritic cell def: "conventional dendritic cell is_a leukocyte that has_high_plasma_membrane_amount_relative_to_leukocyte CD11c and lacks_plasma_membrane_part CD19, CD3, C34, and CD56." [AMM:amm] comment: Immunological Reviews 2007 219: 118-142 intersection_of: CL:0000738 ! leukocyte intersection_of: has_high_plasma_membrane_amount_relative_to_leukocyte PRO:000001013 ! CD11c intersection_of: lacks_plasma_membrane_part DC_CL:0000072 ! CD3 intersection_of: lacks_plasma_membrane_part PRO:000001002 ! CD19 intersection_of: lacks_plasma_membrane_part PRO:000001003 ! CD34 intersection_of: lacks_plasma_membrane_part PRO:000001024 ! CD56
LAP TGF-b TGF-b TGF-b II II I I STRAP Smad 7 Shc Smad 2 Smad 2 Smad 2 Smad 2 Smad 2 Smad 2 S S S S S S S S S X S S S S S S S S S S S S S S S S X S P P P P P P P P P P P P P P P P P P P P P P P P P P P P TAK1 Y T Y T K Y Y Y T Y T Y T T K Y T P P P P P P P P P P U P P P P U P Smad 4 Smad 4 Smad 4 Smad 2 Phosphorylation (P) at Serine (S), Threonine (T) Tyrosine (Y) Ubiquitination (U) at Lysine (K) TGF-beta signaling – comparison between PID and Reactome PRO:000000397 Furin PRO:000000618 Growth signals Ca2+ Growth signals Stress signals PRO:000000616 TGF-beta receptor PRO:000000523 PRO:000000410 Cytoplasm Smad 2 PRO:000000468 PRO:000000650 MEKK1 PRO:000000481 Smad 4 ERK1/2 PRO:000000366 Shc XIAP TAK1 CaM PRO:000000650 PRO:000000651 PRO:000000366 Degradation P38 MAPK pathway JNK cascade MAPKKK PRO:000000652 X PRO:000000650 Ski PRO:000000366 Nucleus Common in both Reactome & PID X Only included in Reactome * All others are in PID. Not all components in the pathway from both databases are listed DNA binding and transcription regulation
Human PRLR The Need for Representation of Various Proteins Forms Glucocorticoid receptor (GR) and PTMs…
lysosome Extracelluar, e.g. LDL binding M6P Sphingomyelin phosphodiesterase (SMPD1) (ASM_HUMAN) • Cleavage sites: • lysosomal: the enzyme is transported from the Golgi apparatus to the lysosome after additions of mannose-6-phosphate moieties (M6P) and binding to M6P receptor. • secreted: the shorter cleaved form is not modified with M6P and is targeted for secretion to the extracellular space, with different functions such as LDL binding and oxidized LDL catabolism.
OVOL2_MOUSE (Q8CIV7) GOA for Transcription factor Ovo-like 2 Form 1 - long:GO:0045892 IDA - negative regulation of transcription, DNA-dependent Form 2 – short: GO:0045893 IDA - positive regulation of transcription, DNA-dependent - Gene. 2004 336:47-58. PMID:15225875 274 aa
Implications of Protein Evolution B.subtilis Human Mouse Chimp Worm Yeast E.coli Rat Fly • Conclusions from experiments performed on proteins from one organism are often applicable to the homologous protein from another organism. • Information learned about existing proteins allows us to infer theproperties of ancestral proteins. Common ancestor
Functional convergence • Protein classes of the same function derived from different evolutionary origins, e.g. carbonate dehydratase (or carbonic anhydrase EC 4.2.1.1), which has three independent gene families with functional convergence. Animal and prokaryotic type Plant and prokaryotic type Archaea type
Functional divergence Gene Duplication (TGM3/EPB42 split) Speciation (Human/mouse split) Human TGM3 branch TGM3 (Human) Mouse TGM3 (Mouse) Human EPB42 (Human) EPB42 branch Mouse EPB42 (Mouse) TGM3 (Human) TGM3 (Mouse) EPB42 (Human) EPB42 (Mouse) TGM3 = Protein-glutamine gamma-glutamyltransferase (Transglutaminase; involved in protein modification) EBP42 = Erythrocyte membrane protein band 4.2 (Constituent of cytoskeleton; involved in cell shape)
Pfam Domain PRO protein domain protein Root Level is_a has_part GOGene Ontology • Family-Level Distinction • Derivation: common ancestor • Source: PIRSF family translation product of an evolutionarily-related gene molecular function is_a has_function • Gene-Level Distinction • Derivation: specific gene • Sources: PIRSF subfamily, Panther subfamily biological process translation product of a specific gene participates_in is_a cellular component • Sequence-Level Distinction • Derivation: specific allele or splice variant • Source: UniProtKB part_of (for complexes) located_in(for compartments) translation product of a specific mRNA derives_from OMIM Disease • Modification-Level Distinction • Derived from post-translational modification • Source: UniProtKB disease cleaved/modified translation product agent_in SOSequence Ontology Example: TGF-beta receptor phosphorylated smad2 isoform1 is a phosphorylated smad2 isoform1 derives_from smad2 isoform 1 is a smad2 is a TGF-b receptor-regulated smad is a smad is aprotein ProForm Modification Level sequence change Sequence Level has_agent(sequence change) agent_of (effect on function) Gene Level ProEvo PSI-MODModification Family Level protein modification Root Level has_modification Protein Ontology (PRO) http://pir.georgetown.edu/pro/
Mothers against decapentaplegic homolog 2 Smad 2 GO annotation of SMAD2_HUMAN: Cellular Component: - nucleus - cytoplasm Molecular Function:- protein bindingBiological Process: - signal transduction - regulation of transcription, DNA-dependent
TGF-b TGF-beta receptor II I Smad 2 1 phosphorylation Smad 4 Smad 2 P P CAMK2 ERK1 2 complex formation Smad 2 P P P P Smad 2 P P P P Smad 4 Cytoplasm 3 nuclear translocation Smad 2 P P P Smad 4 Nucleus 4 DNA binding ++ Transcription Regulation
Smad 2 P P Smad 2 P P P Smad 2 P P P Smad 2 P P x Smad 2 Smad2 gene products Forms Location ID SMAD2_HUMAN Smad 2 SMAD2_HUMAN SMAD2_HUMAN SMAD2_HUMAN SMAD2_HUMAN Smad 2 SMAD2_HUMAN SMAD2_HUMAN
Search:smad2 -> 21 protein terms http://pir.georgetown.edu/cgi-bin/pro/textsearch_pro
PRO hierarchy for smad2 isoform 2 Root:Protein
Summary • The vision of the biomedical ontology community is that all biomedical knowledge and data are disseminated on the Internet using principled ontologies, such that they are semantically interoperable and useful for improving biomedical science and clinical care. • The scope extends to all knowledge and data that is relevant to the understanding or improvement of human biology and health. • Knowledge and data are semantically interoperable when they enable predictable, meaningful, computation across knowledge sources developed independently to meet diverse needs. • Principled ontologies are ones that follow NCBO-recommended formats and methodologies for ontology development, maintenance, and use.