630 likes | 737 Views
The Gene Ontology project and its application to fission yeast functional genomics data. Valerie Wood. Introduction to the Gene Ontology (GO) project. What is GO? (requirement, implementation). How does it work? (annotation and ontology development).
E N D
The Gene Ontology project and its application to fission yeast functional genomics data Valerie Wood
Introduction to the Gene Ontology (GO) project • What is GO? (requirement, implementation) • How does it work? (annotation and ontology development) • What can I use it for? (applications) • How can I use it? Practical exercises • Tools for using GO for data analysis • Data mining the fission yeast genome data
Gene Ontology Why?
Gene 1 mRNA export protein phosphorylation transcription mitotic cell cycle … Gene 2 mRNA export DNA recombination RNA elongation (pol II) … Traditional analysis • requires literature searching • gene by gene basis • time-consuming
Gene 1 mRNA export protein phosphorylation transcription mitotic cell cycle … Gene 2 mRNA export DNA recombination RNA elongation (pol II) … Not scalable! Gene 3 mRNA export transcription (pol II) … Gene 4 mRNA export transcription polyadenylation … Gene 5 mRNA export RNA elongation … Gene 6 mRNA export rRNA transcription DNA topological change … Gene 5000 cell cycle chromosome segregation kinetochore assembly protein localization …
Help! The problem gets bigger and bigger and bigger! http://www.teamtechnology.co.uk/f-scientist.jpg
What is the size of the ‘annotation problem’? Fission yeast + pombe gives 8170 results Including cell cycle gives 3467 The literature corpus Including DNA repair gives 555 How will we ever extract all of this information?
Grouping by process Cell cycle Gene 1 Gene 7 Gene 8 … transcription Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 .. mRNA export Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 protein phosphorylation Gene 1 Gene 7 Gene 10 … cell wall organization and biogenesis Gene 10 Gene 15 Gene 18 …
time Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Puparial adhesion Molting cycle hemocyanin Amino acid catabolism Lipid metobolism Peptidase activity Protein catabloism Immune response Immune response Toll regulated genes control attacked GO can be used to spot patterns in thousands of genes typically obtained by functional genomics data Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
A controlled vocabulary GO is also necessary for handling different terminology used between and within scientific communities: • Different phrases have the same or related meanings • The same phrase is used to describe different ‘entities’
late endosome to vacuole transport MVB sorting multivesicular body sorting late endosome to vacuole transport ; GO:0045324
Bud initiation? tooth bud initiation cellular bud initiation flower bud initiation
So what is GO ? GO provides a “controlled vocabulary” for biological knowledge that can be interpreted identically both within and between genomes Species independent, therefore enabling cross species comparisons Provides a way to capture and represent biological knowledge in a computable form
Gene Ontology Content and structure
What is Ontology? • Dictionary: A branch of metaphysics concerned with the nature and relations of being. • In philosophy, the most fundamental branch of metaphysics. It studies being or existence as well as the basic categories thereof—trying to find out what entities and what types of entities exist. – Wikipedia 1606 1700s
So what does that mean? From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things. Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing Gruber 1993 is part of
Ontology Includes: • A vocabulary of terms (names for concepts) • Definitions • Defined logical relationships to each other
What information might we want to capture about a gene product? • GO divided into three parts: • What does the gene product do? • Where and when does it act? • Why does it perform these activities? molecular function cellular component biological process
Cellular Component • where a gene product acts (location or complex) Images from http://microscopy.fsu.edu
insulin binding insulin receptor activity drug transporter activity glucose-6-phosphate isomerase activity Molecular Function • What a gene product does (activity)
transcription cell division gluconeogenesis Biological Process Broad objective or goal
Analogy: Gene Product = hammer Function (what) Process (why) Drive stake (into soil) Gardening Drive nail (into wood) Carpentry Smash roach Pest Control
Ontology Structure • The Gene Ontology is structured as a directed acyclic graph (DAG) • A DAG is similar to a hierarchy except terms can have more than one parent • Terms can have zero, one or more children • Terms are linked by two relationships • is-a • part-of
DAG: Directed Acyclic Graph Heirarchy Many-to-many parental relationship One-to-many parental relationship Each child may have one or more parents Each child has only one parent Parent-Child Relationships
cell membrane chloroplast mitochondrial chloroplast membrane membrane is-a part-of Ontology Structure
Ontology structure • This allows the modelling of biology more realistically than a hierarchy
gene A Ontology structure An important feature of GO is that broader parents give rise to more specific children.When a gene is annotated to a term, it is automatically annotated to all of its parent terms Allows curators to assign terms at different levels of granularity, depending what is known or can be inferred
True Path Rule • Every path from any term back to its top-level parent(s) must always be true (biologically accurate), or the ontology must be revised cell • cytoplasm • chromosome • nuclear chromosome • cytoplasmic chromosome • mitochondrial chromosome • nucleus • nuclear chromosome • is-a • part-of
Anatomy of a GO term unique GO ID id: GO:0006094 name: gluconeogenesis namespace: process def: The formation of glucose from noncarbohydrate precursors, such as pyruvate, amino acids and glycerol. exact_synonym: glucose biosynthesis synonym http://cancerweb.ncl.ac.uk/ def source is_a: GO:0006006 is_a: GO:0006092 term name ontology definition parentage
No GO Areas • GO covers ‘normal’ functions and processes • No pathological processes • No experimental conditions • NO evolutionary relationships • NO gene products • NOT a system of nomenclature for genes
Things to remember • A gene product may have several functions, processes or components • Sets of functions make up a biological process • A function term refers to a reaction or activity, NOT a gene product • …..
For each gene ******* GO:******* IDA PMID:******* IDA Read and record paper ***** PMID: ***** Identify GO terms What type of evidence? GO:****
Pias3 Pias4 Pias2 ATSIZ1 MGI TAIR Miz1 RGD Pias3 Pias4 GeneDB S.pombe SGD CST9 pli1 pli1 NFI1 MMS21 nse2 SIZ1 many groups annotate, we see the results of research across species GO:0019789 SUMO ligase activity
MolecularFunction: Acetyl-CoA CoA-SH Citrate synthase Biological Process: TCA Cycle 7519 Cellular Component: 9459 13494 Fission yeast annotation progress Total 30,616 annotations to 3080 terms Data from 06/06/07
Evidence Codes used 8618 IDA inferred from direct assay 776 IPI inferred from physical interaction 901 IGI inferred from genetic interaction 1089 TAS traceable author statement 1073 IC inferred by curator 9045 ISS inferred from sequence similarity 1912 IMP inferred from mutant phenotype 522 NAS non-traceable author statement 6397 IEA from electronic annotation 30333
GO Curation Strategy Manual Curation • Emphasis on Primary Literature (IDA, IMP, IGI, IPI) • Manual inspection of sequence similarity (ISS) Computational Mappings (IEA) • InterPro (domain or family) to GO • UniProt (Swissprot keyword to GO) • E.C. number to GO 1617 PMIDs 15230 annotations 9569 annotations 5815 annotations Data from 06/06/07
GO Curation Progress pombe manual pombe electronic pombe total cerevisiae total Total 30,616 annotations to 3080 GO terms S. cerevisiae has 27662 annotations to 2971 GO term (no IEA) Data from 06/06/07
Function 3542 (includes protein binding) 993 Biological Process 4019 Cellular Component 4821 GO aspect coverage 18 191 54 3279 (3455) 679 672 14 Total 5004 (5780 S. cerevisiae) All three aspects unknown 105 (564 S. cerevisiae)
Developing GO Adding new terms and biological concepts to the Gene Ontology • GO under constant development • International group of developers (all the major model organism databases contribute) • central editorial office at EBI - 4 members • Developed in consultation with domain experts • Term suggestions handled through online tracking system
Why GO changes • Advances in biology • New organisms join, need new terms • Fix errors and legacy terms • Improve logical consistency • Suggestions for changes come from • the GO editors and organism curators • the user community • Analysis of logical consistency
flybase SGD SGD MGI