250 likes | 275 Views
Learn about the importance of ontology in bioinformatics, how it benefits biologists in their work, and the Gene Ontology (GO) as a structured vocabulary for describing gene products. Explore the three hierarchies of GO and how gene annotation using GO terms can enhance research coordination.
E N D
What is an Ontology • An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common knowledge) • Terms represent a controlled vocabulary, and define the concepts of a domain. • Terms are linked by relationships, which constitute a semantic network. • Ontologies augment natural language annotations and can be more easily processed computationally. (becomes the language of the domain it describes for communication, coordination and collaboraton)
Why We Need Ontology in Bioinformatics • Biologists need knowledge in order to perform their work. • Sequence comparison to infer the function. • Biologists need knowledge for communication, but such knowledge may be represented in different ways. • Different use of gene: • The coding region of DNA • DNA fragment that can be transcripted and translated into a protein • DNA region of biological interest with a name and that carries a genetic trait or phenotype
The Gene Ontology (GO) • Provides structured vocabularies for describing gene products in the domain of molecular biology. • Enables a common understanding of model organisms and between databases • Consisted of three structurally unlinked hierarchies (molecular function, biological process and cellular component). • 2 types of relationships between terms: • is-a: subclass. • part-of: physical part of, or subprocess of.
Why Gene Ontology? • Without structured vocabularies, different sources can refer to the same concept using different terms (e.g., cdc54 in yeast is MCM4 in mouse). • What is a well-known shorthand in one research community is gibberish in another. Contributions by one research community may not be recognized by others. • Without coordination, research work may be duplicated. • The goal of the Gene Ontology Consortium is to produce a controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Three GO Hierarchies • Molecular function: elemental activity/task (what) • (e.g., DNA-binding, polymerase, transcription factor) • (what a gene does at the biochemical level) • Biological process: goal or objective (why) • (e.g., mitosis, DNA replication, cell cycle control) • (A broad biological perspective – not currently a pathway) • Cellular component: location within cellular structures and macromolecular complex (where) • (e.g., nucleus, ribosome, pre-replication complex) (Each GO hierarchy has a DAG structure. A child term may have many parent terms) (Gene Ontology information can be accessed at http://www.geneontology.org/)
Example: Gene Ontology Hierarchy Biological process (GO:0008150) i i i i … Development (GO:0007275) Cellular process (GO:0009987) Physiological (GO:0007582) Behavior (GO:0007610) … … i i i i i … … … … … … … … Communication (GO:0007154) Cell death (GO:0008219) Cell growth (GO:0008151) P i … … … … … … … Cell aging (GO:0007569) Programmed (GO:0012501) P i … … … … Induction (GO:0012502) Apoptosis (GO:0006915) is a i i i … … … HS response (GO:0009626) Autophagic cell death (GO:0048102) part of P
is-a part-of i P
Gene Annotation Using GO Terms • Association of GO terms with gene products based on evidence from literature reference or computational analysis. • The creation of GO and the association of GO terms with gene products (gene annotation) are two independent operations. • A gene can be associated with one or more GO terms (gene categories), and one category normally has many genes (many-to-many relationship between genes and GO terms)
Gene Product Associations to an Ontology yeast ID Term Definition Ontology Synonyms fly Is-a| Part-of Node1 ID Node2 ID GO ID DB ID Evidence code Reference Citation NOT mouse
Genes of a Biological Process Tend to Be Co-Regulated Biological Process Gene Names
Use Gene Ontology (GO) to Annotate Genes • GO URL: http://www.geneontology.org/ • Two concepts: • Gene Ontology: Provides structured vocabularies for describing gene products in the domain of molecular biology (all species share the same gene ontology) • Annotations: Association of GO terms with gene products based on evidence from literature reference or computational analysis (each species has a separate annotation file)
The Gene Ontology (GO) • GO file: http://www.geneontology.org/ontology/gene_ontology.obo • An example of GO term • [Term] • id: GO:0000001 (A unique id for the GO term) • name: mitochondrion inheritance (The name of the GO term) • namespace: biological_process (see next slide) • def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824, PMID:11389764, SGD:mcc] (A detailed description of the GO term) • is_a: GO:0048308 ! organelle inheritance • is_a: GO:0048311 ! mitochondrion distribution
Gene Annotation Using GO Terms • http://www.geneontology.org/GO.current.annotations.shtml • Select the annotation file for a particular species • An example of an annotation entry for yeast • SGD S000004660 AAC1 GO:0005743 SGD_REF:S000050955|PMID:2167309 TAS C ADP/ATP translocator YMR056C gene taxon:4932 • “AAC1” is the gene name • “GO:0005743” is the GO id, we can link it to the corresponding item in the ontology file • “SGD_REF:S000050955|PMID:2167309” is where this annotation comes from • “C” means this annotation belongs to the “cellular component” namespace • “ADP/ATP translocator” is a brief description of this annotation • “YMR056C” is another name for this gene • “taxon:4932” means this is a yeast gene
Gene Annotation Using GO Terms Given a list of genes L from a specific species Sj 1) go to http://www.geneontology.org/GO.current.annotations.shtml 2) select and download the annotation file Fj for Sj For each gene Gi in list L 3) find the annotation entry Ek for Gi in Fj 4) find the GO term id from entry Ek 5) go to http://www.geneontology.org/ontology/gene_ontology.obo 6) find the GO term in the ontology file, the GO term provides more detailed annotation for this gene
Use of GO to Annotation Genes Problem: Given a list of n genes, whether they are significantly associated with a specific GO term ? Solution: Calculate the p-Value. Notations Total number of genes in the data set : N Total number of genes assigned to term T: M Number of genes in the list: n Number of genes in the list and assigned to term T: m
How to Assess Overrepresentation of a GO Term? Genes on an array: Total number of genes (N): 2,285 Number of genes – cell cycle (M): 161 Genes in a cluster: Number of genes in the cluster (n): 147 Number of genes – cell cycle (m): 25 Is the GO term (i.e., cell cycle) significantly overrepresented in the cluster?
Given the total number of genes in the data set associated with term T is M, if randomly draw n genes from the data set N, what is the probability that m of the selected n genes will be associated with T? Hyper-geometric Distribution
Based on Hyper-geometric distribution, the probability of having m genes or fewer associated to T in N can be calculated by summing the probabilities of a random list of N genes having 1, 2, …, m genes associated to T. So the p-value of over-representation is as follows: P-Value
MAPPFinder • A tool for mapping gene expression data to the GO hierarchies. • Part of the free software package GenMAPP. • Available at http://www.genmapp.org/. (Doniger et al., 2003)
MAPPFinder Sample Output (Doniger et al., 2003)
GoMiner • A client-server application using Java (data on the server side). • Available at http://discover.nci.nih.gov/gominer/. (Zeeberg et al., 2003)
p GO # genes (Genes linked to poor breast cancer outcome) Onto-Express • A web application for GO-based microarray data analysis (http://vortex.cs.wayne.edu/Projects.html). • The input to Onto-Express is a list of Affymetrix probe IDs, GenBank sequence accessions or UniGene cluster IDs. • Part of the integrated Onto-Tools, including: • Onto-Compare: compare commercial arrays. • Onto-Design: help array design (probe selection). • Onto-Translate: provide mapping of different IDs.