350 likes | 560 Views
Modeling Functional Genomics Datasets CVM8890-101. Lesson 2 13 June 2007 Teresia Buza. Lesson 2: Introduction to functional annotation. Orthologs and homologs; clusters of orthologous genes (COGs) and the gene ontology (GO); and how to find what functional annotation is available.
E N D
Modeling Functional Genomics DatasetsCVM8890-101 Lesson 2 13 June 2007 Teresia Buza
Lesson 2: Introduction to functional annotation. Orthologs and homologs; clusters of orthologous genes (COGs) and the gene ontology (GO); and how to find what functional annotation is available.
Introduction to Functional • Annotation
Where are we? Central Dogma New technology Genomic hypothesis Genome Gene ATGTCCTATCCATGTCGTACAGATTGACGAGAT Genome sequencing mRNA transcript Transcriptome Transcript profiling Protein Proteome Protein quantification What is all this? Structural annotation What next? Functional annotation
Genome Annotation • Biologists refer to both the annotation of the genome and functional annotation of gene products: • “Structural” Annotation • & • “Functional” Annotation
Structural & Functional Annotation • Structural annotation • Identification of genomic elements. • ORFs predicted during genome assembly • Location of ORFs • Gene structure • Coding regions • Location of regulatory motifs etc • Functional annotation • Attaching biological information to genomic elements. • Biochemical function • Biological function • Involved regulation and interactions • Expression etc • These steps may involve both biological experiments and in • silico analysis. http://en.wikipedia.org/wiki/Genome_annotation#Genome_annotation (with modifications)
Why Functional Annotation? Enables you to take large “laundry lists” of genes/proteins and turn them into a biologically useful model
Functional Annotation • Annotation of gene products = Gene Ontology (GO) annotation • Initially, predicted ORFs have no functional literature and GO annotation relies on computational methods (rapid but ?Quantity vs Quality) • Functional literature exists for many genes/proteins prior to genome sequencing (slow but provide high quality annotations) • GO annotation does not rely on a completed genome sequence!
Types of Functional annotation • Based in direct experimental evidence of function • Experiments in the same ORGANISM example: • Enzyme assays • Binding experiments • Pathway analysis • Synthetic lethals • Functional complementation • Gene mutations • RNAi • 2-hybrid interactions etc • Indirect Evidence of function • Expression analysis • Structure analysis • Sequence analysis
Functional Annotation • Problem: • Many genes/proteins have no annotation • Some have unknown functions • Challenge: • We want to get the maximum functional annotation for modeling our data • Solution: • Read papers (Pubmed etc) • Search for homologs/orthologs of known function • Homologs and orthologs help assign function….
2. Finding Function: orthologs and homologs
What are Homologs, Orthologs, Paralogs? Homolog Is a relationship between genes separated by the event of speciation or genetic duplication Ortholog Orthologs are homologous genes in different species that evolved from a common ancestor gene by speciation. Normally (not always), orthologs retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes. Paralog Paralogs are homologous genes related by duplication within a genome. Paralogs evolve new functions, even if these are related to the original one. http://homepage.usask.ca/~ctl271/857/def_homolog.shtml
Orthologs & Paralogs orthologs Paralogs http://www.ensembl.org/info/data/compara/tree_example1.jpg
How to search for Orthology? • BLAST : http://www.ncbi.nlm.nih.gov/BLAST/ • Sequence alignment search tool • Utilizes heuristic algorithm • MPsrch: http://www.ebi.ac.uk/MPsrch/ • Sequence comparison tool • Implement Smith & Waterman algorithm • Utilizes exhaustive algorithm • Domain analysis: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml • Analysis of regions of sequence homology among sets of proteins that are not all full- • length homologs. • Homology domains often, but not always, correspond to recognizable protein folding • domains • Protein family databases (e.g. COGs & KOGs) • Superfamily: Complete set of proteins having sequence homology over essentially their • full length. • Subfamilies: Incomplete set of homologous proteins which yet encompass proteins of • diverse function
Systems for Functional Annotation 1. Clusters of Orthologous Groups (COGs) Prokaryotes 2. euKaryote Orthologous Groups (KOGs) Eukaryotes 3. Gene Ontology (GO)
COGs & KOGs • Both are based on orthology. • Genes are assigned to broad categories (A-Z) • Each category corresponds to an ancient conserved domain • COGs - prokaryotes • KOGs - eukaryotes
Clusters of Orthologous Groups (COGs) http://www.ncbi.nlm.nih.gov/COG/ Text search: COGs has 25 functional categories (A – Z) in four broad groups • Information storage and processing • Cellular processes and signaling • Metabolism • Poorly characterized
COGs Categories INFORMATION STORAGE AND PROCESSING [J] Translation, ribosomal structure and biogenesis [A] RNA processing and modification [K] Transcription [L] Replication, recombination and repair [B] Chromatin structure and dynamics CELLULAR PROCESSES AND SIGNALING [D] Cell cycle control, cell division, chromosome partitioning [Y] Nuclear structure [V] Defense mechanisms [T] Signal transduction mechanisms [M] Cell wall/membrane/envelope biogenesis [N] Cell motility [Z] Cytoskeleton [W] Extracellular structures [U] Intracellular trafficking, secretion, and vesicular transport [O] Posttranslational modification, protein turnover, chaperones ftp://ftp.ncbi.nih.gov/pub/COG/COG/fun.txt
COGs Categories METABOLISM [C] Energy production and conversion [G] Carbohydrate transport and metabolism [E] Amino acid transport and metabolism [F] Nucleotide transport and metabolism [H] Coenzyme transport and metabolism [I] Lipid transport and metabolism [P] Inorganic ion transport and metabolism [Q] Secondary metabolites biosynthesis, transport and catabolism POORLY CHARACTERIZED [R] General function prediction only [S] Function unknown ftp://ftp.ncbi.nih.gov/pub/COG/COG/fun.txt
Example 1 Classification of COGs by functional categories Tatusov et al., 2000: The COG database: a tool for genome-scale analysis of protein functions and evolution
Example 2 40 35 30 AMX 25 20 15 10 5 0 40 35 30 CTC 25 Decrease 20 Increase 15 10 5 0 40 35 30 25 ENR 20 15 10 5 0 - C D E F G H I J K L M N O P Q R S T U V COG categories Effects of Antibiotics on Pasteurella multocida transcriptome Nanduri et al 2006
The Gene Ontology (GO) • The Gene Ontology (GO) is the de facto Standard for functional annotation • GO functional annotation is based on orthology AND direct experimental evidence • GO terms allow much more detailed functional analysis (> 24,000 terms) than COGs & KOGs (25 broad terms) • GO is a controlled vocabulary of terms split into three related ontologies covering basic areas of molecular biology: • molecular function: 8,123 terms • biological process: 13,960 terms • cellular component: 2,071 terms GO Report 2007- 04
Example 3 Functional Annotation of Chicken Proteomic data Cellular Component
Use GO for……. • Modeling function in high-throughput datasets (arrays!) started by Fly, Yeast, Mouse (Ashburner et al 2000, 2001) • Grouping gene products by biological function • Determining which classes of gene products are over-represented or under-represented • Focusing on particular biological pathways and functions (hypothesis-driven) • Relating a protein’s location to its function
Annotating to the GO • Need to show type of evidence of function • Literature curation: read and interpret reviewed literature (IDA, IGI, IMP, IPI, IGC) (TAS, NAS) • Computational analysis (RCA, ISS, IEA) http://www.geneontology.org/GO.evidence.shtml
How to find functional annotation • For quick search you need to know: • Name of your species (e.g Sus scrofa, Aspergillus flavus) • Taxonomy ID (e.g 9823 – S. scrofa, 5059 – A. flavus etc) • Database to look in (e.g. NCBI, Uniprot, EBI-GOA, GOC, AgBase etc) • Not all functional annotation for a species will be in one database! • Not very many species have a broad coverage of GO annotation… BUT do not worry • Search for their homologs might help • May rely on manual annotation from literature (Refer Manual annotation Course on by Fiona McCarthy)
Are the genes/proteins in GenBank? Check by Taxon ID Functional annotation Yes No Known? NM_, NP_ Annotate by structural/sequence similarity ORTHOLOGS (ISS code) Annotate by structural/sequence similarity ORTHOLOGS (ISS code) Annotate by structural/sequence similarity ORTHOLOGS (ISS code) UniParc/IPI Yes No GO No GO No GO Manual annotations from literature (IDA, IMP, IPI, IGI, IEP codes) Manual annotations from literature (IDA, IMP, IPI, IGI, IEP codes) Manual annotations from literature (IDA, IMP, IPI, IGI, IEP codes) UniProtKB GOA make GO annotations (IEA) using automated methods Fill in GO association file GOA collect all GO annotations & submit to GOC Submit to AgBase (Agricultural Species) • GOC maintain annotation files • unfiltered GOA • filtered GOA GOA maintain annotation file AgBase maintains annotation file