230 likes | 450 Views
Finding the genes in microbial genomes Natalia Ivanova MGM Workshop February 3, 2009. Introduction Tools out there Basic principles behind tools Known problems of the tools: why you may need manual curation. Outline. Introduction (who said annotating prokaryotic genomes is easy?)
E N D
Finding the genes in microbial genomes Natalia Ivanova MGM Workshop February 3, 2009
Introduction Tools out there Basic principles behind tools Known problems of the tools: why you may need manual curation Outline
Introduction (who said annotating prokaryotic genomes is easy?) Tools out there Basic principles behind tools Known problems of the tools: why you may need manual curation Outline
features Well-annotated bacterial genome in Artemis genome viewer: rRNA tRNA operon promoter terminator protein-coding gene CDS protein-binding site Finding the genes in microbial genomes • Sequence features in prokaryotic genomes: • stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA) • protein-coding genes (CDSs) • transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends) • translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins) • pseudogenes (tRNA and protein-coding genes) • …
Introduction (who said annotating prokaryotic genomes is easy?) Tools out there(don’t bother to write down the names and links, all presentations will be available on the web site) Basic principles behind tools Known problems of the tools: why you may need manual curation Outline
IMG-ER http://img.jgi.doe.gov/er IMG-ER submission page: http://durian.jgi-psf.org/~imachen/cgi-bin/Submission/main.cgi RAST http://rast.nmpdr.org/ JCVI Annotation Service http://www.tigr.org/tigr-scripts/AnnotationEngine/ann_engine.cgi • Output: • stable RNA-encoding genes, • CDSs, • functional annotations • output in GenBank format • Output: • rRNAs and tRNAs, • CDSs, • functional annotations • output in several formats • Output: • CDSs, stable RNAs? • functional annotations • format? Tools out there: servers for microbial genome annotation - I
AMIGENE http://www.genoscope.cns.fr/agc/tools/amiga/Form/form.php RefSeq http://www.ncbi.nlm.nih.gov/genomes/MICROBES/genemark.cgi http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi EasyGene http://www.cbs.dtu.dk/services/EasyGene/ Output: CDSs, output in gff format Output: CDSs, output in tbl format Output: CDSs, size restriction <1Mb Tools out there: servers for microbial genome annotation - II
Artemis http://www.sanger.ac.uk/Software/Artemis/ Manatee http://manatee.sourceforge.net/ Argo http://www.broad.mit.edu/annotation/argo/ Major difference: viewer vs editor? Windows and Linux versions; works with files in many formats, annotated by any pipeline Linux versions only; genome needs to be annotated by the JCVI Annotation Service Windows and Linux; works with files in many formats Tools out there: genome browsers for manual annotation of microbial genomes
Large structural RNAs (23S and 16S rRNAs) RNAmmer http://www.cbs.dtu.dk/services/RNAmmer/ Small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component) Rfam database, INFERNAL search tool http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/ ARAGORN http://130.235.46.10/ARAGORN1.1/HTML/aragorn1.2.html tRNAScan-SE http://lowelab.ucsc.edu/tRNAscan-SE/ Web service: sequence search is limited to 2 kb Web service: sequence search is limited to 15 kb, finds tRNAs and tmRNAs only Web service: sequence search is limited to 5 Mb, finds tRNAs only Tools out there: tools for finding stable (“non-coding”) RNAs - I
Short regulatory RNAs (riboswitches, etc.) Rfam database, INFERNAL search tool http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/ Other (less popular) tools: Pipeline for discovering cis-regulatory ncRNA motifs: http://bio.cs.washington.edu/supplements/yzizhen/pipeline/ RNAz http://www.tbi.univie.ac.at/~wash/RNAz/ Tools out there: tools for finding “non-coding” RNAs - II Web service: sequence search is limited to 2 kb; Provides list of pre-calculated RNAs for publicly available genomes
Tools out there: finding protein-coding genes (not ORFs!) Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction) Open reading frame (ORF): reading frame between a start and stop codon
CRITICA Glimmer family (Glimmer2, Glimmer3, RBS finder) http://glimmer.sourceforge.net/ GeneMark family (GeneMark-hmm, GeneMarkS) http://exon.gatech.edu/GeneMark/ EasyGene AMIGENE PRODIGAL (default JGI gene finder) http://compbio.ornl.gov/prodigal/ Combinations and variations of the above REGANOR (CRITICA + Glimmer3 + pre-processing) RAST(Glimmer2 + pre- and post-processing) Tools out there: most popular CDS-finding tools
BLASTx Fgenesb http://linux1.softberry.com/berry.phtml?topic=fgenesb&group=programs&subgroup=gfindb GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs) http://exon.gatech.edu/GeneMark/ MetaGene http://metagene.cb.k.u-tokyo.ac.jp/metagene/ GISMO ? http://www.cebitec.uni-bielefeld.de/groups/brf/software/gismo/ Full-service servers IMG/M-ER – uses GeneMark for Sanger, proxygenes for 454 http://img.jgi.doe.gov/submit MG-RAST http://metagenomics.nmpdr.org/ Tools out there: metagenome annotation
Introduction (who said annotating prokaryotic genomes is easy?) Tools out there(don’t bother to write down the names and links, all presentations will be available on the web site) Basic principles behind tools(very basic, see specific papers for details) Known problems of the tools: why you may need manual curation Outline
Two major approaches to prediction of protein-coding genes: “evidence-based” (ORFs with translations homologous to the known proteins are CDSs) Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions Limitations: cannot find “unique” genes; low sensitivity towards short genes; prone to propagation of false positive results of ab initio annotation tools ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs) Advantages: finds “unique” genes; high sensitivity Limitations: often misses “unusual” genes; high rate of false positives Basic principles: finding CDSs using evidence-based vs ab initio algorithms
An ORF that is likely to be protein-coding is found by searching for “coding potential” “Coding potential” is defined by comparing nucleotide sequence of an ORF to a hidden Markov model (HMM) HMM is generated using a training set from the genome or from average frequencies observed for multiple genomes Probability that an ORF is a protein-coding gene is computed N-terminal (5’) boundary is found by finding a start codon (ATG, GTG, TTG) next to a ribosomal binding site (RBS, Shine-Dalgarno sequence) Different genomes have different frequencies of start codons RBS is found by (Gibbs sampling) multiple sequence alignment of upstream sequences and represented by a weighted positional frequency matrix Or RBS is found by multiple sequence alignment and represented as one of the states in an HMM model Or ... Basic principles: finding protein-coding genes with ab initio methods Example: the overall HMM architecture used in EasyGene (from Larsen & Krogh, BMC Bioinformatics, 2003).
Training set selection(evidence-based vs purely ab initio) Example: CRITICA and EasyGene use evidence-based training sets (BLASTn with counting synonymous/non-synonymous codons in CRITICA, BLASTx in EasyGene); Glimmer uses ab initio training set of long non-overlapping ORFs; GeneMark uses ab initio heuristic model Statistical model of coding and non-coding regions(codon frequencies, dicodon frequencies, hidden Markov models) Example: CRITICA uses dicodon frequencies to model coding regions; Glimmer uses interpolated Markov models (IMM) of up to 5-th order; GeneMark uses order 2 hmm for coding regions, order 0 hmm for non-coding regions Statistical model architecture(i. e. which parts of the CDS are explicitly modeled – may include RBS, spacer region, start codon, second codon, internal codons, stop codon, etc.) Example: EasyGene explicitely models RBS, spacer region, start codon, second codon, internal codons, stop codon, codons surrounding stop codon, non-coding sequence; all other tools have less comprehensive architectures of HMM Additional algorithms for refinement of predictions(RBS finder, overlap resolution, etc.) Example: Glimmer2.0 has a scoring schema for overlap resolution; Glimmer3. uses a dynamic programming algorithm to select the highest-scoring set of predictions consistent with the maximum allowed overlap Features and differences between gene finding tools
Introduction (who said annotating prokaryotic genomes is easy?) Tools out there(don’t bother to write down the names and links, all presentations will be available on the web site) Basic principles behind tools(very basic, see specific papers for details) Known problems of the tools: why you may need manual curation(more on manual curation in the next talk by Thanos Lykidis) Outline
Known problems: RNAs • Large rRNAs: RNAmmer is very accurate, but it has been developed very recently • variation of rRNA sizes in closely related strains • most 16S rRNA are missing anti-Shine-Dalgarno sequence • Small structural RNAs: covariance models are generally accurate, but may miss some tRNAs in Archaea • Check for the full complement of tRNAs with all necessary anti-codons • No model for pyrrolysine tRNA • Small regulatory RNAs: search is accurate but slow (too many models) • Annotations of regulatory RNAs are missing from many genomes
Known problems: CDSs • Short CDSs: many are missed, others are overpredicted • short ribosomal proteins (30-40 aa long) are often missed • short proteins in the promoter region are often overpredicted • N-terminal sequences are often inaccurate (many features of the sequence around start codon are not accounted for) • Glimmer2.0 is calling genes longer than they should be • GeneMark, Glimmer3.0 err both ways, but mostly call genes shorter • Pseudogenes • all tools are looking for ORFs (needs valid start and stop codons) • “unique” genes are often predicted on the opposite strand of a pseudogene • Proteins with unusual translational features (recoding, programmed frameshifts) • these genes are often mistaken for pseudogenes • see pseudogenes
Supplemental tools • TIS (translation initiation site) prediction/correction TICO http://tico.gobics.de/ TriTISA http://mech.ctb.pku.edu.cn/protisa/TriTISA • Two tools often disagree about the best TIS, especially in high GC genomes • Operon prediction JPOP http://csbl.bmb.uga.edu/downloads/#jpop http://www.cse.wustl.edu/~jbuhler/research/operons/ http://www.sph.umich.edu/~qin/hmm/ • Proteins with unusual translational features – selenocysteine-containing genes bSECISearch http://genomics.unl.edu/bSECISearch/
Known problems: different gene finding tools applied to the same genome
There are plenty of tools for automated annotation of microbial genome, including several “full-service” servers and annotation pipelines Even “full-service” pipelines identify a limited range of features and development of automated or semi-automated tools for identification of operons, promoters, terminators etc. is highly desirable, but not likely in the absence of experimental data Nearly all of the annotation tools and servers are using different strategies, algorithms, models, settings, etc., so the results may and will vary Different automated gene finders have different advantages and limitations; the best strategy is using any of them or a combination followed by evidence-based manual curation Conclusions