430 likes | 703 Views
microbial genome annotation. to annotate : make or furnish critical or explanatory notes or comments annotation : note or comment. bacterial genome features. sizes: 0.6 Mb (Myocplasma) up to 10 Mb (Myxobacteria). chromosomes: circular, linear, megaplasmids.
E N D
microbial genome annotation to annotate: make or furnish critical or explanatory notes or comments annotation: note or comment
bacterial genome features sizes:0.6 Mb (Myocplasma) up to 10 Mb (Myxobacteria) chromosomes: circular, linear, megaplasmids G+C content: 25% (Buchnera) up to 75% (Streptomyces) average gene density: one gene per 1-1.3 bp intergenic regions: 10 - 20% organization: many genes form operons or gene islands genome plasticity: insertions, deletions, inversions, gene pools
sequenced vs. experimentally characterised genes current status in genome sequencing published genome sequencing projects since 1995 sequenced characterised
shotgun sequencing fragmentation of genomic DNA (2-10 kbp) plasmidlibrary of genomic fragments (8-10fold genome coverage) sequencing ( 500 bp reads) ..AGGCATCTAGGATTACCATCTACTT ..AGCTATCGAGCATCTAGGATTACCATCTA TACCATCTACTTCTCATTTTCTAAATA.. GATTACCATCTACTTCTCATTTTCTAAATATCGCGCA.. assembly of overlapping sequences gap-closure annotation
genome sequencing project restriction mapping cosmid mapping primary annotation nucleotide sequence analysis functional data comparative genomics example: Pseudomonas putida KT2440 shot-gun sequencing assembly and gap closure
automatic gene finding prediction of putative coding regions, application of 1 or more algorithms annotation strategy sequencing biological databases
gene finding • Gene finding: • often based on probabilistic models (for instance HMMs, IMMs) • many algorithms available • no perfect algorithm (no 99.9%), false positives and false negatives • > additional evaluation needed (overlaps, intergenic regions, short genes, start sites)
gene finding with Glimmer • Glimmer uses an Interpolated Markov Model(IMM) to predict those open reading frames (ORFs) most likely to be genes • The IMM makes predictions based on statistical probabilities generated when the model building algorithm is trained on a set of ‚known‘ genes (long ORFs, ORFs that match to known proteins) • The algorithm calculates the occurence of base x following oligomer y (up to 8mers) in the set of ‚known‘ genes and generates the probability of each combination occuring in a real gene • These probabilities are then used to predict whether any given ORF is a real gene or not.
automatic gene finding prediction of putative coding regions, application of 1 or more algorithms similarity searches, assignments to protein families etc., sequence features, suggestion of function, classification automatic annotation validation of gene finding and automatic annotations, additional database searches, literature searches and other information sources, contextual analysis manual annotation validation and update of previous annotations re-annotation annotation strategy sequencing biological databases
sequencing errors, assembly errors false postives, start sites, false negatives false negatives, under- and over-prediction false positives, over-prediction, domain error, false negatives, under-prediction, undefined source, typographical errors errors in annotation sequencing biological databases automatic gene finding biological databases automatic annotation manual annotation re-annotation
reannotation scores Score Description Comment 7 False positive Original annotation predicts function without any supporting evidence 6 Over-prediction Original annotation predicts a specific biochemical function without sufficient supporting evidence 5 Domain error Original annotation overlooks different domain structure of query and reference proteins 4 False negativeOriginal annotation does not provide predicted function although there is sufficient evidence to characterize the query protein 3 Under-prediction Original annotation predicts a nonspecific biochemical function although a more detailed prediction could have been made 2 Undefined source Original annotation contains undefined terms, non-homology based predictions, and so on 1 Typographical error Original annotation contains typographical errors that may be propagated in the database 0 Total agreement Original annotation is correct, but annotations may be only semantically (but not computationally) identical CA Ouzounis and PD Karp, Genome Biology 2002, 3 (2): comment2001.1-2001.6
annotation • basic annotation: • name, gene symbol, functional category • gene characteristics (length, position, G+C content, ...) • protein characteristics (domains/motifs, MW, PI, ...) • extended annotation: • genomic context, phylogenetic relations • protein interaction, pathways • further gene characteristics (codon-usage, oligonucleotides) • experimental data (high troughput data)
diversity of nomenclature descriptive: multidrug efflux MFS transporter multidrug resistance efflux pump homolog efflux pump protein multidrug resistance protein B EmrB protein consistent: histidin sensor kinase sensor kinase two component sensor kinase transmembrane sensor kinase two component system, transmembrane sensor sensor histidin kinase sensory box protein
new developments annotation using Gene Ontology (GO) categories • controlled vocabulary • annotation according to function, process, or localization • combination of these ontologies • evidence codes example:translation factor
annotation strategies homology/structure > pairwise homology > protein domains/families > binding-sites > amino acid composition > secondary structures > 3D structures
pairwise alignment • search again protein databases: GenBank, SwissProt, PIR, ... • different algorithms: Blast, PSI-Blast, Fasta, Smith-Waterman • different search strategies: • combination of Blast and Fasta • Blast search followed by Smith-Waterman alignment • PSI-Blast (iterative search, builds a profile in the first run and repeats search against profile Example_1: functional characterization possible Example_2: functional characterization impossible Example_3: functional characterization ambiguous
known problems • no cut-off values, leads to overprediction • different degrees of conservation during evolution • ambigous substrate/interaction specificity • no information on orthology/paralogy • wrong annotations, database artefacts • transitive annotation • multidomain proteins
transitive annotation 30% 30% B (database entry from sequencing project) 30% 30% C (well characterised database entry) A is like B, B is like C, but C is not like A A (new predicted protein)
multidomain proteins B-domain proteins A-domain proteins • multidomain problem dominant domains A B X (new sequence) 30% 70%
protein families - HMM AIEEGEILVIMGLSGSGKST AIEEGEIFVIMGLSGSGKST EVYDGEIFVIMGLSGSGKST KIAKGEFICFIGPSGCGKTT DILKGEFICFIGPSGCGKTV eIakGEifvimGlSGsGKsT +++ GEi+ ++G SGsGKs DLYRGEILAVVGGSGSGKSV HMM highly curated multiple alignment of well characterised seed proteins generation of Hidden Markov Model (HMM) including cutoffs alignment to genome proteins, assignment of scores
protein families > databases: Pfam, TIGRfam, Smart > based on highly curated sets of proteins known to share the same or similar fuctions or be members of the same family > family name often refers to well characterized members > further classification into super- or subfamilies possible > trusted cutoff and noise cutoff can be used for evaluation of assignments example 1: uncovering of a MFS family transporter example 2: porins in P. putida KT2440
motifs/domains/structure PROSITE motifs: binding sites, phosphorylation sites, membrane anchors Lipoprotein motifs: putative lipid modification Signal peptides: characteristic for membrane and extracellular proteins Membrane spanning regions: typical for transporter, sensors, etc. Secondary structures: helices, ß-sheets, coils Tertiary structures: scan against profiles of known protein structures
homology/structure > pairwise homology > protein domains/families > binding-sites > amino acid composition > secondary structures > 3D structures phenotype/experiments > metabolic pathways > physiological features > localisation > expression data > knock-out phenotypes > comparative genomics annotation strategies
metabolic pathways arylsulfatase acyl-CoA dehydrogenase ori aat fcs vdh ech regulator 4-hydroxycinnamic acid MFS transporter Degradation of ferrulic acid by Pseudomonas spp. vanB vanA pcaA pcaB Jörg Overhage, Horst Priefert,* and Alexander Steinbüchel, AEM 1999, 65:4837ff
homology/structure > pairwise homology > protein domains/families > binding-sites > amino acid composition > secondary structures > 3D structures genomic context > orthology, phylogeny > conserved neighborhood > operon structure > gene fusion, protein interaction > phylogenetic profiles annotation strategies phenotype/experiments > metabolic pathways > physiological features > localisation > expression data > knock-out phenotypes > comparative genomics
orthology / paralogy > gene A from genome 1 is the ortholog of gene B from genome 2 if: • gene A is best homolog of gene B among all genes in genome 1 • gene B is best homolog of gene A among all genes in genome 2 > orthologs are genes that have diverged from each other after specification events > paralogs are genes that have diverged from each other after gene duplication events > homologs are genes that descent from acommon ancestor gene
orthology / paralogy 50% lysin transporter Y gene X 70% gene Z genome B genome A gene Z gene Y : orthologs gene X gene Y : homologs gene X gene Z : paralogs
paralogous families • protein clustering of a complete genome detects paralogous families • members of the same protein families share conserved domains that are often connected with a function, localization or process • paralogous families used for maintaining consistency of annotation, start-site editing • can be useful for genome comparision Enright et al. 2002, NAR 30, 1575-84
conserved neighborhood • neighborhood (gene order or proximity) of two or more genes is conserved between different taxospecies • assumption: gene order conserved during evolution due to co-regulation or co-transcription, therefore participation in the same complex, pathway or process • problem: false positivs due to short phylogenetic distances
conserved neighborhood Inner membrane permease protein ATP-binding protein putative periplasmic substrate-binding protein ABC transporter Binding protein permease putative ABC transporter ATPase conserved hypothetical, one TM domain
gene fusion analysis gene Y gene X gene Y gene Z gene X gene Z genome B genome A • orthologs of individual ‚component‘ proteins in a genome A are fused into a single protein in a genome B • assumption: component proteins in genome A are involved in the same (or similar) protein complex, pathway or process • problem: increased number false positivs with increasing level of paralogy
phylogenetic profiles species a species b species c species d species e species f protein 1 protein 2 protein 3 protein 4 protein 5 • phylogenetic profiles are co-occurence patterns of genes (orthologs) in different genomes • assumption: similar phylogenetic profiles of genes could implicate participation in the same pathway or process • problem: false positivs due to short phylogenetic distances and high noise-to-signal ratio
homology/structure > pairwise homology > protein domains/families > binding-sites > amino acid composition > secondary structures > 3D structures genome associated features > codon-usage > mobile elements, islands > oligonucleotide frequencies > promotor/terminator > RNAs annotation strategies phenotype/experiments > metabolic pathways > physiological features > localisation > expression data > knock-out phenotypes > comparative genomics genomic context > orthology, phylogeny > conserved neighborhood > operon structure > gene fusion, protein interaction > phylogenetic profiles
genome sequence features • nucleotide content • oligonnucleotide bias • oligonucleotide variance > all three features are expected to be relatively constant throughout the genome • codon usage • (oligo)nnucleotide skew • third position GC skew • repeats > atypical sequence features often indicate alien DNA, highly/lowly expressed genes, or unusual structural features
detection of gene islands Region 1 type II secretion proteins, others Region 2 lps/eps biosynthesis cluster Region 3 arsenate detooxification operon; unknown operon Region 4 prophage Region 5 hypothetical proteins, large non-coding regions Region 6+7 transposons Region 8+9 heavy metal resistance genes (tranposons?)