440 likes | 581 Views
Wellcome Trust Workshop Working with Pathogen Genomes Module 2 Gene Prediction. The Annotation Process. DNA SEQUENCE. Useful Information. ANNALYSIS SOFTWARE. Annotator. Gene finding. Sequence. sequence alignment to related gene (e.g. orthologue). base composition.
E N D
Wellcome Trust Workshop Working with Pathogen Genomes Module 2 Gene Prediction
The Annotation Process DNA SEQUENCE Useful Information ANNALYSIS SOFTWARE Annotator
Gene finding Sequence sequence alignment to related gene (e.g. orthologue) base composition sequence alignment transcript data (e.g. EST) Gene finding software Accurately predict sample set of genes training set Full gene set
DNA in Artemis AT content Forward translations Reverse Translations DNA and amino acids
Gene prediction programs: ORFs and CDSs ORFs are not equivalent to CDSs Not all open reading frames are coding sequences
GC content • Coding regions have higher GC content in AT-rich genomes
CODON USAGE • Codon bias is different for each organism. • DNA content in coding regions is restricted • but it is not restricted in non coding regions. • The codon usage for any particular gene can influence expression.
Codon usage • All organisms have a preferred set of codons. MalariaTrypanosoma GUU 0.41 GUU 0.28 GUC 0.06 GUC 0.19 GUA 0.42 GUA 0.14 GUG 0.11 GUG 0.39
Codon Usage • http://www.kazusa.or.jp/codon/
Codon Usage Table UUU 34.3( 26847) UCU 15.3( 11956) UAU 45.6( 35709) UGU 15.3( 11942) UUC 7.3( 5719) UCC 5.3( 4141) UAC 5.5( 4340) UGC 2.4( 1872) UUA 49.2( 38527) UCA 18.2( 14239) UAA 1.0( 813) UGA 0.2( 188) UUG 10.1( 7911) UCG 2.8( 2154) UAG 0.2( 123) UGG 5.2( 4066) CUU 8.7( 6776) CCU 9.1( 7148) CAU 19.5( 15287) CGU 3.3( 2561) CUC 1.7( 1354) CCC 2.5( 1982) CAC 3.9( 3020) CGC 0.5( 354) CUA 5.4( 4217) CCA 13.1( 10221) CAA 25.1( 19650) CGA 2.4( 1878) CUG 1.3( 1044) CCG 0.9( 742) CAG 3.3( 2598) CGG 0.2( 184) AUU 34.0( 26611) ACU 12.8( 10050) AAU105.5( 82591) AGU 21.6( 16899) AUC 5.9( 4636) ACC 5.5( 4312) AAC 18.5( 14518) AGC 3.8( 2994) AUA 44.7( 34976) ACA 22.8( 17822) AAA 90.5( 70863) AGA 16.9( 13213) AUG 20.9( 16326) ACG 3.8( 2951) AAG 19.2( 15056) AGG 3.9( 3091) GUU 18.1( 14200) GCU 12.5( 9811) GAU 55.5( 43424) GGU 16.6( 12960) GUC 2.6( 2063) GCC 3.2( 2541) GAC 8.6( 6696) GGC 1.6( 1269) GUA 18.2( 14258) GCA 12.6( 9871) GAA 65.8( 51505) GGA 16.7( 13043) GUG 4.9( 3806) GCG 1.1( 890) GAG 10.1( 7878) GGG 2.9( 2243)
Codon Usage in Artemis Forward frames Reverse frames
Gene prediction: Amino acid usage: Correlation scores Within each window, plots correlation between amino acid usage in window and global amino-acid usage in EMBL “Magic number” = 52.7 Arbitrary units
Gene prediction: Correlation scores M. tuberculosis NADH dehydrogenase operon
Gene prediction: Positional base preference (FramePlot) Plots the GC content in each position of each reading frame of the DNA sequence. In G+C-rich organisms the GC content of the 3rd base is often higher; in A+T rich organisms it is lower. Good prediction of coding in malaria and trypanosomes and G+C-rich prokaryotes. 3 1 Frame-specific G+C content 2 G+C content of chromosome
Genefinding programs • Genefinding software packages use Hidden Markov Models. • Predict coding, intergenic and intron sequences • Need to be trained on a specific organism. • Never perfect!
What is an HMM • A statistical model that represents a gene. • Similar to a “weight matrix” but one that can recognise gaps and treat them in a systematic way. • Has a different “states” that represent introns, exons, intergenic regions, etc • Considers the “state” of preceding sequence
A typical HMM http://linkage.rockefeller.edu/wli/gene/krogh98.pdf
Gene prediction programs: Problems • ORFs are not equivalent to CDSs • Gene prediction programs find new genes that share properties with a given set of genes. • They can be confounded by: • Sequence constraints (ribosomal proteins etc.) • Sequence biases • Sequence quality • Different sets of genes • Horizontal gene transfer • Non-coding DNA
Gene prediction programs: Problems Sequence composition variation Y. pestis ribosomal proteins final glimmer orpheus
Gene prediction programs: Problems Non-protein coding regions: S. typhi ribosomal RNA genes final genefinder orpheus glimmer glimmer orpheus genefinder final
Gene prediction programs: Problems Non-protein coding regions: N. meningitidis DNA repeats final orpheus glimmer glimmer orpheus final
Gene prediction programs: Problems Pseudogenes M. leprae
Gene prediction programs: Problems Pseudogenes: M. leprae Glimmer
Gene prediction programs: Problems Pseudogenes: M. leprae Pseudogenes: M. leprae ORPHEUS
Gene prediction programs: Problems Pseudogenes: M. leprae WUBLASTX vs. M. tuberculosis
Gene prediction programs: Problems Pseudogenes: M. leprae Final annotation
Gene prediction programs: Statistics Mycobacterium marinum; 6,636,827 bp, 65.7% G+C compared to manually curated gene set: 5519 genes (incl 46 pseudogenes) 1 4 http://www.tigr.org/softlab/glimmer/glimmer.html Krogh+Larson pers comm 2 http://opal.biology.gatech.edu/GeneMark/ 5 http://pedant.gsf.de/orpheus/ 3 http://cbcb.umd.edu/software/glimmer/
Gene prediction programs: Problems splicing Plasmodium falciparum Original annotation Updated annotation
Homology Data • Coding regions are more conserved than non coding regions due to selective pressure. • Comparing all possible translations against all known proteins will give clues to known genes. • Blastx
EST sequencing stop 3’UTR 5’UTR exon M intron CAP AAAAAAAAAA mRNA CAP AAAAAAAAAA TTTTTTTTT cDNA TTTTTTTTT EST EST
The Gene Prediction Process ESTs FASTA BlastX DNA SEQUENCE Usefull CDS Prediction ANNALYSIS SOFTWARE Gene finders Codon Usage AT content Annotator
Gene prediction in eukaryotes: HMMs highlighted manually reviewed gene structure pale brown hit to H. contortus EST cluster in Nembase found using PASA brown-green hit to H.contortus individual ESTs in NCBI database found using PASA pink/red blocks hits to Uniprot bright green twinscan prediction (homology based) pale pink snap prediction (ab initio) yellow hmmgene prediction (ab initio) pale blue genscan prediction (ab initio) red genefinder (ab initio) dark blue fgenesh prediction (ab initio) jade green augustus hints prediction (homology based) orange augustus prediction (ab initio) purple genewise prediction (homology based)
Gene prediction in eukaryotes: HMMs A B P. falciparum gene predictions (PlasmoDB)
Gene prediction in eukaryotes: HMMs Bartfinder hmmgene geneid Phat EST (contig) combined prediction Dictyostelium discoideum gene predictions
Manual refinement P. falciparum P. knowlesi
Ongoing manual annotation e.g. PF14_0021, PF14_0022 P. falciparum P. vivax Revised annotation (back to Two genes!)
Using FASTA Results • FASTA is a global alignment tool BLAST FASTA • Reduces sensitivity increases specificity