1 / 45

WELLCOME TRUST ADVANCED COURSES Working with Pathogen Genomes 17th - 21st November 2008

WELLCOME TRUST ADVANCED COURSES Working with Pathogen Genomes 17th - 21st November 2008. Module 2 Gene Prediction Anna V. Protasio. Most of the slides were kindly provided by Matt Berriman. The Annotation Process. DNA SEQUENCE. Useful Information. ANALYSIS SOFTWARE. Annotator.

Download Presentation

WELLCOME TRUST ADVANCED COURSES Working with Pathogen Genomes 17th - 21st November 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WELLCOME TRUST ADVANCED COURSESWorking with Pathogen Genomes17th - 21st November 2008 Module 2 Gene Prediction Anna V. Protasio Most of the slides were kindly provided by Matt Berriman

  2. The Annotation Process DNA SEQUENCE Useful Information ANALYSIS SOFTWARE Annotator

  3. Gene finding Sequence sequence alignment to related gene (e.g. orthologue) base composition sequence alignment transcript data (e.g. EST) Gene finding software Accurately predict sample set of genes training set Full gene set

  4. DNA in Artemis AT content Forward translations Reverse Translations DNA and amino acids

  5. Gene prediction programs:ORFs and CDSs ORFs are not equivalent to CDSs Not all open reading frames (ORFs) are coding sequences EXAMPLE Compare two bacterial genomes, AT rich and GC rich

  6. Comparing AT-rich and GC-rich genomes • Open files Data/Campy/Cj.seq and Data/Strep?St6A9.seq • Create/ Mark open reading frames • Select/ All • Edit/ Trim selected features/ To Any / (100) Some features will remain selected -> delete • View / Overview Compare both genomes

  7. GC content • Coding regions have higher GC content in AT-rich genomes

  8. GC content

  9. CODON USAGE • Codon bias is different for each organism. • DNA content in coding regions is restricted • but it is not restricted in non coding regions. • The codon usage for any particular gene can influence expression. • All organisms have a preferred set of codons (example Valine) Malaria Trypanosoma GUU 0.41 GUU 0.28 GUC 0.06 GUC 0.19 GUA 0.42 GUA 0.14 GUG 0.11 GUG 0.39

  10. Codon Usage • http://www.kazusa.or.jp/codon/

  11. Codon Usage Table UUU 34.3( 26847) UCU 15.3( 11956) UAU 45.6( 35709) UGU 15.3( 11942) UUC 7.3( 5719) UCC 5.3( 4141) UAC 5.5( 4340) UGC 2.4( 1872) UUA 49.2( 38527) UCA 18.2( 14239) UAA 1.0( 813) UGA 0.2( 188) UUG 10.1( 7911) UCG 2.8( 2154) UAG 0.2( 123) UGG 5.2( 4066) CUU 8.7( 6776) CCU 9.1( 7148) CAU 19.5( 15287) CGU 3.3( 2561) CUC 1.7( 1354) CCC 2.5( 1982) CAC 3.9( 3020) CGC 0.5( 354) CUA 5.4( 4217) CCA 13.1( 10221) CAA 25.1( 19650) CGA 2.4( 1878) CUG 1.3( 1044) CCG 0.9( 742) CAG 3.3( 2598) CGG 0.2( 184) AUU 34.0( 26611) ACU 12.8( 10050) AAU105.5( 82591) AGU 21.6( 16899) AUC 5.9( 4636) ACC 5.5( 4312) AAC 18.5( 14518) AGC 3.8( 2994) AUA 44.7( 34976) ACA 22.8( 17822) AAA 90.5( 70863) AGA 16.9( 13213) AUG 20.9( 16326) ACG 3.8( 2951) AAG 19.2( 15056) AGG 3.9( 3091) GUU 18.1( 14200) GCU 12.5( 9811) GAU 55.5( 43424) GGU 16.6( 12960) GUC 2.6( 2063) GCC 3.2( 2541) GAC 8.6( 6696) GGC 1.6( 1269) GUA 18.2( 14258) GCA 12.6( 9871) GAA 65.8( 51505) GGA 16.7( 13043) GUG 4.9( 3806) GCG 1.1( 890) GAG 10.1( 7878) GGG 2.9( 2243) How does it look for your favorite pathogen?

  12. Codon Usage in Artemis Forward frames Reverse frames

  13. Codon Usage

  14. GC frame plot • Plots the third position GC content of each frame of a DNA sequence. • In coding DNA the GC content of the 3rd base is often higher. • Good prediction of coding in malaria and trypanosomes. ATGCCTGCAGGGAAACCTTCTGGTCTGAAGACTGCGCGCA TGCCTGCAGGGAAACCTTCTGGTCTGAAGACTGCGCGCA GCCTGCAGGGAAACCTTCTGGTCTGAAGACTGCGCGCA

  15. Gene finding programs • Genefinding software packages use Hidden Markov Models. • Predict coding, intergenic and intron sequences • Need to be trained on a specific organism. • Never perfect!

  16. What is an HMM • A statistical model that represents a gene. • Similar to a “weight matrix” but one that can recognise gaps and treat them in a systematic way. • Has different “states” that represent introns, exons, intergenic regions, etc • Considers the “state” of preceding sequence - CONDITIONAL PROBABILITY

  17. Phat http://linkage.rockefeller.edu/wli/gene/krogh98.pdf

  18. Gene prediction programs: Problems • ORFs are not equivalent to CDSs • Gene prediction programs find new genes that share properties with a given set of genes. • They can be “confused” by: • Sequence constraints (ribosomal proteins etc.) • Sequence biases • Sequence quality • Different sets of genes • Horizontal gene transfer • Non-coding DNA

  19. Gene prediction programs: Problems Non-protein coding regions: S. typhi ribosomal RNA genes final genefinder orpheus glimmer glimmer orpheus genefinder final

  20. final orpheus glimmer glimmer orpheus final Gene prediction programs: Problems Non-protein coding regions: N. meningitidis DNA repeats

  21. Gene prediction programs: Problems Pseudogenes M. leprae

  22. Gene prediction programs: Problems Pseudogenes: M. leprae Glimmer

  23. Gene prediction programs: Problems Pseudogenes: M. leprae Pseudogenes: M. leprae ORPHEUS

  24. Gene prediction programs: Problems Pseudogenes: M. leprae WUBLASTX vs. M. tuberculosis

  25. Gene prediction programs: Problems Pseudogenes: M. leprae Final annotation

  26. Gene prediction programs: Statistics CDS prediction 1 1 2 Size (Mb) G+C Glimmer G2 ORPHEUS other Final Organism 3 Campylobacter jejuni 1.641 30.55 1761 1518 1783 1654 Neisseria meningitidis A 2.184 51.81 3134 2024 2121 4 1605 intact 1115 pseudo Mycobacterium leprae 3.268 57.80 949 5679 4427 5 Salmonella typhi 4.809 52.09 5194 4666 4973 4600 Yersinia pestis 4.654 47.64 2654 4312 4011 1 http://www.tigr.org/softlab/glimmer/glimmer.html 2 http://pedant.mips.biochem.mpg.de/orpheus/index.html 3 Start-to-stop >100 aa 4 TIGR CMR (http://www.tigr.org/) 5 GeneFinder (Krogh+Larson pers comm)

  27. Gene prediction programs: Problems splicing Plasmodium falciparum Original annotation Updated annotation

  28. Homology Data • Coding regions are more conserved than non coding regions due to selective pressure. • Comparing all possible translations against all known proteins will give clues to known genes. • Blastx

  29. BLASTX

  30. Blastx on frame lines

  31. Using FASTA Results • FASTA is a global alignment tool BLAST FASTA • Reduces sensitivity increases specificity

  32. EST sequencing stop 3’UTR 5’UTR exon M intron CAP AAAAAAAAAA mRNA CAP AAAAAAAAAA TTTTTTTTT cDNA TTTTTTTTT EST EST

  33. ESTs

  34. Showing Multiple Evidence

  35. Illumina sequencing Reads = 35 bases long

  36. Alignment of multiple reads SSAHA Ning, Z., Cox, A. J. & Mullikin, J. C. SSAHA: a fast search method for large DNA databases. Genome Res 11, 1725-9 (2001).SSAHA MAQ Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008 Nov;18(11):1851-8. Epub 2008 Aug 19. Plots - Artemis

  37. Short-read sequencing data - output files Smp_scaff001950 190 A 26 @,..,.,....,,.,..,,.,,...., @>><0>;>>>0<6:>>><<>>><>.>5 Smp_scaff001950 191 A 25 @,..,.....,,.,..,,.,,...., @;><>>>>*8<<6>>><;>>><>>;7 Smp_scaff001950 192 A 19 @,..,....,,,..,,.... @>6&6<>>;><>>:<><>>: Smp_scaff001950 193 A 18 @,.,.....,,,.,,.... @;</<>>6<<&<>><>>>9 Smp_scaff001950 194 A 18 @,.,.....,,,.,,.... @:9:>>.*'<<<>><>>.' Smp_scaff001950 195 T 18 @,.,.....,,,..,,... @6&6<<1+,><9<<>><,3 Smp_scaff001950 196 C 17 @..,.....,...,,... @6</9>96&6>7><>>+6 Smp_scaff001950 197 A 17 @..,.....,...,,... @09/>>>&;<>/>>><>> Smp_scaff001950 198 A 16 @..,....,...,,... @5&8<>6'<>>;><>>; Smp_scaff001950 199 G 16 @T.,....,...,,... @36/<>;&>>+><<&>3 Smp_scaff001950 200 T 14 @..,.....,...,, @3<><;'23>>;><< Smp_scaff001950 201 G 11 @..,....,.., @'*>>6+'<6>< Smp_scaff001950 202 T 8 @.,....., @7>:&7'<> Smp_scaff001950 203 C 7 @T,..T.. @.>8<7*6 Smp_scaff001950 204 C 7 @.,..... @6<>&8*> Smp_scaff001950 205 C 9 @,.,.,.... @88<>;<<*; Smp_scaff001950 206 A 11 @,..,.,...., @<96>51><';8 Smp_scaff001950 207 T 11 @,G.G.,...., @>;,>>;,<>>6 Smp_scaff001950 208 C 10 @,..,.,..., @<<,>;;<6/; Smp_scaff001950 209 C 9 @,..,.,.., @>>5>>;><9 Smp_scaff001950 210 G 12 @,,A,..A.,.A, @<>;>6/<>,<<6 Smp_scaff001950 1 X 0 Smp_scaff001950 2 X 0 Smp_scaff001950 3 X 0 Smp_scaff001950 4 X 0 Smp_scaff001950 5 X 0 Smp_scaff001950 6 X 0 Smp_scaff001950 7 X 0 Smp_scaff001950 8 X 0 Smp_scaff001950 9 X 0 Smp_scaff001950 10 X 0 Smp_scaff001950 11 X 0 Smp_scaff001950 12 X 0 Smp_scaff001950 13 X 0 Smp_scaff001950 14 X 0 Smp_scaff001950 15 X 0 Smp_scaff001950 16 X 0 Smp_scaff001950 17 X 0 Smp_scaff001950 18 X 0 Smp_scaff001950 19 X 0 Smp_scaff001950 20 X 0

  38. Short-read sequencing data - plots Smp_14000

  39. The Gene Prediction Process ESTs FASTA BlastX DNA SEQUENCE Usefull CDS Prediction ANNALYSIS SOFTWARE Gene finders Codon Usage AT content Annotator

  40. Gene prediction in eukaryotes: HMMs highlighted manually reviewed gene structure pale brown hit to H. contortus EST cluster in Nembase found using PASA brown-green hit to H.contortus individual ESTs in NCBI database found using PASA pink/red blocks hits to Uniprot bright green twinscan prediction (homology based) pale pink snap prediction (ab initio) yellow hmmgene prediction (ab initio) pale blue genscan prediction (ab initio) red genefinder (ab initio) dark blue fgenesh prediction (ab initio) jade green augustus hints prediction (homology based) orange augustus prediction (ab initio) purple genewise prediction (homology based)

  41. Gene prediction in eukaryotes: HMMs A B P. falciparum gene predictions (PlasmoDB)

  42. Gene prediction in eukaryotes: HMMs Bartfinder hmmgene geneid Phat EST (contig) combined prediction Dictyostelium discoideum gene predictions

  43. Manualrefinement P. falciparum P. knowlesi

  44. Ongoingmanual annotation e.g. PF14_0021, PF14_0022 P. falciparum P. vivax Revised annotation (back to Two genes!)

More Related