  1. 10/21/05Gene Prediction(formerly Gene Prediction - 3) D Dobbs ISU - BCB 444/544X: Gene Prediction

  5. Gene Prediction & Regulation Mon - Gene structure review: Eukaryotes vs prokaryotes Wed - Regulatory regions: Promoters & enhancers Fri - Predicting genes - Predicting regulatory regions (?) Next week: Predicting RNA structure (miRNAs, too) D Dobbs ISU - BCB 444/544X: Gene Prediction

  6. Reading Assignment • Mount Bioinformatics • Chp 9 Gene Prediction & Regulation • pp 361-385 Predicting Promoters • Ck Errata:http://www.bioinformaticsonline.org/help/errata2.html • * Brown Genomes 2 (NCBI textbooks online) • Sect 9 Overview: Assembly of Transcription Initiation Complex • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.chapter.7002 • Sect 9.1-9.3 DNA binding proteins, Transcription initiation • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.section.7016 * NOTE: Don’t worry about the details!! D Dobbs ISU - BCB 444/544X: Gene Prediction

  7. Optional Reading • Reviews: • Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698-709 http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html • Wasserman WW & Sandelin (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html D Dobbs ISU - BCB 444/544X: Gene Prediction

  8. Review last lecture: Gene Regulation(formerly Gene Prediction-2) cDNAs & ESTs UniGene Regulatory regions Eukaryotes vs prokaryotes D Dobbs ISU - BCB 444/544X: Gene Prediction

  9. DNA RNA protein Phenotype cDNA [1] Transcription [2] RNA processing (splicing) [3] RNA export [4] RNA surveillance Pevsner p160 D Dobbs ISU - BCB 444/544X: Gene Prediction

  10. UniGene: unique genes via ESTs • • Find UniGene at NCBI: • www.ncbi.nlm.nih.gov/UniGene • UniGene clusters contain many ESTs • • UniGene data come from many cDNA libraries. • Thus, when you look up a gene in UniGene • you get information on its abundance • and its regional distribution Pevsner p164 D Dobbs ISU - BCB 444/544X: Gene Prediction

  11. Today: Gene Prediction(formerly Gene Prediction - 3) Predicting genes Mon - Predicting regulatory regions Focus on promoters Introduction to RNA Later: Genome browsers D Dobbs ISU - BCB 444/544X: Gene Prediction

  12. Gene Prediction • Overview of steps & strategies • What sequence signals can be used? • What other types of information can be used? • Algorithms • HMMs, discriminant functions, neural nets • Gene prediction software • 3 major types • many,many programs! D Dobbs ISU - BCB 444/544X: Gene Prediction

  13. Predicting Genes - Basic steps: • Obtain genomic sequence • Translate in all 6 reading frames • Compare with protein sequence database • Perform database similarity search • with EST & cDNA databases, if available • Use gene prediction program to locate genes • Analyze gene regulatory sequences D Dobbs ISU - BCB 444/544X: Gene Prediction

  14. Overview of gene prediction strategies What sequence signals can be used? Transcription:TF binding sites, promoter, initiation site, terminator Processing signals:splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage What other types of information can be used? cDNAs & ESTs(experimental data,pairwise alignment) homology(sequence comparison, BLAST) D Dobbs ISU - BCB 444/544X: Gene Prediction

  15. Automated gene prediction strategies • Similarity-based or Comparative • BLAST - Do other organisms have similar sequence? (Is sequence similar to known gene or protein) • Ab initio = “from the beginning” • Predict without explicit comparison with cDNA or proteins via “rule-based” gene models - but rules are derived from statistical analysis of datasets • Combined "evidence-based" • Combine gene models with alignment to known ESTs & protein sequences • BEST RESULTS? Combined D Dobbs ISU - BCB 444/544X: Gene Prediction

  16. Examples of gene prediction software • Similarity-based or Comparative • BLAST • SGP2 (extension of GeneID) • Ab initio = “from the beginning” • GeneID - (used in lab this week) • GENSCAN - (used in lab this week) • GeneMark.hmm - (should try this!) • Combined "evidence-based” • GeneSeqer (Brendel et al., ISU) • BEST?GENSCAN, GeneMark.hmm, GeneSeqer • but depends on organism & specific task D Dobbs ISU - BCB 444/544X: Gene Prediction

  17. Gene prediction: Eukaryotes vs prokaryotes Gene prediction is easier in microbial genomes Why? Smaller genomes Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously, mostly HMM-based Now: similarity-based methods because so many genomesavailable D Dobbs ISU - BCB 444/544X: Gene Prediction

  18. GeneSeqer - Brendel et al. http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi D Dobbs ISU - BCB 444/544X: Gene Prediction

  19. Thanks to Volker Brendel, ISU for following Figs & Slides Slightly modified from: BSSI Genome Informatics Module http://www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleB V Brendel vbrendel@iastate.edu D Dobbs ISU - BCB 444/544X: Gene Prediction

  20. Signals: Pre-mRNA Splicing Start codon Stop codon Genomic DNA Transcription pre-mRNA Cap- -Poly(A) Splicing mRNA -Poly(A) Cap- Translation Protein exon intron GT AG Acceptorsite Donor site Splice sites Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  21. Brendel - Spliced Alignment I: Compare with cDNA or EST probes Start codon Stop codon Genomic DNA Start codon Stop codon -Poly(A) mRNA Cap- 5’-UTR 3’-UTR Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  22. Start codon Stop codon Genomic DNA Protein Brendel - Spliced Alignment II: Compare with protein probes Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  23. Brendel Spliced Alignment Algorithm • Perform pairwise alignment with large gaps in one sequence (introns) • Align genomic DNA with cDNA, EST or protein • Score semi-conserved sequences at splice junctions • Score coding constraints in translated exons Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  24. Species • Type • Number of True Splice Sites / Phase • 1 2 3 • Home sapiens • GT • AG • 6586 • 6555 • 5277 • 5194 • 3037 • 2979 • Mus musculus • GT • AG • 1212 • 1194 • 1185 • 1139 • 521 • 504 • Rattus norvegicus • GT • AG • 450 • 442 • 408 • 386 • 147 • 140 • Gallus gallus • GT • AG • 288 • 284 • 238 • 228 • 107 • 103 • Drosophila • GT • AG • 989 • 1001 • 670 • 671 • 524 • 536 • C. elegans • GT • AG • 37029 • 36864 • 20500 • 20325 • 20789 • 20626 • S. pombe • GT • AG • 170 • 179 • 118 • 122 • 119 • 118 • Aspergillus • GT • AG • 221 • 217 • 176 • 172 • 157 • 163 • Arabidopsis thaliana • GT • AG • 23019 • 22929 • 9297 • 9247 • 8653 • 8611 • Zea mays • GT • AG • 316 • 311 • 107 • 104 • 88 • 83 Donor (GT) & Acceptor (AG) Sites Used for Model Training Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  25. Information Content Ii: • Extent of Splice Signal Window: Splice Site Detection i : ith position in sequence Ī : average information content over all positions i > 20 nt from splice site Ī : average standard deviation of Ī Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  26. Human T2_GT Human T2_AG Human F1_AG Human Fi_AG A. thaliana T2_GT A. thaliana T2_AG A. thaliana F1_AG A. thaliana Fi_AG Results? Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  27. Let S = s-l s-l+1 s-l+2…s-1GT s1 s2 s3 …sr Bayesian Splice Site Prediction where H indexes the hypotheses of GT or AG at - True site in reading phase 1, 2, or 0 - False within-exon site in reading phase 1, 2, or 0 - False within-intron site Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  28. H0: H=T: - 2-class model: - 7 class model: Bayes Factor as Decision Criterion Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  29. Interpretation of Bayes Factor in terms of Critical Value c = 2 lnBF • Positive evidence for H0 if 2  c  6 • Strong support for H0 if 6  c  10 • Very strong support for H0 if c > 10 Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  30. Evaluation of Splice Site Prediction Actual True False • TP • FP PP=TP+FP True Predicted • FN • TN False PN=FN+TN AP=TP+FN AN=FP+TN = Coverage • Misclassification rates: • Sensitivity: • Specificity: • Normalized specificity: Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  31. Species • Model • Site • Test Site Set • True False • Bayes • Factor • Sn • (%) •  • (%) • Sp • (%) • Homo sapiens • 2C • GT • AG • 921 • 920 • 44411 • 65103 • 0 • 3 • 6 • 0 • 3 • 6 • 98.5 • 91.7 • 66.3 • 96.3 • 90.3 • 76.1 • 90.5 • 96.3 • 98.5 • 88.4 • 92.9 • 96.1 • 16.4 • 34.8 • 57.6 • 9.7 • 15.7 • 25.6 • Drosophila • 2C • GT • AG • 329 • 329 • 11501 • 14920 • 0 • 3 • 6 • 0 • 3 • 6 • 95.4 • 90.0 • 83.9 • 95.7 • 92.1 • 85.1 • 94.8 • 97.6 • 99.1 • 94.8 • 97.0 • 98.5 • 34.1 • 53.6 • 75.0 • 28.7 • 41.4 • 59.4 • C. elegans • 7C • GT • AG • 400 • 400 • 7460 • 10132 • 0 • 3 • 6 • 0 • 3 • 6 • 97.8 • 94.2 • 84.8 • 98.8 • 96.2 • 90.2 • 92.7 • 97.1 • 99.1 • 97.2 • 98.8 • 99.5 • 40.4 • 64.3 • 85.4 • 58.2 • 76.9 • 88.5 • A. thaliana • 7C • GT • AG • 613 • 614 • 9027 • 10196 • 0 • 3 • 6 • 0 • 3 • 6 • 99.5 • 95.6 • 87.1 • 99.2 • 96.4 • 87.1 • 93.2 • 97.6 • 99.3 • 92.3 • 96.4 • 98.6 • 48.1 • 73.2 • 91.0 • 41.9 • 62.0 • 81.2 Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  32.  Human GT site Human AG site Sn Sn   C. elegans GT site C. elegans AG site Sn Sn   A. thaliana GT site A. thaliana AG site Sn Sn Performance? Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  33. PG PG (1-PG)(1-PD(n+1)) en en+1 (1-PG)PD(n+1) PA(n)PG (1-PG)PD(n+1) in in+1 1-PA(n) Markov Model for Spliced Alignment Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  34. Performance vs other methods • Comparison with ab initio gene prediction programs? • Depends on: • Availability of ESTs • Availability of protein homologs Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  35. GeneSeqer vs NAP vs GENSCAN (Exon prediction) 1.00 0.90 0.80 0.70 0.60 Exon (Sn + Sp) / 2 0.50 0.40 GeneSeqer 0.30 NAP 0.20 GENSCAN 0.10 0.00 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GENSCAN - Burge, MIT Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  36. GeneSeqer vs NAP vs GENSCAN (Intron prediction) 1.00 0.90 0.80 0.70 0.60 Intron (Sn + Sp) / 2 0.50 GeneSeqer 0.40 0.30 NAP 0.20 GENSCAN 0.10 0.00 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GENSCAN - Burge, MIT Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  37. Genomic Sequence Fast Search Spliced Alignment EST or protein database (Suffix Array/ Suffix Tree) Output Assembly GeneSeqer Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  38. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  39. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  40. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  41. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  42. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  43. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  44. Gene Structure Annotation - Problems • False positive intergenic region: • 2 annotated genes actually correspond to a single gene • False negative intergenic region: • One annotated gene structure actually contains 2 genes • False negative gene prediction: • Missing gene (no annotation) • Other: • partially incorrect gene annotation • missing annotation of alternative transcripts Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  45. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  46. Other Resources • Current Protocols in Bioinformatics • http://www.4ulr.com/products/currentprotocols/bioinformatics.html • Finding Genes • 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations • 4.2 Using MZEF To Find Internal Coding Exons • 4.3 Using GENEID to Identify Genes • 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes • 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm • 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm • 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome • 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences • 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation • 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences D Dobbs ISU - BCB 444/544X: Gene Prediction

