1 / 46

10/21/05 Gene Prediction (formerly Gene Prediction - 3)

10/21/05 Gene Prediction (formerly Gene Prediction - 3). Announcements. Exam 2 - next Friday Posted online: Exam 2 Study Guide 544 Reading Assignment (2 papers). Announcements. 544 Semester Projects - Information needed: Please send email to me (or David) ddobbs@iastate.edu

greiner
Download Presentation

10/21/05 Gene Prediction (formerly Gene Prediction - 3)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 10/21/05Gene Prediction(formerly Gene Prediction - 3) D Dobbs ISU - BCB 444/544X: Gene Prediction

  2. Announcements Exam 2 - next Friday Posted online: Exam 2 Study Guide 544 Reading Assignment (2 papers) D Dobbs ISU - BCB 444/544X: Gene Prediction

  3. Announcements • 544 Semester Projects - Information needed: • Please send email to me (or David) • ddobbs@iastate.edu • Briefly describe: • Your background & current grad research • Is there a problem related to your research you would like to learn more about & develop as project for this course? • or • What would your ‘dream’ project be? D Dobbs ISU - BCB 444/544X: Gene Prediction

  4. Announcements 2 Bioinformatics Seminars today (Fri Oct 21) 12:10 PMBCB Faculty Seminar in E164 Lagomarcino “Protein Networks” Bob Jernigan, BBMB & Director,Baker Center for Bioinformatics & Biological Statistics http://www.bcb.iastate.edu/courses/BCB691-F2005.html#Oct%2021 4:10 PMGDCB Special Seminar in 1414 MBB “Integrating the Unknown-eome with Abiotic Stress Response Networks in Arabidopsis” Ron Mittler, Dept. of Biochem & Mol Biology University of Nevada, Reno D Dobbs ISU - BCB 444/544X: Gene Prediction

  5. Gene Prediction & Regulation Mon - Gene structure review: Eukaryotes vs prokaryotes Wed - Regulatory regions: Promoters & enhancers Fri - Predicting genes - Predicting regulatory regions (?) Next week: Predicting RNA structure (miRNAs, too) D Dobbs ISU - BCB 444/544X: Gene Prediction

  6. Reading Assignment • Mount Bioinformatics • Chp 9 Gene Prediction & Regulation • pp 361-385 Predicting Promoters • Ck Errata:http://www.bioinformaticsonline.org/help/errata2.html • * Brown Genomes 2 (NCBI textbooks online) • Sect 9 Overview: Assembly of Transcription Initiation Complex • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.chapter.7002 • Sect 9.1-9.3 DNA binding proteins, Transcription initiation • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.section.7016 * NOTE: Don’t worry about the details!! D Dobbs ISU - BCB 444/544X: Gene Prediction

  7. Optional Reading • Reviews: • Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698-709 http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html • Wasserman WW & Sandelin (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html D Dobbs ISU - BCB 444/544X: Gene Prediction

  8. Review last lecture: Gene Regulation(formerly Gene Prediction-2) cDNAs & ESTs UniGene Regulatory regions Eukaryotes vs prokaryotes D Dobbs ISU - BCB 444/544X: Gene Prediction

  9. DNA RNA protein Phenotype cDNA [1] Transcription [2] RNA processing (splicing) [3] RNA export [4] RNA surveillance Pevsner p160 D Dobbs ISU - BCB 444/544X: Gene Prediction

  10. UniGene: unique genes via ESTs • • Find UniGene at NCBI: • www.ncbi.nlm.nih.gov/UniGene • UniGene clusters contain many ESTs • • UniGene data come from many cDNA libraries. • Thus, when you look up a gene in UniGene • you get information on its abundance • and its regional distribution Pevsner p164 D Dobbs ISU - BCB 444/544X: Gene Prediction

  11. Today: Gene Prediction(formerly Gene Prediction - 3) Predicting genes Mon - Predicting regulatory regions Focus on promoters Introduction to RNA Later: Genome browsers D Dobbs ISU - BCB 444/544X: Gene Prediction

  12. Gene Prediction • Overview of steps & strategies • What sequence signals can be used? • What other types of information can be used? • Algorithms • HMMs, discriminant functions, neural nets • Gene prediction software • 3 major types • many,many programs! D Dobbs ISU - BCB 444/544X: Gene Prediction

  13. Predicting Genes - Basic steps: • Obtain genomic sequence • Translate in all 6 reading frames • Compare with protein sequence database • Perform database similarity search • with EST & cDNA databases, if available • Use gene prediction program to locate genes • Analyze gene regulatory sequences D Dobbs ISU - BCB 444/544X: Gene Prediction

  14. Overview of gene prediction strategies What sequence signals can be used? Transcription:TF binding sites, promoter, initiation site, terminator Processing signals:splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage What other types of information can be used? cDNAs & ESTs(experimental data,pairwise alignment) homology(sequence comparison, BLAST) D Dobbs ISU - BCB 444/544X: Gene Prediction

  15. Automated gene prediction strategies • Similarity-based or Comparative • BLAST - Do other organisms have similar sequence? (Is sequence similar to known gene or protein) • Ab initio = “from the beginning” • Predict without explicit comparison with cDNA or proteins via “rule-based” gene models - but rules are derived from statistical analysis of datasets • Combined "evidence-based" • Combine gene models with alignment to known ESTs & protein sequences • BEST RESULTS? Combined D Dobbs ISU - BCB 444/544X: Gene Prediction

  16. Examples of gene prediction software • Similarity-based or Comparative • BLAST • SGP2 (extension of GeneID) • Ab initio = “from the beginning” • GeneID - (used in lab this week) • GENSCAN - (used in lab this week) • GeneMark.hmm - (should try this!) • Combined "evidence-based” • GeneSeqer (Brendel et al., ISU) • BEST?GENSCAN, GeneMark.hmm, GeneSeqer • but depends on organism & specific task D Dobbs ISU - BCB 444/544X: Gene Prediction

  17. Gene prediction: Eukaryotes vs prokaryotes Gene prediction is easier in microbial genomes Why? Smaller genomes Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously, mostly HMM-based Now: similarity-based methods because so many genomesavailable D Dobbs ISU - BCB 444/544X: Gene Prediction

  18. GeneSeqer - Brendel et al. http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi D Dobbs ISU - BCB 444/544X: Gene Prediction

  19. Thanks to Volker Brendel, ISU for following Figs & Slides Slightly modified from: BSSI Genome Informatics Module http://www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleB V Brendel vbrendel@iastate.edu D Dobbs ISU - BCB 444/544X: Gene Prediction

  20. Signals: Pre-mRNA Splicing Start codon Stop codon Genomic DNA Transcription pre-mRNA Cap- -Poly(A) Splicing mRNA -Poly(A) Cap- Translation Protein exon intron GT AG Acceptorsite Donor site Splice sites Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  21. Brendel - Spliced Alignment I: Compare with cDNA or EST probes Start codon Stop codon Genomic DNA Start codon Stop codon -Poly(A) mRNA Cap- 5’-UTR 3’-UTR Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  22. Start codon Stop codon Genomic DNA Protein Brendel - Spliced Alignment II: Compare with protein probes Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  23. Brendel Spliced Alignment Algorithm • Perform pairwise alignment with large gaps in one sequence (introns) • Align genomic DNA with cDNA, EST or protein • Score semi-conserved sequences at splice junctions • Score coding constraints in translated exons Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  24. Species • Type • Number of True Splice Sites / Phase • 1 2 3 • Home sapiens • GT • AG • 6586 • 6555 • 5277 • 5194 • 3037 • 2979 • Mus musculus • GT • AG • 1212 • 1194 • 1185 • 1139 • 521 • 504 • Rattus norvegicus • GT • AG • 450 • 442 • 408 • 386 • 147 • 140 • Gallus gallus • GT • AG • 288 • 284 • 238 • 228 • 107 • 103 • Drosophila • GT • AG • 989 • 1001 • 670 • 671 • 524 • 536 • C. elegans • GT • AG • 37029 • 36864 • 20500 • 20325 • 20789 • 20626 • S. pombe • GT • AG • 170 • 179 • 118 • 122 • 119 • 118 • Aspergillus • GT • AG • 221 • 217 • 176 • 172 • 157 • 163 • Arabidopsis thaliana • GT • AG • 23019 • 22929 • 9297 • 9247 • 8653 • 8611 • Zea mays • GT • AG • 316 • 311 • 107 • 104 • 88 • 83 Donor (GT) & Acceptor (AG) Sites Used for Model Training Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  25. Information Content Ii: • Extent of Splice Signal Window: Splice Site Detection i : ith position in sequence Ī : average information content over all positions i > 20 nt from splice site Ī : average standard deviation of Ī Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  26. Human T2_GT Human T2_AG Human F1_AG Human Fi_AG A. thaliana T2_GT A. thaliana T2_AG A. thaliana F1_AG A. thaliana Fi_AG Results? Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  27. Let S = s-l s-l+1 s-l+2…s-1GT s1 s2 s3 …sr Bayesian Splice Site Prediction where H indexes the hypotheses of GT or AG at - True site in reading phase 1, 2, or 0 - False within-exon site in reading phase 1, 2, or 0 - False within-intron site Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  28. H0: H=T: - 2-class model: - 7 class model: Bayes Factor as Decision Criterion Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  29. Interpretation of Bayes Factor in terms of Critical Value c = 2 lnBF • Positive evidence for H0 if 2  c  6 • Strong support for H0 if 6  c  10 • Very strong support for H0 if c > 10 Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  30. Evaluation of Splice Site Prediction Actual True False • TP • FP PP=TP+FP True Predicted • FN • TN False PN=FN+TN AP=TP+FN AN=FP+TN = Coverage • Misclassification rates: • Sensitivity: • Specificity: • Normalized specificity: Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  31. Species • Model • Site • Test Site Set • True False • Bayes • Factor • Sn • (%) •  • (%) • Sp • (%) • Homo sapiens • 2C • GT • AG • 921 • 920 • 44411 • 65103 • 0 • 3 • 6 • 0 • 3 • 6 • 98.5 • 91.7 • 66.3 • 96.3 • 90.3 • 76.1 • 90.5 • 96.3 • 98.5 • 88.4 • 92.9 • 96.1 • 16.4 • 34.8 • 57.6 • 9.7 • 15.7 • 25.6 • Drosophila • 2C • GT • AG • 329 • 329 • 11501 • 14920 • 0 • 3 • 6 • 0 • 3 • 6 • 95.4 • 90.0 • 83.9 • 95.7 • 92.1 • 85.1 • 94.8 • 97.6 • 99.1 • 94.8 • 97.0 • 98.5 • 34.1 • 53.6 • 75.0 • 28.7 • 41.4 • 59.4 • C. elegans • 7C • GT • AG • 400 • 400 • 7460 • 10132 • 0 • 3 • 6 • 0 • 3 • 6 • 97.8 • 94.2 • 84.8 • 98.8 • 96.2 • 90.2 • 92.7 • 97.1 • 99.1 • 97.2 • 98.8 • 99.5 • 40.4 • 64.3 • 85.4 • 58.2 • 76.9 • 88.5 • A. thaliana • 7C • GT • AG • 613 • 614 • 9027 • 10196 • 0 • 3 • 6 • 0 • 3 • 6 • 99.5 • 95.6 • 87.1 • 99.2 • 96.4 • 87.1 • 93.2 • 97.6 • 99.3 • 92.3 • 96.4 • 98.6 • 48.1 • 73.2 • 91.0 • 41.9 • 62.0 • 81.2 Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  32.  Human GT site Human AG site Sn Sn   C. elegans GT site C. elegans AG site Sn Sn   A. thaliana GT site A. thaliana AG site Sn Sn Performance? Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  33. PG PG (1-PG)(1-PD(n+1)) en en+1 (1-PG)PD(n+1) PA(n)PG (1-PG)PD(n+1) in in+1 1-PA(n) Markov Model for Spliced Alignment Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  34. Performance vs other methods • Comparison with ab initio gene prediction programs? • Depends on: • Availability of ESTs • Availability of protein homologs Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  35. GeneSeqer vs NAP vs GENSCAN (Exon prediction) 1.00 0.90 0.80 0.70 0.60 Exon (Sn + Sp) / 2 0.50 0.40 GeneSeqer 0.30 NAP 0.20 GENSCAN 0.10 0.00 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GENSCAN - Burge, MIT Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  36. GeneSeqer vs NAP vs GENSCAN (Intron prediction) 1.00 0.90 0.80 0.70 0.60 Intron (Sn + Sp) / 2 0.50 GeneSeqer 0.40 0.30 NAP 0.20 GENSCAN 0.10 0.00 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GENSCAN - Burge, MIT Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  37. Genomic Sequence Fast Search Spliced Alignment EST or protein database (Suffix Array/ Suffix Tree) Output Assembly GeneSeqer Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  38. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  39. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  40. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  41. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  42. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  43. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  44. Gene Structure Annotation - Problems • False positive intergenic region: • 2 annotated genes actually correspond to a single gene • False negative intergenic region: • One annotated gene structure actually contains 2 genes • False negative gene prediction: • Missing gene (no annotation) • Other: • partially incorrect gene annotation • missing annotation of alternative transcripts Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  45. Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction

  46. Other Resources • Current Protocols in Bioinformatics • http://www.4ulr.com/products/currentprotocols/bioinformatics.html • Finding Genes • 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations • 4.2 Using MZEF To Find Internal Coding Exons • 4.3 Using GENEID to Identify Genes • 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes • 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm • 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm • 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome • 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences • 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation • 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences D Dobbs ISU - BCB 444/544X: Gene Prediction

More Related