460 likes | 584 Views
Explore automated gene prediction strategies, concepts of gene prediction, regulatory regions, genome browsers, and more. Stay updated on upcoming bioinformatics seminars and projects in the field.
E N D
10/21/05Gene Prediction(formerly Gene Prediction - 3) D Dobbs ISU - BCB 444/544X: Gene Prediction
Announcements Exam 2 - next Friday Posted online: Exam 2 Study Guide 544 Reading Assignment (2 papers) D Dobbs ISU - BCB 444/544X: Gene Prediction
Announcements • 544 Semester Projects - Information needed: • Please send email to me (or David) • ddobbs@iastate.edu • Briefly describe: • Your background & current grad research • Is there a problem related to your research you would like to learn more about & develop as project for this course? • or • What would your ‘dream’ project be? D Dobbs ISU - BCB 444/544X: Gene Prediction
Announcements 2 Bioinformatics Seminars today (Fri Oct 21) 12:10 PMBCB Faculty Seminar in E164 Lagomarcino “Protein Networks” Bob Jernigan, BBMB & Director,Baker Center for Bioinformatics & Biological Statistics http://www.bcb.iastate.edu/courses/BCB691-F2005.html#Oct%2021 4:10 PMGDCB Special Seminar in 1414 MBB “Integrating the Unknown-eome with Abiotic Stress Response Networks in Arabidopsis” Ron Mittler, Dept. of Biochem & Mol Biology University of Nevada, Reno D Dobbs ISU - BCB 444/544X: Gene Prediction
Gene Prediction & Regulation Mon - Gene structure review: Eukaryotes vs prokaryotes Wed - Regulatory regions: Promoters & enhancers Fri - Predicting genes - Predicting regulatory regions (?) Next week: Predicting RNA structure (miRNAs, too) D Dobbs ISU - BCB 444/544X: Gene Prediction
Reading Assignment • Mount Bioinformatics • Chp 9 Gene Prediction & Regulation • pp 361-385 Predicting Promoters • Ck Errata:http://www.bioinformaticsonline.org/help/errata2.html • * Brown Genomes 2 (NCBI textbooks online) • Sect 9 Overview: Assembly of Transcription Initiation Complex • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.chapter.7002 • Sect 9.1-9.3 DNA binding proteins, Transcription initiation • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.section.7016 * NOTE: Don’t worry about the details!! D Dobbs ISU - BCB 444/544X: Gene Prediction
Optional Reading • Reviews: • Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698-709 http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html • Wasserman WW & Sandelin (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html D Dobbs ISU - BCB 444/544X: Gene Prediction
Review last lecture: Gene Regulation(formerly Gene Prediction-2) cDNAs & ESTs UniGene Regulatory regions Eukaryotes vs prokaryotes D Dobbs ISU - BCB 444/544X: Gene Prediction
DNA RNA protein Phenotype cDNA [1] Transcription [2] RNA processing (splicing) [3] RNA export [4] RNA surveillance Pevsner p160 D Dobbs ISU - BCB 444/544X: Gene Prediction
UniGene: unique genes via ESTs • • Find UniGene at NCBI: • www.ncbi.nlm.nih.gov/UniGene • UniGene clusters contain many ESTs • • UniGene data come from many cDNA libraries. • Thus, when you look up a gene in UniGene • you get information on its abundance • and its regional distribution Pevsner p164 D Dobbs ISU - BCB 444/544X: Gene Prediction
Today: Gene Prediction(formerly Gene Prediction - 3) Predicting genes Mon - Predicting regulatory regions Focus on promoters Introduction to RNA Later: Genome browsers D Dobbs ISU - BCB 444/544X: Gene Prediction
Gene Prediction • Overview of steps & strategies • What sequence signals can be used? • What other types of information can be used? • Algorithms • HMMs, discriminant functions, neural nets • Gene prediction software • 3 major types • many,many programs! D Dobbs ISU - BCB 444/544X: Gene Prediction
Predicting Genes - Basic steps: • Obtain genomic sequence • Translate in all 6 reading frames • Compare with protein sequence database • Perform database similarity search • with EST & cDNA databases, if available • Use gene prediction program to locate genes • Analyze gene regulatory sequences D Dobbs ISU - BCB 444/544X: Gene Prediction
Overview of gene prediction strategies What sequence signals can be used? Transcription:TF binding sites, promoter, initiation site, terminator Processing signals:splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage What other types of information can be used? cDNAs & ESTs(experimental data,pairwise alignment) homology(sequence comparison, BLAST) D Dobbs ISU - BCB 444/544X: Gene Prediction
Automated gene prediction strategies • Similarity-based or Comparative • BLAST - Do other organisms have similar sequence? (Is sequence similar to known gene or protein) • Ab initio = “from the beginning” • Predict without explicit comparison with cDNA or proteins via “rule-based” gene models - but rules are derived from statistical analysis of datasets • Combined "evidence-based" • Combine gene models with alignment to known ESTs & protein sequences • BEST RESULTS? Combined D Dobbs ISU - BCB 444/544X: Gene Prediction
Examples of gene prediction software • Similarity-based or Comparative • BLAST • SGP2 (extension of GeneID) • Ab initio = “from the beginning” • GeneID - (used in lab this week) • GENSCAN - (used in lab this week) • GeneMark.hmm - (should try this!) • Combined "evidence-based” • GeneSeqer (Brendel et al., ISU) • BEST?GENSCAN, GeneMark.hmm, GeneSeqer • but depends on organism & specific task D Dobbs ISU - BCB 444/544X: Gene Prediction
Gene prediction: Eukaryotes vs prokaryotes Gene prediction is easier in microbial genomes Why? Smaller genomes Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously, mostly HMM-based Now: similarity-based methods because so many genomesavailable D Dobbs ISU - BCB 444/544X: Gene Prediction
GeneSeqer - Brendel et al. http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi D Dobbs ISU - BCB 444/544X: Gene Prediction
Thanks to Volker Brendel, ISU for following Figs & Slides Slightly modified from: BSSI Genome Informatics Module http://www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleB V Brendel vbrendel@iastate.edu D Dobbs ISU - BCB 444/544X: Gene Prediction
Signals: Pre-mRNA Splicing Start codon Stop codon Genomic DNA Transcription pre-mRNA Cap- -Poly(A) Splicing mRNA -Poly(A) Cap- Translation Protein exon intron GT AG Acceptorsite Donor site Splice sites Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Brendel - Spliced Alignment I: Compare with cDNA or EST probes Start codon Stop codon Genomic DNA Start codon Stop codon -Poly(A) mRNA Cap- 5’-UTR 3’-UTR Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Start codon Stop codon Genomic DNA Protein Brendel - Spliced Alignment II: Compare with protein probes Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Brendel Spliced Alignment Algorithm • Perform pairwise alignment with large gaps in one sequence (introns) • Align genomic DNA with cDNA, EST or protein • Score semi-conserved sequences at splice junctions • Score coding constraints in translated exons Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Species • Type • Number of True Splice Sites / Phase • 1 2 3 • Home sapiens • GT • AG • 6586 • 6555 • 5277 • 5194 • 3037 • 2979 • Mus musculus • GT • AG • 1212 • 1194 • 1185 • 1139 • 521 • 504 • Rattus norvegicus • GT • AG • 450 • 442 • 408 • 386 • 147 • 140 • Gallus gallus • GT • AG • 288 • 284 • 238 • 228 • 107 • 103 • Drosophila • GT • AG • 989 • 1001 • 670 • 671 • 524 • 536 • C. elegans • GT • AG • 37029 • 36864 • 20500 • 20325 • 20789 • 20626 • S. pombe • GT • AG • 170 • 179 • 118 • 122 • 119 • 118 • Aspergillus • GT • AG • 221 • 217 • 176 • 172 • 157 • 163 • Arabidopsis thaliana • GT • AG • 23019 • 22929 • 9297 • 9247 • 8653 • 8611 • Zea mays • GT • AG • 316 • 311 • 107 • 104 • 88 • 83 Donor (GT) & Acceptor (AG) Sites Used for Model Training Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Information Content Ii: • Extent of Splice Signal Window: Splice Site Detection i : ith position in sequence Ī : average information content over all positions i > 20 nt from splice site Ī : average standard deviation of Ī Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Human T2_GT Human T2_AG Human F1_AG Human Fi_AG A. thaliana T2_GT A. thaliana T2_AG A. thaliana F1_AG A. thaliana Fi_AG Results? Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Let S = s-l s-l+1 s-l+2…s-1GT s1 s2 s3 …sr Bayesian Splice Site Prediction where H indexes the hypotheses of GT or AG at - True site in reading phase 1, 2, or 0 - False within-exon site in reading phase 1, 2, or 0 - False within-intron site Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
H0: H=T: - 2-class model: - 7 class model: Bayes Factor as Decision Criterion Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Interpretation of Bayes Factor in terms of Critical Value c = 2 lnBF • Positive evidence for H0 if 2 c 6 • Strong support for H0 if 6 c 10 • Very strong support for H0 if c > 10 Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Evaluation of Splice Site Prediction Actual True False • TP • FP PP=TP+FP True Predicted • FN • TN False PN=FN+TN AP=TP+FN AN=FP+TN = Coverage • Misclassification rates: • Sensitivity: • Specificity: • Normalized specificity: Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Species • Model • Site • Test Site Set • True False • Bayes • Factor • Sn • (%) • • (%) • Sp • (%) • Homo sapiens • 2C • GT • AG • 921 • 920 • 44411 • 65103 • 0 • 3 • 6 • 0 • 3 • 6 • 98.5 • 91.7 • 66.3 • 96.3 • 90.3 • 76.1 • 90.5 • 96.3 • 98.5 • 88.4 • 92.9 • 96.1 • 16.4 • 34.8 • 57.6 • 9.7 • 15.7 • 25.6 • Drosophila • 2C • GT • AG • 329 • 329 • 11501 • 14920 • 0 • 3 • 6 • 0 • 3 • 6 • 95.4 • 90.0 • 83.9 • 95.7 • 92.1 • 85.1 • 94.8 • 97.6 • 99.1 • 94.8 • 97.0 • 98.5 • 34.1 • 53.6 • 75.0 • 28.7 • 41.4 • 59.4 • C. elegans • 7C • GT • AG • 400 • 400 • 7460 • 10132 • 0 • 3 • 6 • 0 • 3 • 6 • 97.8 • 94.2 • 84.8 • 98.8 • 96.2 • 90.2 • 92.7 • 97.1 • 99.1 • 97.2 • 98.8 • 99.5 • 40.4 • 64.3 • 85.4 • 58.2 • 76.9 • 88.5 • A. thaliana • 7C • GT • AG • 613 • 614 • 9027 • 10196 • 0 • 3 • 6 • 0 • 3 • 6 • 99.5 • 95.6 • 87.1 • 99.2 • 96.4 • 87.1 • 93.2 • 97.6 • 99.3 • 92.3 • 96.4 • 98.6 • 48.1 • 73.2 • 91.0 • 41.9 • 62.0 • 81.2 Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Human GT site Human AG site Sn Sn C. elegans GT site C. elegans AG site Sn Sn A. thaliana GT site A. thaliana AG site Sn Sn Performance? Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
PG PG (1-PG)(1-PD(n+1)) en en+1 (1-PG)PD(n+1) PA(n)PG (1-PG)PD(n+1) in in+1 1-PA(n) Markov Model for Spliced Alignment Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Performance vs other methods • Comparison with ab initio gene prediction programs? • Depends on: • Availability of ESTs • Availability of protein homologs Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
GeneSeqer vs NAP vs GENSCAN (Exon prediction) 1.00 0.90 0.80 0.70 0.60 Exon (Sn + Sp) / 2 0.50 0.40 GeneSeqer 0.30 NAP 0.20 GENSCAN 0.10 0.00 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GENSCAN - Burge, MIT Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
GeneSeqer vs NAP vs GENSCAN (Intron prediction) 1.00 0.90 0.80 0.70 0.60 Intron (Sn + Sp) / 2 0.50 GeneSeqer 0.40 0.30 NAP 0.20 GENSCAN 0.10 0.00 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GENSCAN - Burge, MIT Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Genomic Sequence Fast Search Spliced Alignment EST or protein database (Suffix Array/ Suffix Tree) Output Assembly GeneSeqer Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Gene Structure Annotation - Problems • False positive intergenic region: • 2 annotated genes actually correspond to a single gene • False negative intergenic region: • One annotated gene structure actually contains 2 genes • False negative gene prediction: • Missing gene (no annotation) • Other: • partially incorrect gene annotation • missing annotation of alternative transcripts Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Brendel 2005 D Dobbs ISU - BCB 444/544X: Gene Prediction
Other Resources • Current Protocols in Bioinformatics • http://www.4ulr.com/products/currentprotocols/bioinformatics.html • Finding Genes • 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations • 4.2 Using MZEF To Find Internal Coding Exons • 4.3 Using GENEID to Identify Genes • 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes • 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm • 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm • 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome • 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences • 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation • 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences D Dobbs ISU - BCB 444/544X: Gene Prediction