1 / 41

BCB 444/544

BCB 444/544. Lecture 26 Gene Prediction #26_Oct22. Required Reading ( before lecture). Mon Oct 22 - Lecture 26 Gene Prediction Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Regulatory Element Prediction Chp 9 - pp 113 - 126

sylvia
Download Presentation

BCB 444/544

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCB 444/544 Lecture 26 Gene Prediction #26_Oct22 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  2. Required Reading (before lecture) MonOct 22- Lecture 26 Gene Prediction • Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Regulatory Element Prediction • Chp 9 - pp 113 - 126 Thurs Oct 25- Review Session & Project Planning Fri Oct 26 - EXAM 2 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  3. Assignments & Announcements Sun Oct 21 - Study Guide for Exam 2 was posted Mon Oct 22- HW#4 Due (no "correct" answer to post) Thu Oct 25 - Lab = Optional Review Session for Exam 544 Project Planning/Consult with DD & MT Fri Oct 26 - Exam 2 - Will cover: • Lectures 13-26 (thru Mon Sept 17) • Labs 5-8 • HW# 3 & 4 • All assigned reading: Chps 6 (beginning with HMMs), 7-8, 12-16 Eddy: What is an HMM Ginalski: Practical Lessons… BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  4. BCB 544 "Team" Projects • 544 Extra HW#2 is next step in Team Projects • Write ~ 1 page outline • Schedule meeting with Michael & Drena to discuss topic • Read a few papers • Write a more detailed plan • You may work alone if you prefer • Last week of classes will be devoted to Projects • Written reports due: Mon Dec 3(no class that day) • Oral presentations (15-20') will be:Wed-Fri Dec 5,6,7 • 1 or 2 teams will present during each class period • See Guidelines for Projects posted online BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  5. BCB 544 Only: New Homework Assignment 544 Extra#2(posted online Thurs?) No - sorry! sent by email on Sat… Due: PART 1 - ASAP PART 2 - Fri Nov 2 by 5 PM Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  6. Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html • Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB • Dave SegalUC DavisZinc Finger Protein Design • Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  7. Chp 16 - RNA Structure Prediction SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 16RNA Structure Prediction (Terribilini) • RNA Function • Types of RNA Structures • RNA Secondary Structure Prediction Methods • Ab Initio Approach • Comparative Approach • Performance Evaluation BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  8. Covalent & non-covalent bonds in RNA This is a new slide • Primary: • Covalent bonds • Secondary/Tertiary • Non-covalent bonds • H-bonds • (base-pairing) • Base stacking Fig 6.2 Baxevanis & Ouellette 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  9. RNA Pseudoknots & Tetraloops This is a new slide • Often have important regulatory or catalytic functions Pseudoknot Tetraloop http://www.lbl.gov/Science-Articles/Research-Review/Annual-Reports/1995/images/rna.gif http://academic.brooklyn.cuny.edu/chem/zhuang/QD/mckay_hr.gif BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  10. Base Pairing in RNA This slide has been changed See: IMB Image Library of Biological Molecules G-C, A-U, G-U ("wobble") & many variants http://www.fli-leibniz.de/ImgLibDoc/nana/IMAGE_NANA.html#basepairs BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  11. RNA Secondary Structure Prediction Methods This slide has been changed Two (three, recently) main types of methods: • Ab initio - based on calculating most energetically favorable secondary structure(s) Energy minimization (thermodynamics) • Comparative approach - based on comparisons of multiple evolutionarily-related RNA sequences Sequence comparison (co-variation) • Combinedcomputational & experimental Use experimental constraints when available BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  12. RNA Secondary structure prediction - 3 This is a new slide DMS G 200 220 240 Kethoxal modification (mild) (strong) DMS modification (mild) (strong) 3) Combined experimental & computational • Experiments: Map single-stranded vs double-stranded regions in folded RNA • How? Enzymes: S1 nuclease, T1 RNase Chemicals: kethoxal, DMS, OH • Software: Mfold Sfold RNAStructure RNAFold RNAlifold BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  13. Ab Initio Prediction: Clarifications This slide has been changed • Free energy is calculated based on parameters determined in the wet lab • Correction: Use known energy associated with each type of nearest-neighbor pair (base-stacking) (not base-pair) • Base-pair formation is not independent: multiple base-pairs adjacent to each other are more favorable than individual base-pairs - cooperative - because of base-stacking interactions • Bulges and loops adjacent to base-pairs have a free energy penalty BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  14. Energy minimization: What are the rules? This is a new slide Basepair Basepair A U A U A=U A=U G = -1.2 kcal/mole A U U A A=U U=A G = -1.6 kcal/mole What gives here? Why 1.2 vs 1.6? C Staben 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  15. Energy minimization calculations:Base-stacking is critical This is a new slide - Tinocco et al. C Staben 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  16. Ab Initio Energy Calculation This slide has been changed • Search for all possible base-pairing patterns • Calculate total energy of each structure based on all stabilizing and destabilizing forces Total free energy for a specific RNA conformation = Sum of incremental energy terms for: • helical stacking (sequence dependent) • loop initiation • unpaired stacking (favorable "increments" are < 0) Fig 6.3 Baxevanis & Ouellette 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  17. Dynamic Programming This slide has been changed • Finding optimal secondary structure is difficult - lots of possibilities • Compare RNA sequence with itself • Apply scoring scheme based on energy parameters for base stacking, cooperativity, and penalties for destabilizing forces (loops, bulges) • Find path that represents most energetically favorable secondary structure BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  18. 3 - Popular Programs that use Combined Computational Experimental Approaches • Mfold • Sfold • RNAStructure • RNAFold • RNAlifold BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  19. SL Y SL Y SL Z SL X SL Z SL X Sfold -51.14 kcal/mol Mfold -54.84 kcal/mol SL Y SL Z SL Y SL X SL Z SL X RNAstructure -71.3 kcal/mol RNAfold -80.16 kcal/mol Comparison of Predictions for Single RNA using Different Methods JH Lee 2007 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  20. Comparison of Mfold Predictions: -/+ Constraints Mfold -126.05 kcal/mol Mfold plus constraints -54.84 kcal/mol JH Lee 2007 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  21. Performance Evaluation This slide has been changed • Ab initio methods? correlation coefficient = 20-60% • Comparative approaches? correlation coefficient = 20-80% • Programs that require user to supply MSA are more accurate • Comparative programs are consistently more accurate than ab initio • Base-pairs predicted by comparative sequence analysis for large & small subunit rRNAs are 97% accurate when compared with high resolution crystal structures! - Gutell, Pace • BEST APPROACH?Methods that combine computational prediction (ab initio & comparative) with experimental constraints (from chemical/enzymatic modification studies) BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  22. Chp 8 - Gene Prediction SECTION IIIGENE AND PROMOTER PREDICTION Xiong: Chp 8 Gene Prediction • Categories of Gene Prediction Programs • Gene Prediction in Prokaryotes • Gene Prediction in Eukaryotes BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  23. What is a Gene? What is a gene? segment of DNA, some of which is "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory" • Genes can encode: • mRNA (for protein) • other types of RNA (tRNA, rRNA, miRNA, etc.) • Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  24. Gene Finding Problem: Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT Steps: • Search against protein / EST database • Apply gene prediction programs (many programs available) • Analyze regulatory regions BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  25. Prokaryotes Small genomes 0.5 - 10·106 bp About 90% of genome is coding Simple gene structure Prediction success ~99% Eukaryotes Large genomes 107 – 1010 bp Often less than 2% coding Complicated gene structure (splicing, long exons) Prediction success 50-95% Gene Prediction in Prokaryotes vs Eukaryotes Splice sites Start codon Stop codon ATG TAA ATG TAA 5’ UTR 3’ UTR Promotor Open reading frame (ORF) Promotor Exons Introns BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  26. DNA "Signals" Used by Gene Finding Algorithms • Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP • Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… • Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron • Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length • Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  27. Computational Gene Finding Approaches • Ab initio methods • Search by signal: find DNA sequences involved in gene expression. • Search by content: Test statistical properties distinguishing coding from non-coding DNA • Similarity based methods • Database search: exploit similarity to proteins, ESTs, and cDNAs • Comparative genomics: exploit aligned genomes • Do other organisms have similar sequence? • Hybrid methods - best BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  28. Examples of Gene Prediction Software • Ab initio • Genscan, GeneMark.hmm, Genie, GeneID… • Similarity-based • BLAST, Procrustes… • Hybrids • GeneSeqer, GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM. • BEST?Ab initio - Genescan(according to some assessments) Hybrid - GeneSeqer But depends on organism & specific task Lists of Gene Prediction Software http://www.bioinformaticsonline.org/links/ch_09_t_1.html http://cmgm.stanford.edu/classes/genefind/ BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  29. DN 5’ 3’ exon 1 intron exon 2 intron exon 3 3’ 5’ Transcription 1' transcript (RNA) 5’ 3’ Splicing (remove introns) 3’ 5’ Capping & polyadenylation Mature mRNA 5’ 7MeG AAAAA 3’ m Export to cytoplasm Synthesis & Processing of Eukaryotic mRNA Gene in DNA BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  30. insert vector What are cDNAs & ESTs? • cDNA libraries are important for determining gene • structure & studying regulation of gene expression • Isolate RNA (always from a specific • organism, region, and time point) • Convert RNA to complementary DNA • (with reverse transcriptase) • Clone into cDNA vector • Sequence the cDNA inserts • Short cDNAs are called ESTs or • Expressed Sequence Tags • ESTs are strong evidence for genes • Full-length cDNAs can be difficult to obtain BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  31. UniGene: Unique genes via ESTs • • Find UniGene at NCBI: • www.ncbi.nlm.nih.gov/UniGene • UniGene clusters contain many ESTs • • UniGene data come from many cDNA libraries. • When you look up a gene in UniGene, you can • obtain information re: level & tissue • distribution of expression BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  32. Gene Prediction • Overview of steps & strategies • What sequence signals can be used? • What other types of information can be used? • Algorithms • HMMs, Bayesian models, neural nets • Gene prediction software • 3 major types • many, many programs! BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  33. Overview of Gene Prediction Strategies • What sequence signals can be used? • Transcription:TF binding sites, promoter, initiation site, terminator, GC islands, etc. • Processing signals:Splice donor/acceptors, polyA signal • Translation:Start (AUG = Met) & stop (UGA,UUA, UAG) • ORFs, codon usage • What other types of information can be used? • Homology (sequence comparison, BLAST) • cDNAs & ESTs(experimental data, pairwise alignment) BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  34. Gene prediction: Eukaryotes vs prokaryotes Gene prediction is easier in microbial genomes Why?Smaller genomes Simpler gene structures Many more sequenced genomes! (for comparative approaches) Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are available e.g., GeneMark.hmm TIGRComprehensive Microbial Resource (CMR) NCBIMicrobial Genomes BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  35. Predicting Genes - Basic steps: • Obtain genomic sequence • BLAST it! • Perform database similarity search • (with EST & cDNA databases, if available) • Translate in all 6 reading frames • (i.e., "6-frame translation") • Compare with protein sequence databases • Use Gene Prediction software to locate genes • Analyze regulatory sequences • Refine gene prediction BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  36. Predicting Genes - Details: • 1. 1st, mask to "remove" repetitive elements (ALUs, etc.) • Perform database search on translatedDNA (BlastX,TFasta) • Use several programs to predict genes (GENSCAN, GeneMark.hmm, GeneSeqer) • Search for functional motifs in translated ORFs (Blocks, Motifs, etc.) & in neighboring DNA sequences • Repeat BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  37. Spliced Alignment Algorithm Intron GT AG Donor Acceptor Splice sites GeneSeqer- Brendel et al.- ISU http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Brendel et al (2004)Bioinformatics 20: 1157 • Perform pairwise alignment with large gaps in one sequence (due to introns) • Align genomic DNA with cDNA, ESTs, protein sequences • Score semi-conserved sequences at splice junctions • Using Bayesian model or MM • Score coding constraints in translated exons • Using a Bayesian model or MM Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  38. Start codon Stop codon Genomic DNA Protein Brendel - Spliced Alignment II: Compare with protein probes Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  39. Information Content Ii: • Extent of Splice Signal Window: Splice Site Detection Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES i: ith position in sequence Ī: avg information content over all positions >20 nt from splice site Ī: avg sample standard deviation of Ī Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  40. Human T2_GT Human T2_AG Information content vs position Which sequences are exons & which are introns? How can you tell? Brendel et al (2004)Bioinformatics 20: 1157 Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

  41. PG PG (1-PG)(1-PD(n+1)) en en+1 (1-PG)PD(n+1) PA(n)PG (1-PG)PD(n+1) in in+1 1-PA(n) Markov Model for Spliced Alignment Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction

More Related