410 likes | 579 Views
BCB 444/544. Lecture 26 Gene Prediction #26_Oct22. Required Reading ( before lecture). Mon Oct 22 - Lecture 26 Gene Prediction Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Regulatory Element Prediction Chp 9 - pp 113 - 126
E N D
BCB 444/544 Lecture 26 Gene Prediction #26_Oct22 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Required Reading (before lecture) MonOct 22- Lecture 26 Gene Prediction • Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Regulatory Element Prediction • Chp 9 - pp 113 - 126 Thurs Oct 25- Review Session & Project Planning Fri Oct 26 - EXAM 2 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Assignments & Announcements Sun Oct 21 - Study Guide for Exam 2 was posted Mon Oct 22- HW#4 Due (no "correct" answer to post) Thu Oct 25 - Lab = Optional Review Session for Exam 544 Project Planning/Consult with DD & MT Fri Oct 26 - Exam 2 - Will cover: • Lectures 13-26 (thru Mon Sept 17) • Labs 5-8 • HW# 3 & 4 • All assigned reading: Chps 6 (beginning with HMMs), 7-8, 12-16 Eddy: What is an HMM Ginalski: Practical Lessons… BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
BCB 544 "Team" Projects • 544 Extra HW#2 is next step in Team Projects • Write ~ 1 page outline • Schedule meeting with Michael & Drena to discuss topic • Read a few papers • Write a more detailed plan • You may work alone if you prefer • Last week of classes will be devoted to Projects • Written reports due: Mon Dec 3(no class that day) • Oral presentations (15-20') will be:Wed-Fri Dec 5,6,7 • 1 or 2 teams will present during each class period • See Guidelines for Projects posted online BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
BCB 544 Only: New Homework Assignment 544 Extra#2(posted online Thurs?) No - sorry! sent by email on Sat… Due: PART 1 - ASAP PART 2 - Fri Nov 2 by 5 PM Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html • Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB • Dave SegalUC DavisZinc Finger Protein Design • Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Chp 16 - RNA Structure Prediction SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 16RNA Structure Prediction (Terribilini) • RNA Function • Types of RNA Structures • RNA Secondary Structure Prediction Methods • Ab Initio Approach • Comparative Approach • Performance Evaluation BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Covalent & non-covalent bonds in RNA This is a new slide • Primary: • Covalent bonds • Secondary/Tertiary • Non-covalent bonds • H-bonds • (base-pairing) • Base stacking Fig 6.2 Baxevanis & Ouellette 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
RNA Pseudoknots & Tetraloops This is a new slide • Often have important regulatory or catalytic functions Pseudoknot Tetraloop http://www.lbl.gov/Science-Articles/Research-Review/Annual-Reports/1995/images/rna.gif http://academic.brooklyn.cuny.edu/chem/zhuang/QD/mckay_hr.gif BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Base Pairing in RNA This slide has been changed See: IMB Image Library of Biological Molecules G-C, A-U, G-U ("wobble") & many variants http://www.fli-leibniz.de/ImgLibDoc/nana/IMAGE_NANA.html#basepairs BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
RNA Secondary Structure Prediction Methods This slide has been changed Two (three, recently) main types of methods: • Ab initio - based on calculating most energetically favorable secondary structure(s) Energy minimization (thermodynamics) • Comparative approach - based on comparisons of multiple evolutionarily-related RNA sequences Sequence comparison (co-variation) • Combinedcomputational & experimental Use experimental constraints when available BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
RNA Secondary structure prediction - 3 This is a new slide DMS G 200 220 240 Kethoxal modification (mild) (strong) DMS modification (mild) (strong) 3) Combined experimental & computational • Experiments: Map single-stranded vs double-stranded regions in folded RNA • How? Enzymes: S1 nuclease, T1 RNase Chemicals: kethoxal, DMS, OH • Software: Mfold Sfold RNAStructure RNAFold RNAlifold BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Ab Initio Prediction: Clarifications This slide has been changed • Free energy is calculated based on parameters determined in the wet lab • Correction: Use known energy associated with each type of nearest-neighbor pair (base-stacking) (not base-pair) • Base-pair formation is not independent: multiple base-pairs adjacent to each other are more favorable than individual base-pairs - cooperative - because of base-stacking interactions • Bulges and loops adjacent to base-pairs have a free energy penalty BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Energy minimization: What are the rules? This is a new slide Basepair Basepair A U A U A=U A=U G = -1.2 kcal/mole A U U A A=U U=A G = -1.6 kcal/mole What gives here? Why 1.2 vs 1.6? C Staben 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Energy minimization calculations:Base-stacking is critical This is a new slide - Tinocco et al. C Staben 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Ab Initio Energy Calculation This slide has been changed • Search for all possible base-pairing patterns • Calculate total energy of each structure based on all stabilizing and destabilizing forces Total free energy for a specific RNA conformation = Sum of incremental energy terms for: • helical stacking (sequence dependent) • loop initiation • unpaired stacking (favorable "increments" are < 0) Fig 6.3 Baxevanis & Ouellette 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Dynamic Programming This slide has been changed • Finding optimal secondary structure is difficult - lots of possibilities • Compare RNA sequence with itself • Apply scoring scheme based on energy parameters for base stacking, cooperativity, and penalties for destabilizing forces (loops, bulges) • Find path that represents most energetically favorable secondary structure BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
3 - Popular Programs that use Combined Computational Experimental Approaches • Mfold • Sfold • RNAStructure • RNAFold • RNAlifold BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
SL Y SL Y SL Z SL X SL Z SL X Sfold -51.14 kcal/mol Mfold -54.84 kcal/mol SL Y SL Z SL Y SL X SL Z SL X RNAstructure -71.3 kcal/mol RNAfold -80.16 kcal/mol Comparison of Predictions for Single RNA using Different Methods JH Lee 2007 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Comparison of Mfold Predictions: -/+ Constraints Mfold -126.05 kcal/mol Mfold plus constraints -54.84 kcal/mol JH Lee 2007 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Performance Evaluation This slide has been changed • Ab initio methods? correlation coefficient = 20-60% • Comparative approaches? correlation coefficient = 20-80% • Programs that require user to supply MSA are more accurate • Comparative programs are consistently more accurate than ab initio • Base-pairs predicted by comparative sequence analysis for large & small subunit rRNAs are 97% accurate when compared with high resolution crystal structures! - Gutell, Pace • BEST APPROACH?Methods that combine computational prediction (ab initio & comparative) with experimental constraints (from chemical/enzymatic modification studies) BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Chp 8 - Gene Prediction SECTION IIIGENE AND PROMOTER PREDICTION Xiong: Chp 8 Gene Prediction • Categories of Gene Prediction Programs • Gene Prediction in Prokaryotes • Gene Prediction in Eukaryotes BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
What is a Gene? What is a gene? segment of DNA, some of which is "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory" • Genes can encode: • mRNA (for protein) • other types of RNA (tRNA, rRNA, miRNA, etc.) • Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Gene Finding Problem: Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT Steps: • Search against protein / EST database • Apply gene prediction programs (many programs available) • Analyze regulatory regions BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Prokaryotes Small genomes 0.5 - 10·106 bp About 90% of genome is coding Simple gene structure Prediction success ~99% Eukaryotes Large genomes 107 – 1010 bp Often less than 2% coding Complicated gene structure (splicing, long exons) Prediction success 50-95% Gene Prediction in Prokaryotes vs Eukaryotes Splice sites Start codon Stop codon ATG TAA ATG TAA 5’ UTR 3’ UTR Promotor Open reading frame (ORF) Promotor Exons Introns BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
DNA "Signals" Used by Gene Finding Algorithms • Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP • Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… • Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron • Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length • Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Computational Gene Finding Approaches • Ab initio methods • Search by signal: find DNA sequences involved in gene expression. • Search by content: Test statistical properties distinguishing coding from non-coding DNA • Similarity based methods • Database search: exploit similarity to proteins, ESTs, and cDNAs • Comparative genomics: exploit aligned genomes • Do other organisms have similar sequence? • Hybrid methods - best BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Examples of Gene Prediction Software • Ab initio • Genscan, GeneMark.hmm, Genie, GeneID… • Similarity-based • BLAST, Procrustes… • Hybrids • GeneSeqer, GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM. • BEST?Ab initio - Genescan(according to some assessments) Hybrid - GeneSeqer But depends on organism & specific task Lists of Gene Prediction Software http://www.bioinformaticsonline.org/links/ch_09_t_1.html http://cmgm.stanford.edu/classes/genefind/ BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
DN 5’ 3’ exon 1 intron exon 2 intron exon 3 3’ 5’ Transcription 1' transcript (RNA) 5’ 3’ Splicing (remove introns) 3’ 5’ Capping & polyadenylation Mature mRNA 5’ 7MeG AAAAA 3’ m Export to cytoplasm Synthesis & Processing of Eukaryotic mRNA Gene in DNA BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
insert vector What are cDNAs & ESTs? • cDNA libraries are important for determining gene • structure & studying regulation of gene expression • Isolate RNA (always from a specific • organism, region, and time point) • Convert RNA to complementary DNA • (with reverse transcriptase) • Clone into cDNA vector • Sequence the cDNA inserts • Short cDNAs are called ESTs or • Expressed Sequence Tags • ESTs are strong evidence for genes • Full-length cDNAs can be difficult to obtain BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
UniGene: Unique genes via ESTs • • Find UniGene at NCBI: • www.ncbi.nlm.nih.gov/UniGene • UniGene clusters contain many ESTs • • UniGene data come from many cDNA libraries. • When you look up a gene in UniGene, you can • obtain information re: level & tissue • distribution of expression BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Gene Prediction • Overview of steps & strategies • What sequence signals can be used? • What other types of information can be used? • Algorithms • HMMs, Bayesian models, neural nets • Gene prediction software • 3 major types • many, many programs! BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Overview of Gene Prediction Strategies • What sequence signals can be used? • Transcription:TF binding sites, promoter, initiation site, terminator, GC islands, etc. • Processing signals:Splice donor/acceptors, polyA signal • Translation:Start (AUG = Met) & stop (UGA,UUA, UAG) • ORFs, codon usage • What other types of information can be used? • Homology (sequence comparison, BLAST) • cDNAs & ESTs(experimental data, pairwise alignment) BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Gene prediction: Eukaryotes vs prokaryotes Gene prediction is easier in microbial genomes Why?Smaller genomes Simpler gene structures Many more sequenced genomes! (for comparative approaches) Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are available e.g., GeneMark.hmm TIGRComprehensive Microbial Resource (CMR) NCBIMicrobial Genomes BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Predicting Genes - Basic steps: • Obtain genomic sequence • BLAST it! • Perform database similarity search • (with EST & cDNA databases, if available) • Translate in all 6 reading frames • (i.e., "6-frame translation") • Compare with protein sequence databases • Use Gene Prediction software to locate genes • Analyze regulatory sequences • Refine gene prediction BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Predicting Genes - Details: • 1. 1st, mask to "remove" repetitive elements (ALUs, etc.) • Perform database search on translatedDNA (BlastX,TFasta) • Use several programs to predict genes (GENSCAN, GeneMark.hmm, GeneSeqer) • Search for functional motifs in translated ORFs (Blocks, Motifs, etc.) & in neighboring DNA sequences • Repeat BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Spliced Alignment Algorithm Intron GT AG Donor Acceptor Splice sites GeneSeqer- Brendel et al.- ISU http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Brendel et al (2004)Bioinformatics 20: 1157 • Perform pairwise alignment with large gaps in one sequence (due to introns) • Align genomic DNA with cDNA, ESTs, protein sequences • Score semi-conserved sequences at splice junctions • Using Bayesian model or MM • Score coding constraints in translated exons • Using a Bayesian model or MM Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Start codon Stop codon Genomic DNA Protein Brendel - Spliced Alignment II: Compare with protein probes Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Information Content Ii: • Extent of Splice Signal Window: Splice Site Detection Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES i: ith position in sequence Ī: avg information content over all positions >20 nt from splice site Ī: avg sample standard deviation of Ī Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Human T2_GT Human T2_AG Information content vs position Which sequences are exons & which are introns? How can you tell? Brendel et al (2004)Bioinformatics 20: 1157 Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
PG PG (1-PG)(1-PD(n+1)) en en+1 (1-PG)PD(n+1) PA(n)PG (1-PG)PD(n+1) in in+1 1-PA(n) Markov Model for Spliced Alignment Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction