440 likes | 455 Views
Explore the importance of RNA structures, learn about secondary structures and their prediction, understand the role of functions in structure, and study different RNA classes. Discover how to represent RNA structures and predict their stability.
E N D
Lecture 11. RNA Secondary Structure Prediction The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics
Lecture outline • From sequences to functions • RNA secondary structures CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Part 1 From Sequences to Functions
From sequences to functions • One of the biggest questions in molecular biology: Can one tell the function of a molecule (DNA/RNA/protein) from its sequence alone? • Sometimes, but usually not (yet) • Easier if we also know the structure • Common believe:sequence structure function • Of course, also depends on the environment CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Molecular structures • Four levels: • Primary structures • The sequence • Secondary structures • First formed • Local • Tertiary structures • Global • Sometimes called “folds” or “domains” • Quaternary structures • Multiple molecules Image credit: http://www.personal.psu.edu/jms5704/blogs/simmons/levels_of_protein_s_c_la_784.jpg CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Structure and function • Why function depends on structure? • Structure itself is the function (e.g., tubulins) • Binding • Complementarity of interacting structures • Formation of special bonds Image credit: http://www.nigms.nih.gov/NR/rdonlyres/54BEAC37-47A9-454A-BC4F-B94EA127FA1E/0/fig1a_large.jpg, http://upload.wikimedia.org/wikimedia/en-labs/7/7f/Protein_Protein_Docking.JPG CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Structure and function • Why function depends on structure? (cont’d) • Functional group (e.g., catalytic site) • Determining localization (e.g., transporter membrane proteins) Image credit: http://www.catalysis-ed.org.uk/principles/images/enzyme_substrate.gif, Spudich , Science 288(5470):1358-1359, 2000 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Part 2 RNA Secondary Structures
Important RNA classes • Coding: • Messenger RNAs (mRNAs) • For translating into proteins • Non-coding: • Ribosomal RNAs (rRNAs) • Parts of the ribosome complex • Transfer RNAs (tRNAs) • Delivering free amino acids during translation • Micro RNAs (miRNAs) • Binding mRNA targets to promote RNA degradation or repress translation • Small nucleolar RNAs (snoRNAs) • Guiding chemical modifications of other RNAs • Small nuclear RNAs (snRNAs) • Involved in mRNA splicing • Long non-coding RNAs (lncRNAs) • Some involved in gene regulation • ... Image source: http://legacy.hopkinsville.kctcs.edu/sitecore/instructors/Jason-Arnold/VLI/Module%201/m1DNAfunction/m1DNAfunction3.html CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Importance of RNA structures • Structure is important to many classes of RNA • Examples: tRNA snoRNA Image sources: http://www.bio.miami.edu/dana/pix/tRNA.jpg, http://lowelab.ucsc.edu/images/CDBox.jpg CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
RNA secondary structures • Largely possible to be projected onto a 2D plane (all the T’s in the figure should be U’s) Stem/hairpin loop Stacking pairs Bulge Internal loop Multi-loop Exterior loop Dangling nucleotides Less stable pair Coaxial stacking Image credit: http://www.clcbio.com/scienceimages/rna_prediction/RNA_structure_prediction_web.png CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
RNA secondary structures • Pseudoknots: complex structures Image credit: Wikipedia, Sperschneider and Datta, RNA 14(4):630-640, (2008) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Representing RNA secondary structures • Formats: (see http://projects.binf.ku.dk/pgardner/bralibase/RNAformats.html): • Dot-bracket format • Stockholm format • ... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Dot-bracket format • Sequence (nucleotides 10, 20, 30, etc. marked in red):GUGAAUGAUGAAUUUAAUUCUUUGGUCCGUGUUUAUGAUGGGAAGUAAGACCCCCGAUAUGAGUGACAAAAGAGAUGUGGUUGACUAUCACAGUAUCUGACG • Structure:......((((.......((((((.(((....((((((.((((..........)))).)))))).))).)))))).((((((.....)))))).))))..... Image credit: Xihao Hu CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Predicting RNA secondary structures • A basic assumption in structure predictions: • Real structure has the lowest free energy • In a simplified view, more stable bonds lower free energy • In the case of RNA secondary structures: • Good to form more pairs • Canonical pairs: A-U, C-G • Sometimes G-U (a “wobble base pair”) • Good to form more stable pairs. Stability: • C-G > A-U > G-U • Good to have stable sub-structures • E.g., stacking pairs CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Predicting RNA secondary structures • We will assume there are no pseudoknots • With pseudoknots, currently there is no known algorithm that can find the optimal solution efficiently • We need two things: • A thermodynamic model for computing the free energy of a structure • A method for finding the structure with the minimum free energy • This setting sounds familiar? A pseudoknot Image credit: Wikipedia CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Further assumptions • The free energy of a secondary structure is the sum of the free energy of the sub-structures. • Not the sum of individual bases/base pairs, as one base pair can participate in multiple sub-structures. • We will count each sub-structure exactly once. For example, to count a hairpin loop, we consider the base pair that closes the loop. • The free energy values of the sub-structures are independent. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Problem definition • Given an RNA sequence, find a set of base pairs so that each base is paired at most once • Example: • Input sequence: GUGAAUGAUGAAUUU...ACG • Output set of base pairs: • (7, 97) • (8, 96) • ... • (18, 74) • ... • (81, 87) Image credit: Xihao Hu CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Linear view 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... . ... ( ( ( ( . . . . . . . ( ( ( ( ( ( . ( ... ) . ) ) ) ) ) ) . ( ( ( ( ( ( . . . . . ) ) ) ) ) ) . ) ) ) ) ... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Thermodynamics model • We will consider four types of sub-structures here: • Stacking pairs: both (i, j) and (i+1, j-1) are in the set • Hairpin loop: there is a pair (i, j), where all bases from i+1 to j-1 are not paired • Bulge/Internal loop: there are two pairs (i, j) and (i1, j1), where i<i1<j1<j, and all bases from i+1 to i1-1 and from j1+1 to j-1 are not paired • Multi-loop: there are pairs (i, j), (i1, j1), ..., (ik, jk), where i<i1<j1<...<ik<jk<j, and all bases from i+1 to i1-1, from j1+1 to i2-1, ..., jk-1+1 to ik-1 and from jk+1 to j-1 are unpaired • One base pair can participate in multiple structures CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Stacking pairs • Both (i, j) and (i+1, j-1) are in the set, with j>i (we require ji+2) • E.g., i:20, j:72 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i i+1 j-1 j CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Hairpin loop • There is a pair (i, j), with j>i (unless specified, we require ji+2), where all bases from i+1 to j-1 are not paired • E.g., i: 81, j: 87 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i j Image source: http://img.ehowcdn.com/article-new/ds-photo/getty/article/151/226/87820768_XS.jpg CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Bulge/Internal loop • Internal loop: There are two pairs (i, j) and (i1, j1), where i<i1<j1<j, and all bases from i+1 to i1-1 and from j1+1 to j-1 are not paired • Called a bulge if only one side has unpaired bases • Unless specified, we allow i1=i+1 or j=j1+1 (but not both) • E.g., i:23, j:69, i1:25, j1:67 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i i1 j1 j CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Multi-loop • Multi-loop: There are pairs (i, j), (i1, j1), ..., (ik, jk), where k2 and i<i1<j1<...<ik<jk<j, and all bases from i+1 to i1-1, from j1+1 to i2-1, ..., jk-1+1 to ik-1 and from jk+1 to j-1 are unpaired • E.g., k=2, i:10, j:94, i1:18, j1:74, i2:76, j2:92 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i i1 j1 i2 j2 j CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
One possible thermodynamic model • Unpaired bases have 0 free energy and all the terms below have negative free energy • eS(i, j): for the stacking pairs (i, j) and (i+1, j-1) • eH(i, j): for the hairpin loop closed at (i, j) • eBI(i, j, i1, j1): for a bulge or internal loop enclosed by the pairs (i, j) and (i1, j1) • eM(i, j, i1, j1, ..., ik, jk): for a multi-loop that consists of the pairs (i, j), (i1, j1), ..., (ik, jk) and satisfying i<i1<j1<...<ik<jk<j CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Finding the optimal structure • Dynamic programming • Let s be the RNA sequence with n nucleotides • Tables: • V(j): free energy of the optimal structure for s[1..j] • Final answer is based on V(n) • VP(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair • VBI(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a budge or internal loop • VM(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a multi-loop CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Update formulas • V(j): free energy of the optimal structure for s[1..j] • V(1) = 0 • For j > 1, ... 1 ... j j is unpaired j-1 j ... 1 ... j pairs with i 1 ... i-1 i ... j ... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Update formulas • VP(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair • We require that i < j ... ... i ... j Stacking pairs ... i i+1 ... j-1 j ... Hairpin loop ... i ... j ... All unpaired CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Update formulas • VBI(i, j):free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a budge or internal loop (i.e., i and j take the roles of i1 and j1) ... ... i ... j ... i ... i1 ... j1 ... j ... Budge or internal loop All unpaired All unpaired CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Update formulas • VM(i, j):free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a multi-loop ... ... i ... j CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Time and space requirements • V: n entries, each takes O(n) time • VP(i, j): O(n2) entries, each takes constant time CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Time and space requirements • VBI: O(n2) entries, each takes O(n2) time • VM: O(n2) entries, each takes O(n2k) time CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Time and space requirements • Summary: • V: n entries, each takes O(n) time • VP: O(n2) entries, each takes constant time • VBI: O(n2) entries, each takes O(n2) time • VM: O(n2) entries, each takes O(n2k) time • Total: O(n2) space, O(n2k+2) time • Exponential if k is unbounded • Some approximations could bring the time down to O(n4) – still huge for large n, but feasible for small or median n CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Some remarks • If we allow general pseudoknots, there is currently no efficient way to find the optimal RNA secondary structure with the minimum free energy • Other methods to predict RNA secondary structures: • Conservation and covariation • High conservation: 2 and 4 • Strong covariation: 1 and 5 • Experimental methods (e.g., RNA footprinting) 12345 ACGGU ACUGU CCAGG UCCGA CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Representing pseudoknots • Without pseudoknots, RNA secondary structures can be unambiguously represented by dots (single bases) and brackets (base pairs) • What if there are pseudoknots? • Need more types of brackets 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... . ... ( ( ( ( . . . . . . . ( ( ( ( ( ( . ( ... ) . ) ) ) ) ) ) . ( ( ( ( ( ( . . . . . ) ) ) ) ) ) . ) ) ) ) ... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 G A A G U A C A A U A U G U A A C C G . { . ( ( ( ( . . . . . ) ) } ) ) . . Image source: http://ultrastudio.org/upload/RNAPseudoKnot-25005810.jpg CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Epilogue Case Study, Summary and Further Readings
Case study: Drug finding/design • Drugs are mostly chemicals with a specific structure that interacts with some biological objects • Examples: • Inhibiting the activities of an important protein of bacteria • Blocking the interaction between virus and receptors of host cell • Simulating the production of a hormone CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Case study: Drug finding/design • Suppose we want to identify/design a chemical to target a particular object (e.g., a protein), we need to make sure that they have tight bindings through a process called docking Image source: http://vds.cm.utexas.edu/ CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Case study: Drug finding/design • Computational problem: • Input: a target protein and a list of chemicals • Goal: find a chemical that binds the target well • Try different locations and orientations • Binding depends on structure and chemistry • Output: One or more chemicals that bind the target well • Difficulties: • Computational complexity • Large search space for each protein-chemical combination • Need to try many chemicals • Need to ensure specificity (not to target other proteins and cause side effects) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Case study: Drug finding/design • There is a game for players to try folding proteins called FoldIt (http://fold.it/) • Score based on free energy • Real time update of scores and ranks • Players can discuss and share solutions • Resulted in some amazingly good folds as compared to automatic predictions by computer programs Image source: http://fold.it/portal/site_files/theme/science/competition.png CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Summary • Functions depend on structures • Different levels of structures: • Primary (sequence) • Secondary (local) • Tertiary (global) • Quaternary (interactions) • RNA secondary structures can be predicted by dynamic programming based on a thermodynamic model • Important sub-structures • Stacking pairs • Hairpin loops • Internal loops/bulges • Multi-loops • Pseoduknots CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Further readings • Chapter 11 of Algorithms in Bioinformatics: A Practical Introduction • Speed up of algorithm • Algorithm for RNA structure perdition with pseudoknots • Free slides available • Parts VII and VIII of Fundamental Concepts of Bioinformatics • Protein folding and protein structure prediction • Docking CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019