560 likes | 852 Views
RNA Secondary Structure Prediction. Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia, MO 65211-2060 E-mail: xudong@missouri.edu 573-882-7064 (O) http://digbio.missouri.edu. Final Report. Due on Dec. 8.
E N D
RNA Secondary Structure Prediction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia, MO 65211-2060 E-mail: xudong@missouri.edu 573-882-7064 (O) http://digbio.missouri.edu
Final Report • Due on Dec. 8. • A numerical score (0-15) will be assigned for based on • Clear formulation of the project (2) • Method (4) • Significant results achieved (4) • Discussion (3) • Writing of the project (2)
Final Presentation • Preferably shown in powerpoint file, pdf is fine • Preferably 20 minutes (up to 25 min), plus 5min for questions • 15 points for the presentation (introduction, methods, results, discussions) • 15 points for software demo • Implementation of the software • Major functionalities • Documentation • Perform a test run
Presentation Evaluation • A numerical score (0-15) will be assigned based on • Did the student put enough effort? (3) • Is the work interesting or novel? (3) • Is the method technically sound? (3) • Is the discussion insightful? (3) • Is the presentation clear? (3)
Software Demo • A numerical score (0-15) will be assigned based on • Whether the program can run using actual biological data (3) • Documentations (3) • Whether it is easy to use (3) • Performance in accuracy (3) • Performance in computing time and memory usage (3)
Outline • RNA Secondary Structure • Comparative Approach • Base-Pair Maximization • Free Energy Minimization • Local Structure Prediction
RNA Types siRNA, short interfering RNA; miRNA, microRNA; small temporal RNA stRNA; snoRNA small nucleolar RNA ; snRNA: Small nuclear RNA.
Features of RNA • RNA: polymer composed of a combination of four nucleotides • adenine (A) • cytosine (C) • guanine (G) • uracil (U)
Features of RNA • G-C and A-U form complementary hydrogen bonded base pairs (canonical Watson-Crick) • G-C base pairs being more stable (3 hydrogen bonds) A-U base pairs less stable (2 bonds) • non-canonical pairs can occur in RNA -- most common is G-U
RNA Pairs A-U G-C G-U
RNA Structure Hierarchy Primary structure: 5’ ACCACCUGCUGA 3’ Secondary Structure Tertiary structure:
Secondary Structure Categories Hairpin loop Hairpin loop Stem Stem Internal loop Internal loop Bulge loop Bulge loop Pseudoknots
Assumptions in Secondary Structure Prediction • Most likely structure similar to energetically most stable structure • Energy associated with any position is only influenced by local sequence and structure • Structure formed does not produce pseudoknots
Exceptions Pseudoknot Kissing hairpins Hairpin-bulge Do not obey “parentheses rule”
Outline • RNA Secondary Structure • Comparative Approach • Base-Pair Maximization • Free Energy Minimization • Local Structure Prediction
Inferring Structure By Comparative Sequence Analysis • First step is to calculate a multiple sequence alignment • Requires sequences be similar enough so that they can be initially aligned • Sequences should be dissimilar enough for correlated mutation to be detected
Mutual Information • fxi : frequency of a base in column i • fxixj: joint (pairwise) frequency of a base pair between columns i and j • Information ranges from 0 and 2 bits • If i and j are uncorrelated, mutual information is 0
Outline • RNA Secondary Structure • Comparative Approach • Base-Pair Maximization • Free Energy Minimization • Local Structure Prediction
Base-Pair Maximization • Find structure with the most base pairs • Efficient dynamic programming approach to this problem introduced by Nussinov (1970s). • Four ways to get the best structure between position i and j from the best structures of the smaller subsequences
Nussinov Algorithm • 1)Add i,j pair onto best structure found for subsequence i+1, j-1 • 2)add unpaired position i onto best structure for subsequence i+1, j • 3)add unpaired position j onto best structure for subsequence i, j-1 • 4)combine two optimal structures i,k and k+1, j
Dynamic Programming - 1 Notation: • e(ri,rj) : free energy of a base pair joining ri and rj • S(i,j) : optimal free energy associated with segment ri…rj
Dynamic Programming - 2 • i is unpaired, added on to • a structure for i+1…j • S(i,j) = S(i+1,j) • j is unpaired, added on to • a structure for i…j-1 • S(i,j) = S(i,j-1)
Dynamic Programming - 3 • i j paired, but not to each other; • the structure for i…j adds together • structures for 2 sub regions, • i…k and k+1…j • S(i,j) = max {S(i,k)+S(k+1,j)} • i j paired, added on to • a structure for i+1…j-1 • S(i,j) = S(i+1,j-1)+e(ri,rj) i<k<j
Dynamic Programming - 4 Since there are only four cases, the optimal score S(i,j) is just the maximum of the four possibilities:
j Initialisation: No close basepairs i
j Propagation: C5….U9 : C5 unpaired: S(6,9) = 0 U10 unpaired: S(5,8)=0 C5-U10 paired S(6,8) +e(C,U)=0 C5 paired, U10 paired: S(5,6)+S(7,9)=0 S(5,7)+S(8,9)=0
j Propagation: C5….G11 : C5 unpaired: S(6,11) = 3 G11 unpaired: S(5,10)=3 C5-G11 paired S(6,10)+e(C,G)=6 C5 paired, G11 paired: S(5,6)+S(7,11)=1 S(5,7)+S(8,11)=0 S(5,8)+S(9,11)=0 S(5,9)+S(10,11)=0
j Propagation: i
j Traceback: i
Final Prediction C G U G C G C U A U U A A U AUACCCUGUGGUAU Total free energy: -12 kcal/mol
Some Notes • Computational complexity: N3 • Does not work with pseudo-knot (would invalidate DP algorithm) • Methods that include pseudo knots: Rivas and Eddy, JMB 285, 2053 (1999) These methods are at least N6
Outline • RNA Secondary Structure • Comparative Approach • Base-Pair Maximization • Free Energy Minimization • Local Structure Prediction
Energy Minimization Methods • RNA folding is determined by biophysical properties • Energy minimization algorithm predicts the correct secondary structure by minimizing the free energy (G) • G calculated as sum of individual contributions of: • loops • base pairs • secondary structure elements • Energies of stems calculated as stacking contributions between neighboring base pairs
Calculating Best Structure • sequence is compared against itself using a dynamic programming approach • similar to the maximum base-paired structure • instead of using a scoring scheme, the score is based upon the free energy values • Gaps represent some form of a loop • The most widely used software that incorporates this minimum free energy algorithm is MFOLD.
How well do they perform? • Current RNA folding programs get about 60-70% of base pairs correct, on average: useful, but not yet good. • The problem is the scoring system: thermodynamic model is accurate within 5-10%, and many alternative structures are within 10%. • Possible solution: combination of thermodynamic score with comparative sequence information
Outline • RNA Secondary Structure • Comparative Approach • Base-Pair Maximization • Free Energy Minimization • Local Structure Prediction
RNA Motif in HIV TAR motif: Transactivating Response Element
RNA Motifs Associated with Transcription termination Rho-independent terminator stop the transcription process via its hairpin structure
Algorithm in Rnall • Definition 1. A “match” : canonical base pairs • Definition 2. A “mismatch”: non-canonical base pair • Definition 3. An “insertion”/“deletion”: nucleotide unpaired
RNA LSS in HIV TAR (30) DIS (260) PolyA (82) SD (292) PSI (319)
Some RNA Resource • Comparative RNA web site http://www.rna.icmb.utexas.edu/ • RNA world http://www.imb-jena.de/RNA.html • RNA page by Michael Suker http://www.bioinfo.rpi.edu/~zukerm/rna/ • RNA structure database http://www.rnabase.org/ http://ndbserver.rutgers.edu/ (nucleic acid database) http://prion.bchs.uh.edu/bp_type/ (non canonical bases) • RNA structure classification http://scor.berkeley.edu/ • RNA visualisation http://ndbserver.rutgers.edu/services/download/index.html#rnaview http://rutchem.rutgers.edu/~xiangjun/3DNA/
Reading Assignments • Suggested reading: • Chapter 14 in “Current Topics in Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press. 2002.” • Optional reading: • http://www.bioinfo.rpi.edu/~zukerm/seqanal/mfold-3.0-manual.pdf