Improving Free Energy Functions for RNA Folding

Improving Free Energy Functions for RNA Folding RNA Secondary Structure Prediction

Why RNA is Important • Machinery of protein construction • Catalytic role in cells • May be possible to destroy specific sequences of RNA (to interrupt protein production) • RNase P (Cech/Altman c.1981)

AAUCG...CUUCUUCCA Primary Tertiary Secondary RNA Structural Levels Secondary: http://anx12.bio.uci.edu/~hudel/bs99a/lecture21/lecture2_2.html Tertiary: http://www.leeds.ac.uk/bmb/courses/teachers/trnballs.html

Abstracting the problem A G C G C A U C Zuker (1981) Nucleic Acids Research 9(1) 133-149

Why it is hard • Large search space (hard to enumerate) Hofacker et al. (1994) Monat. Chem. 125 167-188

Why it is hard • Secondary structure does not exist. • Unlike proteins • Putative structures (prone to revision) • Quality of Energy Functions • Discussed later

Current Algorithms • Single-Strand • Minimum Free Energy (Zuker et. al. 1981) • Partition Functions (McCaskill 1990) • Comparative Sequence Analysis • Max. Weighted Matching (Nussinov et. al. 1978) • Stochastic CFG (Sakikibara et. al. 1994) • Phylogenetic Trees (Gulko et. al. 1995) • Statistical Significance (Noller & Woese, early 80’s) See proposal for references

MFE / Tinoco Hypothesis The free energy of a secondary structure equals the sum of the free energies of the loops and stacked pairs Tinoco et al. (1971) Nature 230 362-367.

Secondary Structures Proposed System AAUCG...CUUCUUCCA 2 GA (E’) 3 1 MFE (E) AAUCG...CUUCUUCCA

Step I - Calc MFE Structure • Given a sequence  apply the MFE algorithm • Generates secondary structure S

Step II - Structural Similarity • Given a database of experimentally verified RNA structures • Let Q be the database structure most similar to S • Based on RNase P Database (Brown 1999)

Step III - Construct E’ • Create a new energy function:

Discussion on E’ • E’ has global information • Global information precludes the use of dynamic programming (MFE, Partition) • Leaves (stochastic) combinatorial optimization • Gradient Descent (no E/S) • Genetic Algorithms / Simulated Annealing

Step IV - Genetic Algorithm • RNA Structural Prediction by GA • Input: sequence  • Output: structure that maximizes E’ for  • Steady State Genetic Algorithm • Pseudoknots forbidden (conflicts) • Fitness = -E’ • Effect of Similarity(Q, S) diminishes with each generation (pseudo-SA).

23 52 (23 52 3 3.2) length start end weight Genetic Algorithm - Repn. • Stem-loop representation(Chen et. Al. 2000) • Window method (EMBOSS Palindrome)

Fit stems of P2 into C1 or C2 randomly. Placement must be conflict free. C1 P1 P2 C2 Genetic Algorithm - Operators • Mutation • Add stem from stem pool to a child • Crossover

Preliminary Results • E’ does not lead to drastic speed up • Genetic algorithm is very slow • If initial population generated randomly from stem pool. • Use suboptimal folding for initial population.

Preliminary Results Explained • The real structure is usually very similar the Tinoco optimal structure. • View E’ as a way of choosing among the suboptimal structures.

Future Work • More testing on the entire RNase P Database (> 400 structures) • Tune E’ • Accuracy comparison to MFE and Partition Function Algorithms • Parallelize genetic algorithm

END

Improving Free Energy Functions for RNA Folding