730 likes | 935 Views
Computational approaches for RNA energy parameter estimation. Mirela Andronescu Department of Computer Science. Supervisors Anne Condon Holger Hoos. Committee David Mathews Kevin Murphy. Tertiary structure. Secondary structure. a set of base pairs: A-U,C-G, G-U. RNA structure.
E N D
Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science • Supervisors • Anne Condon • Holger Hoos • Committee • David Mathews • Kevin Murphy
Tertiary structure Secondary structure • a set of base pairs: A-U,C-G, G-U RNA structure RNA sequence 5’ ACGUAGCGA…3’
Energy model Prediction algorithm 60% accuracy output input 5’ ACUGCUAGC UGCGUUGC… 3’ New energy model Prediction algorithm 71% accuracy Overview predict
Translation Catalysis Splicing Gene silencing Roles of RNA structures and thermodynamics
Determining RNA secondary structure • Experimentally • X-ray crystallography, NMR, chemical & structure probing -- expensive • Computationally • Comparative sequence analysis, given many homologous sequences • Thermodynamic approaches, using an energy model
Thermodynamic RNA secondary structure prediction • Assumption • RNAs fold into their minimum free energy structures • Common approach • dynamic programming algorithm O(n3) [Zuker & Stiegler, 1981; Lyngso et al, 1999] • Based on an energy model • the Turner model [Mathews et al, 1999, 2004]
The Turner model[Mathews et al, 1999, 2004] [3’ UTR protein-binding RNA from Rfam] • Energy model: • Features (stacked pair AG/CU) • Parametersθ (-2.1 kcal/mol) • Energy function ΔG(θ) = cT θ
The Turner model[Mathews et al, 1999, 2004] • Obtained by • Linear regression from experimental data • Biological knowledge • Limitations • No thorough computational method was used • Many parameters have been extrapolated • Large amounts of data were not exploited • Accuracy on our data set: 60% • Our goal: Improve the RNA energy model
Contributions • Databases (Ch 3) • Parameter estimation algorithms (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • Model selection and feature relationships (Ch 6) • Parameter esti-mation for models with pseudoknots (Ch 7)
RNA STRAND • Structural data from 8 public databases • RNA sequences with • known secondary structures • unknown free energies • Determined by • comparative sequence analysis • X-ray crystallography • NMR • 4600 RNAs, avg. length 530 nucleotides [Andronescu et al, BMC Bioinformatics 2008]
RNA THERMO • Thermodynamic data from 58 papers • RNA sequences with • known secondary structures • measured free energies • Determined by • optical melting experiments [Turner lab & collaborators] • 1300 RNAs, avg. length 17 nucleotides
Outline • Databases: RNA STRAND and RNA THERMO (Ch 3) • Parameter estimation algorithms (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • Model selection and feature relationships (Ch 6) • Parameter esti-mation for models with pseudoknots (Ch 7)
Parameter estimation problem • Given • A structural set S (seq + str) • A thermodynamic set T (seq + str + free energy) • A model with • a fixed set of features (e.g. Turner99 with 363 features) • a free energy function (e.g. linear in the parameters θ) • Estimate (learn) parameters θ that maximize avg. accuracy when measured on reference set Sn = #correctly predicted bp / # true bp PPV = #correctly predicted bp / # predicted bp F-measure = harmonic mean (Sn, PPV) = 2*Sn*PPV/(Sn+PPV)
Constraint Generation (CG) • Idea: for all (x,yknown) in S,yknown should have lower free energy than all other structures y Predict low energy structures with the current θ Solve a constrained quadratic opt. problem min (Σδ2 + Σ (free energy error for T)2 + regularizer) subject to ΔG(x,yknown,θ) < ΔG(x,y,θ) + δ, for all (x,yknown) in S Repeat until convergence [Andronescu et al, Bioinformatics 2007]
P(structural data) = Boltzmann Likelihood (BL) • The probability of a structure y is a Boltzmann function: • Solve a non-linear optimization problem with unique optimum max (P(structural data) P(thermo data) regularizer) • Similar approach (CONTRAfold) proposed by [Do et al, 2006] • no thermo data was used • free energies are not predicted correctly
Outline • Databases: RNA STRAND and RNA THERMO (Ch 3) • Parameter estimation algorithms: CG and BL (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • Model selection and feature relationships (Ch 6) • Parameter esti-mation for models with pseudoknots (Ch 7)
BL*, trained on STrain+T, F=0.69, RMSE=1.34 CG 07 [Andr. 2007], trained on SProc+T F=0.65, RMSE=1.03 CG*, trained on STrain+T, F=0.68,RMSE=0.98 CONTRAfold 1.1, trained on 151Rfam F=0.61, RMSE=9.17 CONTRAfold 2.0, trained on SProc F=0.68, RMSE=6.02 Parameter estimation for models without pseudoknots Set from RNA STRAND, # str: 2500 Avg len: 330 Std len: 500 BL* gives the highest accuracy on average, an increase of 9% from the Turner99 parameters. Turner99 F=0.60, RMSE=1.24 • Sensitivity = #correctly predicted bp / # true bp • PPV = #correctly predicted bp / # predicted bp
Runtime analysis BL is at least 10 times slower than CG, but slightly more accurate. Reference machine: a 3GHz Intel Xeon CPU (1MB cache and 2GB RAM)
Outline • Databases: RNA STRAND and RNA THERMO (Ch 3) • Parameter estimation algorithms: CG and BL (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • 9% better F-measure • Model selection and feature relationships (Ch 6) • Parameter esti-mation for models with pseudoknots (Ch 7)
Model selection • Explore parsimonious and lavish models • For lavish models, use feature relationships
Feature relationships • Link features not covered by thermo set T with those that are covered BL: max (P(structural data) P(thermo data) regularizer)
BL-FR*, trained on STrain+T, #features=7726, F=0.71, RMSE=1.51 BL*, trained on STrain+T, F=0.69, RMSE=1.34 CG 07 [Andr. 2007], trained on SProc+T F=0.65, RMSE=1.03 CG*, trained on STrain+T, F=0.68,RMSE=0.98 CONTRAfold 1.1, trained on 151Rfam F=0.61, RMSE=9.17 CONTRAfold 2.0, trained on SProc F=0.68, RMSE=6.02 Model selection and feature relationships Modeling feature relationships improves prediction by an additional 1.3% (10.6% from the Turner99 parameters). Turner99 F=0.60, RMSE=1.24
Outline • Databases: RNA STRAND and RNA THERMO (Ch 3) • Parameter estimation algorithms: CG and BL (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • 9% better F-measure • Model selection and feature relationships (Ch 6) • 11% better F-measure • Parameter esti-mation for models with pseudoknots (Ch 7)
Parameter estimation for models with pseudoknots • Models (Turner features + additional features for pseudoknots) • Dirks & Pierce [Dirks and Pierce, 2003] • Cao & Chen [Cao and Chen, 2006] • Prediction algorithm • HotKnots [Ren et al, 2005] • Parameter estimation algorithm • CG modified for this problem • BL was much harder to implement
Parameter estimation for models with pseudoknots • Improvements on average: • Dirks & Pierce parameters by 9% • Cao &Chen parameters by 6% * Short means at most 100 nucleotides
Conclusions • Databases: RNA STRAND and RNA THERMO (Ch 3) • Parameter estimation algorithms: CG and BL (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • 9% better F-measure • Model selection and feature relationships (Ch 6) • 11% better F-measure • Parameter esti-mation for models with pseudoknots (Ch 7) • 9% and 6% better F
Applications • CG 07 [Andr 2007] is part of RNA Vienna WebSuites • Many other software packages benefit from this work • MFE and suboptimal secondary structure prediction • Simulation of folding pathways, sampling and clustering • Prediction of hybridization efficiency, target availability of siRNA
Directions for future work • No single parameter set (or algorithm) results in better accuracy for all structures • Combine parameter sets and algorithms • Explore other models • Models for multi-loops are not accurate • Accuracy of data is questionable • Obtain / generate / pre-process data more accurately
Acknowledgments • Supervisors: • Anne Condon, Holger Hoos • Committee: • Dave Mathews, Kevin Murphy • Collaborators: • Vera Bereg, Cristina Pop, Alex Brown • Members of the BETA lab and CS department • UBC and IBM Research for funding
RNAs play diverse roles • Messenger RNA • Ribosomal RNA • Transfer RNA [contexo.info]
RNA structure plays role in splicing [Rogic et al, 2008] [Bruce R. Korf, Human Genetics and Genomics]
RNAs can act as catalysts (ribozymes) [James & Al-Shamkhani]
RNA hybridization thermodynamics [Lu and Mathews, 2008]
Design of optical melting experiments • 16% of multi-loops in RNA STRAND have 5 or more branches • 30% of internal loops have ≥7 unpaired bases • 13% of internal loops have asymmetry ≥ 3 • Pseudoknots (22 experiments, only 4 features out of the 11 DP are covered)