1 / 68

Computational approaches for RNA energy parameter estimation

Computational approaches for RNA energy parameter estimation. Mirela Andronescu Department of Computer Science. Supervisors Anne Condon Holger Hoos. Committee David Mathews Kevin Murphy. Tertiary structure. Secondary structure. a set of base pairs: A-U,C-G, G-U. RNA structure.

rhys
Download Presentation

Computational approaches for RNA energy parameter estimation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science • Supervisors • Anne Condon • Holger Hoos • Committee • David Mathews • Kevin Murphy

  2. Tertiary structure Secondary structure • a set of base pairs: A-U,C-G, G-U RNA structure RNA sequence 5’ ACGUAGCGA…3’

  3. Energy model Prediction algorithm 60% accuracy output input 5’ ACUGCUAGC UGCGUUGC… 3’ New energy model Prediction algorithm 71% accuracy Overview predict

  4. Translation Catalysis Splicing Gene silencing Roles of RNA structures and thermodynamics

  5. Determining RNA secondary structure • Experimentally • X-ray crystallography, NMR, chemical & structure probing -- expensive • Computationally • Comparative sequence analysis, given many homologous sequences • Thermodynamic approaches, using an energy model

  6. Thermodynamic RNA secondary structure prediction • Assumption • RNAs fold into their minimum free energy structures • Common approach • dynamic programming algorithm O(n3) [Zuker & Stiegler, 1981; Lyngso et al, 1999] • Based on an energy model • the Turner model [Mathews et al, 1999, 2004]

  7. The Turner model[Mathews et al, 1999, 2004] [3’ UTR protein-binding RNA from Rfam] • Energy model: • Features (stacked pair AG/CU) • Parametersθ (-2.1 kcal/mol) • Energy function ΔG(θ) = cT θ

  8. The Turner model[Mathews et al, 1999, 2004] • Obtained by • Linear regression from experimental data • Biological knowledge • Limitations • No thorough computational method was used • Many parameters have been extrapolated • Large amounts of data were not exploited • Accuracy on our data set: 60% • Our goal: Improve the RNA energy model

  9. Contributions • Databases (Ch 3) • Parameter estimation algorithms (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • Model selection and feature relationships (Ch 6) • Parameter esti-mation for models with pseudoknots (Ch 7)

  10. RNA STRAND • Structural data from 8 public databases • RNA sequences with • known secondary structures • unknown free energies • Determined by • comparative sequence analysis • X-ray crystallography • NMR • 4600 RNAs, avg. length 530 nucleotides [Andronescu et al, BMC Bioinformatics 2008]

  11. RNA THERMO • Thermodynamic data from 58 papers • RNA sequences with • known secondary structures • measured free energies • Determined by • optical melting experiments [Turner lab & collaborators] • 1300 RNAs, avg. length 17 nucleotides

  12. Outline • Databases: RNA STRAND and RNA THERMO (Ch 3) • Parameter estimation algorithms (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • Model selection and feature relationships (Ch 6) • Parameter esti-mation for models with pseudoknots (Ch 7)

  13. Parameter estimation problem • Given • A structural set S (seq + str) • A thermodynamic set T (seq + str + free energy) • A model with • a fixed set of features (e.g. Turner99 with 363 features) • a free energy function (e.g. linear in the parameters θ) • Estimate (learn) parameters θ that maximize avg. accuracy when measured on reference set Sn = #correctly predicted bp / # true bp PPV = #correctly predicted bp / # predicted bp F-measure = harmonic mean (Sn, PPV) = 2*Sn*PPV/(Sn+PPV)

  14. Constraint Generation (CG) • Idea: for all (x,yknown) in S,yknown should have lower free energy than all other structures y Predict low energy structures with the current θ Solve a constrained quadratic opt. problem min (Σδ2 + Σ (free energy error for T)2 + regularizer) subject to ΔG(x,yknown,θ) < ΔG(x,y,θ) + δ, for all (x,yknown) in S Repeat until convergence [Andronescu et al, Bioinformatics 2007]

  15. P(structural data) = Boltzmann Likelihood (BL) • The probability of a structure y is a Boltzmann function: • Solve a non-linear optimization problem with unique optimum max (P(structural data)  P(thermo data)  regularizer) • Similar approach (CONTRAfold) proposed by [Do et al, 2006] • no thermo data was used • free energies are not predicted correctly

  16. Outline • Databases: RNA STRAND and RNA THERMO (Ch 3) • Parameter estimation algorithms: CG and BL (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • Model selection and feature relationships (Ch 6) • Parameter esti-mation for models with pseudoknots (Ch 7)

  17. BL*, trained on STrain+T, F=0.69, RMSE=1.34 CG 07 [Andr. 2007], trained on SProc+T F=0.65, RMSE=1.03 CG*, trained on STrain+T, F=0.68,RMSE=0.98 CONTRAfold 1.1, trained on 151Rfam F=0.61, RMSE=9.17 CONTRAfold 2.0, trained on SProc F=0.68, RMSE=6.02 Parameter estimation for models without pseudoknots Set from RNA STRAND, # str: 2500 Avg len: 330 Std len: 500 BL* gives the highest accuracy on average, an increase of 9% from the Turner99 parameters. Turner99 F=0.60, RMSE=1.24 • Sensitivity = #correctly predicted bp / # true bp • PPV = #correctly predicted bp / # predicted bp

  18. Runtime analysis BL is at least 10 times slower than CG, but slightly more accurate. Reference machine: a 3GHz Intel Xeon CPU (1MB cache and 2GB RAM)

  19. Outline • Databases: RNA STRAND and RNA THERMO (Ch 3) • Parameter estimation algorithms: CG and BL (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • 9% better F-measure • Model selection and feature relationships (Ch 6) • Parameter esti-mation for models with pseudoknots (Ch 7)

  20. Model selection • Explore parsimonious and lavish models • For lavish models, use feature relationships

  21. Feature relationships • Link features not covered by thermo set T with those that are covered BL: max (P(structural data)  P(thermo data) regularizer)

  22. BL-FR*, trained on STrain+T, #features=7726, F=0.71, RMSE=1.51 BL*, trained on STrain+T, F=0.69, RMSE=1.34 CG 07 [Andr. 2007], trained on SProc+T F=0.65, RMSE=1.03 CG*, trained on STrain+T, F=0.68,RMSE=0.98 CONTRAfold 1.1, trained on 151Rfam F=0.61, RMSE=9.17 CONTRAfold 2.0, trained on SProc F=0.68, RMSE=6.02 Model selection and feature relationships Modeling feature relationships improves prediction by an additional 1.3% (10.6% from the Turner99 parameters). Turner99 F=0.60, RMSE=1.24

  23. Outline • Databases: RNA STRAND and RNA THERMO (Ch 3) • Parameter estimation algorithms: CG and BL (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • 9% better F-measure • Model selection and feature relationships (Ch 6) • 11% better F-measure • Parameter esti-mation for models with pseudoknots (Ch 7)

  24. Parameter estimation for models with pseudoknots • Models (Turner features + additional features for pseudoknots) • Dirks & Pierce [Dirks and Pierce, 2003] • Cao & Chen [Cao and Chen, 2006] • Prediction algorithm • HotKnots [Ren et al, 2005] • Parameter estimation algorithm • CG modified for this problem • BL was much harder to implement

  25. Parameter estimation for models with pseudoknots • Improvements on average: • Dirks & Pierce parameters by 9% • Cao &Chen parameters by 6% * Short means at most 100 nucleotides

  26. Conclusions • Databases: RNA STRAND and RNA THERMO (Ch 3) • Parameter estimation algorithms: CG and BL (Ch 4) • Parameter esti-mation for models without pseudoknots (Ch 5) • 9% better F-measure • Model selection and feature relationships (Ch 6) • 11% better F-measure • Parameter esti-mation for models with pseudoknots (Ch 7) • 9% and 6% better F

  27. Applications • CG 07 [Andr 2007] is part of RNA Vienna WebSuites • Many other software packages benefit from this work • MFE and suboptimal secondary structure prediction • Simulation of folding pathways, sampling and clustering • Prediction of hybridization efficiency, target availability of siRNA

  28. Directions for future work • No single parameter set (or algorithm) results in better accuracy for all structures • Combine parameter sets and algorithms • Explore other models • Models for multi-loops are not accurate • Accuracy of data is questionable • Obtain / generate / pre-process data more accurately

  29. Acknowledgments • Supervisors: • Anne Condon, Holger Hoos • Committee: • Dave Mathews, Kevin Murphy • Collaborators: • Vera Bereg, Cristina Pop, Alex Brown • Members of the BETA lab and CS department • UBC and IBM Research for funding

  30. Additional slides

  31. RNAs play diverse roles • Messenger RNA • Ribosomal RNA • Transfer RNA [contexo.info]

  32. RNA structure plays role in splicing [Rogic et al, 2008] [Bruce R. Korf, Human Genetics and Genomics]

  33. RNAs can act as catalysts (ribozymes) [James & Al-Shamkhani]

  34. RNA hybridization thermodynamics [Lu and Mathews, 2008]

  35. RNA STRAND

  36. Design of optical melting experiments • 16% of multi-loops in RNA STRAND have 5 or more branches • 30% of internal loops have ≥7 unpaired bases • 13% of internal loops have asymmetry ≥ 3 • Pseudoknots (22 experiments, only 4 features out of the 11 DP are covered)

  37. Analysis of RNA THERMO

  38. Analysis of RNA THERMO

  39. Schematic representation of data

  40. Other BL results (M363)

  41. Accuracy on classes

  42. Correlations between parameters

  43. Accuracy vs length, no pseudoknots

  44. Accuracy vs length, no pseudoknots

  45. Correlation accuracies, all

  46. Correlation accuracies, all

  47. Correlation accuracies, 0-200

  48. Correlation accuracies, 200-700

  49. Correlation accuracies, 700-2000

  50. Correlation accuracies, 2000-4000

More Related