1 / 45

RNA secondary structure prediction and runtime optimization

RNA secondary structure prediction and runtime optimization. Greg Goldgof October 5, 2006 CS374 Presentation Stanford University. Presentation Overview. CONTRAfold: probabilistic RNA folding. Background on RNA secondary structure prediction.

micol
Download Presentation

RNA secondary structure prediction and runtime optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University

  2. Presentation Overview • CONTRAfold: probabilistic RNA folding • Background on RNA secondary structure prediction • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection

  3. GATTACA GAUUACA What are RNA and mRNA? • RNA is a polymer of nucleotides A, U, C, and G transcribed from DNA • Traditional role as messenger molecule (mRNA)

  4. internal loop multi-branch loop hairpin loop helix (stem) bulge loop What is RNA secondary structure/folding?

  5. Pseudoknots • Not dealt with by either paper. • Pseudoknots will not be treated in this talk.

  6. non-coding RNA (RNA genes) • RNA enzymes: catalytic RNA • Ribosomal RNA (rRNA) • Transfer RNA (tRNA) • RNAi: RNA mediated gene regulation • Micro RNA (miRNA) • Short-interfering RNA (siRNA) • Alternative splicing: small-nuclear RNA (snRNA) • Others: snoRNA, eRNA, srpRNA, tmRNA, gRNA Structure essential to function for many ncRNAs

  7. Presentation Overview • CONTRAfold: probabilistic RNA folding • Background on RNA secondary structure prediction • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection

  8. CONTRAfold Problem: Given an RNA sequence, predict the most likely secondary structure AUCCCCGUAUCGAUC AAAAUCCAUGGGUACCCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA

  9. How does CONTRAfold work? For example: • CONTRAfold looks at features that indicate a good structure • C-G base pairings • A-U base pairings • Helices of length 5 • Hairpin loops of size 9 • Bulge loops of size 2 • CG/GC Base-pair stacking interactions • These examples are called thermodynamic parameters because they represent free energy values

  10. ( ) exp How does CONTRAfold choose a structure? # of occurrences of feature i, in structure y generated from sequence x • The probability of a structure y, given a sequence x, is determined by the following relationship: • Every feature fi is associated with a weight wi. weight of Feature i structure sequence

  11. High confidence bases darker Low confidence bases lighter How does CONTRAfold choose a structure? Cont’d • Considers all structures and finds optimal structure via dynamic programming in O(n3) • Added bonus: probability associated with each base

  12. # correct base pairings Sensitivity = # true base pairings # correct base pairings Specificity = # predicted base pairings Parameter γ allows trade-off between sensitivity and specificity  = 1 AUCCCCGUAUCGAUC AAAAUCCAUGGGUACCCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA  = 8  = 1024

  13. CONTRAfold learns how to predict good structures • CONTRAfold learns the relative value, or weight, of each of its features • A training set is a collection of known correct solutions that a program learns from. • CONTRAfold trains on set of published examples of known RNA structures taken from a database called Rfam (RNA families) • CONTRAfold determines the weight for each feature that maximizes its performance on the training set.

  14. CONTRAfold Performance

  15. Presentation Overview • CONTRAfold: probabilistic RNA folding • Background on RNA secondary structure prediction • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection

  16. Other Methods Physics-based models Stochastic context-free grammars

  17. Physics-based models • Features experimentally determined in lab, rather than learned • All features reflect thermodynamic interactions • Until CONTRAfold, best performing method Disadvantages to CONTRAfold • Thermodynamic weights difficult to calculate • No incorporation of non-thermodynamic features • Cannot be tailored to specific families of RNAs since weights always the same • Cannot trade off between sensitivity and specificity • No associated probabilities with each pair-bonding

  18. Stochastic context-free grammars S  aSu | cSg | aS | uS | … | Su | SS | ε • Based on grammar rules with associated probabilities P .21 .15 .11 .08 .03 .22 .02 • We select the set of transformations that highest probability of generating the input sequence. This set gives us our structure. S • Let’s generate a structure for the sequence acuuauuag aS acSg acuSag acuguaucuag .(((...).)) acg .() acuag .(()) a . acuuag .((.)) acuguacuag .(((..).)) acugucuag .(((.).)) acugcuag .((().)) acuSuag acugScuag acuguScuag acuguaScuag acuguauScuag acuguaucuag

  19. Stochastic context-free grammars cont’d • Therefore, they can also be optimized to specific datasets • Like CONTRAfold, transformation probabilities can be automatically trained • Provide an associated probability with a given structure Disadvantages to CONTRAfold • Grammar rules of SCFG less expressive than features of CONTRAfold or physics-based methods • Poor accuracy: always dominated by physics-based models

  20. Advantages of CONTRAfold • High accuracy • Automated training of parameters • Can be tuned to specific data • Provides associated probabilities for each base-pairing • Ability to control sensitivity/specificity trade-off • Can incorporate both physics-based and non-thermodynamic parameters

  21. Presentation Overview • Background on RNA secondary structure prediction • CONTRAfold: probabilistic RNA folding • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection

  22. We want the highest scoring fold Score for optimal structure from base i to base j How is RNA folding done? • Only scores interactions between paired bases Simple Nussinov Folding Algorithm • Useful for demonstrating general structure of more complex folding algorithms δ(i, j) = score for a pairing between i and j. Base i is unpaired, consider pairing between i+1 and j Base j is unpaired, consider pairing between i and j-1

  23. How is RNA folding done? • Only scores interactions between paired bases Simple Nussinov Folding Algorithm • Useful for demonstrating general structure of more complex folding algorithms Pair i and j. Now consider pairing between i+1 and j-1.

  24. How is RNA folding done? • Only scores interactions between paired bases Simple Nussinov Folding Algorithm • Useful for demonstrating general structure of more complex folding algorithms i and j begin a bifurcation. Consider every possible bifurcation point k. Sum scores from each folded structure.

  25. How is RNA folding done? • What is the runtime of the Nussinov algorithm? O(n) * O(n) * O(n) → O(n3) For a given sequence of length n = j – i we must consider: • All possible value of i O(n) For each i we must consider: • All possible values of j O(n) For each i, j pair we must consider: • All possible values of k O(n)

  26. A more sophisticated algorithm • We want to take into account more advanced features than just base-pairings.

  27. U C G U C A C G C j i What is V(i, j)? eh = Energy of a hairpin closed at i and j

  28. A U G C j i What is V(i, j)? es = Energy of stacked pair i, j and i+1, j-1

  29. j’ i’ C G U C A A G C j i What is V(i, j)? ebi = Energy of a bulge or interior loop that begins at i, j and is closed at i’, j’

  30. What is V(i, j)? Same old bifurcation equation, but i is paired to j

  31. What is its runtime? • This equation theoretically O(n), however, it is standard to bound RNA interior loops by a constant (30), making it O(1) • Still only O(n3) because we are only recursing on i, j, and k

  32. Presentation Overview • Background on RNA secondary structure prediction • CONTRAfold: probabilistic RNA folding • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection

  33. From W From V CandidateFold • What does it do? • Same folding as complex model in O(n2ψ(n)), where ψ(n) is shown to a constant • How does it do it? • Imposes some constraints on W and V • Rather than trying all k, they keep a list of candidate positions reducing this step to O(1) time

  34. CandidateFold • What is the advantage of CandidateFold? • Much faster RNA folding • What is an application of high-speed RNA folding? • Accessible motif finding

  35. Presentation Overview • Background on RNA secondary structure prediction • CONTRAfold: probabilistic RNA folding • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection

  36. G A U U A C A . . . RNA Regulatory motif (AUUAC) microRNA U A A U G What is an RNA regulatory motif? • RNA regulatory motif: A motif used to regulate translation • Motif: A conserved sequence element • A regulator binds to a regulatory motif • Regulatory protein • Micro RNA

  37. What is an accessible motif? • If a sequence is part of an intramolecular hybridization, it is unlikely to bind to regulators • We define a motif as “accessible” if none of its nucleotides is hybridized as part of the folding

  38. Accessible motifs cont’d • Therefore, only accessible sequences should be scanned for regulatory motifs

  39. Accessible motifs cont’d • Therefore, only accessible sequences should be scanned for regulatory motifs.

  40. How do Wexler et al. detect regulatory motifs? Problem: Given a set of mRNAs G, a parameter k denoting motif window size, and a pre-defined energy threshold δ, find the regulatory motifs • Stage 1: Process sequence set G to extract all “accessible windows” • Run sliding window of size k across each mRNA sequence • Find the minimal energy fold for the sequence, assuming none of the bases in the window are paired • If the energy of this folding minus the energy of a normal folding of the mRNA < δ, then accept the window • Stage 2: Search for regulatory motifs among the “accessible windows” • Motif finding will be discussed in later lectures

  41. Results: Degradation Related Motifs

  42. Results: Tissue Specific microRNAs Silique: A long, slender, many-seeded, cylindrical fruit of the Mustard Family

  43. The End

  44. Works Cited CB Do, DA Woods, S Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14): e90-e98, 2006. Y Wexler, C Zilberstein, M Ziv-Ukelson. A Study of Accessible Motifs and RNA Folding Complexity. Recomb 2006, LNBI 3909: 473-487, 2006.

More Related