470 likes | 744 Views
RNA secondary structure prediction and runtime optimization. Greg Goldgof October 5, 2006 CS374 Presentation Stanford University. Presentation Overview. CONTRAfold: probabilistic RNA folding. Background on RNA secondary structure prediction.
E N D
RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University
Presentation Overview • CONTRAfold: probabilistic RNA folding • Background on RNA secondary structure prediction • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection
GATTACA GAUUACA What are RNA and mRNA? • RNA is a polymer of nucleotides A, U, C, and G transcribed from DNA • Traditional role as messenger molecule (mRNA)
internal loop multi-branch loop hairpin loop helix (stem) bulge loop What is RNA secondary structure/folding?
Pseudoknots • Not dealt with by either paper. • Pseudoknots will not be treated in this talk.
non-coding RNA (RNA genes) • RNA enzymes: catalytic RNA • Ribosomal RNA (rRNA) • Transfer RNA (tRNA) • RNAi: RNA mediated gene regulation • Micro RNA (miRNA) • Short-interfering RNA (siRNA) • Alternative splicing: small-nuclear RNA (snRNA) • Others: snoRNA, eRNA, srpRNA, tmRNA, gRNA Structure essential to function for many ncRNAs
Presentation Overview • CONTRAfold: probabilistic RNA folding • Background on RNA secondary structure prediction • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection
CONTRAfold Problem: Given an RNA sequence, predict the most likely secondary structure AUCCCCGUAUCGAUC AAAAUCCAUGGGUACCCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA
How does CONTRAfold work? For example: • CONTRAfold looks at features that indicate a good structure • C-G base pairings • A-U base pairings • Helices of length 5 • Hairpin loops of size 9 • Bulge loops of size 2 • CG/GC Base-pair stacking interactions • These examples are called thermodynamic parameters because they represent free energy values
( ) exp How does CONTRAfold choose a structure? # of occurrences of feature i, in structure y generated from sequence x • The probability of a structure y, given a sequence x, is determined by the following relationship: • Every feature fi is associated with a weight wi. weight of Feature i structure sequence
High confidence bases darker Low confidence bases lighter How does CONTRAfold choose a structure? Cont’d • Considers all structures and finds optimal structure via dynamic programming in O(n3) • Added bonus: probability associated with each base
# correct base pairings Sensitivity = # true base pairings # correct base pairings Specificity = # predicted base pairings Parameter γ allows trade-off between sensitivity and specificity = 1 AUCCCCGUAUCGAUC AAAAUCCAUGGGUACCCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA = 8 = 1024
CONTRAfold learns how to predict good structures • CONTRAfold learns the relative value, or weight, of each of its features • A training set is a collection of known correct solutions that a program learns from. • CONTRAfold trains on set of published examples of known RNA structures taken from a database called Rfam (RNA families) • CONTRAfold determines the weight for each feature that maximizes its performance on the training set.
Presentation Overview • CONTRAfold: probabilistic RNA folding • Background on RNA secondary structure prediction • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection
Other Methods Physics-based models Stochastic context-free grammars
Physics-based models • Features experimentally determined in lab, rather than learned • All features reflect thermodynamic interactions • Until CONTRAfold, best performing method Disadvantages to CONTRAfold • Thermodynamic weights difficult to calculate • No incorporation of non-thermodynamic features • Cannot be tailored to specific families of RNAs since weights always the same • Cannot trade off between sensitivity and specificity • No associated probabilities with each pair-bonding
Stochastic context-free grammars S aSu | cSg | aS | uS | … | Su | SS | ε • Based on grammar rules with associated probabilities P .21 .15 .11 .08 .03 .22 .02 • We select the set of transformations that highest probability of generating the input sequence. This set gives us our structure. S • Let’s generate a structure for the sequence acuuauuag aS acSg acuSag acuguaucuag .(((...).)) acg .() acuag .(()) a . acuuag .((.)) acuguacuag .(((..).)) acugucuag .(((.).)) acugcuag .((().)) acuSuag acugScuag acuguScuag acuguaScuag acuguauScuag acuguaucuag
Stochastic context-free grammars cont’d • Therefore, they can also be optimized to specific datasets • Like CONTRAfold, transformation probabilities can be automatically trained • Provide an associated probability with a given structure Disadvantages to CONTRAfold • Grammar rules of SCFG less expressive than features of CONTRAfold or physics-based methods • Poor accuracy: always dominated by physics-based models
Advantages of CONTRAfold • High accuracy • Automated training of parameters • Can be tuned to specific data • Provides associated probabilities for each base-pairing • Ability to control sensitivity/specificity trade-off • Can incorporate both physics-based and non-thermodynamic parameters
Presentation Overview • Background on RNA secondary structure prediction • CONTRAfold: probabilistic RNA folding • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection
We want the highest scoring fold Score for optimal structure from base i to base j How is RNA folding done? • Only scores interactions between paired bases Simple Nussinov Folding Algorithm • Useful for demonstrating general structure of more complex folding algorithms δ(i, j) = score for a pairing between i and j. Base i is unpaired, consider pairing between i+1 and j Base j is unpaired, consider pairing between i and j-1
How is RNA folding done? • Only scores interactions between paired bases Simple Nussinov Folding Algorithm • Useful for demonstrating general structure of more complex folding algorithms Pair i and j. Now consider pairing between i+1 and j-1.
How is RNA folding done? • Only scores interactions between paired bases Simple Nussinov Folding Algorithm • Useful for demonstrating general structure of more complex folding algorithms i and j begin a bifurcation. Consider every possible bifurcation point k. Sum scores from each folded structure.
How is RNA folding done? • What is the runtime of the Nussinov algorithm? O(n) * O(n) * O(n) → O(n3) For a given sequence of length n = j – i we must consider: • All possible value of i O(n) For each i we must consider: • All possible values of j O(n) For each i, j pair we must consider: • All possible values of k O(n)
A more sophisticated algorithm • We want to take into account more advanced features than just base-pairings.
U C G U C A C G C j i What is V(i, j)? eh = Energy of a hairpin closed at i and j
A U G C j i What is V(i, j)? es = Energy of stacked pair i, j and i+1, j-1
j’ i’ C G U C A A G C j i What is V(i, j)? ebi = Energy of a bulge or interior loop that begins at i, j and is closed at i’, j’
What is V(i, j)? Same old bifurcation equation, but i is paired to j
What is its runtime? • This equation theoretically O(n), however, it is standard to bound RNA interior loops by a constant (30), making it O(1) • Still only O(n3) because we are only recursing on i, j, and k
Presentation Overview • Background on RNA secondary structure prediction • CONTRAfold: probabilistic RNA folding • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection
From W From V CandidateFold • What does it do? • Same folding as complex model in O(n2ψ(n)), where ψ(n) is shown to a constant • How does it do it? • Imposes some constraints on W and V • Rather than trying all k, they keep a list of candidate positions reducing this step to O(1) time
CandidateFold • What is the advantage of CandidateFold? • Much faster RNA folding • What is an application of high-speed RNA folding? • Accessible motif finding
Presentation Overview • Background on RNA secondary structure prediction • CONTRAfold: probabilistic RNA folding • Other RNA folding methods: Physics-based methods and SCFGs • How is RNA folding done from an algorithmic perspective? • CandidateFold: RNA folding in O(n2) • Genome-wide accessible motif detection
G A U U A C A . . . RNA Regulatory motif (AUUAC) microRNA U A A U G What is an RNA regulatory motif? • RNA regulatory motif: A motif used to regulate translation • Motif: A conserved sequence element • A regulator binds to a regulatory motif • Regulatory protein • Micro RNA
What is an accessible motif? • If a sequence is part of an intramolecular hybridization, it is unlikely to bind to regulators • We define a motif as “accessible” if none of its nucleotides is hybridized as part of the folding
Accessible motifs cont’d • Therefore, only accessible sequences should be scanned for regulatory motifs
Accessible motifs cont’d • Therefore, only accessible sequences should be scanned for regulatory motifs.
How do Wexler et al. detect regulatory motifs? Problem: Given a set of mRNAs G, a parameter k denoting motif window size, and a pre-defined energy threshold δ, find the regulatory motifs • Stage 1: Process sequence set G to extract all “accessible windows” • Run sliding window of size k across each mRNA sequence • Find the minimal energy fold for the sequence, assuming none of the bases in the window are paired • If the energy of this folding minus the energy of a normal folding of the mRNA < δ, then accept the window • Stage 2: Search for regulatory motifs among the “accessible windows” • Motif finding will be discussed in later lectures
Results: Tissue Specific microRNAs Silique: A long, slender, many-seeded, cylindrical fruit of the Mustard Family
Works Cited CB Do, DA Woods, S Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14): e90-e98, 2006. Y Wexler, C Zilberstein, M Ziv-Ukelson. A Study of Accessible Motifs and RNA Folding Complexity. Recomb 2006, LNBI 3909: 473-487, 2006.