600 likes | 827 Views
Wi'07. Bafna. The central dogma. The DNA contains the code to all of the essential machinery of the cell, mainly proteins.Information about proteins is in genes. The genic information is first copied to an intermediate molecule (transcription)The transcript goes outside the nucleus and is transla
E N D
1. Wi’07 Bafna Prediction of RNA structure Vineet Bafna
2. Wi’07 Bafna The central dogma The DNA contains the code to all of the essential machinery of the cell, mainly proteins.
Information about proteins is in genes.
The genic information is first copied to an intermediate molecule (transcription)
The transcript goes outside the nucleus and is translated into a protein.
3. Wi’07 Bafna
4. Wi’07 Bafna The Central Dogma However, not all genes are translated!
5. Wi’07 Bafna Example: tRNA
6. Wi’07 Bafna RNA Could it be that, there is a large trove of non-coding RNA, that is not discovered yet?
The answer seems to be YES
Today’s lecture: How can we computationally identify ncRNA?
7. Wi’07 Bafna Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome).
Subsequently, many other non-coding genes have been found
8. Wi’07 Bafna
9. Wi’07 Bafna Scientific American, 2006
10. Wi’07 Bafna Bacterial Riboswitches -Breaker Lab
11. Wi’07 Bafna ncRNA gene finding The RNA world hypothesis:
RNA are as important as protein coding genes.
Many undiscovered ncRNA exist
Computational methods for discovering ncRNA are not mature.
What are the clues to non-coding genes?
Structure: Given a sequence, what is the structure into which it can fold with minimum energy?
12. Wi’07 Bafna RNA structure: Basics Key: RNA is single-stranded. Think of a string over 4 letters, AC,G, and U.
The complementary bases form pairs.
A <-> U, C <-> G, G <-> U
Base-pairing defines a secondary structure. The base-pairing is usually non-crossing.
13. Wi’07 Bafna RNA structure: Basics Key: RNA is single-stranded. Think of a string over 4 letters, AC,G, and U.
The complementary bases form pairs.
Base-pairing defines a secondary structure. The base-pairing is usually non-crossing.
14. Wi’07 Bafna RNA structure: pseudoknots Sometimes, unpaired bases in loops form ‘crossing pairs’. These are pseudoknots.
15. Wi’07 Bafna De novo RNA structure prediction Any set of non-crossing base-pairs defines a secondary structure.
Abstract Question:
Given an RNA string find a structure that maximizes the number of non-crossing base-pairs
Incorporate the true energetics of folding
Incorporate Pseudo-knots
16. Wi’07 Bafna
17. Wi’07 Bafna De novo RNA prediction: a combinatorial formulation Input:
A string over A,C,G,U
A pairs with U, C pairs with G, G with U
Output:
A subset of possible base-pairings of maximum size (score) such that
No two base-pairs intersect
Each nucleotide pairs with at most one other nucleotide
How can we compute this set efficiently?
18. Wi’07 Bafna RNA structure Nussinov’s algorithm
Score B for every base-pair. No penalty for loops. No pesudo-knots.
Let W(i,j) be the score of the best structure of the subsequence from i to j.
19. Wi’07 Bafna Obtaining RNA structure
20. Wi’07 Bafna Obtaining RNA Structure
21. Wi’07 Bafna RNA structure: example
22. Wi’07 Bafna RNA Structure: Details
23. Wi’07 Bafna Base-pairing & Loops Base-pairs arise from complementary nucleotides
Single-stranded
Stack is when 2 base-pairs are contiguous
Loops arise when there are unpaired bases.
They are characterized by the number of base-pairs that close it.
Hairpin: closed by 1 base-pair
Bulge/Interior Loops (2 base-pairs)
Multiple Internal loops (k base-pairs)
24. Wi’07 Bafna Scoring Loops, multi-loops Zuker-Turner Energy Rules
http://www.bioinfo.rpi.edu/~zukerm/rna/energy/node2.html
Stacking Energies
Energy for Bulges and Interior Loops
Energy for Multi-loops
25. Wi’07 Bafna Improving the efficiency Computing structure on a 2000nt sequence is about 20s (2GHz, 2Gb)
(Wexler et al., 2006)
How much time would it take to search 10000 candidate regions?
What would be the impact of an O(n) speedup on this search?
26. Wi’07 Bafna A triangle inequality After the computations are done,
A triangle inequality is satisfied
27. Wi’07 Bafna Towards more efficient algorithms Consider the main loop again
We only need to branch, when i forms a base-pair. Therefore, w.l.o.g
28. Wi’07 Bafna A candidate list approach Note that if C is the maximum size of the candidate list, then the running time is O(Cn2)
How can we reduce the size of the candidate list?
Consider only those k s.t. bk is complementary to bi
29. Wi’07 Bafna Candidate list Consider i,j,k s.t.
W(i,j) = B(ri,rk)+W(i+1,k-1)+W(k+1,j) (1)
Claim: For all j’ >= j
j is not a candidate in computing W(i,j’)
30. Wi’07 Bafna Proof
31. Wi’07 Bafna The candidate list algorithm For all i = n down to 1
For all j = i+1 to n
32. Wi’07 Bafna How large is the candidate list? Here, the answer depends very much on the energy computations.
If we simply count the number of base-pairs, the candidate list should be large (probably)
For more realistic functions, it is very small (experiments with n upto 1000, showed C=6)
33. Wi’07 Bafna Empirical estimates of C
34. Wi’07 Bafna A thermodynamic argument It is conjectured, and empirically shown that RNA has the polymer-zeta property
Probability that a string of length m folds into an RNA structure is b/mc for constants b,c>0
Note that if c>0, the number of indices j that base-pair with i is O(1)
In practice c~1.5
35. Wi’07 Bafna End of Lecture
36. Wi’07 Bafna Incorporating pseudoknots in structure prediction Q: Determine optimal structure, when pseudoknots are allowed.
Pseudoknots are only loosely defined.
If any interleaving is allowed, then simply select a structure in which a max number of nucleotides can be paired.
Under some restrictive notions, the structure problem becomes NP-hard.
37. Wi’07 Bafna Akutsu’s Simple pseudoknots A region [i0,k0] forms a simple pseudoknot if there exist positions j’0, j0 s.t.
Each (i,j) ? M satisfies either
i0 ? i < j’0 ? j < j0 or j’0 ? i < j0 ? j ? k0
Within one of the two groups, there is no interleaving.
i<i’<j’0 or j’0 ?i<i’ implies j>j’.
38. Wi’07 Bafna Simple pseudoknots A region [i0,k0] forms a simple pseudoknot if it can be partioined into 3 regions s.t. every base-pair has the following properties
It either connects (A,B) , or (B,C)
Two base-pairs within (A,B) (or, (B,C)) do not interleave
39. Wi’07 Bafna Simple pseudoknotted structure Collection of simple pseudoknots (Mi) and other basepairs M’.
None of the simple pseudoknot regions overlap.
M’ is a secondary structure without pseudoknots for the sequence obtained by excising all pseudoknotted regions.
Can we identify a sub-structure for the pseudo-knots?
40. Wi’07 Bafna Akutsu’s algorithm (Main idea) Rotate the sequence so that it forms two loops.
Each allowed base-pair is a horizontal line in exactly one of the two loops. The horizontal lines in the two loops are non-crossing.
All base-pairs can be ordered.
41. Wi’07 Bafna Main idea The base-pairs have a total order that a d.p. can exploit.
(i,j) < (i’,j’) if one of the following holds
i<i’<j<j’
i<i’<j’<j<j0
j’0<i’<i<j<j’
42. Wi’07 Bafna Frontiers of D.P. We need to construct an increasing path of base-pairs.
Consider a ‘frontier’ triple (i,j,k), corresponding to the region (i0,i) and (j,k)
We have the following cases:
(i,j) form a base-pair
(j,k) form a base-pair
Neither (i,j) nor (j,k) form a base-pair
43. Wi’07 Bafna Ordering frontiers ‘frontier’ triple (i,j,k) >= (i’,j’,k’) if and only if
i’ <= i
j <= j’
k’ <= k
44. Wi’07 Bafna Ordering Frontiers (i,j,k) >= (i’,j’,k’) if and only if
45. Wi’07 Bafna Recurrences SL(i,j,k) is the optimum score of a frontier (i,j,k) assuming that
i<j’0<j<j0<k
(i,j) form a base-pair
SR(i,j,k) is the optimum score of a frontier (i,j,k) assuming that i<j’0<j<j0<k, and
(j,k) form a base-pair
SM(i,j,k) is the optimum score of a frontier (i,j,k) assuming that i<j’0<j<j0<k, and
Neither (i,j) nor (j,k) form a base-pair
46. Wi’07 Bafna Computing SL
47. Wi’07 Bafna Putting it all together
48. Wi’07 Bafna Computing optimal pseudo-knotted structures Let S(i,j) be the opt score of a simple pseudoknotted structure.
Either (i,j) is a simple pseudoknot
S(i,j) = Spseudo(i,j)
Or, not
S(i,j) = max{ v(ai,aj) + S(i+1,j-1),
max k { S(i,k-1)+S(k,j)}
}
49. Wi’07 Bafna Time Complexity We compute Spseudo(i,j) for all (i,j). Each computation is O(n3). Total time?
For each i0, perform the following computation:
Spseudo(i0,k0) = max i0?i<j ?k0{SL(i,j,k0),…}
Total time O(n4) Can you do better?
50. Wi’07 Bafna Recursive pseudoknots Each loop of the RNA structure is a recursive pseudoknotted RNA structure.
Optimal recursive pseudoknotted RNA structure problem can be solved in O(n5) time.
51. Wi’07 Bafna Open Questions There should be a direct generalization of simple pseudoknots to the following structure.
Rivas and Eddy do consider such a generalization, but a systematic treatment is missing.
Q: Given a pseudo-knotted structure, is it an Akutsu simple pseudoknotted structure?
Linear time algorithm was devised recently for this problem.
52. Wi’07 Bafna Structure as a proxy for ncRNA Any genomic region with an energetically favorable fold is a candidate ncRNA?
Rivas and Eddy show otherwise.
They use
LOD score L = Pr(s|RNA)/Pr(s|Null)
Z-score Z = G-?/?
53. Wi’07 Bafna LOD scores for putative ncRNA A region of C. elegans is containing two tRNAs is chosen. The position of tRNA is indicated by stars.
The true positions and the LOD-score correspond.
Stronger results in M. janaschii (AT%=70)
Weak results in E.coli (GC%=50)
54. Wi’07 Bafna Correlation with GC content GC content detector has results similar to structure prediction!
55. Wi’07 Bafna Significance of detected regions Test for significance was using Z-scores on shuffled sequences.
Earlier tests shuffled sequences using a large window.
Shuffling high scoring windows led to lower Z-scores.
56. Wi’07 Bafna Testing chimeric sequence Test on chimeric sequence. A real tRNA was embedded in 2000bp random sequence with similar GC-composition.
57. Wi’07 Bafna Increasing window size tRNA Z-score computation with a larger window (85nt).
58. Wi’07 Bafna For very strong signals, a Z-score > 4 may still be significant.
59. Wi’07 Bafna Z-score distribution Z-scores of 415 tRNA genes. 98% have Z-score lower than 4.
60. Wi’07 Bafna Structure We hoped that RNA structure computation would give us a clue into detecting novel ncRNA
Turns out that random DNA can ‘fold’ into a low energy structure.