160 likes | 302 Views
CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure. Facts about RNAs Mainly as “information carrier” in protein synthesis mRNA tRNA Also as catalysts Ribozymes Also as regulator RNAi Single strand Intra-molecular pairing Secondary structure
E N D
CISC 467/667 Intro to Bioinformatics(Spring 2007)RNA secondary structure CISC667, S07, Lec19, Liao
Facts about RNAs • Mainly as “information carrier” in protein synthesis • mRNA • tRNA • Also as catalysts • Ribozymes • Also as regulator • RNAi • Single strand • Intra-molecular pairing • Secondary structure • Essential for sequence stability and function • Similarity in secondary structure plays a more important role than primary sequence in determining homology for RNAs CISC667, S07, Lec19, Liao
C A A A G G A A G U A C A C G U G C C G C A RNA stem loop hybridization pairing: A-U, and C-G where “-” stands for a pairing, and “x” for no pairing. In the above example, seq1 and seq2 fold into a similar structure, whereas seq3 does not. Pairwise alignments disregarding such structural restrictions may be misleading; seq2 and seq3 have 70% sequence identity, seq1 and seq3 have 60%, whereas seq1 and seq2 have only 30%. G A U x C C x U G x G seq1 seq2 seq3 seq1 C A G G A A A C U G seq2 G C U G C A A A G C seq3 G C U G C A A C U G CISC667, S07, Lec19, Liao
A A G A G C A U C G C A G G A A A C U G C A G G A A A C U G ( ( ( . . . . ) ) ) Representations for palindromic structures 1) 2) 3) There is a 1-1 correspondence between RNA secondary structures and well-balanced parenthesis expressions, where the balancing parentheses correspond to base pairings via hydrogen bonds. CISC667, S07, Lec19, Liao
Secondary structure prediction Base pair maximization algorithm [Nussinov] mi,j: maximal # of base pairs that can be formed for sequence si…sj. di,j = 1, if si and sj are paired = 0, otherwise Initialization mi,i-1 = 0 for i= [2, L] mi,i = 0 for i = [1,L] Recursion mi,j= max {mi+1,j, mi,j-1, mi+1,j-1+ di,j, maxi<k<j {mi,k+ mk+1,j} // bifurcation } CISC667, S07, Lec19, Liao
Dynamic programming table 1 2 3 4 5 6 7 8 9 G G G A A A U C C 1 G 0 1 2 3 2 G 0 1 2 3 3 G 0 1 2 2 4 A 0 1 1 1 A A 5 A 0 0 1 1 1 6 A 0 1 1 1 A U 7 U 0 G C 8 C 0 G C 9 C 0 G Time Complexity: O(L3), where L is sequence length Space complexity: O(L2) CISC667, S07, Lec19, Liao
Secondary structure prediction Minimum energry algorithm [Zuker] Ei,j: minimum energy for an optimal fold formed by sequence si…sj. ei,j = -5, if si , sj is CG or GC = -4, if si , sj is AU or UA = -1, if si , sj is GU or UG Initialization ei,i-1 = 0 for i= [2, L] ei,i = 0 for i = [1,L] Recursion Ei,j= min {Ei+1,j, Ei,j-1, Ei+1,j-1+ ei,j, mini<k<j {Ei,k+ Ek+1,j} // bifurcation } CISC667, S07, Lec19, Liao
Motivation: • Both DNA and protein sequences can be considered as sentences in a “language”. • We know the “language” is not random • What’s the grammar of the language? • “The linguistics of DNA” – David Searls, American Scientist, 80(19920579. CISC667, S07, Lec19, Liao
Chomsky hierarchy of transformational grammars (1956) • Regular expression (no recursive structure, cannot handle palindromes) • W aW, or W a • Context-free • W • Context-sensitive • 1W 2 1 2 • Unrestricted grammars • 1W 2 Notations: a: any terminals; , : any string of non-terminals and/or terminals, including the null string; : any string of non-terminals and/or terminals, not including the null string. Note: Chomsky normal form: any CFG can be written such that all productions have the following forms: Wv Wy Wz Wv a CISC667, S07, Lec19, Liao
S w1 w2 w3 c a g g a a a c u g seq1 C A G G A A A C U G seq2 G C U G C A A A G C To capture the palindromic structure, a context-free grammar can be given as follows. S aW1u | cW1g | gW1c | uW1a W1 aW2u| cW2g| gW2c| uW2a W2 aW3u| cW3g| gW3c| uW3a W3 gaaa | gcaa CISC667, S07, Lec19, Liao
Stochastic grammars Motivations: • Irregularities (or exceptions to the grammar) in languages • Such irregularities can be taken care of by “growing” your grammar; introducing new non-terminals and new production rules. • It is useful to differentiate productions that account for a large portion of the language from those that are rare exceptions. • This can be achieved by assigning probabilities to various productions. • For any non-terminals, the probabilities of all its possible productions must add to 1. • Given a sentence x, a stochastic grammar parsing the sentence will assign a probability P(x|) that the sentence belongs to the language specified by the grammar, whereas, a conventional grammar will just give a yes-or-no answer. • x language P(x| ) =1 CISC667, S07, Lec19, Liao
Stochastic CFG for sequence modeling • Calculate optimal alignment of a sequence to a parameterized SCFG. [Cocke-Younger-Kasami algorithm] • Calculate probability for a given sequence to belong to the language specified by SCFG. [inside algorithm, and inside-outside algorithm] • Given a set of example sequences/structures, estimate optimal probability parameters for a SCFG. Note: • The issues addressed above by SCFGs are quite similar to those for HMMs; actually it can be proved that HMMs are equivalent to stochastic regular grammars. CISC667, S07, Lec19, Liao
Resources Lecture notes: http://www.bioinfo.rpi.edu/~zukerm/lectures/RNAfold-html/ RNA informatics: http://www-lbit.iro.umontreal.ca/RNA_Links/RNA.shtml Software: -Vienna RNA fold CISC667, S07, Lec19, Liao