CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure

CISC 467/667 Intro to Bioinformatics(Spring 2007)RNA secondary structure CISC667, S07, Lec19, Liao

Facts about RNAs • Mainly as “information carrier” in protein synthesis • mRNA • tRNA • Also as catalysts • Ribozymes • Also as regulator • RNAi • Single strand • Intra-molecular pairing • Secondary structure • Essential for sequence stability and function • Similarity in secondary structure plays a more important role than primary sequence in determining homology for RNAs CISC667, S07, Lec19, Liao

CISC667, S07, Lec19, Liao

C A A A G G A A G U A C A C G U G C C G C A RNA stem loop hybridization pairing: A-U, and C-G where “-” stands for a pairing, and “x” for no pairing. In the above example, seq1 and seq2 fold into a similar structure, whereas seq3 does not. Pairwise alignments disregarding such structural restrictions may be misleading; seq2 and seq3 have 70% sequence identity, seq1 and seq3 have 60%, whereas seq1 and seq2 have only 30%. G A U x C C x U G x G seq1 seq2 seq3 seq1 C A G G A A A C U G seq2 G C U G C A A A G C seq3 G C U G C A A C U G CISC667, S07, Lec19, Liao

A A G A G C A U C G C A G G A A A C U G C A G G A A A C U G ( ( ( . . . . ) ) ) Representations for palindromic structures 1) 2) 3) There is a 1-1 correspondence between RNA secondary structures and well-balanced parenthesis expressions, where the balancing parentheses correspond to base pairings via hydrogen bonds. CISC667, S07, Lec19, Liao

Secondary structure prediction Base pair maximization algorithm [Nussinov] mi,j: maximal # of base pairs that can be formed for sequence si…sj. di,j = 1, if si and sj are paired = 0, otherwise Initialization mi,i-1 = 0 for i= [2, L] mi,i = 0 for i = [1,L] Recursion mi,j= max {mi+1,j, mi,j-1, mi+1,j-1+ di,j, maxi<k<j {mi,k+ mk+1,j} // bifurcation } CISC667, S07, Lec19, Liao

Dynamic programming table 1 2 3 4 5 6 7 8 9 G G G A A A U C C 1 G 0 1 2 3 2 G 0 1 2 3 3 G 0 1 2 2 4 A 0 1 1 1 A A 5 A 0 0 1 1 1 6 A 0 1 1 1 A U 7 U 0 G C 8 C 0 G C 9 C 0 G Time Complexity: O(L3), where L is sequence length Space complexity: O(L2) CISC667, S07, Lec19, Liao

Secondary structure prediction Minimum energry algorithm [Zuker] Ei,j: minimum energy for an optimal fold formed by sequence si…sj. ei,j = -5, if si , sj is CG or GC = -4, if si , sj is AU or UA = -1, if si , sj is GU or UG Initialization ei,i-1 = 0 for i= [2, L] ei,i = 0 for i = [1,L] Recursion Ei,j= min {Ei+1,j, Ei,j-1, Ei+1,j-1+ ei,j, mini<k<j {Ei,k+ Ek+1,j} // bifurcation } CISC667, S07, Lec19, Liao

Motivation: • Both DNA and protein sequences can be considered as sentences in a “language”. • We know the “language” is not random • What’s the grammar of the language? • “The linguistics of DNA” – David Searls, American Scientist, 80(19920579. CISC667, S07, Lec19, Liao

Chomsky hierarchy of transformational grammars (1956) • Regular expression (no recursive structure, cannot handle palindromes) • W  aW, or W  a • Context-free • W   • Context-sensitive • 1W 2  1  2 • Unrestricted grammars • 1W 2   Notations: a: any terminals; , : any string of non-terminals and/or terminals, including the null string; : any string of non-terminals and/or terminals, not including the null string. Note: Chomsky normal form: any CFG can be written such that all productions have the following forms: Wv Wy Wz Wv a CISC667, S07, Lec19, Liao

S w1 w2 w3 c a g g a a a c u g seq1 C A G G A A A C U G seq2 G C U G C A A A G C To capture the palindromic structure, a context-free grammar can be given as follows. S  aW1u | cW1g | gW1c | uW1a W1 aW2u| cW2g| gW2c| uW2a W2 aW3u| cW3g| gW3c| uW3a W3  gaaa | gcaa CISC667, S07, Lec19, Liao

Stochastic grammars Motivations: • Irregularities (or exceptions to the grammar) in languages • Such irregularities can be taken care of by “growing” your grammar; introducing new non-terminals and new production rules. • It is useful to differentiate productions that account for a large portion of the language from those that are rare exceptions. • This can be achieved by assigning probabilities to various productions. • For any non-terminals, the probabilities of all its possible productions must add to 1. • Given a sentence x, a stochastic grammar  parsing the sentence will assign a probability P(x|) that the sentence belongs to the language specified by the grammar, whereas, a conventional grammar will just give a yes-or-no answer. • x language P(x| ) =1 CISC667, S07, Lec19, Liao

Stochastic CFG for sequence modeling • Calculate optimal alignment of a sequence to a parameterized SCFG. [Cocke-Younger-Kasami algorithm] • Calculate probability for a given sequence to belong to the language specified by SCFG. [inside algorithm, and inside-outside algorithm] • Given a set of example sequences/structures, estimate optimal probability parameters for a SCFG. Note: • The issues addressed above by SCFGs are quite similar to those for HMMs; actually it can be proved that HMMs are equivalent to stochastic regular grammars. CISC667, S07, Lec19, Liao

Resources Lecture notes: http://www.bioinfo.rpi.edu/~zukerm/lectures/RNAfold-html/ RNA informatics: http://www-lbit.iro.umontreal.ca/RNA_Links/RNA.shtml Software: -Vienna RNA fold CISC667, S07, Lec19, Liao

CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure