RNA Secondary Structure

RNA Secondary Structure What is RNA? Definition of RNA secondary Structure RNA molecule evolution Algorithms for base pair maximisation Chomsky’s Linguistic Hierarchy Stochastic Context Free Grammars & Evolution Miscelaneous topics

Base PairingFrom Przytycka

An Example: t-RNA From Paul Higgs

Known RNAs t-RNA (transfer-) m-RNA (messenger-) mi-RNA (micro-) Sn-RNA (small nuclear) RNA-I (interfering) Srp-RNA (Signal Recognition Particle) 5S RNA 16S RNA 23S RNA RNA viruses: Retroviruses (HIV), Coronavirus (SARS),. ….

Functions of RNAs Information Transfer: mRNA Codon -> Amino Acid adapter: tRNA Other base pairing functions: ??? Enzymatic Reactions: Structural: Metabolic: ??? Regulatory: RNAi

Known RNA Structures http://www.rnabase.org/metaanalysis/ httpp://www.sanger.ac.uk/Software/rfam http://www.scor.lbl,gov Rfam – database of RNA alignments and secondary structure models Scor - database of RNA experimentally solved structures Figure 1: The cumulative number of publicly available RNA containing structures determined by x-ray crystallography (red), nmr spectroscopy (purple) or all techniques combined (blue) has been steadily increasing since the first RNA containing structure was released in 1978. There has been a substantial acceleration in RNA structure determinations since the mid-1990s. Figure 2: In a positive new trend, the average number of conformational map outliers per residue solved has shown a consistent downtrend recently. Interestingly, most of the improvement can be attributed to structures determined by x-ray crystallography. There has been no consistent trend for structures determined by NMR spectroscopy.

RNA SS: recursive definition Nussinov (1978) remade from Durbin et al.,1997 Secondary Structure : Set of paired positions on inteval [i,j]. A-U + C-G can base pair. Some other pairings can occur + triple interactions exists. Pseudoknot – non nested pairing: i < j < k < l and i-k & j-l. i+1 j-1 i j-1 i+1 j j i j j i i k k+1 i,j pair j unpaired i unpaired bifurcation

RNA Secondary Structure ( ) N1 NL ) ( ) ( N1 NL N1 NL ) ) N1 NL ) ( N1 Nk Nk+1 NL ) ) The number of secondary structures: Waterman,1978

j i RNA: Matching Maximisation.remade from Durbin et al.,1997 Example: GGGAAAUCC (A-U & G-C) G G G A A A U C C A A G G G A A A U C A U G C C G G

RNA Secondary Structure Evolution From Durbin et al.(1998) Biological Sequence Comparison

Inference about hidden structure C C A A G C A U U Observable Unobservable Goldman, Thorne & Jones, 96 Knudsen & Hein, 99 Pedersen & Hein, 03 Observable Unobservable

Goldman, Thorne & Jones: ”Structure” + ”Evolution” 1 3 2 4 1 A S D F G H J K L P 2 A S D F G H J K L P 3 D S D F G K J K L C 4 D S D F G K J K L C HMM x x x x x  x x L x x x

Three Questions What is the probability of the data? What is the most probable ”hidden” configuration? What is the probability of specific ”hidden” state? Training: Given a set of instances, find parameters making them probable if they were independent. O1 O2O3 O4O5 O6O7 O8 O9 O10 H1 H2 H3

The Basic Calculations O1 O2O3 O4O5 O6O7 O8 O9 O10 H1 H2 H3 O1 O2O3 O4O5 O6O7 O8 O9 O10 H1 H2 H3 What is the most probable ”hidden” configuration? What is the probability of specific ”hidden” state? The time required for these calculations is proportional to K2*L, where K is the number of hidden states and L the length of the sequence.

Empirical Doublet Models Alignment of slowly N related molecules – L long AUUGCAUUCCAAUUGCAUUCCA rN1,N2= #(N1->N2,N2->N1)/[NP/U(NP/U-1)/2] N1 not N2 AUUGCAUUCCAAUUGCAUUCCAwhere NP/U is number of paired/unpaired in alignment AUUGCAUUCCAAUUGCAUUCCAr’N1,N2 = #N1*rN1,N2/#N2 AUUGCAUUCCAAUUGCAUUCCA Partial Doublet Model AU UA GC CG UG GU AU -1.16 .18 .5 .12 .02 .27 UA .18 -1.16 .12 .5 .27 .02 CG .33 .08 -.82 .13 .02 .23 CG .08 .33 .13 -.82 .23 .02 UG .08 1.00 .1 1.26 -2.56 .04 GU 1.00 .08 1.26 .1 .04 -2.56 Singlet/Marginalized Doublet Model A C G U A -.75/-1.15 .16/.13 .32/.79 .26/.23 C .4/.09 -1.57/-.84 .24/.16 .93/.59 G .55/.45 .17/.13 -.96/-.7 .24/.11 U .35/.18 .51/.70 .19/.16 -1.05/-1.03

Doublet Evolution From Bjarne Knudsen

2 3 4 5 6 7 8 1 2 C 3 4 5 C 6 7 A G C A U U 2 3 4 5 6 7 8 1 2 3 4 5 6 7 Structure Dependent Evolution: RNA U A C A C C G U U A C A C C G U U A C A C C G U U A C A C C G U

Structure Dependent Evolution: RNA

A starting symbol: • A set of substitution rules applied to variables - - in the present string: Grammars: Finite Set of Rules for Generating Strings Regular Context Free Context Sensitive General (also erasing) finished – no variables

Chomsky Linguistic Hierarchy Source: Biological Sequence Comparison W nonterminal sign, a any sign,  are strings, but , not null string.  Empty String Regular GrammarsW --> aW’W --> a Context-Free GrammarsW -->  Context-Sensitive Grammars1W2 --> 12 Unrestricted Grammars1W2 -->  The above listing is in increasing power of string generation. For instance "Context-Free Grammars" can generate all sequences "Regular Grammar" can in addition to some more.

Simple String Generators Terminals(capital)---Non-Terminals(small) i. Start with SS --> aT bS T --> aS bT One sentence – odd # of a’s: S-> aT -> aaS –> aabS -> aabaT -> aaba ii. S--> aSa bSb aa bb One sentence (even length palindromes): S--> aSa --> abSba --> abaaba

Stochastic Grammars The grammars above classify all string as belonging to the language or not. All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language. If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules. i. Start with S.S --> (0.3)aT (0.7)bS T --> (0.2)aS (0.4)bT (0.2) *0.2 *0.7 *0.3 *0.3 *0.2 S -> aT -> aaS –> aabS -> aabaT -> aaba ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb *0.1 *0.3 *0.5 S -> aSa -> abSba -> abaaba

Secondary Structure Generators S --> LSL .869 .131 F --> dFdLS .788 .212 L --> s dFd .895 .105

SCFG Analogue to HMM calculations (Durbin et al,1998) What is the probability of the data? What is the most probable ”hidden” configuration? What is the probability of specific ”hidden” state? S W WL WR j L 1 i i’ j’ The time required for these calculations is proportional to K2*L3, where K is the number of hidden states and L the length of the sequence.

RNA Secondary Structure Knudsen & Hein, 03

1. Accuracy as certainty threshold is increased. 2. Accuracy as function of sequence number: From Knudsen & Hein (1999)

RNA Secondary Structure Knudsen & Hein, 03

Observing Evolution has 2 parts C C A A G C A U U P(x): x x P(Further history of x):

RNA Structure Prediction and Alignment Can only align molecules of same type. Sankoff, 1985 Combined RNA secondary structure & alignment Gorodkin 1997 Foldalign – only hairpins 2002 Dynalign Perriquet 2002 Carnac

RNA Structure Representations Circle with chords Full Description E Mountains Ordered Tree Balanced Nested Parenthesis From Fontana, 2003 Moulton et al.,2002

RNA Structure Evolution Insertion-deletion process of Doublets Singlets There are methods of tree alignments that could probably be extended to statistica tree alignment.

Metrics on RNA StructuresMoulton,2000 Base Pair Metrics Tree Metrics Mountain Metrics

Population Genetics of Coupled Mutations W.Stephan,96 & P.Higgs,98 Possible separation of long term and short term evolution Creation of Linkage Disequilibrium of paired sites.

SingletDoublet Models Kirby et al, 95, Tillier et al.,98, Savill et al.,01 Jukes-Cantor with bias toward base pairing: 1/4ml, 1 difference, pairing gained 1/4m, 1 difference, pairing unchanged Ri,j= 1/4m/l, 1 difference pairing lost 0, 2 differences

Contagious Dependencies: Overlapping Reading Frames & CG frequencies Pedersen & Jensen,01 n n n n n n n n n n n

N1 N2 N4 N2 DoubletTetraplet Models Nerman & Durbin at B.Knudsen’s exam 02 Stacking: In principle a 44 times 44 matrix (65.536 entries!!) is need, but proper parametrisation and symmetries is could reduce this substantially.

RNA + Protein Structure Dependent Molecular Evolution Singlet Straight forward, no interference from RNA level. Doublets What seems to be needed is a parametrisation of how base pairing creates departure from a independent singlet,singlet model.

Miscellaneous Topics RNA Folding Molecular Dynamics of RNA Structures RNA Structure – Sequence Landscapes RNA Homology Modelling & Threading RNA Gene Finding Close to Optimal Structures Constraint Satisfaction Modelling

Literature & www-sites Eddy, S. Non-coding RNA genes and the modern RNA world.Nat Rev Genet. 2001 Dec;2(12):919-29. Review. Eddy, S. “Computational genomics of noncoding RNA genes” Cell. 2002 Apr 19;109(2):137-40. Review. Fontana (2002) Modelling “evo-devo” with RNA BioEssays 24.12.1164-77 Knudsen, B. and J.J.Hein (2003) "Practical RNA Folding” (In Press, RNA) Knudsen, B. and J.J.Hein (1999) "Using stochastic context free grammars and molecular evolution to predict RNA secondary structure (Bioinformatics vol 15.5 15.6.446-454) Moore (1999) Structural Motifs in RNA Ann.Rev.Biochem. 68.287-300. Moulton et al. (2000) Metrics on RNA Secondary Structures J.Compu.Biol. 7.1/2.277- Perriquet et al.(2003) Finding the common homologous structure shared by two homologous RNAs. Bioinformatics 19.1.108-116. http://www.imb-jena.de/RNA.html http://scor.lbl.gov/index.html http://www.rnabase.org/metaanalysis/

RNA Secondary Structure