290 likes | 490 Views
Combining RNA and Protein selection models. The Central Idea in Comparative Molecular Biology & Genomics Three basic applications Protein secondary structure RNA secondary structure Gene structure Combining Evolution Constraints Protein-Protein RNA-Protein
E N D
Combining RNA and Protein selection models The Central Idea in Comparative Molecular Biology & Genomics Three basic applications Protein secondary structure RNA secondary structure Gene structure Combining Evolution Constraints Protein-Protein RNA-Protein Combining Structure Descriptions
TCGTA TGGTT Modelling Sequence Evolution a - unknown Biological setup Pi,j(t) continuous time markov chain on the state space {A,C,G,T}. t1 e A t2 C C
Jukes-Cantor 69: Total Symmetry Rate-matrix, R: T O A C G T F A -3*aa aa R C a -3*aaa O G a a -3* a a M T a a a -3* a Transition prob. after time t, a = a*t: P(equal) = ¼(1 + 3e-4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e-4*a ) ~ 3a Stationary Distribution: (1,1,1,1)/4.
Comparison of Evolutionary Objects. C C A A G C A U U Observable Unobservable Goldman, Thorne & Jones, 96 Knudsen & Hein, 99 Eddy & others Pedersen & Hein, 03 Haussler & others Multiple levels of selection Protein-protein RNA-protein Pedersen, Meyer, Forsberg, Hein,… Observable Unobservable
Structure Description:Grammars Finite Set of Rules Generating Strings • A starting symbol: • A set of substitution rules applied to variables - - in the present string: Context Free Regular finished – no variables Protein secondary structure Gene Structure RNA secondary structure
Simple String Generators Terminals(capital)---Non-Terminals(small) i. Start with SS --> aTbS T --> aSbT One sentence – odd # of a’s: S-> aT -> aaS –> aabS -> aabaT -> aaba ii. S--> aSabSbaa bb One sentence (even length palindromes): S--> aSa --> abSba --> abaaba
Stochastic Grammars The grammars above classify all string as belonging to the language or not. All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language. If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules. i. Start with S.S --> (0.3)aT (0.7)bS T --> (0.2)aS (0.4)bT (0.2) *0.2 *0.7 *0.3 *0.3 *0.2 S -> aT -> aaS –> aabS -> aabaT -> aaba ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb *0.1 *0.3 *0.5 S -> aSa -> abSba -> abaaba
Gene Describers Simple Prokaryotic Genes: Simple Eukaryotic Genes:
Secondary Structure Generators S --> LSL .869 .131 F --> dFdLS .788 .212 L --> s dFd .895 .105 Knudsen & Hein, 99
Structure Dependent Evolution Models • Protein Secondary Structure Dependent(Goldman, Thorne & Jones) • a, b & Loop each has their own mutation rate matrix (20,20) , Ra,Rb & Rloop 2. RNA Secondary Structure Dependent i. R singlet, singlet (4,4) ii. R doublet,doublet (16,16) (base pair conserving relative to R singlet, singletX R singlet, singlet ) 3. Gene Structure Dependent i. Rnon-coding{ATG-->GTG} ii. Rcoding{ATG-->GTG} iii-. Other structural categories, regulatory signals …..
i. The Genetic Code 3 classes of sites: 4 2-2 1-1-1-1 4 (3rd) 1-1-1-1 (3rd) ii. TA (2nd) Problems: i. Not all fit into those categories. ii. Change in on site can change the status of another.
b b a a b Kimura’s 2 parameter model & Li’s Model. Probabilities: Rates: start Selection on the 3 kinds of sites (a,b)(?,?) 1-1-1-1 (f*a,f*b) 2-2 (a,f*b) 4 (a, b)
alpha-globin from rabbit and mouse. Ser Thr Glu Met Cys Leu Met Gly Gly TCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * ** TCG ACA GGG ATA TAT CTA ATG GGT ATA Ser Thr Gly Ile Tyr Leu Met Gly Ile • Sites Total Conserved Transitions Transversions • 1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584) • 2-2 77 51 (.6623) 21(.2727) 5(.0649) • 4 78 47 (.6026) 16(.2051) 15(.1923) • Z(at,bt) = .50[1+exp(-2at) - 2exp(-t(a+b)] transition Y(at,bt) = .25[1-exp(-2bt )] (transversion) • X(at,bt) = .25[1+exp(-2at) + 2exp(-t(a+b)] identity • L(observations,a,b,f)= • C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15} • where a = at and b = bt. • Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663 • Transitions Transversions • 1-1-1-1 a*f = 0.0500 2*b*f = 0.0622 • 2-2 a = 0.3004 2*b*f = 0.0622 • 4 a = 0.3004 2*b = 0.3741 • Expected number of: replacement substitutions 35.49 synonymous 75.93 • Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72 • Silent sites : 429 - 314.72 = 114.28 Ks = .6644 Ka = .1127
Three Questions O1 O2O3 O4O5 O6O7 O8 O9 O10 H1 H2 H3 What is the probability of the data? What is the most probable ”hidden” configuration? What is the probability of specific ”hidden” state? HMM/Stochastic Regular Grammar: W Stochastic Context Free Grammars: WL WR j L 1 i i’ j’
Comparative Gene Finding Jakob Skou Pedersen & Hein, 2004
Why combine RNA & Protein Models? Short Term/Long Term Evolution Discrepancies Separating Selective Effects Analyzing one level without interference from the other level Predicting gene structure and RNA structure better. Annotation of Viral Genomes
Combining Levels of Selection. Assume multiplicativity: fA,B = fA*fB Protein-Protein Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic Jensen & Pedersen, 2001 Contagious Dependence Protein-RNA Singlet Doublets Contagious Dependence
Overlapping Coding Regions Hein & Stoevlbaek, 95 1st 1-1-1-1 2-2 4 2nd (f1f2a, f1f2b) (f2a, f1f2b) (f2a, f2b) 1-1-1-1 sites 2-2 4 (f1a, f1f2b) (f2a, f1f2b) (a, f2b) (f1a, f1b) (a, f1b) (a, b) pol gag Example: Gag & Pol from HIV Gag 1-1-1-1 2-2 4 Pol 64 31 34 1-1-1-1 sites 2-2 4 40 7 0 27 2 0 MLE:a=.084 b= .024 a+2b=.133 fgag=.403 fpol=.229 Ziheng Yang has an alternative model to this, were sites are lumped into the same category if they have the same configuration of positions and reading frames.
HIV2 Analysis Hasegawa, Kisino & Yano Subsitution Model Parameters: a*t β*t pApCpGpT 0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003 Selection Factors GAG 0.385 (s.d. 0.030) POL 0.220 (s.d. 0.017) VIF 0.407 (s.d. 0.035) VPR 0.494 (s.d. 0.044) TAT 1.229 (s.d. 0.104) REV 0.596 (s.d. 0.052) VPU 0.902 (s.d. 0.079) ENV 0.889 (s.d. 0.051) NEF 0.928 (s.d. 0.073) Estimated Distance per Site: 0.194
Evolution under double constraints Codon Nucleotide Independence Heuristic Singlet Ri,j =f* qi,j Doublet R(i1,i2),(j1,j2) = f1 * f2 * q (i1,i2),(j1,j2)
Structure Prediction: Hepatitis C Analysis U U U A A – U G – C G – C U – A C – G C C U U C – G C – G G – U C U G – C C A G – C A C A G G – U G – C C – G C – G G – C U – G A A A A C G - U A - U C - G U - G C - G C - G G - U U A C C G C C G - C G - C U - G G - C G - C G – C A - U U U A G A C C - G U – A A A A G U - G G - C G - U A - C - G C - G U - A C - G U - A U
Evolution Models: A hierarchy of hypotheses 3 3 3 1 1 1 2 2 2 Codon Factors transversion transition, ratio Duplet distortion Doublet/ singlet ratio Likelihood # parameters - - 4 0 1 2 3 4 5 - - - - - + L= 1.0531 10-25927 L= 2.0596 10-25797 L= 1.3104 10-21569 L= 2.5006 10-21513 L= 4.5739 10-21484 L= 2.1155 10-21473 - - 0.173 0.415 0.415 0.414 0.292 5 (f1:0.24,f2:0.14) (f1:0.24,f2:0.14) (f1:0.24,f2:0.14) (f1:0.24,f2:0.14) ts/tv=2.00 3 (ts/tv)=1.50,1.26,3.05 3 (ts/tv, equil.) 3 (ts/tv, equil.) 7 9 15 17 Singlet Doublets
Combined RNA & Protein Structure Gene Structure Fixed, RNA Structure Stochastic Presently being implemented with viral analysis in mind Both RNA & Gene Structure Stochastic Would imply Gene Finding as well. Grammar for overlapping genes a new phenomena Gene Structure Stochastic, RNA Structure Fixed An untypical situation A challenge for the future: structure evolution.
Open Problems N1 N2 N4 N3 Stacking Substitution Models In principle a 44 times 44 matrix (65.536 entries!!) is need, but proper parametrisation and symmetries is could reduce this substantially. Other Sets of Constraints: Regulatory Signals Combining with Alignment A C G T A T C G T T C G T
References. Hein,J & J.Stoevlbaek (1995) “A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frames” J.Mol.Evol. 40.181-189. Jensen,JL & Pedersen (2001) “Probabilistic models of DNA sequence evolution with context dependent rates of subsitution” Adv. Appl.Prob. 32.499-517. Katz and Burge (2003) “Widespread Selection for Local RNA Secondary Structure in Coding Regions of Bacterial Genes. Genome Research. 13.2042-51 Kirby, AK, SV Muse & W.Stephan (1995) “Maintenance of pre-mRNA secondary structure by epistatic selection” PNAS. 92.9047-51. Knudsen, Hein 99 “Predicting RNA Structure using Stochastic Context Free Grammars and Molecular Evolution” Bioinformatics 15.6.446-454. Knudsen and Hein (2003) “Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acid Research 31.13.3423-28. New Influenza gene article??? Meyer and Durbin (2002) “Comparative Ab Initio prediction of Gene Structure using pair HMMs” Bioinformatics 18.10.1309-18. Moulton, V., Zuker, M. Steel, M., Penny, D. and Pointon, R. “Metrics on RNA Structures”. J. Computational Biology, 7 (1): 277-292, (2000). Pedersen, AMK & JL Jensen (2001) “A Dependent – Rates Model and an MCMC-Based Methodology for the Maximum-Likelihood Analysis of Sequences with Overlapping Reading Frames” Mol.Biol.Evol. 18.5.763-76. Pedersen JS & J. Hein 2003 – “Gene finding with a Hidden Markov Model of genome structure and evolution” Bioinformatics Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “An evolutionary model for protein coding regions with RNA secondary structure” Manuscript in Preparation Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “Structure Models” Manuscript in Preparation Schadt, E. & K.Lange (2002) “Codon and Rate Variation Models in Molecular Phylogeny” Mol.Biol.Evol. 19.9.1534-49 Savill, NJ et al (2001) “RNA Sequence Evolution With Secondary Structure Constraints: Comparison of Substituin Ratye Models Using Maximum-Likehood Methods” Genetics. 2001 Jan 157.399-4111 Simmonds, P. and DB Smith (July1999) “Structural Constraints on RNA Virus Evolution” J.of Virology 5787-94 Tillier ERM & RA Collins (1998) “High Apparent Rate of Simultaneous Compensatory Base-Pair Substitutions in Ribosomal RNA” Genetics 149.1993-2001. Yang, Z. et al. (1995) “Molecular Evolution of the Hepatitis B Virus Genome” J.Mol.Evol. 41.587-96
Acknowledgements 1. Comparative RNA Structure - Bjarne Knudsen 2. Comparative Gene Structure - Jakob Skou Pedersen 3. Integrating Levels of Selection & Structure: Jakob Skou Pedersen, Irmtraud Meyer, Roald Forsberg Bjarne Knudsen Roald Forsberg Irmtraud Meyer Jakob Skou Pedersen