1 / 29

Combining RNA and Protein selection models

Combining RNA and Protein selection models. The Central Idea in Comparative Molecular Biology & Genomics Three basic applications Protein secondary structure RNA secondary structure Gene structure Combining Evolution Constraints Protein-Protein RNA-Protein

ilar
Download Presentation

Combining RNA and Protein selection models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining RNA and Protein selection models The Central Idea in Comparative Molecular Biology & Genomics Three basic applications Protein secondary structure RNA secondary structure Gene structure Combining Evolution Constraints Protein-Protein RNA-Protein Combining Structure Descriptions

  2. TCGTA TGGTT Modelling Sequence Evolution a - unknown Biological setup Pi,j(t) continuous time markov chain on the state space {A,C,G,T}. t1 e A t2 C C

  3. Jukes-Cantor 69: Total Symmetry Rate-matrix, R: T O A C G T F A -3*aa aa R C a -3*aaa O G a a -3* a a M T a a a -3* a Transition prob. after time t, a = a*t: P(equal) = ¼(1 + 3e-4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e-4*a ) ~ 3a Stationary Distribution: (1,1,1,1)/4.

  4. Comparison of Evolutionary Objects. C C A A G C A U U Observable Unobservable Goldman, Thorne & Jones, 96 Knudsen & Hein, 99 Eddy & others Pedersen & Hein, 03 Haussler & others Multiple levels of selection Protein-protein RNA-protein Pedersen, Meyer, Forsberg, Hein,… Observable Unobservable

  5. Structure Description:Grammars Finite Set of Rules Generating Strings • A starting symbol: • A set of substitution rules applied to variables - - in the present string: Context Free Regular finished – no variables Protein secondary structure Gene Structure RNA secondary structure

  6. Simple String Generators Terminals(capital)---Non-Terminals(small) i. Start with SS --> aTbS T --> aSbT One sentence – odd # of a’s: S-> aT -> aaS –> aabS -> aabaT -> aaba ii. S--> aSabSbaa bb One sentence (even length palindromes): S--> aSa --> abSba --> abaaba

  7. Stochastic Grammars The grammars above classify all string as belonging to the language or not. All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language. If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules. i. Start with S.S --> (0.3)aT (0.7)bS T --> (0.2)aS (0.4)bT (0.2) *0.2 *0.7 *0.3 *0.3 *0.2 S -> aT -> aaS –> aabS -> aabaT -> aaba ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb *0.1 *0.3 *0.5 S -> aSa -> abSba -> abaaba

  8. Gene Describers Simple Prokaryotic Genes: Simple Eukaryotic Genes:

  9. Secondary Structure Generators S --> LSL .869 .131 F --> dFdLS .788 .212 L --> s dFd .895 .105 Knudsen & Hein, 99

  10. Structure Dependent Evolution Models • Protein Secondary Structure Dependent(Goldman, Thorne & Jones) • a, b & Loop each has their own mutation rate matrix (20,20) , Ra,Rb & Rloop 2. RNA Secondary Structure Dependent i. R singlet, singlet (4,4) ii. R doublet,doublet (16,16) (base pair conserving relative to R singlet, singletX R singlet, singlet ) 3. Gene Structure Dependent i. Rnon-coding{ATG-->GTG} ii. Rcoding{ATG-->GTG} iii-. Other structural categories, regulatory signals …..

  11. i. The Genetic Code 3 classes of sites: 4 2-2 1-1-1-1 4 (3rd) 1-1-1-1 (3rd) ii. TA (2nd) Problems: i. Not all fit into those categories. ii. Change in on site can change the status of another.

  12. b b a a b Kimura’s 2 parameter model & Li’s Model. Probabilities: Rates: start Selection on the 3 kinds of sites (a,b)(?,?) 1-1-1-1 (f*a,f*b) 2-2 (a,f*b) 4 (a, b)

  13. alpha-globin from rabbit and mouse. Ser Thr Glu Met Cys Leu Met Gly Gly TCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * ** TCG ACA GGG ATA TAT CTA ATG GGT ATA Ser Thr Gly Ile Tyr Leu Met Gly Ile • Sites Total Conserved Transitions Transversions • 1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584) • 2-2 77 51 (.6623) 21(.2727) 5(.0649) • 4 78 47 (.6026) 16(.2051) 15(.1923) • Z(at,bt) = .50[1+exp(-2at) - 2exp(-t(a+b)] transition Y(at,bt) = .25[1-exp(-2bt )] (transversion) • X(at,bt) = .25[1+exp(-2at) + 2exp(-t(a+b)] identity • L(observations,a,b,f)= • C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15} • where a = at and b = bt. • Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663 • Transitions Transversions • 1-1-1-1 a*f = 0.0500 2*b*f = 0.0622 • 2-2 a = 0.3004 2*b*f = 0.0622 • 4 a = 0.3004 2*b = 0.3741 • Expected number of: replacement substitutions 35.49 synonymous 75.93 • Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72 • Silent sites : 429 - 314.72 = 114.28 Ks = .6644 Ka = .1127

  14. Three Questions O1 O2O3 O4O5 O6O7 O8 O9 O10 H1 H2 H3 What is the probability of the data? What is the most probable ”hidden” configuration? What is the probability of specific ”hidden” state? HMM/Stochastic Regular Grammar: W Stochastic Context Free Grammars: WL WR j L 1 i i’ j’

  15. Comparative Gene Finding Jakob Skou Pedersen & Hein, 2004

  16. Knudsen & Hein, 99

  17. From Knudsen & Hein (1999)

  18. Knudsen and Hein, 2003

  19. Why combine RNA & Protein Models? Short Term/Long Term Evolution Discrepancies Separating Selective Effects Analyzing one level without interference from the other level Predicting gene structure and RNA structure better. Annotation of Viral Genomes

  20. Combining Levels of Selection. Assume multiplicativity: fA,B = fA*fB Protein-Protein Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic Jensen & Pedersen, 2001 Contagious Dependence Protein-RNA Singlet Doublets Contagious Dependence

  21. Overlapping Coding Regions Hein & Stoevlbaek, 95 1st 1-1-1-1 2-2 4 2nd (f1f2a, f1f2b) (f2a, f1f2b) (f2a, f2b) 1-1-1-1 sites 2-2 4 (f1a, f1f2b) (f2a, f1f2b) (a, f2b) (f1a, f1b) (a, f1b) (a, b) pol gag Example: Gag & Pol from HIV Gag 1-1-1-1 2-2 4 Pol 64 31 34 1-1-1-1 sites 2-2 4 40 7 0 27 2 0 MLE:a=.084 b= .024 a+2b=.133 fgag=.403 fpol=.229 Ziheng Yang has an alternative model to this, were sites are lumped into the same category if they have the same configuration of positions and reading frames.

  22. HIV2 Analysis Hasegawa, Kisino & Yano Subsitution Model Parameters: a*t β*t pApCpGpT 0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003 Selection Factors GAG 0.385 (s.d. 0.030) POL 0.220 (s.d. 0.017) VIF 0.407 (s.d. 0.035) VPR 0.494 (s.d. 0.044) TAT 1.229 (s.d. 0.104) REV 0.596 (s.d. 0.052) VPU 0.902 (s.d. 0.079) ENV 0.889 (s.d. 0.051) NEF 0.928 (s.d. 0.073) Estimated Distance per Site: 0.194

  23. Evolution under double constraints Codon Nucleotide Independence Heuristic Singlet Ri,j =f* qi,j Doublet R(i1,i2),(j1,j2) = f1 * f2 * q (i1,i2),(j1,j2)

  24. Structure Prediction: Hepatitis C Analysis U U U A A – U G – C G – C U – A C – G C C U U C – G C – G G – U C U G – C C A G – C A C A G G – U G – C C – G C – G G – C U – G A A A A C G - U A - U C - G U - G C - G C - G G - U U A C C G C C G - C G - C U - G G - C G - C G – C A - U U U A G A C C - G U – A A A A G U - G G - C G - U A - C - G C - G U - A C - G U - A U

  25. Evolution Models: A hierarchy of hypotheses 3 3 3 1 1 1 2 2 2 Codon Factors transversion transition, ratio Duplet distortion Doublet/ singlet ratio Likelihood # parameters - - 4 0 1 2 3 4 5 - - - - - + L= 1.0531 10-25927 L= 2.0596 10-25797 L= 1.3104 10-21569 L= 2.5006 10-21513 L= 4.5739 10-21484 L= 2.1155 10-21473 - - 0.173 0.415 0.415 0.414 0.292 5 (f1:0.24,f2:0.14) (f1:0.24,f2:0.14) (f1:0.24,f2:0.14) (f1:0.24,f2:0.14) ts/tv=2.00 3 (ts/tv)=1.50,1.26,3.05 3 (ts/tv, equil.) 3 (ts/tv, equil.) 7 9 15 17 Singlet Doublets

  26. Combined RNA & Protein Structure Gene Structure Fixed, RNA Structure Stochastic Presently being implemented with viral analysis in mind Both RNA & Gene Structure Stochastic Would imply Gene Finding as well. Grammar for overlapping genes a new phenomena Gene Structure Stochastic, RNA Structure Fixed An untypical situation A challenge for the future: structure evolution.

  27. Open Problems N1 N2 N4 N3 Stacking Substitution Models In principle a 44 times 44 matrix (65.536 entries!!) is need, but proper parametrisation and symmetries is could reduce this substantially. Other Sets of Constraints: Regulatory Signals Combining with Alignment A C G T A T C G T T C G T

  28. References. Hein,J & J.Stoevlbaek (1995) “A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frames” J.Mol.Evol. 40.181-189. Jensen,JL & Pedersen (2001) “Probabilistic models of DNA sequence evolution with context dependent rates of subsitution” Adv. Appl.Prob. 32.499-517. Katz and Burge (2003) “Widespread Selection for Local RNA Secondary Structure in Coding Regions of Bacterial Genes. Genome Research. 13.2042-51 Kirby, AK, SV Muse & W.Stephan (1995) “Maintenance of pre-mRNA secondary structure by epistatic selection” PNAS. 92.9047-51. Knudsen, Hein 99 “Predicting RNA Structure using Stochastic Context Free Grammars and Molecular Evolution” Bioinformatics 15.6.446-454. Knudsen and Hein (2003) “Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acid Research 31.13.3423-28. New Influenza gene article??? Meyer and Durbin (2002) “Comparative Ab Initio prediction of Gene Structure using pair HMMs” Bioinformatics 18.10.1309-18. Moulton, V., Zuker, M. Steel, M., Penny, D. and Pointon, R. “Metrics on RNA Structures”. J. Computational Biology, 7 (1): 277-292, (2000). Pedersen, AMK & JL Jensen (2001) “A Dependent – Rates Model and an MCMC-Based Methodology for the Maximum-Likelihood Analysis of Sequences with Overlapping Reading Frames” Mol.Biol.Evol. 18.5.763-76. Pedersen JS & J. Hein 2003 – “Gene finding with a Hidden Markov Model of genome structure and evolution” Bioinformatics Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “An evolutionary model for protein coding regions with RNA secondary structure” Manuscript in Preparation Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “Structure Models” Manuscript in Preparation Schadt, E. & K.Lange (2002) “Codon and Rate Variation Models in Molecular Phylogeny” Mol.Biol.Evol. 19.9.1534-49 Savill, NJ et al (2001) “RNA Sequence Evolution With Secondary Structure Constraints: Comparison of Substituin Ratye Models Using Maximum-Likehood Methods” Genetics. 2001 Jan 157.399-4111 Simmonds, P. and DB Smith (July1999) “Structural Constraints on RNA Virus Evolution” J.of Virology 5787-94 Tillier ERM & RA Collins (1998) “High Apparent Rate of Simultaneous Compensatory Base-Pair Substitutions in Ribosomal RNA” Genetics 149.1993-2001. Yang, Z. et al. (1995) “Molecular Evolution of the Hepatitis B Virus Genome” J.Mol.Evol. 41.587-96

  29. Acknowledgements 1. Comparative RNA Structure - Bjarne Knudsen 2. Comparative Gene Structure - Jakob Skou Pedersen 3. Integrating Levels of Selection & Structure: Jakob Skou Pedersen, Irmtraud Meyer, Roald Forsberg Bjarne Knudsen Roald Forsberg Irmtraud Meyer Jakob Skou Pedersen

More Related