1 / 32

Simon Whelan University of Manchester

A simple model for a complex world. Simon Whelan University of Manchester. Isaac Newton Institute. Modelling sequence evolution. Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC...

aida
Download Presentation

Simon Whelan University of Manchester

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A simple model for a complex world Simon Whelan University of Manchester Isaac Newton Institute

  2. Modelling sequence evolution Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A T C A • Simple models assume: • All sites evolving to the same process • All parts of the tree evolve to the same process C G T

  3. G A T C A C G T Spatial heterogeneity in sequence evolution Also known as pattern heterogeneity Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0

  4. G A T C A C G T Spatial heterogeneity in sequence evolution Also known as pattern heterogeneity Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0

  5. G A T C A C G T Spatial heterogeneity in sequence evolution Also known as pattern heterogeneity Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0

  6. G A T C A C G T Temporal heterogeneity in sequence evolution Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0

  7. G A T C A C G T Temporal heterogeneity in sequence evolution Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0

  8. Motivation • Biological • Not including heterogeneity leads to inaccurate inferences (Naylor; Lockhart) • Form of heterogeneity is poorly characterised • Understanding heterogeneity may lead to biological insights • Modelling • Needs to describe general heterogeneity (Warnow) • Must be identifiable (Rhodes; Allman) • Should be computationally efficient: • Few parameters • Small(-ish) number of states • Applicable to tree search

  9. Temporal hidden Markov models (THMMs) • General type of model • Describes temporal and spatial heterogeneity • Allows simple likelihood computation (reversible; stationary; i.i.d.) • Previous incarnations • Mostly examine temporal and spatial rate variation • Covarion model of Tuffley and Steel and its progeny • Other names include: • Markov modulated Markov processes (models) • Switching processes • Covarion-like

  10. Substitution processes There are 1,…,g separate HKY substitution processes, each representing a hidden state in a HMM The kth hidden state is defined by rate matrix Mk: = nucleotide distribution of hidden state k = rate of hidden state k = transition/transversion rate ratio of hidden state k Note: Subscripts refer to observable states. Superscripts refer to hidden states

  11. Temporal heterogeneity: a switching model A reversible Markov model describing the switching rate between hidden states This process defined by g x g rate matrix C = exchangeability between hidden states k and l = probability of a hidden state Note: Subscripts refer to observable states. Superscripts refer to hidden states

  12. Defining a THMM The 4g x 4g instantaneous rate matrix is: = changes between observable states i, j and hidden states k, l Equilibrium distribution is Hidden states and observable states do not change simultaneously Note: Subscripts refer to observable states. Superscripts refer to hidden states

  13. State 1 State 2 State 3 State 1 State 2 State 3 THMMs for spatial and temporal heterogeneity G G G A A A T T T C C C A C G C= T A C G Rate of transitions between hidden states relative to substitution rate 0.07 T A C G T Note: Value proportional to bubble area

  14. Mixture models for spatial heterogeneity State 1 State 2 State 3 G G G A A A T T T C C C A Restricting all to zero results in a mixture model C State 1 G T A C State 2 G Probability of different hidden states accounted for by the equilibrium distribution at the root T A C State 3 G T

  15. Mixture models for spatial heterogeneity Pr( ) Pr( ) Pr( )

  16. Investigating heterogeneity in groEL • Data • Herbeck et al. (2005) examined groEL sequences to investigate origins of primary endosymbionts • Variability of GC content demonstrated to affect tree estimate • There are 23 sequences of length 1572 nucleotides (all 3 codon positions)

  17. Investigating heterogeneity in groEL • Investigating spatial heterogeneity • Use mixture model ( set to 0) • Examine 2 and 3 hidden states • Relative importance of rate ( to vary), nucleotide frequencies ( to vary), and Ts/Tv bias ( to vary) • Importance of all HKY parameters varying between classes • Investigating simple temporal heterogeneity • Single extra degree of freedom over mixture models • Use ‘simple’ THMM ( set to equal) • Relative importance of allowing different HKY parameters to vary temporally • Investigating simple temporal heterogeneity • GTR switching allows all to vary

  18. Results: groEL (no Γ- distribution) lnL(HKY) = -18579.6 Improvement in over HKY lnL(HKY+dG) = -16209.9 (Improvement in AIC over HKY)

  19. It’s all about rate: frequencies G G G A A A T T T C C C A C G C= T A C G Rate of transitions between hidden states relative to substitution rate 0.04 T A C G T

  20. Results: groEL (with Γ- distribution) lnL(HKY) = -18579.6 Improvement in over HKY+Γ lnL(HKY+dG) = -16209.9 (Improvement in AIC over HKY+Γ)

  21. THMM+Γ Frequencies State 1 State 2 State 3 G G G G G G A A A A A A T T T T T T C C C C C C A A C C State 1 G G C= T T A A C C State 2 G G Rate of transitions between hidden states relative to substitution rate 0.14 T T A A C C State 3 G G T T

  22. THMM+Γ All+H State 1 State 2 State 3 G G G G G G A A A A A A T T T T T T C C C C C C A A C C State 1 G G C= T T A A C C State 2 G G Rate of transitions between hidden states relative to substitution rate 0.07 T T A A C C State 3 G G T T

  23. More results: data from PANDIT ΣlnL(HKY) = -1 053 026.8 ΣImprovement in over HKY+Γ Σ lnL(HKY+dG) = -1 017 588.4 (ΣImprovement in AIC over HKY+Γ) Improvement = 35 438.4

  24. More evolution = more heterogeneity? All+Γ with GTR switching 81 51 74 38 83 70 52 77 60 64 24 48 33 23 23 46 (Looks similar for dN/dS and dN)

  25. More evolution = more heterogeneity? • Potential cause 1: Something wrong with the statistics • ΔAIC per site relative to HKY(+Γ) is not correcting properly for improvements given by additional branches or something else… • Some kind of systematic error as tree length grows, such as tree estimate accuracy • Potential cause 2: Something biologically interesting • As tree length grows the substitution process tends to appear more heterogeneous

  26. G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0 G A T C A C G T T T C G T A Time

  27. G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0 G A T C A C G T T T C G T A Time

  28. G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0 G A T C A C G T T T C G T A Time

  29. G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0 G A T C A C G T T T C G T A Time

  30. A simple model for describing complexity Hidden biological process • Degeneracy of the genetic code • The degeneracy of the genetic code can leads to staccato patterns of evolution, particularly at the 3rd codon • Present in nearly all analyses of nucleotide coding data 4-fold degeneracy • Other types of complexity • Any sequence where biological function places restrictions on how sites change and those restrictions have the potential to vary over time 2-fold degeneracy 1-fold degeneracy

  31. Conclusions • Temporal and spatial heterogeneity • Spatial variation in rate masks other effects • Most complex model provides best description of data in all cases • Progression to 4 hidden state models provides further improvement, but runs into numerical optimisation problems • Biological causes of heterogeneity • May occur whenever there is biological function in sequence data • Long evolutionary times may require (even) more sophisticated models • THMMs could provide a simple framework for describing and drawing inferences from heterogeneity induced by complex dependencies

More Related