320 likes | 472 Views
A simple model for a complex world. Simon Whelan University of Manchester. Isaac Newton Institute. Modelling sequence evolution. Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC...
E N D
A simple model for a complex world Simon Whelan University of Manchester Isaac Newton Institute
Modelling sequence evolution Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A T C A • Simple models assume: • All sites evolving to the same process • All parts of the tree evolve to the same process C G T
G A T C A C G T Spatial heterogeneity in sequence evolution Also known as pattern heterogeneity Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0
G A T C A C G T Spatial heterogeneity in sequence evolution Also known as pattern heterogeneity Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0
G A T C A C G T Spatial heterogeneity in sequence evolution Also known as pattern heterogeneity Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0
G A T C A C G T Temporal heterogeneity in sequence evolution Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0
G A T C A C G T Temporal heterogeneity in sequence evolution Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... Seq5 TCTT---TAACGTGCATAGATAACTC... Seq6 TCAC---TAACATGTATAGATAACTC... Seq7 TCTCTTCTAACGTGCATTGTGAAGTC... Seq8 TCTCTTTTGACATGTATTGAAAAATC... G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0
Motivation • Biological • Not including heterogeneity leads to inaccurate inferences (Naylor; Lockhart) • Form of heterogeneity is poorly characterised • Understanding heterogeneity may lead to biological insights • Modelling • Needs to describe general heterogeneity (Warnow) • Must be identifiable (Rhodes; Allman) • Should be computationally efficient: • Few parameters • Small(-ish) number of states • Applicable to tree search
Temporal hidden Markov models (THMMs) • General type of model • Describes temporal and spatial heterogeneity • Allows simple likelihood computation (reversible; stationary; i.i.d.) • Previous incarnations • Mostly examine temporal and spatial rate variation • Covarion model of Tuffley and Steel and its progeny • Other names include: • Markov modulated Markov processes (models) • Switching processes • Covarion-like
Substitution processes There are 1,…,g separate HKY substitution processes, each representing a hidden state in a HMM The kth hidden state is defined by rate matrix Mk: = nucleotide distribution of hidden state k = rate of hidden state k = transition/transversion rate ratio of hidden state k Note: Subscripts refer to observable states. Superscripts refer to hidden states
Temporal heterogeneity: a switching model A reversible Markov model describing the switching rate between hidden states This process defined by g x g rate matrix C = exchangeability between hidden states k and l = probability of a hidden state Note: Subscripts refer to observable states. Superscripts refer to hidden states
Defining a THMM The 4g x 4g instantaneous rate matrix is: = changes between observable states i, j and hidden states k, l Equilibrium distribution is Hidden states and observable states do not change simultaneously Note: Subscripts refer to observable states. Superscripts refer to hidden states
State 1 State 2 State 3 State 1 State 2 State 3 THMMs for spatial and temporal heterogeneity G G G A A A T T T C C C A C G C= T A C G Rate of transitions between hidden states relative to substitution rate 0.07 T A C G T Note: Value proportional to bubble area
Mixture models for spatial heterogeneity State 1 State 2 State 3 G G G A A A T T T C C C A Restricting all to zero results in a mixture model C State 1 G T A C State 2 G Probability of different hidden states accounted for by the equilibrium distribution at the root T A C State 3 G T
Mixture models for spatial heterogeneity Pr( ) Pr( ) Pr( )
Investigating heterogeneity in groEL • Data • Herbeck et al. (2005) examined groEL sequences to investigate origins of primary endosymbionts • Variability of GC content demonstrated to affect tree estimate • There are 23 sequences of length 1572 nucleotides (all 3 codon positions)
Investigating heterogeneity in groEL • Investigating spatial heterogeneity • Use mixture model ( set to 0) • Examine 2 and 3 hidden states • Relative importance of rate ( to vary), nucleotide frequencies ( to vary), and Ts/Tv bias ( to vary) • Importance of all HKY parameters varying between classes • Investigating simple temporal heterogeneity • Single extra degree of freedom over mixture models • Use ‘simple’ THMM ( set to equal) • Relative importance of allowing different HKY parameters to vary temporally • Investigating simple temporal heterogeneity • GTR switching allows all to vary
Results: groEL (no Γ- distribution) lnL(HKY) = -18579.6 Improvement in over HKY lnL(HKY+dG) = -16209.9 (Improvement in AIC over HKY)
It’s all about rate: frequencies G G G A A A T T T C C C A C G C= T A C G Rate of transitions between hidden states relative to substitution rate 0.04 T A C G T
Results: groEL (with Γ- distribution) lnL(HKY) = -18579.6 Improvement in over HKY+Γ lnL(HKY+dG) = -16209.9 (Improvement in AIC over HKY+Γ)
THMM+Γ Frequencies State 1 State 2 State 3 G G G G G G A A A A A A T T T T T T C C C C C C A A C C State 1 G G C= T T A A C C State 2 G G Rate of transitions between hidden states relative to substitution rate 0.14 T T A A C C State 3 G G T T
THMM+Γ All+H State 1 State 2 State 3 G G G G G G A A A A A A T T T T T T C C C C C C A A C C State 1 G G C= T T A A C C State 2 G G Rate of transitions between hidden states relative to substitution rate 0.07 T T A A C C State 3 G G T T
More results: data from PANDIT ΣlnL(HKY) = -1 053 026.8 ΣImprovement in over HKY+Γ Σ lnL(HKY+dG) = -1 017 588.4 (ΣImprovement in AIC over HKY+Γ) Improvement = 35 438.4
More evolution = more heterogeneity? All+Γ with GTR switching 81 51 74 38 83 70 52 77 60 64 24 48 33 23 23 46 (Looks similar for dN/dS and dN)
More evolution = more heterogeneity? • Potential cause 1: Something wrong with the statistics • ΔAIC per site relative to HKY(+Γ) is not correcting properly for improvements given by additional branches or something else… • Some kind of systematic error as tree length grows, such as tree estimate accuracy • Potential cause 2: Something biologically interesting • As tree length grows the substitution process tends to appear more heterogeneous
G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0 G A T C A C G T T T C G T A Time
G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0 G A T C A C G T T T C G T A Time
G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0 G A T C A C G T T T C G T A Time
G A G T A C T C A A C C G G T T Rate = 2.0 Rate = 0.5 Rate = 1.0 G A T C A C G T T T C G T A Time
A simple model for describing complexity Hidden biological process • Degeneracy of the genetic code • The degeneracy of the genetic code can leads to staccato patterns of evolution, particularly at the 3rd codon • Present in nearly all analyses of nucleotide coding data 4-fold degeneracy • Other types of complexity • Any sequence where biological function places restrictions on how sites change and those restrictions have the potential to vary over time 2-fold degeneracy 1-fold degeneracy
Conclusions • Temporal and spatial heterogeneity • Spatial variation in rate masks other effects • Most complex model provides best description of data in all cases • Progression to 4 hidden state models provides further improvement, but runs into numerical optimisation problems • Biological causes of heterogeneity • May occur whenever there is biological function in sequence data • Long evolutionary times may require (even) more sophisticated models • THMMs could provide a simple framework for describing and drawing inferences from heterogeneity induced by complex dependencies