Modeling Protein Sequence Evolution: Let's Get Real(er)!

Modeling protein sequence evolution: Lets get real(er)! Andrew J. Roger Dept. of Biochemistry & Molecular Biology Dalhousie University, Halifax, N.S. Canada

Dr. Christian Blouin Fac. of Comp Sci Dr. Ed Susko (Dept. of Math/Stats) Dr. Matt Spencer Univ. of Liverpool Karen Li (smart summer student) Dr. Huaichun Wang (postdoctoral fellow) Dan Gaston (Bioinf./Comp. Biol. M.Sc. student)

Lactobacillus E. coli Human protein g Shiitake mush. …STTTGHLIYKCGGIDKR… …STTMGNLAYQLGVFDQR… …STTVGNLAFQLGAIDAR… …STTVGMLSYQLGAVDKR… Probability of going from state ito j at protein g,site x, branch e: Pij site x I I branch e i j V F A ‘super-alignment’ of proteins

Current phylogenetic models of protein evolution • Codon models • parameterized in terms of rates of interchange between synonymous and non-synonymous codons • Model of amino acid interchange are assembled from frequencies of changes observed in large databases • PAM, JTT, VT, mtREV, WAG, PMB • Usually combined with model of among-site rate variation • e.g. JTT+G or JTT+G+invariable sites models • Adjust the matrix to reflect the equilibrium (stationary) frequencies of amino acids in your dataset • JTT+F+ G

Human Lactobacillus j i e Shiitake mushroom E. coli r3 r1 r1 r2 Probability of going from state i to j at protein g, site x, edge e e

The problem… • Such models are a DRASTIC over-simplification of what is really going on • Average over sites, average over lineages, average across families • Sites in proteins can change function over time • sites under purifying selection <--> neutral <--> positive selection • Every amino acid site in a protein has a unique structural/functional context • Hydrophobicity, polarity, charge, dihedral angle, size, functional group…etc…etc • Different sites have different exchangeabilities to different aa’s • Different “frequencies” of aa’s occur at different sites

Human Lactobacillus j i Shiitake mushroom E. coli r3 r1 r1 r2 Probability of going from state i to j at protein g, site x, branch e • Assumptions • ‘fast-evolving’ positions are always fast and slow-evolving positions are always slow • Sites (x’s) have the same rate of evolution (rx) on different branches (e’s)

fast fast slow slow Changing rates of evolution at sites in different parts of the tree of life (heterotachy) Archaebacteria EF-1a Eukaryotes EF-1a

Models that 'deal' with heterotachy (changing site rates across the tree) • Covarion models (stationary) • Tuffley and Steel (1998) • Galtier (2001) • Huelsenbeck (2002) • Wang et al. (2007) • Discrete rate-shift models • Gu 1999, 2002 • Bivariate rates: Susko et al. (2002) • Pupko and Galtier (2001) - LRT for diff. site rates in subtrees • Knudsen and Miyamoto (2001) • Mixture of edgelength models • Kolaczkowski and Thornton (2005) • Spencer et al. (2005) • Zhou et al. (2007)

Human Lactobacillus e j i Q Shiitake mushroom E. coli Probability of going from state i to j at protein g, site x, branch e • Assumptions • different sites (x’s) and branches (e’s) all evolve according to the same general ‘rules’ • - i.e. rate matrices (R’s) and frequencies (P’s) are the ‘same’ for all x and e

Hydrophobic amino acids

AcidicBasic Hydrophobic amino acids

D or E C, V or A V or L R or K Evolution of chaperonin 60 over ~1.5 billion years Plants Fungi Animals Protists Bacteria

Distribution of the number of different amino acid states in alignment columns HSP90 protein Simulated under JTT+F+ G model on HSP90 tree (1x105 sites) Number of sites Number of amino acid states observed at site

** < 0.001 * < 0.01 p-values from the 2 tests Protein family (sites) Z-test (uniformity) c2 test (states) RATE 1 RATE 2 RATE 3 RATE 4 ** ** ** ** ** EF-2 (669) * ** ** ** ILVD_EDD (310) 0.1954 ** ** ** ** ** HSP90 (459) ** ** ** ** ** NuoF (405) ** ** ** ** Glu_synth_NTN (253) 0.01174 ** ** ** ** Poty_coat (212) 0.1897 * CTP synthetase (212) ** ** ** ** ** ** ** ** ** SecA (203) ** ** ** ** 0.2872 EF1a (361) ** * ** ** * a-tubulin (375) ** * ** ** HSP70 (432) 0.3127 ** ** ** DNA topo IV (228) 0.213 * * ** ** ** Usher (317) 0.08051 ** ** * ** b-tubulin (382) 0.01767 ** ** ** CPN60 (466) 0.1826 0.04338 ** * ** Carboxyl_trans (212) 0.9667 0.04754 * ** MreB (275) 0.4971 0.1046 0.02768 * ** ** ** * actin (363) ** MPP (203) 0.04491 0.2412 0.03161 0.3224 * * ** MCM (220) 0.6576 0.11 0.6625 Filament (210) 0.3517 0.09121 0.9233 0.4505

How do we model the site-specific nature of protein evolution? Use information from tertiary (3D) structure of the protein under examination: Parisi & Echave (2002) Robinson et al. (2005) Rodrigue et al. (2005) Use site-specific frequency classes to parameterize a model: Bruno (1996) Lartillot et al. (2004) ‘Dayhoff’ type matrices for structural classes from databases of alignments + characterized structures: Lio, Goldman et al. (1998) Gascuel et al. (?)

Principal Components Analysis (PCA) of aa-frequency matrices (from 21 globular protein alignments)

Can be cut up into at least 4 classes G (A,S) V,I,L (M) D,E

A simple class frequency (cF) mixture model.... Use 4 frequency classes from PCA and add a fifth corresponding to the whole dataset frequencies (PF): This way JTT+F+G is a special case of JTT+cF+G where P(P1)…P(P4) = 0 Can do likelihood ratio test where:

Likelihood ratio tests From which PCA classes were derived New datasets

How do we model the site-specific nature of protein evolution? Use information from tertiary (3D) structure of the protein under examination: Parisi & Echave (2002) Robinson et al. (2005) Rodrigue et al. (2005) Use site-specific frequency classes to parameterize a model: Bruno (1996) Lartillot et al. (2004) ‘Dayhoff’ type matrices for structural classes from databases of alignments + characterized structures: Lio, Goldman et al. (1998) Gascuel et al. (?)

Anfinsen’s corollory Christian B. Anfinsen 1916-1995 The native state of the protein is the conformation of minimum energy Energy ‘native’ state Conformation ‘space’

We are not the first to do this... Simulation-based approach • Parisi and Echave (2001) Mol. Biol. Evol. 18:750-756 Parameterized Markov Modeling approach • Robinson et al. (2003) Mol. Biol. Evol. 20:1692-1704 • model is at the codon-level • 'ground-breaking' • Rodrigue et al. (2005) Gene 347:207 & (2006) Mol. Biol. Evol. 23:1762 • models at the amino acid level Key features of the Robinson and Rodrigue models: • Bayesian approaches - explicitly context dependent (not i.i.d.) • difference in energy between sequence i and j on a fixed structure is used to parameterize the Q matrix • Qij--> instantaneous rate of sequencei changing to sequencej • these are 4nx4n (nucleotides) or 20nx20n (amino acids) Q matrices where n is the number of sites (typically n > 100)......yikes. • Use MCMC to sample character change histories • extremely high dimensional model --> how good are the approximations??

The energy of a given state is related to the probability that state is occupied at equilibrium: Er = energy of state r T = temperature k = Boltzmann’s constant pr = probability of state r Ludwig Boltzmann The Austrian Physicist 1844-1906 Boltzmann’s principle

How the ‘mean force potentials’ are derived: • Contact energy ( ) • For all amino acid pairs (i,j) at each distance slice v in a database of thousands of structures • To get the ‘total energy’ for site x in a given structure, sum the energy contributions over all sites within a given distance threshold of x (dv < t ) • Solvation energy ( ) is calculated similarly • Implemented in Sippl’s PROSA 2003 • program (http://www.came.sbg.ac.at) j dv i x

Some details • can measure distances between two residues from the 'backbone' carbon (C) or from first side-chain carbon (C) • the latter makes more sense biochemically (but early structures sometimes did not have good resolution of side chains) • fast approximation to 'full energy' calculations consider one distance slice corresponding to residues in 'contact' (within ~4-6Å) • Bastolla et al. (2005)... contact map • Robinson et al. (2005) used 'full energy' calculation, whereas Rodrigue et al. (2005) and (2006) used Bastolla contact map based energies (how good is this?)

An ‘energy-based’ model where sites are independent If substitution of amino acid j for i at a site x: • increases energy --> ‘bad’ --> should occur less often • decrease energy --> ‘good’ --> should occur more often where fjis a function of amino acid frequencies in the alignment, and s and p are weight parameters. But its not all about energy…. Plus add rates, r, from a discretized gamma distribution to get E+JTT+ model....

How do we get site specific energy differences between states? Two approaches: Structure For every site x, mutate state to 19 other aa's: …STTMGNL... A . . . j Average For each sequence q, for each site x, mutate to 19 other aa's: …STTTGHL… Average: …STTMGNL… …STTVGNL… …STTVGML… PROSA-2003 mutate mutate PROSA-2003

Performance - likelihood ratio tests P-value (df=3) contact 0.000 0.000 average 0.000 average 0.000 average 0.000 0.000 0.000  av. (no JTT) 0.43 1.00 0.19 0.07 -415.52 cF model (df=4) 0.43 92.24 0.000 Similar results with two other proteins -- lipoxygenase and myoglobin

Number of contacts Site-likelihood diff.s between energy model versus # of contacts at site For site x,

% solvent accessible Site-likelihood diff.s between energy model versus % solvent accessibility

Energydoes best! lnL(energy+JTT) - lnLJTT

Energydoes best! E126

Energies at 126 predict stationary amino acid frequencies better than JTT Site 126 Observed Contact energy Solvation energy JTT

Energy sucks lnLenergy+JTT - lnLJTT

S306 lnLenergy+JTT - lnLJTT

Energies at 306 site-specific amino acid stationary frequencies worse than JTT Site 306 Observed Contact energy Solvation energy JTT

Lobster enolase (1PDZ) aligned with minimized Schistosoma structure model b 6.55Å b W302 S306

7.73Å P306 W302 Lobster enolase (1PDZ) aligned with minimized Schistosoma structure model

Summary • Traditional 'average' protein models are useful but their assumptions are often seriously violated • Need to address: • heterotachy • site-specific nature of substitution process • coevolution • changing state frequencies over the tree • Often SEVERAL of these factors may be important for a given protein family • ignoring them may cause phylogenetic artefacts • New models come with new assumptions and new problems....e.g.: • energy models currently assume that structures do not change across species and that they are static entities • complex models may not be identifiable (Allman and Rhodes and others)

Be careful of believing too much in our models

Acknowledgements Group members Gabino Sanchez-Perez Huaichun Wang Jessica Leigh Daniel Gaston Karen Li Collaborators Ed Susko Matt Spencer Christian Blouin

Modeling Protein Sequence Evolution: Let's Get Real(er)!