1 / 44

Modeling protein sequence evolution: Lets get real(er)!

Modeling protein sequence evolution: Lets get real(er)!. Andrew J. Roger. Dept. of Biochemistry & Molecular Biology Dalhousie University, Halifax, N.S. Canada. Dr. Christian Blouin Fac. of Comp Sci. Dr. Ed Susko (Dept. of Math/Stats). Dr. Matt Spencer Univ. of Liverpool. Karen Li

oria
Download Presentation

Modeling protein sequence evolution: Lets get real(er)!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling protein sequence evolution: Lets get real(er)! Andrew J. Roger Dept. of Biochemistry & Molecular Biology Dalhousie University, Halifax, N.S. Canada

  2. Dr. Christian Blouin Fac. of Comp Sci Dr. Ed Susko (Dept. of Math/Stats) Dr. Matt Spencer Univ. of Liverpool Karen Li (smart summer student) Dr. Huaichun Wang (postdoctoral fellow) Dan Gaston (Bioinf./Comp. Biol. M.Sc. student)

  3. Lactobacillus E. coli Human protein g Shiitake mush. …STTTGHLIYKCGGIDKR… …STTMGNLAYQLGVFDQR… …STTVGNLAFQLGAIDAR… …STTVGMLSYQLGAVDKR… Probability of going from state ito j at protein g,site x, branch e: Pij site x I I branch e i j V F A ‘super-alignment’ of proteins

  4. Current phylogenetic models of protein evolution • Codon models • parameterized in terms of rates of interchange between synonymous and non-synonymous codons • Model of amino acid interchange are assembled from frequencies of changes observed in large databases • PAM, JTT, VT, mtREV, WAG, PMB • Usually combined with model of among-site rate variation • e.g. JTT+G or JTT+G+invariable sites models • Adjust the matrix to reflect the equilibrium (stationary) frequencies of amino acids in your dataset • JTT+F+ G

  5. Human Lactobacillus j i e Shiitake mushroom E. coli r3 r1 r1 r2 Probability of going from state i to j at protein g, site x, edge e e

  6. The problem… • Such models are a DRASTIC over-simplification of what is really going on • Average over sites, average over lineages, average across families • Sites in proteins can change function over time • sites under purifying selection <--> neutral <--> positive selection • Every amino acid site in a protein has a unique structural/functional context • Hydrophobicity, polarity, charge, dihedral angle, size, functional group…etc…etc • Different sites have different exchangeabilities to different aa’s • Different “frequencies” of aa’s occur at different sites

  7. Human Lactobacillus j i Shiitake mushroom E. coli r3 r1 r1 r2 Probability of going from state i to j at protein g, site x, branch e • Assumptions • ‘fast-evolving’ positions are always fast and slow-evolving positions are always slow • Sites (x’s) have the same rate of evolution (rx) on different branches (e’s)

  8. fast fast slow slow Changing rates of evolution at sites in different parts of the tree of life (heterotachy) Archaebacteria EF-1a Eukaryotes EF-1a

  9. Models that 'deal' with heterotachy (changing site rates across the tree) • Covarion models (stationary) • Tuffley and Steel (1998) • Galtier (2001) • Huelsenbeck (2002) • Wang et al. (2007) • Discrete rate-shift models • Gu 1999, 2002 • Bivariate rates: Susko et al. (2002) • Pupko and Galtier (2001) - LRT for diff. site rates in subtrees • Knudsen and Miyamoto (2001) • Mixture of edgelength models • Kolaczkowski and Thornton (2005) • Spencer et al. (2005) • Zhou et al. (2007)

  10. Human Lactobacillus e j i Q Shiitake mushroom E. coli Probability of going from state i to j at protein g, site x, branch e • Assumptions • different sites (x’s) and branches (e’s) all evolve according to the same general ‘rules’ • - i.e. rate matrices (R’s) and frequencies (P’s) are the ‘same’ for all x and e

  11. Hydrophobic amino acids

  12. AcidicBasic Hydrophobic amino acids

  13. D or E C, V or A V or L R or K Evolution of chaperonin 60 over ~1.5 billion years Plants Fungi Animals Protists Bacteria

  14. Distribution of the number of different amino acid states in alignment columns HSP90 protein Simulated under JTT+F+ G model on HSP90 tree (1x105 sites) Number of sites Number of amino acid states observed at site

  15. ** < 0.001 * < 0.01 p-values from the 2 tests Protein family (sites) Z-test (uniformity) c2 test (states) RATE 1 RATE 2 RATE 3 RATE 4 ** ** ** ** ** EF-2 (669) * ** ** ** ILVD_EDD (310) 0.1954 ** ** ** ** ** HSP90 (459) ** ** ** ** ** NuoF (405) ** ** ** ** Glu_synth_NTN (253) 0.01174 ** ** ** ** Poty_coat (212) 0.1897 * CTP synthetase (212) ** ** ** ** ** ** ** ** ** SecA (203) ** ** ** ** 0.2872 EF1a (361) ** * ** ** * a-tubulin (375) ** * ** ** HSP70 (432) 0.3127 ** ** ** DNA topo IV (228) 0.213 * * ** ** ** Usher (317) 0.08051 ** ** * ** b-tubulin (382) 0.01767 ** ** ** CPN60 (466) 0.1826 0.04338 ** * ** Carboxyl_trans (212) 0.9667 0.04754 * ** MreB (275) 0.4971 0.1046 0.02768 * ** ** ** * actin (363) ** MPP (203) 0.04491 0.2412 0.03161 0.3224 * * ** MCM (220) 0.6576 0.11 0.6625 Filament (210) 0.3517 0.09121 0.9233 0.4505

  16. How do we model the site-specific nature of protein evolution? Use information from tertiary (3D) structure of the protein under examination: Parisi & Echave (2002) Robinson et al. (2005) Rodrigue et al. (2005) Use site-specific frequency classes to parameterize a model: Bruno (1996) Lartillot et al. (2004) ‘Dayhoff’ type matrices for structural classes from databases of alignments + characterized structures: Lio, Goldman et al. (1998) Gascuel et al. (?)

  17. Principal Components Analysis (PCA) of aa-frequency matrices (from 21 globular protein alignments)

  18. Can be cut up into at least 4 classes G (A,S) V,I,L (M) D,E

  19. A simple class frequency (cF) mixture model.... Use 4 frequency classes from PCA and add a fifth corresponding to the whole dataset frequencies (PF): This way JTT+F+G is a special case of JTT+cF+G where P(P1)…P(P4) = 0 Can do likelihood ratio test where:

  20. Likelihood ratio tests From which PCA classes were derived New datasets

  21. How do we model the site-specific nature of protein evolution? Use information from tertiary (3D) structure of the protein under examination: Parisi & Echave (2002) Robinson et al. (2005) Rodrigue et al. (2005) Use site-specific frequency classes to parameterize a model: Bruno (1996) Lartillot et al. (2004) ‘Dayhoff’ type matrices for structural classes from databases of alignments + characterized structures: Lio, Goldman et al. (1998) Gascuel et al. (?)

  22. Anfinsen’s corollory Christian B. Anfinsen 1916-1995 The native state of the protein is the conformation of minimum energy Energy ‘native’ state Conformation ‘space’

  23. We are not the first to do this... Simulation-based approach • Parisi and Echave (2001) Mol. Biol. Evol. 18:750-756 Parameterized Markov Modeling approach • Robinson et al. (2003) Mol. Biol. Evol. 20:1692-1704 • model is at the codon-level • 'ground-breaking' • Rodrigue et al. (2005) Gene 347:207 & (2006) Mol. Biol. Evol. 23:1762 • models at the amino acid level Key features of the Robinson and Rodrigue models: • Bayesian approaches - explicitly context dependent (not i.i.d.) • difference in energy between sequence i and j on a fixed structure is used to parameterize the Q matrix • Qij--> instantaneous rate of sequencei changing to sequencej • these are 4nx4n (nucleotides) or 20nx20n (amino acids) Q matrices where n is the number of sites (typically n > 100)......yikes. • Use MCMC to sample character change histories • extremely high dimensional model --> how good are the approximations??

  24. The energy of a given state is related to the probability that state is occupied at equilibrium: Er = energy of state r T = temperature k = Boltzmann’s constant pr = probability of state r Ludwig Boltzmann The Austrian Physicist 1844-1906 Boltzmann’s principle

  25. How the ‘mean force potentials’ are derived: • Contact energy ( ) • For all amino acid pairs (i,j) at each distance slice v in a database of thousands of structures • To get the ‘total energy’ for site x in a given structure, sum the energy contributions over all sites within a given distance threshold of x (dv < t ) • Solvation energy ( ) is calculated similarly • Implemented in Sippl’s PROSA 2003 • program (http://www.came.sbg.ac.at) j dv i x

  26. Some details • can measure distances between two residues from the 'backbone' carbon (C) or from first side-chain carbon (C) • the latter makes more sense biochemically (but early structures sometimes did not have good resolution of side chains) • fast approximation to 'full energy' calculations consider one distance slice corresponding to residues in 'contact' (within ~4-6Å) • Bastolla et al. (2005)... contact map • Robinson et al. (2005) used 'full energy' calculation, whereas Rodrigue et al. (2005) and (2006) used Bastolla contact map based energies (how good is this?)

  27. An ‘energy-based’ model where sites are independent If substitution of amino acid j for i at a site x: • increases energy --> ‘bad’ --> should occur less often • decrease energy --> ‘good’ --> should occur more often where fjis a function of amino acid frequencies in the alignment, and s and p are weight parameters. But its not all about energy…. Plus add rates, r, from a discretized gamma distribution to get E+JTT+ model....

  28. How do we get site specific energy differences between states? Two approaches: Structure For every site x, mutate state to 19 other aa's: …STTMGNL... A . . . j Average For each sequence q, for each site x, mutate to 19 other aa's: …STTTGHL… Average: …STTMGNL… …STTVGNL… …STTVGML… PROSA-2003 mutate mutate PROSA-2003

  29. Performance - likelihood ratio tests P-value (df=3) contact 0.000 0.000 average 0.000 average 0.000 average 0.000 0.000 0.000  av. (no JTT) 0.43 1.00 0.19 0.07 -415.52 cF model (df=4) 0.43 92.24 0.000 Similar results with two other proteins -- lipoxygenase and myoglobin

  30. Number of contacts Site-likelihood diff.s between energy model versus # of contacts at site For site x,

  31. % solvent accessible Site-likelihood diff.s between energy model versus % solvent accessibility

  32. Energydoes best! lnL(energy+JTT) - lnLJTT

  33. Energydoes best! E126

  34. Energies at 126 predict stationary amino acid frequencies better than JTT Site 126 Observed Contact energy Solvation energy JTT

  35. Energy sucks lnLenergy+JTT - lnLJTT

  36. S306 lnLenergy+JTT - lnLJTT

  37. S306 lnLenergy+JTT - lnLJTT

  38. Energies at 306 site-specific amino acid stationary frequencies worse than JTT Site 306 Observed Contact energy Solvation energy JTT

  39. Lobster enolase (1PDZ) aligned with minimized Schistosoma structure model b 6.55Å b W302 S306

  40. 7.73Å P306 W302 Lobster enolase (1PDZ) aligned with minimized Schistosoma structure model

  41. Summary • Traditional 'average' protein models are useful but their assumptions are often seriously violated • Need to address: • heterotachy • site-specific nature of substitution process • coevolution • changing state frequencies over the tree • Often SEVERAL of these factors may be important for a given protein family • ignoring them may cause phylogenetic artefacts • New models come with new assumptions and new problems....e.g.: • energy models currently assume that structures do not change across species and that they are static entities • complex models may not be identifiable (Allman and Rhodes and others)

  42. Be careful of believing too much in our models

  43. Acknowledgements Group members Gabino Sanchez-Perez Huaichun Wang Jessica Leigh Daniel Gaston Karen Li Collaborators Ed Susko Matt Spencer Christian Blouin

More Related