350 likes | 513 Views
Phylogenetic inference on the evolution of protein-coding genes. Lecture 1: The evolution of protein coding genes Jesse Bloom, jbloom@fhcrc.org.
E N D
Phylogenetic inference on the evolution of protein-coding genes Lecture 1: The evolution of protein coding genes Jesse Bloom, jbloom@fhcrc.org
You are lucky. You are in the same department as the person who has contributed more to this field than anyone else. His book is the best reference on the topic. It is not required for this class, but is a good place to look if you want to dig more deeply.I’m unlucky. Do any of you know who Pete Myers is? Joe Felsenstein
Phylogenetic inference refers analyzing sequences to make inferences about the evolutionary process that created them. Popular software packages include: BEAST, MrBayes, PHYLIP, PAUP, PAML, etc. Goals can be varied:* establishing evolutionary relationships (building phylogenetic trees)* detecting positive selection* understanding selective forces* reconstructing ancestral sequences* testing specific hypotheses (i.e. was this person infected with one or two HIV strains?)* gleaning biological information (i.e. constraints on protein function, etc)
My lectures will focus exclusively on the evolution of protein-coding genes. Other approaches may be better for other regions of the genome.
The gene sequences are the result of some forward evolutionary process. All we are able to observe is a small subset of the sequences that result from this process. We must use these sequences to make inferences about the evolutionary process.
Lecture 1: The forward evolutionary processLecture 2: Inferences about evolutionary distancesLecture 3: Inferences about evolutionary distancesLecture 4: Building phylogenetic trees
The forward evolutionary process. We will begin by thinking about this process.Your homework will begin with simulating this process.Both the lectures and the homework will thenproceed to making inferences from the sequences.Conceptually, errors and approximations come from two sources: (1) Simplifications about the actual evolutionary process.(2) Imperfections in the inference approaches.
The forward evolutionary process. Let si be the sequence of an individual variant of the gene that we are studying. Let gi be the sequence of the genome containing si. Let Gbe the set of all of these genomes. Conceptually, the forward model that describes how the sequences change is:G(t + dt) = F[G(t), universe(t), random effects]Formally, the evolution of the gene sequences can be affected by lots of processes that we have no hope of modeling!
The forward evolutionary process. G(t + dt) = F[G(t), universe(t), random effects]A fairly standard set of simplifications are used in phylogenetic inference to make things tractable. First, we assume that the protein-coding sequence s can be considered independent of the rest of the genome g.Second, we assume that the entire population can be modeled by a single protein sequence.Third, we assume that the evolution is only a function of mutation m and selection f acting on the gene sequence (we don’t worry about asteroids wiping out the dinosaurs…)s(t + dt) = f[m{s(t)}]
The forward evolutionary process: s(t + dt) = f[m{s(t)}] wildtype sequence s(t) mutation m(s) selection removes mutation f[m(s)] = s mutation is tolerated and goes to fixation: f[m(s)] = s’
The forward evolutionary process: mutationWe will only consider substitution mutations (i.e. A->T). Evolution also obviously involves insertions / deletions, but the theoretical framework for analyzing indels is less established.We assume that mutations are independent and identically distributed (i.i.d.). That is, each site in the gene is identically likely to be mutated, and the changes at sites are independent of each other. (When might this assumption be wrong?)Effectively, we have partitioned the whole sequence into individual sites, each of which experiences independent and identically distributed mutations.s = s1 s2s3s4s5s6s7s8 s9 …
The forward evolutionary process: mutation s = s1 s2s3s4s5s6s7s8 s9 …What is the i.i.d. model for how each site evolves? Transition: purine to purine mutation or pyrimidine to pyrimidine mutation Transversion: purine to pyrimidine mutation, or pyrimidine to purine mutation In most cases, transitions occur more frequently than transversions.
The forward evolutionary process: mutation Jukes Cantor model: all mutation types equally likely, a = b = c = d = e = f = μ / 3 Kimura’s two parameter model: transitions all equally likely, transversions all equally likely, so that a = c = d = f, and e = b, and μ = 2a + b. The transition transversion ratio is defined as R = b / 2a. A value of R = ½ corresponds to no bias. General time reversible model: different values for each type of mutation
The forward evolutionary process: mutation Although we usually think of gene composition as the result of selection, mutation pressure can also shape gene composition. Go to Michael Lynch’s seminar tomorrow at 3:30 PM! Mutations from G/C -> A/T tend to outnumber those from A/T -> G/C for most systems. If we let the ratio of these two types of mutations be m, what is the expected fraction of sites that are A or T? (solved on whiteboard)
The forward evolutionary process: selection s(t + dt) = f[m{s(t)}] wildtype sequence s(t) mutation m(s) selection removes mutation f[m(s)] = s mutation is tolerated and goes to fixation: f[m(s)] = s’
The forward evolutionary process: selection Selection on coding genes can arise from many sources:* selection for protein function* selection for protein stability / folding, against aggregation* selection for mRNA stability or structure* selection for codon usage Selection for factors related to the protein sequence is generally considered to the strongest selective force.
The forward evolutionary process: selection If we assume that selection is acting on the protein, we can partition substitutions into two types:synonymous mutationsgtg gag aatgaatranslates to GENEgtggaaaatgaatranslates to GENEnonsynonymous mutationsgtg gag aatgaatranslates to GENEgtggcgaatgaatranslates to GANE
The forward evolutionary process: selection The relative ratio of these two types of changes is often used to make inferences about selection.synonymous mutationsgtg gag aatgaatranslates to GENEgtggaaaatgaatranslates to GENEnonsynonymous mutationsgtg gag aatgaatranslates to GENEgtggcgaatgaatranslates to GANE
The forward evolutionary process: selection How do we model selection other than the crude idea that synonymous changes are likely to have less effect that nonsynonymous ones?For mutation, we assumed that mutations were independent and identically distributed (i.i.d.). As we will see most phylogenetic models of selection make a similar assumption (although it seems much less credible for selection than mutation).s = s1 s2 s3 s4 s5 s6 s7 s8 s9 …
The forward evolutionary process: selection How do we model selection other than the crude idea that synonymous changes are likely to have less effect that nonsynonymous ones?For mutation, we assumed that mutations were independent and identically distributed (i.i.d.). As we will see most phylogenetic models of selection make a similar assumption (although it seems much less credible for selection than mutation).s = s1 s2 s3 s4 s5 s6 s7 s8 s9 …
The forward evolutionary process: selection In reality, we believe that selection acts on the full protein sequence, such that the fitness w is a function of s: fitness = w(s)We further believe that the probability that a mutation becomes fixed is dependent on its fitness effect. So in reality, sites are probably not changing fully independently.
The forward evolutionary process: selection We are going to simulate the forward evolution of gene sequences using a molecular modeling force field on the encoded protein. Your first homework will involve performing this simulation, and then doing some rudimentary analysis. In subsequent lectures / homeworks, we will do analysis of the evolved protein sequences to see how well the inference tools actually work.
The forward evolutionary process: selection We will use the FOLDX computer program to compute the fitness based on the protein’s structure.fitness = w(s)Download the latest version of FOLDX from http://foldx.crg.es/ (register for an academic license).
We will use FOLDX to model the evolution of sperm whale myoglobin. This is PDB structure 1MBN, downloadable as a *.PDB file from http://www.rcsb.org/pdb/explore.do?structureId=1mbnThe nucleotide sequence encoding the protein in this PDB structure is:>nucleotide sequence for protein in 1MBN structureGTTCTGTCTGAAGGTGAATGGCAGCTGGTTCTGCATGTTTGGGCTAAAGTTGAAGCTGACGTCGCTGGTCATGGTCAGGACATCTTGATTCGACTGTTCAAATCTCATCCGGAAACTCTGGAAAAATTCGATCGTTTCAAACATCTGAAAACTGAAGCTGAAATGAAAGCTTCTGAAGATCTGAAAAAACATGGTGTTACCGTGTTAACTGCCCTAGGTGCTATCCTTAAGAAAAAAGGGCATCATGAAGCTGAGCTCAAACCGCTTGCGCAATCGCATGCTACTAAACATAAGATCCCGATCAAATACCTGGAATTCATCTCTGAAGCGATCATCCATGTTCTGCATTCTAGACATCCAGGTGACTTCGGTGCTGACGCTCAGGGTGCTATGAACAAAGCTCTCGAGCTGTTCCGTAAAGATATCGCTGCTAAGTACAAAGAACTGGGTTACCAGGGTTAA
Each time we make a mutation, FOLDX will generate a structure for the mutant protein and predict its effect on the stability of the protein. We will assume that the only selection pressure is that the stability of the protein must remain sufficiently high. ΔG is the stability of folding (more negative values indicate more stable) ΔΔG is the effect of a mutation on stability (negative values indicate stabilizing, positive values indicate destabilizing)
We assume that selection acts such that proteins that have ΔG < some threshold (are sufficiently stable) all have equal and high fitness, and proteins that have ΔG > some threshold have fitness of zero. We use FOLDX to calculate these stabilities. FOLDX therefore provides our fitness function: fitness = w(s)
The simulation algorithm: parent sequence mutation m(s) Mutation is nonsynonymous and destabilizes protein too much – it is rejected Mutation is synonymous, or is nonsynonymous but retains sufficient stability – mutation is accepted
Download FOLDX, the 1MBN.pdb file, and the 1MBN.fasta nucleotide sequence (the latter two are available on the course webpage). “Repair” the PDB file using the FOLDX <RepairPDB> function. This fixes any problems FOLDX identifies in the PDB file. This will take a few minutes to run on a laptop. Pick a random nucleotide in the gene sequence and mutate it according to the nucleotide mutation model. If the mutation is synonymous, it is accepted and that is our new nucleotide sequence. If the mutation introduces a stop codon, it is rejected. If the mutation is nonsynonymous, evaluate the ΔΔG using FOLDX with the <BuildModel> function. If the new stability is still below the threshold, the mutation is accepted. If it is above the threshold, the mutation is rejected. Initially, the protein has no stability beyond the threshold, so only mutations with ΔΔG <= 0 are accepted. But once the stabilizing mutation is accepted, the protein now has ΔΔG1 of extra stability. So the next mutation will be accepted if ΔΔG1 + ΔΔG2 <= 0, and so on. If a mutation is accepted, the mutant sequence and structure is the new starting point for the next mutation. Evaluating each nonsynonymous mutation should take from a few seconds to a minute on your laptop. Repeat the process in step 3 for a total of 200 attempted mutations. Keep track of how many nonsynonymous and synonymous substitutions result from these 200 mutations, and keep track of the nucleotide sequence at every step. Note that the total number of substitutions will be less than 200, since some nonsynonymous mutations will be culled by selection.
For concreteness, here is the core algorithm of my Python code implementing this exercise n_s = n_n = n_tot = 0 dg = 0 f = open(logfile, 'w') f.write('#STEP\tN_SYNONYMOUS\tN_NONSYNONYMOUS\tN_TOTAL\tdG\tSEQUENCE\n') foristep in range(nsteps): f.write('%d\t%d\t%d\t%d\t%.3f\t%s\n' % (istep, n_s, n_n, n_tot, dg, ntseq[1])) print '%d\t%d\t%d\t%d\t%.3f\t%s\n' % (istep, n_s, n_n, n_tot, dg, ntseq[1]) newseq = ('new', Mutate(ntseq, transition_transversion_r)) try: newprot = pips.fasta.Translate([newseq])[0] exceptValueError: continue # introduces stop codon ifnewprot == protseq: # synonymous n_s += 1 n_tot += 1 ntseq = newseq protseq = newprot else: (PDBtext, ddg) = MutatePDB(currentpdb, protseq[1], newprot[1], foldxpath) ifddg + dg <= 0: # mutationaccepted n_n += 1 n_tot += 1 dg = ddg + dg ntseq = newseq protseq = newprot currentpdb = '%s_mutant%s.pdb' % (repairedpdbfile, istep + 1) open(currentpdb, 'w').write(PDBtext) f.close()
The FOLDX syntax / commands are explained in the manual that comes with the executable. The format is rather complex. Here is an example of the runfile (run by ../FoldX.mac –runfilerunfile.txt) for running <RepairPDB>. This will generate a new repaired PDB file under the name 1MBN_repaired.pdb <TITLE>FOLDX_runscript; <JOBSTART>#; <PDBS>1MBN.pdb; <BATCH>#; <COMMANDS>FOLDX_commandfile; <RepairPDB>1MBN_repaired; <END>#; <OPTIONS>FOLDX_optionfile; <END>#; <JOBEND>#; <ENDFILE>#;
Here is an example of the runfile for making the mutations. <TITLE>FOLDX_runscript; <JOBSTART>#; <PDBS>temp.pdb; <BATCH>#; <COMMANDS>FOLDX_commandfile; <BuildModel>mutant,mutant_file.txt; <END>#; <OPTIONS>FOLDX_optionfile; <END>#; <JOBEND>#; <ENDFILE>#; This should be run with mutant_file.txt like this if the second residue is mutated from V to A (difference between first and second sequences). VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG VASEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG The resulting ΔΔG for the mutation is the second number in the last line of the file Average_Mutant.So in the example, it is 1.81, meaning that this mutation is destabilizing. FoldX 3.0 Beta 5.1 (2011) by the FoldX Consortium Jesper Borg, Frederic Rousseau, JoostSchymkowitz, Luis Serrano and Francois Stricher ------------------------------------------------------ PDB file analysed: 1MBN_repaired.pdb Output type: BuildModel Pdb SD total energy Backbone HbondSidechainHbond Van der Waals Electrostatics Solvation Polar Solvation Hydrophobic Van der Waals clashes entropy sidechain entropy mainchainsloop_entropymloop_entropycis_bond torsional clash backbone clash helix dipole water bridge disulfide electrostatic kon partial covalent bonds energy Ionisation Entropy Complex 1MBN_repaired_1 0 1.81 0.05 0.00 1.56 0.02 -0.73 3.00 0.12 -1.17 -0.12 0.00 0.00 0.00 -0.92 -0.34 -0.00 0.00 0.00 0.00 0.00 0.00 0.00
Homework • Write a program to implement the 200 steps of evolution described on the previous slide. Then make plots showing how each of the following change as a function of the number of mutational steps attempted: • The total number of substitutions that have accumulated • The total number of nonsynonymous substitutions that have accumulated • The total number of synonymous mutations that have accumulated • The average nucleotide percent identity of the sequence at each step with the initial sequence. • The average percent identity of the sequence at each step with the initial sequence plotted separately for each codon position • You will turn in these plots, along with your computer program and its output.
Summary We have talked about how we envision the evolutionary process that leads to the divergence of protein coding genes. You have begun a homework that simulates this process, using a model for protein stability as the selection pressure. In the remaining lectures, we are going to talk about the inference techniques that are used to analyze natural protein sequences to try to understand the evolutionary process.