Phylogenetic inference on the evolution of protein-coding genes

Phylogenetic inference on the evolution of protein-coding genes Lecture 1: Inferences about evolutionary distance Jesse Bloom, jbloom@fhcrc.org

The forward evolutionary process: s(t + dt) = f[m{s(t)}] wildtype sequence s(t) mutation m(s) selection removes mutation f[m(s)] = s mutation is tolerated and goes to fixation: f[m(s)] = s’

In our homework and on Tuesday, we talked about the forward evolution of the genes. In that case, we imagined that we know the exact mutation and selection processes that were operating. We also imagine that we can see each intermediate produced by evolution. In reality, we do not have access to that information. ggggacatccaa cag ggggacatc cat cag ggggacata cat cag ggcgacata cat cag ggcgacatg cat cag ggcgacatg cataag ggcgacatgcaaaag ggagacatccaacag ggagaaatccaa cag ggagaaatccatcag ggataaatc cat cag gggtaaatc cat cag gggtaaatc catcat

ggggacatccaa cag ggggacatc cat cag ggggacata cat cag ggcgacata cat cag ggcgacatg cat cag ggcgacatg cataag ggcgacatgcaaaag ggagacatccaacag ggagaaatccaa cag ggagaaatccatcag ggataaatc cat cag gggtaaatc cat cag gggtaaatc catcat In this case, an ancestral sequence gave rise to two lineages, each of which experienced six mutations. These led the sequences to differ from the ancestor at 4 and 5 sites, respectively. All we can observe is the final alignment of the sequences, which differ from each other at 7 sites. ggcgacatgcaaaag gggtaaatc catcat

ggggacatccaa cag ggggacatc cat cag ggggacata cat cag ggcgacata cat cag ggcgacatg cat cag ggcgacatg cataag ggcgacatgcaaaag ggagacatccaacag ggagaaatccaa cag ggagaaatccatcag ggataaatc cat cag gggtaaatc cat cag gggtaaatc catcat The question we will ask: how many substitutions have occurred since the divergence of these species. We will call this the “evolutionary distance” in substitution per sites. This will be proportional to time if we assume a molecular clock. ggcgacatgcaaaag gggtaaatc catcat

We are asking: How long are the branches on this tree? Note that we cannot actually place the root if the substitution model is assumed to be reversible (satisfies detailed balance). But the total summed branch lengths are conserved. ggcgacatgcaaaag gggtaaatc catcat ggcgacatgcaaaag gggtaaatc catcat

Aside: in some very interesting cases, the substitution model is known to be directional (non-reversible)

If we only can see the final sequences, how do we infer the actual number of substitutions that occurred? We must account for reversions and multiple mutations at the same site. ggcgacatgcaaaag gggtaaatc catcat ggcgacatgcaaaag gggtaaatc catcat

At low levels of divergence, the number of substitutions is roughly equal to the Hamming Distance. But at higher levels of divergence, the number of substitutions will be larger than the Hamming Distance. Above is the plot from one of my simulations of the homework.

In order to derive the relationship between sequence identity and number of substitutions, we need a model both for how mutations are distributed among sites in the protein and how the substitutions themselves occur. Recall the simplest model: Jukes-Cantor. Each site mutates with equal probability to any of the other three nucleotides, so a = b = c = d = e = f = μ/3 where μ is the mutation rate. Mutations occur randomly, so the probability of m mutations in time t has mean <m> = μt and the probability of m mutations is: Pr(m | μ, t) = exp(-μt) * (μt)^m / m! Jukes Cantor result: Pr(site is not changed | m mutations) = ¼ + ¾ * exp(-4/3 * m)

More general models of how a sequence can change. The transition matrix. The transition matrix will (in general) be a irreducible and acyclic stochastic matrix. It approaches a unique equilibrium in the limit of long time.

For sites that usually alter protein sequence (such as first and second codon positions), the evolution is clearly dominated by selection on the protein sequence.

We therefore use phenomenological transition matrices that represent both mutation and selection to represent selection on the coding sequence. Let’s look in detail at the WAG matrix…

Phylogenetic inference on the evolution of protein-coding genes

Phylogenetic inference on the evolution of protein-coding genes

Presentation Transcript

Phylogenetic inference using molecular sequence data

Phylogenetic Inference

Models of Protein Evolution

GENES EVOLUTION

Protein Evolution

Phylogenetic inference

Phylogenetic Inference

Phylogenetic inference on the evolution of protein-coding genes

Functional Non-Coding DNA Part I Non-coding genes and non-coding elements of coding genes

Number of substitutions between two protein-coding genes

Phylogenetic inference using molecular sequence data

Discovery and Characterization of protein-coding genes in D. melanogaster

Evolution of Genes and Genomes

PhyloPat phylogenetic pattern analysis of eukaryotic genes

Phylogenetic Inference

PhyloPat phylogenetic pattern analysis of eukaryotic genes

Lab3: Bayesian phylogenetic Inference and MCMC

Phylogenetic Inference

Phylogenetic Inference

Transcription of Protein-Coding Genes and Formation of Functional mRNA

Protein Evolution

Parallel Bayesian Phylogenetic Inference