Understanding Selection Pressure on Protein Evolution: A Review

Identifying and Modeling Selection Pressure(a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004

Today • McClellan and McCracken: Estimating the Influence of Selection on the Variable Amino Acid Sites of the Cytochrome b Protein Functional Domains • Dagan et al: Ratios of Radical to Conservative Amino Acid Replacement are Affected by Mutational and Compositional FActors and May Not be Indicative of Positive Darwinian Selection • Halpert and Bruno: Evolutionary Distances for Protein-Coding Sequences: Modeling Site-Specific Residue Frequencies

Types of Selection • negative purifying selection • non-synonymous codon changes are selected against • neutral selection • non-synonymous changes in codons have an equivalent probability of elimination or fixation • positive diversifying selection • non-synonymous codon changes are selected for

Identifying Regions Under Selective Pressure • ds/dn << 1 and ds/dn >> 1 commonly used • synonymous substitutions become saturated more quickly than ns • compare conservative/radical substitution ratio to expected distribution under neutral model

A “conservative” definition • Cluster amino acids according to physio-chemical properties • Charge • Volume • Polarity • Grantham’s distance • ... • Within-class = conservative • Across-class = radical

Assessing Substitution Rates • 2 sequences • average over all possible pathways between two codons • TTG(Leu) - ATG(Met) - AGG(Arg) - AGA(Arg) • Many sequences • Build a phylogenetic tree • Infer most likely ancestral sequences • Count synonymous and nonsynonymous substitutions

Cytochrome b Gene Evolution • Matrix and Transmembrane regions have comparable rates of change • Intermembrane region has lower rate of change (McClelland and McCracken)

Group Non-Syn Mutations • 5 Properties • 4 Groups • Neutral model • based only on codon frequencies • Chi-squared test • observed vs. expected (given domain amino acid frquencies)

Question • Do factors unrelated to selection affect the radical/conservative ratio? • nucleotide frequencies • e.g. GC content • transition/transversion ratio • transitions (A->G and T->C) are more common than transversion • distances between amino acids • genetic code • codon biases • due to tRNA availibility, energy usage, or pathogen avoidance • amino acid frequencies • ??

An Initial Test • 3 proteins: Hemoglobin, Interleukin, Ribosomal protein • Simulated neutral evolution using substutition matrix built from psuedogenes • Tested for selection pressure • volume/polarity: 100% FP • grantham: 13-21% FP • charge: 0% FP (Dagan et al)

Simulation Study • Generate virtual ancestral sequence • 300 nt long • Set mutational/compositional parameters • Simulate evolution (ROSE software) • 50 substitutions • Calculate conservative/radical ratio • Each parameter set simulated 50 times

ANOVA

Conclusion • Many composition and mutation factors influence conservative/radical ratio • Poor indicator of positive selection

Correlation or Causation? • Many factors are correlated, but direction of causation is undetermined • transitions more likely to cause conservative changes than transversions • codon bias can influence nucleotide frequencies • purifying selective pressure will reduce the rate of change • Generative models which model many of these relevant factors

Generative Models of Gene/Protein Evolution • Infer relative distances between sequences • Build a phylogenetic tree • Infer which positions are under positive selective pressure • Find additional homologous proteins • Identify co-varying sites

Modeling Evolutionary Processes • Most models • homogeneous, time-reversible Markov models • Simplest models • DNA mutation models • nucleotide frequencies • transition/transversion ratio

Too Simplistic • positions within codons are not independent • codon or amino acid models • parameters not sufficient to explain different rates of change between specific characters • empirical substitution matrix (e.g. PAM) • site-specific rates of change • use a gamma distribution to model variation in rates

Too Simplistic • positions within codons clearly not independent • codon or amino acid models • different rates of change between specific characters • empirical substitution matrix (e.g. PAM) • site-specific rates of change • use a gamma distribution to model variation in rates • equilibrium frequencies are also site-specific • due to functional or structural constraints

Too Simplistic • positions within codons are not independent • codon or amino acid models • parameters not sufficient to explain different rates of change between specific characters • empirical substitution matrix (e.g. PAM) • site-specific rates of change • use a gamma distribution to model variation in rates • equilibrium frequencies are also site-specific • due to functional or structural constraints

Halpern & Bruno 1998 A codon-based model of evolution • site-invariant dna-based mutation model • site-specific amino acid level selection model = probability of mutation = probability of fixation at site i

Halpern & Bruno 1998 • Assumptions • most importantly, selectional pressures are constant at a given position for all lineages over all times • sites independent • markov process is reversible • Does not model • selection at the codon level • codon bias • DNA or RNA structural requirements • uncertainty in MSA

(Kimura 1962) Calculating fixation rates relative fitness of b to a population size

(Kimura 1962) Fixation rates in terms of equilibrium rates and mutation probabilities relative fitness of b to a population size

A Simpler Formulation • pis estimated from nucleotide frequencies and the transition/transversion ratio • πrepresents the frequency of each codon, and is approximated via amino acid and nucleotide frequencies • model ignores: • site-specific nucleic acid selection effects (e.g. from RNA structure) • codon bias

Model Fallout • Amount of “flux” between two codons depends on their relative fitness • Rates are not explicitly modeled, but... • maximum substitution rate will be when all codons are equally fit • synonymous codons will have highest flux • because of degeneracy of 3rd position changes, they will be most frequent

Parameter Estimation • Ideal • estimate parameters simultaneously from large data set • What they did • nucleotide frequencies: from observed frequencies • transition/transversion ratio: using existing nucleotide-based methods • equilibrium amino acid frequencies: • estimate number of times each amino acid was introduced at each position (based on phylogenetic tree but ignores genetic code) • add psuedo-counts

Evaluation • Their hypothesis: • methods that only model differing rates will underestimate more remote divergence times • Test hypothesis on simulated data • given an MSA • estimate the tree (multiplied branch lengths by 6.0) • estimate amino acid frequencies • arbitrarily choose mutational parameters • stochastically generate sequences (how many?)

Predicting Distances Between Sequences A: DNA model (learned?) B: DNA model with site-rate variation C: this model with simulation parameters D: this model with parameters estimated from simulated data x axis: estimated distances y axis: true distances (based on simulation)

Conclusions • failing to model selection effects leads to substantial underestimation of longer distances • possible to estimate equilibrium amino acid frequencies from realistic data sets with an accuracy sufficient for estimating distances between highly divergenct sequences • model accounts for heterogeneity of rates in a novel, and more biologically realistic way • model parameters could in theory be estimated simultaneously using ML or bayesian estimation

Understanding Selection Pressure on Protein Evolution: A Review