1 / 29

Identifying and Modeling Selection Pressure (a review of three papers)

Identifying and Modeling Selection Pressure (a review of three papers). Rose Hoberman BioLM seminar Feb 9, 2004. Today. McClellan and McCracken : Estimating the Influence of Selection on the Variable Amino Acid Sites of the Cytochrome b Protein Functional Domains

cgaytan
Download Presentation

Identifying and Modeling Selection Pressure (a review of three papers)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying and Modeling Selection Pressure(a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004

  2. Today • McClellan and McCracken: Estimating the Influence of Selection on the Variable Amino Acid Sites of the Cytochrome b Protein Functional Domains • Dagan et al: Ratios of Radical to Conservative Amino Acid Replacement are Affected by Mutational and Compositional FActors and May Not be Indicative of Positive Darwinian Selection • Halpert and Bruno: Evolutionary Distances for Protein-Coding Sequences: Modeling Site-Specific Residue Frequencies

  3. Types of Selection • negative purifying selection • non-synonymous codon changes are selected against • neutral selection • non-synonymous changes in codons have an equivalent probability of elimination or fixation • positive diversifying selection • non-synonymous codon changes are selected for

  4. Identifying Regions Under Selective Pressure • ds/dn << 1 and ds/dn >> 1 commonly used • synonymous substitutions become saturated more quickly than ns • compare conservative/radical substitution ratio to expected distribution under neutral model

  5. A “conservative” definition • Cluster amino acids according to physio-chemical properties • Charge • Volume • Polarity • Grantham’s distance • ... • Within-class = conservative • Across-class = radical

  6. Assessing Substitution Rates • 2 sequences • average over all possible pathways between two codons • TTG(Leu) - ATG(Met) - AGG(Arg) - AGA(Arg) • Many sequences • Build a phylogenetic tree • Infer most likely ancestral sequences • Count synonymous and nonsynonymous substitutions

  7. Cytochrome b Gene Evolution • Matrix and Transmembrane regions have comparable rates of change • Intermembrane region has lower rate of change (McClelland and McCracken)

  8. Group Non-Syn Mutations • 5 Properties • 4 Groups • Neutral model • based only on codon frequencies • Chi-squared test • observed vs. expected (given domain amino acid frquencies)

  9. Question • Do factors unrelated to selection affect the radical/conservative ratio? • nucleotide frequencies • e.g. GC content • transition/transversion ratio • transitions (A->G and T->C) are more common than transversion • distances between amino acids • genetic code • codon biases • due to tRNA availibility, energy usage, or pathogen avoidance • amino acid frequencies • ??

  10. An Initial Test • 3 proteins: Hemoglobin, Interleukin, Ribosomal protein • Simulated neutral evolution using substutition matrix built from psuedogenes • Tested for selection pressure • volume/polarity: 100% FP • grantham: 13-21% FP • charge: 0% FP (Dagan et al)

  11. Simulation Study • Generate virtual ancestral sequence • 300 nt long • Set mutational/compositional parameters • Simulate evolution (ROSE software) • 50 substitutions • Calculate conservative/radical ratio • Each parameter set simulated 50 times

  12. ANOVA

  13. Conclusion • Many composition and mutation factors influence conservative/radical ratio • Poor indicator of positive selection

  14. Correlation or Causation? • Many factors are correlated, but direction of causation is undetermined • transitions more likely to cause conservative changes than transversions • codon bias can influence nucleotide frequencies • purifying selective pressure will reduce the rate of change • Generative models which model many of these relevant factors

  15. Generative Models of Gene/Protein Evolution • Infer relative distances between sequences • Build a phylogenetic tree • Infer which positions are under positive selective pressure • Find additional homologous proteins • Identify co-varying sites

  16. Modeling Evolutionary Processes • Most models • homogeneous, time-reversible Markov models • Simplest models • DNA mutation models • nucleotide frequencies • transition/transversion ratio

  17. Too Simplistic • positions within codons are not independent • codon or amino acid models • parameters not sufficient to explain different rates of change between specific characters • empirical substitution matrix (e.g. PAM) • site-specific rates of change • use a gamma distribution to model variation in rates

  18. Too Simplistic • positions within codons clearly not independent • codon or amino acid models • different rates of change between specific characters • empirical substitution matrix (e.g. PAM) • site-specific rates of change • use a gamma distribution to model variation in rates • equilibrium frequencies are also site-specific • due to functional or structural constraints

  19. Too Simplistic • positions within codons are not independent • codon or amino acid models • parameters not sufficient to explain different rates of change between specific characters • empirical substitution matrix (e.g. PAM) • site-specific rates of change • use a gamma distribution to model variation in rates • equilibrium frequencies are also site-specific • due to functional or structural constraints

  20. Halpern & Bruno 1998 A codon-based model of evolution • site-invariant dna-based mutation model • site-specific amino acid level selection model = probability of mutation = probability of fixation at site i

  21. Halpern & Bruno 1998 • Assumptions • most importantly, selectional pressures are constant at a given position for all lineages over all times • sites independent • markov process is reversible • Does not model • selection at the codon level • codon bias • DNA or RNA structural requirements • uncertainty in MSA

  22. (Kimura 1962) Calculating fixation rates relative fitness of b to a population size

  23. (Kimura 1962) Fixation rates in terms of equilibrium rates and mutation probabilities relative fitness of b to a population size

  24. A Simpler Formulation • pis estimated from nucleotide frequencies and the transition/transversion ratio • πrepresents the frequency of each codon, and is approximated via amino acid and nucleotide frequencies • model ignores: • site-specific nucleic acid selection effects (e.g. from RNA structure) • codon bias

  25. Model Fallout • Amount of “flux” between two codons depends on their relative fitness • Rates are not explicitly modeled, but... • maximum substitution rate will be when all codons are equally fit • synonymous codons will have highest flux • because of degeneracy of 3rd position changes, they will be most frequent

  26. Parameter Estimation • Ideal • estimate parameters simultaneously from large data set • What they did • nucleotide frequencies: from observed frequencies • transition/transversion ratio: using existing nucleotide-based methods • equilibrium amino acid frequencies: • estimate number of times each amino acid was introduced at each position (based on phylogenetic tree but ignores genetic code) • add psuedo-counts

  27. Evaluation • Their hypothesis: • methods that only model differing rates will underestimate more remote divergence times • Test hypothesis on simulated data • given an MSA • estimate the tree (multiplied branch lengths by 6.0) • estimate amino acid frequencies • arbitrarily choose mutational parameters • stochastically generate sequences (how many?)

  28. Predicting Distances Between Sequences A: DNA model (learned?) B: DNA model with site-rate variation C: this model with simulation parameters D: this model with parameters estimated from simulated data x axis: estimated distances y axis: true distances (based on simulation)

  29. Conclusions • failing to model selection effects leads to substantial underestimation of longer distances • possible to estimate equilibrium amino acid frequencies from realistic data sets with an accuracy sufficient for estimating distances between highly divergenct sequences • model accounts for heterogeneity of rates in a novel, and more biologically realistic way • model parameters could in theory be estimated simultaneously using ML or bayesian estimation

More Related