1 / 32

Rose Hoberman Roni Rosenfeld Judith Klein-Seetharaman

Using physical-chemical properties of amino acids to model site-specific substitution propensities. Rose Hoberman Roni Rosenfeld Judith Klein-Seetharaman. Heterogeneity Across Sites. Substitution rate varies across sites rate parameter assumed to follow a gamma distribution

thane-snow
Download Presentation

Rose Hoberman Roni Rosenfeld Judith Klein-Seetharaman

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using physical-chemical properties of amino acids to model site-specific substitution propensities. Rose Hoberman Roni Rosenfeld Judith Klein-Seetharaman

  2. HeterogeneityAcross Sites • Substitution rate varies across sites • rate parameter assumed to follow a gamma distribution • mathematically convenient • little biological justification • provides little explanation

  3. HeterogeneityAcross Sites • Rate of substitution varies across sites • rate parameter distributed according to a gamma distribution • mathematically convenient • little biological justification • provides little explanation • Substitution propensities vary across sites • leads to an explosion of parameters (400) • still no biological explanation

  4. Explaining Why Substitution Propensities Vary • Differing substitution propensities are a result of different amino acid preferences (Halpern & Bruno, Koshi & Goldstein) • e.g. substitutions to deleterious amino acids are unlikely • Learning amino acid preferences at each site (~20 vs ~400 parameters) • still too many parameters to estimate accurately • still not biologically informative

  5. Our Modeling Assumption Amino acids preferences are based on which physical and chemical properties are important at each site to the function or structure of the protein • restricts the parameter space (3-5) • provides more explanation

  6. A New Statistical Model of Site-Specific Molecular Evolution • Learn which properties are important at each site • Model amino acid preferences as a function of their properties • Determine a mapping from amino acid preferences to substitution propensities • Combine property-based substitution propensities with other factors that effect substitutions • nucleotide mutation processes • different distances between codons

  7. A New Statistical Model of Site-Specific Molecular Evolution • Learn which properties are important at each site • don’t rely on structural knowledge about the protein • do not artificially restrict to a few preselected physical features • Model amino acid preferences as a function of their properties • Determine a mapping from amino acid preferences to substitution propensities • Combine substitution propensities with codon distance and nucleotide mutation rates

  8. 250 Amino Acid Properties (Downloaded from http://www.scsb.utmb.edu/comp biol.html/venkat/prop.html)

  9. 250 Amino Acid Properties

  10. Visualizing the Amino Acid Distribution FAMLR... LAMLR... IAMLR... P-EL-... GAELR... PGEIR... L-ELY... L-EVR... I-MLK... WAELR... HAELY... YAILY... WAML-...

  11. Variance FAMLR... LAMLR... IAMLR... P-EL-... GAELR... PGEIR... L-ELY... L-EVR... I-MLK... WAELR... HAELY... YAILY... WAML-...

  12. Limitations of Variance

  13. Limitations of Variance Our assumption: when selection is based on a single property, distribution should be unimodal

  14. Using Gaussian Goodness-of-Fit to Test for Property Conservation • Fit a maximum-likelihood Gaussian to amino acid frequencies in property space • From (discretized) Gaussian calculate expected AA frequencies • Calculate goodness-of-fit to learned Gaussian • identifies unimodal distributions • penalizes missing amino acids (“holes”) • Use Monte-Carlo method to calculate significance • Otherwise will have high false discovery rate when entropy is low

  15. GPCR-A Family • Characterized by 7 TM segments • Responds to a large variety of ligands • Ligand binding allows binding and activation of a G protein • Diversity in sequences • Believed to share similar structure • Only known structure is for Rhodopsin

  16. 4 2 0 Results for GPCR

  17. Estimating the False Discovery Rate (FDR) FDR = # false positives / # predicted positives

  18. Initial Validation • Charge conserved at 134 • part of D/E R Y motif of importance to binding and activation of G-protein • Size conserved at 54, 80, 87, 123, 132, 153, 299 • helix faces one or two other helices • Cluster of dynamics properties conserved in third cytoplasmic loop • in Rhodopsin this is the most flexible interhelical loop

  19. Continuing Work • Use multivariate Gaussian to model selection pressure from multiple properties • Derive substitution propensities from amino acid preferences and combine these with codon distance effects and nucleotide mutation rates

  20. Thank YouRoni RosenfeldJudith Klein-SeetharamanNSF

  21. Summary • Proposed a new approach for modeling heterogeneity of the evolutionary process across sites • Designed a test that is able to identify which properties are conserved at different sites • Promising approach for modeling site-specific substitution propensities in a biologically-realistic and computationally tractable way

  22. Significance • Problem: for positions with low entropy, every property will have low variance • very high false positive rate: any combination of 1 more more properties can explain this! • actual explanation may involve several properties • In this case, multiple property constraints • Cannot determine which one property is conserved • Need to condition on entropy

  23. Significance Testing • What is the probability of a property having low variance in this position purely by chance? • Generate a large set of “random” (shuffled) property scales • show examples of shuffling • Calculate variance for each random property • The distribution of this statistic can be used to calculate a threshold for acceptability of false-positives • Show picture here? add error bars?

  24. Gaussian Significance I

  25. Related Work Koshi & Goldstein 1998 Halpern & Bruno 1997

  26. New Model Model of One Fitness Class Model of Multiple Sequences from one Protein Family

  27. Abstract Existing models of molecular evolution capture much of the variability in mutation rates across sites. More biologically realistic models also seek to explain site-specific differences in substitution propensities between residue pairs, leading to more accurate and informative models of evolutionary dynamics. Toward this end, we describe a procedure for systematically characterizing the conservation of each position in a multiple sequence alignment in terms of specific physical and chemical properties. We use a Monte-Carlo method to ascertain the statistical significance of the findings and to control the False Discovery Rate. We use our method to annotate the diverse GPCRA family with a selection pressure profile. We demonstrate the computational and statistical significance of the properties we have identified, and discuss the biological significance of our findings. The latter include confirmation of experimentally determined properties as well as novel testable hypotheses.

  28. Results

  29. Novel Hypothesis • 175 and 265 highly similar conservation patterns • Both tryptophans in rhodopsin • Trp265 in direct contact with retinal ligand, but when exposed to light, crosslinks to Ala169 instead. • Trp161 has been proposed to contribute to this process • The property conservation patterns suggest Trp175 has a more significant role • This hypothesis can be tested experimentally

More Related