1 / 24

Detecting gene-gene interactions in SNP-association studies

Detecting gene-gene interactions in SNP-association studies. Peng Zhou, 03/21/2011. Contents. Basic concepts General items, statistic models, etc. Computational approaches Exhaustive searches Data-mining and machine-learning approaches Bayesian model selection approaches Summary.

gerik
Download Presentation

Detecting gene-gene interactions in SNP-association studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting gene-gene interactions in SNP-association studies Peng Zhou, 03/21/2011

  2. Contents • Basic concepts • General items, statistic models, etc. • Computational approaches • Exhaustive searches • Data-mining and machine-learning approaches • Bayesian model selection approaches • Summary

  3. Statistical models of association • Quantitative outcome (y) • Predictor variable (x) • Linear regression: • y = mx + c • Multiple regression: • y = m1x1 + m2x2 + m3x3 + c

  4. Statistical models of association • Disease (Binary) outcome • Log odds: ln(p/(1 – p)) • predictor variable (x) • Linear regression: • ln(p/(1 – p))= mx + c • Multiple regression: • ln(p/(1 – p)) = m1x1 + m2x2 + m3x3 + c

  5. Statistical models of interaction • Logistic regression: • ln(p/(1 – p)) = α + βxB + γxC + ixBxC • xB and xC: measured indicator variables • β, γ: main effect items • i: interaction item • Test of interaction between xB and xc is just: • One degree of freedom test of i = 0 • Factors that display interaction effects without displaying main effects will be missed in single-locus tests

  6. Case-control approach • Most popular approach • Fit a logistic regression model and test whether the interaction items equal zero • Linear regression in the case of quantitative phenotypes • ‘--epistasis’ option in the software package PLINK • user provide significance level

  7. Exhaustive search • Analyze all possible pairs of loci and perform the desired interaction test for each pair • Multiple testing issue • 100,000 loci -> 1010 tests -> 5×10-12 global significant p-value • Are the tests independent? • Loci in linkage disequilibrium • However, it is feasible • 33 hours on a 10-node cluster to perform all pairwise tests of association allowing for interaction at 300,000 loci in 1,000 cases and 1,000 controls (Marchiniet al., Nature Genetics, 2005)

  8. Exhaustive search • Does not scale up to higher-order interactions • number of tests and the time of analysis increases exponentially • Two-stage procedures: • A subset of loci that pass some single-locus significance threshold are chosen • An exhaustive search of all two-locus (or higher-order) interactions is carried out on this ‘filtered’ subset • Genome-Wide Interaction-Based Association Analysis Identified Multiple New Susceptibility Loci for Common Diseases. (Yang et al., PLoS Genetics, 2011) • Drawback: interactions that do not show any marginal effects will be missed

  9. Data mining approaches • Traditional regression-based methods are not able to deal with nonlinear models or high-dimensional data • Data-mining approaches step through the space of possible models in a computationally efficient way • Recursive partitioning approaches • Multifactor Dimensionality Reduction method

  10. Recursive partitioning approach • Trees are constructed using rules that determine how well a split at a node predictor variable can differentiate observations with respect to the outcome variable

  11. Multifactor Dimensionality Reduction • A data mining strategy for detecting and characterizing nonlinear interactions among discrete attributes (e.g. SNPs) that are predictive of a discrete outcome (e.g. case-control status). It combines: • attribute selection, • attribute construction and • classification with cross-validation

  12. MDR • Select interesting SNPs from the pool of possible candidates • Single-locus association tests that pass a certain threshold, or • Entropy-based measures of information gain and interaction • Constructive induction using MDR • Classification and machine learning • Decision trees, neural networks, naïve Bayes classifier

  13. Multifactor Dimensionality Reduction • Class variable (Y) • Indicator variables (X1, X2) • XOR: a logic operator that is not linearly separable • Y = X1 XOR X2

  14. MDR algorithm • Start by selecting 2 attributes (X1, X2) • Each combination of X1 and X2 are examined: • The number of times Y=1 and Y=0 is counted • The ratio of this count is computed and compared to a fixed threshold (1), and encoded as a binary variable (Z)

  15. MDR • This is repeated for each possible n-factor combination and the combination that maximizes the case–control ratio of the high-risk group is selected • cross-validation • In practice, the data is divided into ten equal parts and a model is fit to each nine-tenths of the data (the training data), and the remaining one-tenth (the test data) is used to assess model fit

  16. Multifactor Dimensionality Reduction • Main problem: does not scale up to large numbers of predictor variables • Anything more than a two-locus screen on more than a few hundred variables will be computationally prohibitive • best suited for use with small numbers of loci (up to a few hundred) • prior processing or filtering step

  17. Bayesian Epistasis Association Mapping(BEAM) • Outputs the posterior probability that each marker is associated with the disease and involved with other markers in epistasis • Uses Markov chain Monte Carlo (MCMC) • Proposed a “B-statistic” for hypothesis testing • Can handle large numbers of markers (100,000 SNPs typed in 500 cases and 500 controls)

  18. BEAM • Nd cases + Nu controls • Case genotypes: D (d1, …, dNd) • Control genotypes: U (u1, …, uNu) • L markers are partitioned into 3 groups: • Group 0: markers unlinked to the disease • Group 1: markers contributing independently to the disease risk • Group 2: markers jointly influence the disease risk (interactions) • I (I1, …, IL): membership of the markers in each group

  19. BEAM • Let L0, L1, L2denote the number of markers in each group (L0+ L1+ L2 = L) • let D0, D1 and D2 denote case genotypes of markers in group 0, 1 and 2, respectively

  20. BEAM • Initialize I according to prior P(I) • use the Metropolis-Hastings algorithm to update I • Output: posterior distribution of makers and interactions associated with the disease • B statistic and conditional B statistic • To accommodate hypothesis-tesing in a frequentist way

  21. BEAM analysis of 47,727 SNPs in Crohn’s disease and control samples. ‘B-statistic’ p values for the 1,321 single-locus associations detected by BEAM. Results from single-locus association analysis of all 47,727 SNPs using the trend test implemented in PLINK

  22. BEAM • Cannot currently handle the 500,000–1,000,000 markers that are now routinely genotyped in genome scans of 5,000 or more individuals • Can only account for linkage disequilibrium between adjacent markers • It is unclear whether linkage disequilibrium between non-adjacent markers is fully accounted for. (identified pairs might be in high LD)

  23. Summary • Numerous methods and software testing for epistasis • Main difference between these methods is the computational time: • semi-exhaustive search of two-locus interactions implemented in PLINK: most feasible • BEAM is feasible only for a filtered dataset and with some modification to the default settings • MDR is feasible for examining two-locus interactions in a filtered data set or for examining higher-level interactions in an even further reduced data set

  24. Thanks!

More Related