250 likes | 460 Views
Detecting gene-gene interactions in SNP-association studies. Peng Zhou, 03/21/2011. Contents. Basic concepts General items, statistic models, etc. Computational approaches Exhaustive searches Data-mining and machine-learning approaches Bayesian model selection approaches Summary.
Detecting gene-gene interactions in SNP-association studies Peng Zhou, 03/21/2011
Contents • Basic concepts • General items, statistic models, etc. • Computational approaches • Exhaustive searches • Data-mining and machine-learning approaches • Bayesian model selection approaches • Summary
Statistical models of association • Quantitative outcome (y) • Predictor variable (x) • Linear regression: • y = mx + c • Multiple regression: • y = m1x1 + m2x2 + m3x3 + c
Statistical models of association • Disease (Binary) outcome • Log odds: ln(p/(1 – p)) • predictor variable (x) • Linear regression: • ln(p/(1 – p))= mx + c • Multiple regression: • ln(p/(1 – p)) = m1x1 + m2x2 + m3x3 + c
Statistical models of interaction • Logistic regression: • ln(p/(1 – p)) = α + βxB + γxC + ixBxC • xB and xC: measured indicator variables • β, γ: main effect items • i: interaction item • Test of interaction between xB and xc is just: • One degree of freedom test of i = 0 • Factors that display interaction effects without displaying main effects will be missed in single-locus tests
Case-control approach • Most popular approach • Fit a logistic regression model and test whether the interaction items equal zero • Linear regression in the case of quantitative phenotypes • ‘--epistasis’ option in the software package PLINK • user provide significance level
Exhaustive search • Analyze all possible pairs of loci and perform the desired interaction test for each pair • Multiple testing issue • 100,000 loci -> 1010 tests -> 5×10-12 global significant p-value • Are the tests independent? • Loci in linkage disequilibrium • However, it is feasible • 33 hours on a 10-node cluster to perform all pairwise tests of association allowing for interaction at 300,000 loci in 1,000 cases and 1,000 controls (Marchiniet al., Nature Genetics, 2005)
Exhaustive search • Does not scale up to higher-order interactions • number of tests and the time of analysis increases exponentially • Two-stage procedures: • A subset of loci that pass some single-locus significance threshold are chosen • An exhaustive search of all two-locus (or higher-order) interactions is carried out on this ‘filtered’ subset • Genome-Wide Interaction-Based Association Analysis Identified Multiple New Susceptibility Loci for Common Diseases. (Yang et al., PLoS Genetics, 2011) • Drawback: interactions that do not show any marginal effects will be missed
Data mining approaches • Traditional regression-based methods are not able to deal with nonlinear models or high-dimensional data • Data-mining approaches step through the space of possible models in a computationally efficient way • Recursive partitioning approaches • Multifactor Dimensionality Reduction method
Recursive partitioning approach • Trees are constructed using rules that determine how well a split at a node predictor variable can differentiate observations with respect to the outcome variable
Multifactor Dimensionality Reduction • A data mining strategy for detecting and characterizing nonlinear interactions among discrete attributes (e.g. SNPs) that are predictive of a discrete outcome (e.g. case-control status). It combines: • attribute selection, • attribute construction and • classification with cross-validation
MDR • Select interesting SNPs from the pool of possible candidates • Single-locus association tests that pass a certain threshold, or • Entropy-based measures of information gain and interaction • Constructive induction using MDR • Classification and machine learning • Decision trees, neural networks, naïve Bayes classifier
Multifactor Dimensionality Reduction • Class variable (Y) • Indicator variables (X1, X2) • XOR: a logic operator that is not linearly separable • Y = X1 XOR X2
MDR algorithm • Start by selecting 2 attributes (X1, X2) • Each combination of X1 and X2 are examined: • The number of times Y=1 and Y=0 is counted • The ratio of this count is computed and compared to a fixed threshold (1), and encoded as a binary variable (Z)
MDR • This is repeated for each possible n-factor combination and the combination that maximizes the case–control ratio of the high-risk group is selected • cross-validation • In practice, the data is divided into ten equal parts and a model is fit to each nine-tenths of the data (the training data), and the remaining one-tenth (the test data) is used to assess model fit
Multifactor Dimensionality Reduction • Main problem: does not scale up to large numbers of predictor variables • Anything more than a two-locus screen on more than a few hundred variables will be computationally prohibitive • best suited for use with small numbers of loci (up to a few hundred) • prior processing or filtering step
Bayesian Epistasis Association Mapping(BEAM) • Outputs the posterior probability that each marker is associated with the disease and involved with other markers in epistasis • Uses Markov chain Monte Carlo (MCMC) • Proposed a “B-statistic” for hypothesis testing • Can handle large numbers of markers (100,000 SNPs typed in 500 cases and 500 controls)
BEAM • Nd cases + Nu controls • Case genotypes: D (d1, …, dNd) • Control genotypes: U (u1, …, uNu) • L markers are partitioned into 3 groups: • Group 0: markers unlinked to the disease • Group 1: markers contributing independently to the disease risk • Group 2: markers jointly influence the disease risk (interactions) • I (I1, …, IL): membership of the markers in each group
BEAM • Let L0, L1, L2denote the number of markers in each group (L0+ L1+ L2 = L) • let D0, D1 and D2 denote case genotypes of markers in group 0, 1 and 2, respectively
BEAM • Initialize I according to prior P(I) • use the Metropolis-Hastings algorithm to update I • Output: posterior distribution of makers and interactions associated with the disease • B statistic and conditional B statistic • To accommodate hypothesis-tesing in a frequentist way
BEAM analysis of 47,727 SNPs in Crohn’s disease and control samples. ‘B-statistic’ p values for the 1,321 single-locus associations detected by BEAM. Results from single-locus association analysis of all 47,727 SNPs using the trend test implemented in PLINK
BEAM • Cannot currently handle the 500,000–1,000,000 markers that are now routinely genotyped in genome scans of 5,000 or more individuals • Can only account for linkage disequilibrium between adjacent markers • It is unclear whether linkage disequilibrium between non-adjacent markers is fully accounted for. (identified pairs might be in high LD)
Summary • Numerous methods and software testing for epistasis • Main difference between these methods is the computational time: • semi-exhaustive search of two-locus interactions implemented in PLINK: most feasible • BEAM is feasible only for a filtered dataset and with some modification to the default settings • MDR is feasible for examining two-locus interactions in a filtered data set or for examining higher-level interactions in an even further reduced data set