390 likes | 454 Views
Interaction-based Learning in Genomics. Shaw- Hwa Lo, Tian Zheng & Herman Chernoff
E N D
Interaction-based Learning in Genomics Shaw-Hwa Lo, Tian Zheng & Herman Chernoff Columbia University Harvard University
Other Collaborators : lulianlonita-Laza, Inchi Hu, HongyuZhao,Hui Wang, Xin Yan, Yuejing Ding, Chien-Hsun Huang, Bo Qiao, Ying Liu, Michael Agne, Ruixue Fan, Maggie Wang, Lei Cong, Hugh Arnolds, Jun Xie, Adeline Lo
Partition-Retention We have n observation on a dependent variable Y and many discrete valued explanatory variables X1,X2, . . . ,XS. We wish : 1). to identify those of the explanatory variables which influence Y; 2). to predict Y, based on 1)’s findings. We assume that Y is influenced by a number of small groups of interacting variables. ( group sizes~ 1 to 8, depending on sample sizes and effects)
Marginal Effects: Causal and Observable 1. If Xi has an effect on Y we expect Y to be correlated with Xi or some function of Xi. In that case Xi has a causal and observational marginal effect. 2. A variable Xi unrelated to (independent of) Y should be uncorrelated with Y except for random variation. But if S (numbers of variables) is large and n moderate, some of the explanatory variables not influencing Y may have a substantial correlation ( or marginal observable effects) with Y . They are impostors. 3. Group of important Interacting influential variables may or may not have marginal observable effects (MOE). Therefore, methods rely on the presence of strong observable marginal effects are unlikely to succeed if MOE are weak.
Ex. 1. X1 and X2 are independent with P(Xi = 1) = P(Xi =−1) = 1/2, Y = X1X2, E(Y |X1) = E(Y |X2) = 0. Y is uncorrelated with X1 and X2 although the pair determine Y . Ex. 2. Y = X1X2, P(Xi = 1) = 3/4 and P(Xi = −1) = 1/4. Here Y is correlated with X1 and X2, and the sample will clearly show marginal observable effects ( and can be detected by t-test). That is the interaction of both X1 and X2 is needed to have an influence on Y . Conclusion: To detect interacting influential variables, it is desirable and sometimes necessary to consider interactive effects. Impostors may present observable marginal effects if S is large and n is moderate.
An ideal analytical tool should have the abilities to: • 1. handle an extremely large number of variables and their higher-order interactions. • 2. detect “ module effects”, referring to the phenomenon where a module C ( a cluster of variables) holds predictive power but becomes useless in prediction if any variable is removed. • 3. identify interaction effects: effect of one variable on a response depends on the values of other variables. • 4. detect and utilize nonlinear and non-additive effects.
A score with four features • Need A sensible score that can be used to measure the influence of a group of variables. • to design an algorithm for removing noisy and non-informative variables while dimensions were altered– meaning this score measures in the same scale in different dimensions • Given a cluster of variables, one can use the score to test the significances of its influences • the cluster with high score ( influential) automatically possess predictive ability
A Special Case of Influential Measure: Genotype-Trait Distortion • In the event of case-control studies: • Where and are counts of cases and controls in each genotype (partition element) , and are the total number of cases and controls under study. A SNP has 3 genotypes (aa, ab, bb).
A general form • Let Y be the disease status (1 for cases and 0 for controls). Then, for a genotype partition П, the score we just discussed can be naturally defined as:
Theorem: Under the null hypothesis that none of the variables has an influence, the null distribution of when is normalized, is asymptotically a weighted sum of independent Chi-square variables. This applies to both the null random-and the null specified- models, under the standard conditions for the applicability of the CLT. Certainly the case-control studies ( is a special case of specified -odels. (2009)
General Setting • The main idea applies much more generally than to special genetic problems. A more general version is proposed to deal with the problem of detecting which, of many potentially influential variables Xs, have an effect on a dependent variable Y using a sample of n observations on, Z =(X, Y) where X =(X1,X2, . . . ,XS). • In the background is the assumption that Y may be slightly or negligibly influenced by each of a few variables Xs, but may be profoundly influenced by the confluence of appropriate values within one or a few small groups of these variables. At this stage the object is not to measure the overall effects of the influential variables, but to discover them efficiently.
Example • We introduce the partition retention approach and related terminology and issues by considering a small artificial example. • Suppose that an observed variable Y is normally distributed with mean X1X2 and variance 1, where X1 and X2 are two of S = 6 observed and potentially influential variables which can take on the values 0 and 1. Given the data on Y and X = (X1, . . . ,X6), for n = 200 subjects, the statistician, who does not know this model, desires to infer which of the six explanatory variables are causally related to Y. In our computation the Xi are selected independently to be 1 with probabilities 0.7, 0.7, 0.5, 0.5, 0.5, 0.5.
A capable analytical tool should have the ability to surmount the following difficulties: • (a) handle an extremely large number of variables (SNPs and other variables in hundreds of thousands or millions) in the data. • (b) detect the so-called “module effect”, which refers to the phenomenon where removing one variable from the current module renders it useless in prediction. • (c) identify interaction ( often higher orders effects) : the effect of one variable on a response variable depends on the values of other variables in the same module. • (d) extract and utilize nonlinear effects (or non-additive effects).
Let , the response variable Y, and X , the explanatory variables (30 Xs, all independent), all be binary, taking values 0 or 1 with 50% chance each. We independently generate 200 observationsand Y is related to X via the model The task is to predict Y based on the information in X. We use 150 observations as the training set and 50 as the test set. This example has a 25% theoretical lower bound for prediction error rates since we do not know which of the two causal variable modules generates the response Y.
Diagrams of conventional approach and the variable-module enabled approach.
Basic tool: the Backward dropping algorithm (BDA). BDA is a “greedy” algorithm that seeks the variable subset that maximizes the I-score through stepwise elimination of variables from an initial subset (k variables) sampled from the variable space (p variables). K << p
Training Set : Consider a training set of n observations, where = X is a p -dimensional vector of discrete variables. Typically p is very large (thousands). • Sampling randomly from Variable Space: Select an initial subset of k explanatory variables k << p. • Compute I-Score based on k variables. • Drop Variables: Tentatively drop each variable and recalculate the -score with one variable less. Then permanently drop the variable that results in the highest -score when tentatively dropped. • Return Set: Continue the next round of dropping on until only one variable left. Keep the subset that yields the highest-score in the whole dropping process. Refer to this subset as the return set.
Classification based on van’t Veer’s Data (2002) . In applying proceduresdescribed in Discoverystage , we successfully identified 18 influential modules with sizes ranging from 2 to 6. The purpose of the original study was to predict breast cancer relapse using gene expression data. The original data contains the expression levels of 24,187 genes for 97 patients, 46 relapse (distant metastasis < 5 year) and 51 non-relapse (no distant metastasis ≥ 5 year). We used 4,918 genes for the classification task, which were reduced by Tibashirani and Efron (2002). 78 cases out of 97 were used as the training set (34 relapse and 44 non-relapse) and 19 (12 relapse and 7 non-relapse) as the test set. The best error rates (biased or not) on this particular test set in the literature is around 10% (2 errors). Proposed method yields a zero error rate (no error) on the test set
The CV error rates of the van’t Veer data are typically around 30%. The proposed method yields an average error rate of 8% over 10 randomly selected CV test samples representing a 74% reduction of error rate (30%-8%/ 30%= 74%) when compared with existing methods. We run the CV experiment by randomly partitioning the 97 patients into a training sample of size 87 and a test sample of 10, then repeated the experiment ten times
In case-control design when there are n cases and n controls in a study, the last line of the equations, divided by will converge to Is the class probability. (two classes, case vs control). This expression is directly related to the correct predictive rate corresponding to the partition . Thus searching for cluster with larger I-score has the automatic effect of seeking clusters with stronger predictive ability---- a very desirable property.
Example Using Breast Cancer Data • Case-Control Sporadic Breast Cancer data from NCI Cancer Genetic Markers of Susceptibility (CGEMS) • 2287 postmenopausal women • 1145 cases and 1142 controls • 18 genes with 304 SNPS selected from literatures:
Under the null estimated by permutations. P-values of the observed marginal effects
Two-way Interaction Networks Pair-wise network based on16 pairs of genes identified by Mean-ratio Method. Pair-wise network based on 18 pairs of genes identified by Quantile-ratio method.
Three-way Interaction Networks 3-way interaction network based on 10 genes identified by Mean-ratio method 3-way interaction network based on 8 genes identified by Quantile-ratio method
Pairwise Interaction (M, R)-plane: observed data and permutation quantiles 1-ESR1 BRCA1, 2-BRCA1 PHB, 3-KRAS2 BRCA1, 4- SLC22A18 BRCA1, 5-RAD51 BRCA1, 6-RB1CC1 SLC22A18, 7-CASP8 KRAS2, 8-CASP8 SLC22A18, 9-PIK3CA BRCA1, 10-PIK3CA ESR1, 11-PIK3CA RB1CC1, 12-PIK3CA SLC22A18, 13-BRCA1 CHEK2, 14-BARD1 BRCA1, 15-BARD1 ESR1, 16-BARD1 TP53 (M, Q)-plane: observed data and permutation quantiles 1-ESR1 BRCA1, 2-BRCA1 PHB, 3-KRAS2 BRCA1, 4-SLC22A18 BRCA1, 5-RAD51 BRCA1, 6-ESR1 SLC22A18, 7-RB1CC1 SLC22A18, 8-CASP8 KRAS2, 9-CASP8 SLC22A18, 10-PIK3CA BRCA1,11-PIK3CA ESR1, 12-PIK3CA RB1CC1, 13-CASP8 PIK3CA, 14-BRCA1 CHEK2, 15-BARD1 BRCA1, 16-BARD1 ESR1, 17-BARD1 TP53, 18-BARD1 SLC22A18
Remarks • One limitation of marginal approaches is due in part that only a fractional information from the data is used; • The proposed approach intends to draw more relevant information ; Improving prediction; • Additional scientific findings are likely if data already collected be suitably reanalyzed; • The proposed approach is particularly useful when a large number of dense markers becomes available; • Information about gene-gene interactions and their disease-networks can be derived and constructed.
Collaborators • Herman Chernoff, TianZheng, lulianlonita-Laza, Inchi Hu, HongyuZhao,Hui Wang, Xin Yan, Yuejing Ding, Chien-Hsun Huang, Bo Qiao, Ying Liu, Michael Agne, Ruixue Fan, Maggie Wang, Lei Cong, Hugh Arnolds, Jun Xie, KjellDoksum
Key References • Lo SH, Zheng T (2002) Backward haplotype transmission association (BHTA) algorithm—a fast multiple-marker screening method. Human Heredity 53 (4): 197-215. • Lo SH, Zheng T (2004) A demonstration and findings of a statistical approach through reanalysis of inflammatory bowel disease data. PNAS U S A 101(28):10386-91 • Lo SH, Chernoff,H., Cong,L., Ding,Y.,Zheng,T.(2008) Discovering Interactions Among BRCA1 and Other Candidate Genes Involved in Sporadic Breast Cancer. PNAS 105:12387-12392. • Chernoff H, Lo SH, Zheng T (2009) Discovering Influential Variables: A Method of Partitions. Annals of Applied Statistics. 3.(4): 1335-1369. • Wang H ., Lo SH, Zheng T &Hu I (2012) Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics 28(21): 2834-2842.