290 likes | 380 Views
Effects of Environment, Genetics and Data Analysis in an Esophageal Cancer Genome-Wide Association Study. Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/2007. Project history. Joint project with Chun Li and Constantin Aliferis
E N D
Effects of Environment, Genetics and Data Analysis in an Esophageal Cancer Genome-Wide Association Study Alexander StatnikovDiscovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/2007
Project history • Joint project with Chun Li and Constantin Aliferis • Cancer Research 2005 paper by Hu et al.: “Genome-Wide Association Study in Esophageal Cancer Using GeneChip Mapping 10K Array” • Reported near-perfect classification of cancer patients & healthy controls on the basis of only SNP data from a case-control GWA study. • This finding suggests that esophageal cancer is a solely genetic disease… • Initial idea of Chun Li • At DSL we had independently obtained the GWA dataset prior to Chun and Constantin have initiated this project
Background • SNPs make up >90% of all human genetic variation and have been extensively studied for functional relationships between phenotype & genotype. • Modern high-throughput genotyping technologies allow fast evaluation of SNPs on a genome-wide scale at a relatively low cost. • During last 2 years, several studies have reported success in using SNP genotyping assays in GWA studies in cancer. Probably, the strongest result is reported in the study by Hu et al.
Claims of Hu et al. “Using the generalized linear model (GLM) with adjustment for potential confounders and multiple comparisons, we identified 37 SNPs associated with disease.” “When the 37 SNPs identified from the GLM recessive mode were used in a principal components analysis, the first principal component correctly predicted 46 of 50 cases and 47 of 50 controls.” […] “The permutation tests indicated that our PCA classification can be generalized.” 4
Study dataset & its preparation • Study dataset: • 50 esophageal squamous cell carcinoma patients • 50 healthy controls (matched by age, sex, place of residence) • 10k Affymetrix SNP arrays with 11,555 SNPs • Additional variables: • Age • Tobacco use • Alcohol consumption • Family history • Consumption of pickled vegetables • Removed ~1.5k SNPs to minimize genotyping errors • Implemented recessive A encoding • Imputed missing genotypes
SNP selection: Original method of Hu et al. (denoted as GLM1) • Fit a GLM model using data for all 100 subjects: Probability(Cancer) = 1 / (1 – exp(-f)), where f = a + b ∙SNP + c ∙family history + d ∙alcohol consumption • Obtain deviances: • D1 - deviance of the above fitted model • D0 - deviance of the null model (without predictor variables) • From χ2 distribution, compute a p-value for the test statistic D0-D1 with 3 degrees of freedom • Perform Bonferroni correction at 0.05 alpha level
SNP selection: Unbiased GLM-based method (denoted as GLM2) • Fit a GLM model using data for all 100 subjects: Probability(Cancer) = 1 / (1 – exp(-f)), where f = a + b ∙SNP + c ∙family history + d ∙alcohol consumption • Obtain deviances: • D1 - deviance of the above fitted model • D0΄- deviance of the model with family history and alcohol consumption • From χ2 distribution, compute a p-value for the test statistic D0΄-D1 with 1 degree of freedom • Perform Bonferroni correction at 0.05 alpha level
Classification:Original method of Hu et al. • Perform principal component analysis (PCA) on selected SNPs using all 100 subjects in the dataset. • Extract the first principal component (PC1). • Use the following rule to classify each of the same 100 subjects as used for the PCA: If PC1 > 0, classify as control, otherwise classify as case
Evaluation of classification performance • Hu et al. used proportion of correct classifications; their classifier is trained and tested in the same dataset • We employ area under ROC curve performance metric and repeated 10-fold cross-validation scheme 0.83 0.9 0.8 0.9 0.6 0.9 0.8 0.7 0.8 0.9 1.0 0.83 0.81 SNP dataset (100 subjects) 0.9 0.8 0.9 0.6 0.9 0.8 0.7 0.8 0.9 1.0 0.6 0.9 0.9 0.6 0.9 0.5 0.9 0.9 0.9 1.0 … 0.79 1.0 0.8 0.9 0.7 0.9 0.8 0.7 0.8 0.6 0.7
Reproducing findings of Hu et al. • Using GLM1 method, Hu et al. reported 37 significant SNPs, we found 226! • Apparently, they used an extra filtering step that was not reported in the paper (personal comm. with their PI). • Nevertheless, the application of PCA-based classifier (as in Hu et al.) to GLM1 significant SNPs resulted in 0.93 proportion of correct classifications and 0.98 AUC. Major findings are reproduced using methods of Hu et al.
Bias in SNP selection method GLM1 of Hu et al. • Calculation of p-values in GLM1 does not reflect significance of the SNP, but the significance of 3 variables combined (SNP, family history, and alcohol consumption) • Family history & alcohol consumption are strong risk factors p-value is biased towards 0.
Bias in SNP selection method GLM1 by Hu et al. • On the contrary, GLM2 reflects significance of SNPs and does not suffer from the above bias: • Its distribution of SNP p-values is uniform • It returns no SNPs significant at the Bonferroni adjusted alpha-level • The distribution of SNP p-values for method GLM1 is not uniform: most p-values are <10-3 Bonferroni adjusted α-level
Empirical demonstration of bias in SNP selection method • Main idea: Create a null distribution where SNPs are completely unrelated to the response variable and see how frequently methods GLM1 and GLM2 find statistically significant SNPs. • Permute all subjects in the SNP data while leaving the response variable, family history of esophageal cancer, and alcohol consumption intact. • Apply GLM1 and GLM2 to the permuted SNP data. Repeat 1,000 times
Results of permutation experiments • GLM1 found significant SNPs in all 1000 permutations! The number of significant SNPs found in a permuted dataset ranges from 185 to 1,938 (357 on average). • GLM2 found significant SNPs in only 48/1000 permutations. The number of significant SNPs found in a permuted dataset ranges from 1 to 3. GLM1 is biased, while GLM2 is not.
Bias in the classification performance estimate of Hu et al. • All data-analysis methods of Hu et al. use data for all subjects. Neither cross-validation nor independent sample validation were performed. • We repeated their data-analysis (GLM1+PCA) embedded in the repeated 10-fold cross-validation design. The resulting performance is only 0.68 AUC (versus 0.98 AUC). 0.30 AUC bias (overestimation) in the reported results
Empirical demonstration of performance estimation bias • Main idea: Create a null distribution where SNPs are completely unrelated to the response variable (i.e. AUC=0.5), apply GLM1+PCA methodology and record resulting performance estimates. • Permute all subjects in the SNP data while leaving the response variable, family history of esophageal cancer, and alcohol consumption intact. • Apply GLM1 to the permuted SNP data. • Build and apply classifier using PCA. • Estimate classification performance (AUC). Repeat 1,000 times
Results of permutation experiments • Classification performance of GLM1+PCA; both methods applied as in Hu et al. to all data (no cross-validation): 0.99 AUC • Classification performance of GLM1+PCA; GLM1 applied to all data, PCA applied by cross-validation (incomplete cross-validation): 0.98 AUC • Classification performance by GLM1+PCA applied by cross-validation: 0.50 AUC 0.48-0.49 AUC bias (overestimation) under the null
Additional analysis of SNP data to assess the effects of genetics and environment.
kernel Classification:Support Vector Machines (SVMs) • Supervised baseline technique for many types high-throughput data (microarray, proteomics, etc). • Trained and applied by cross-validation
SNP selection for fitting SVMs: Recursive Feature Elimination • Among the best performing techniques for the analysis of microarray gene expression data • Applied only to a training set during cross-validation SVM model 5,000 SNPs SVM model 2,500 SNPs 10,000 SNPs … Important for classification Performance estimate Important for classification Performance estimate 2,500 SNPs 5,000 SNPs Discarded Discarded Not important for classification Not important for classification
Classification results: repeated 10-fold cross-valid. estimates “+” denotes building of classifier by ensembling technique
Feedback on our analysis from Hu et al. 1. Concerning bias in SNP selection: • “If we use p-values to rank the SNPs, the two methods [GLM1 and GLM2] will give the same order.” • Our comment: • Ranking of SNPs is irrelevant because the method of Hu et al. (GLM1) as described and used in their paper is the method for selection (and not ranking) of SNPs.
Feedback on our analysis from Hu et al. 2. Concerning bias in estimation of classifier performance: • “It was not our purpose to develop a classifier in this initial pilot effort.” • “…we made these calculations as a frame of reference only.” • The authors presented results of their “cross-validation effort”. SNPs were selected by GLM1 on all 100 subjects and the classifier was trained and tested by cross-validation (2/3 of data is used for training and 1/3 of data is used for testing). This cross-validation procedure was repeated 1,000 times with different splits into training and testing set. 26
Feedback on our analysis from Hu et al. Proportion of correct classifications • The authors obtain the following histogram of classification performance estimates • Our comment: • These results are expected because their SNP selection procedure utilizes both training and testing data. This is “incomplete cross-validation” and is shown to cause biased performance estimation of the classifier. 27
Publications • Statnikov A, Li C, Aliferis CF (2007) “Effects of Environment, Genetics and Data Analysis Pitfalls in an Esophageal Cancer Genome-Wide Association Study.” PLoS ONE 2(9): e958. • Statnikov A, Li C, Aliferis CF (2007) “A statistical reappraisal of the findings of an esophageal cancer genome-wide association study.” Cancer Research, (accepted).
Conclusions • Data-analysis pitfalls in Hu et al. led researchers to (1) identify non-statistically significant SNPs and (2) derive biased estimates of classification performance. • Environmental factors and family history have modest association with the disease, while SNPs do not appear to be associated. • It is crucially important to have sound statistical analysis in genome-wide association studies. • The amount of work involved in demonstration of errors (even obvious), correcting the analysis, communicating with authors, and publishing the rebuttal is significantly greater than publishing the original paper!