320 likes | 342 Views
Integration of Clinical, SNP, and Microarray Gene Expression Measurements in Prediction of CFS. Sooyeol Lim 1 , Wen Le 1,2 , Pingzhao Hu 1 , Baifang Xing 1 , Celia M.T. Greenwood 1,2 , Joseph Beyene 1,2 The Hospital for Sick Children Research Institute 1 ,
E N D
Integration of Clinical, SNP, and Microarray Gene Expression Measurements in Prediction of CFS Sooyeol Lim1, Wen Le1,2, Pingzhao Hu1, Baifang Xing1, Celia M.T. Greenwood1,2, Joseph Beyene1,2 The Hospital for Sick Children Research Institute1, Department of Public Health Sciences, University of Toronto2 The Sixth International Conference for the Critical Assessment of Microarray Data Analysis (CAMDA 2006) Duke University Durham, NC, U.S.A June 8-9, 2006 *Contact: joseph@utstat.toronto.edu
Objective Integration of clinical data, SNP microarray data, and gene expression microarray data to generate a statistical prediction model for chronic fatigue syndrome (CFS) and identify genes that can serve as potential biomarkers
Main Statistical Approach • Pre-validation method for data integration, due to Tibshirani and Efron (2002): • For SNP and expression data, use a simple test combined with cross-validation to generate a prediction score for each individual for each data type. • Combine prediction scores with clinical variables to improve predictive power of model.
Statistical Method: Cross-Validation (CV) • Cross-validation necessary to evaluate model performance while preventing overfitting. • Stratified sampling on case-control data to divide all subjects into 10 groups. • At each step of CV, use 9 groups to train statistical model which is subsequently used to make predictions on remaining 1 group. • Repeat 10 times to obtain predictions on all subjects.
Subjects • 164 subjects identified with all of clinical, SNP, and gene expression data • Subjects classified as 129 “cases” (64 CFS and 65 CFS-like symptoms) and 35 controls • Previously determined clinical diagnoses used (rather than empirically derived classifications [Reeves et al, 2005])
Clinical Data • Previous factor analysis identified 3 major factors: musculoskeletal, infection, mental [Nisenbaum et al., 2004] • Previous clinical research results show statistically significant assoc. with tender lymph node (p=0.02) [Solomon and Reeves, 2004] and 81.4% of CFS subjects reported sleep abnormality [Unger et al., 2004] • 3 clinical assessments chosen to represent 3 major clinical factors: tender lymph nodes, sleep problems, muscle pain
SNP and haplotype data • 42 SNP markers from 10 genes responsible for neurotransmission and neuroendocrine system. • Alleles at closely linked SNP markers frequently occur as common blocks. • For 6 genes (COMT, TH, TPH2, CRHR1, CRHR2, NR3C1), PHASE software used to construct haplotypes with Bayesian algorithms.
Haplotype data (cont.) • For each of 6 haplotyped genes, 3 major haplotypes coded as 2 binary indicator variables. • For haplotypes and SNP alleles on autosomal chr., additive inheritance coding used. (0, 1, 2 according to num.) • For genes on X chr., dominant inheritance coding used. • Values in indicator variables multiplied by probabilities associated with the determination of each haplotype to account for phase uncertainty.
Microarray Gene Expression Data • 177 microarray gene expression samples for 19,892 genes initially available. • 8 technical replicates and 5 samples for excluded subjects removed. • Remaining 164 samples analyzed after normalizing them using quantile normalization after log transformation.
Statistical Method: Exploratory analysis For all 164 subjects without CV, • For clinical and SNP data: logistic regression • For microarray expression data: principal component analysis and Sammon’s non-linear mapping on 400 genes with largest inter-quartile range (IQR) to explore the possibility of grouping gene expressions to achieve dimension reduction.
Prediction scores from gene expression data • Feature selection: Signal-to-Noise (S2N) filter used to select 50 most variable genes out of 19,892 genes on the array. where µ and σ indicate means and std. dev. of expression levels of gene j for case and control groups. • Kernel-based K-nearest neighbor (KNN) algorithm used to generate prediction scores in 10-part CV, which classifies an observation based on k nearest neighbors weighed by distance measure. [Hechenbichler et al., 2004]
Feature selection for haplotype/SNP data • On 9-part training dataset, logistic regressions were fitted for each gene by gene. • J genes were selected that yielded highest measure of given selection criterion (AUC of ROC, accuracy, correlation measures) between fitted values and observed labels. • Various values of j were tried in order to look for optimal measure.
Prediction scores from haplotype/SNP data • After feature selection, all j genes combined in single logistic regression model. • Training data used to estimate parameters for logistic regression. • Trained logistic regression model was used to make predictions on test data.
Integration of clinical data and prediction scores from SNP and gene expression data • Final logistic regression model with 3 clinical variables (muscular pain, tender lymph node, sleep problems) and 2 prediction scores from logistic regressions on SNP and gene expression data in 10-part CV. • Final model requires no feature selection as there are only 5 covariates in the model. • Model performance evaluated with AUC of ROC curve.
Exploratory Principal Component Analysis on Gene Expression Data Fig.1: PCA shows no clear pattern among 400 genes of gene expression data with highest inter-quartile range. (Red dots: controls, Blue dots: cases)
Exploratory Sammon’s non-linear mapping Fig.2: Sammon’s non-linear mapping shows no clear clusters among 400 genes of gene expression data with highest inter-quartile range. (1: cases, 0: controls)
Exploratory analysis with 3 clinical variables Logistic regression with all 3 clinical variables on 164 subjects.
Exploratory analysis with 10 genes in SNP dataset Gene-by-Gene logistic regression models on 164 subjects.
Summary of Similarity Measures for Genes in SNP Data during Feature Selection Similarity measures for 5 genes with highest AUC (mean across 10 training sets)
Number of Genes To Fit with SNP data via 10-part CV AUC values for prediction of CFS using SNP data with different gene selection criteria
Data Integration and Assessment of Logistic Regression Modeling via 10-part CV
Model Check • Inclusion of interaction effects: little improvement in AUC of ROC. • Model diagnostics with deviance residual plots show few observations with marked deviations.
Exploratory Analysis with Alternative Models: Tree and Random Forest • Tree model fitted with all SNP data to identify the structure of genes that allow best identification of cases and controls. • Random Forest model can handle large number of covariates, so it is fitted on clinical variables, SNP genes, expression score data in one step.
Results of Tree and Random Forest Modeling • Tree model on all SNP data shows that NR3C1 is at the top level and CRHR2 and SLC6A4 are at the second level of the tree. • Random forest model on clinical, SNP, and Expression data with 10-part CV: AUC = 0.620, Accuracy = 56.7%
Discussion • Integration of SNP scores with clinical data delivers small improvement in predictive power. • Potential genes: NR3C1, CRHR2 (CRHR1 and SLC6A4 possibilities as well) • Our method did not yield improvement from integration of gene expression microarray data. • Our proposed model yields better prediction than random forest model. • Our model has the advantage that one can use technology-specific method to identify predictors for each dataset and then combine afterward.
Discussion • Uncertainty in case definition may be a potential reason for suboptimal performance of gene expression scores. Potential methods for improvements: • Integrate only expression data for genes also containing SNP markers in order to minimize noise introduced by other genes. • Use a method that assigns weights to classification rules so that gene expression information can be optimally combined.
Acknowledgement This research was supported by funding from Ontario Genomics Institute, Genome Canada, and CIHR grant NPG-64872.
References • Reeves W.C, Lloyd A, Vernon S.D, Klimas N, Jason L.A., Bleijenberg G, Evengard B, White P.D., Nisenbaum R, Unger E.R., and the International Chronic Fatigue Syndrome Study Group. Identification of ambiguities in the 1994 chronic fatigue syndrome research case definition and recommendations for resolution. BMC Health Services Research., 3:25, 2003. • Reeves, C.W., Wagner, D., Nisenbaum R., Jones J.F., Gurbaxani B., Solomon L., Papanicolaou D.A., Unger E.R., Vernon S.D., and Heim C. Chronic fatigue syndrome – a clinically empirical approach to its definition and study. BMC Medicine. 3:19, 2005. • Detours, V., Dumont J.E., Bersini H., and Maenhaut C. Integration and cross-validation of high-throughput gene expression data: comparing heterogeneous data sets. FEBS Letters 546, 98-102, 2003. • Solomon L. and Reeves, W.C. Factors influencing the diagnosis of chronic fatigue syndrome. Arch Intern Med 164:2241-2245, 2004
Reference • Unger, E. R., Nisenbaum R., Moldofsky H., Cesta A., Sammut C., Reyes M., and Reeves, W.C. Sleep assessment in a population-based study of chronic fatigue syndrome. BMC Neurology. 4:6, 2004. • Hattori, E., Liu, C., Zhu, H., and Gershon, E.S. Genetic tests of biologic systems in affective disorders. MolecularPsychiatry. 10:719-740. 2005. • Stephens, M., Smith, N.J., and Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68:978-989, 2001. • Stephens, M., and Donnelly, P. A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet. 73:1162-1169, 2003.
Reference • Irizarry, R.A, Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., and Speed, T.P. 2003 Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4:249-264, 2003. • Tibshirani, R.J., and Efron, B. Pre-validation and inference in microarrays. Statistical Applications in Genetics and Molecular Biology Vol. 1, Iss. 1. Article 1, 2002. • The SAS System for Windows. Cary, NC, USA. The SAS Institute, 2002. • Yeung K.Y., and Ruzzo, W.L. Principal component analysis for clustering gene expression data. Bioinformatics. 17(9):763-774. 2001. • Ewing, R.M, and Cherry, J.M. Visualization of expression clusters using Sammon’s non-linear mapping. Bioinformatics. 17(7):658-659. 2001.
Reference • Hu, P. et al. Serum diagnosis of chronic fatigue syndrome using array-based proteomics. Abstract for CAMDA 2006. • Swets, J.A. Measuring the accuracy of diagnostic systems. Science. 240:1285-1293. 1988. • Hechenbichler, K. and Schliep, K.P. Weighted k-Nearest-Neighbor Techniques and Ordinal Classification, Discussion Paper 399, SFB 386, Ludwig-Maximilians University Munich (htt://www.stat.uni-muenchen.de/sfb386/papers/dsp/paper399.ps), 2004. • Wackerly, D.D., Mendenhall, W., and Scheaffer, R.L. Mathematical statistics with applications. 5th ed. Duxbury press. 1996. • Long P.M. and Vega, V.B. Boosting and microarray data. Machine Learning, 52(1):31-44, 2003.