170 likes | 355 Views
A Multivariate Biomarker for Parkinson’s Disease. M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin. The Michael L. Gargano 12 th Annual Research Day Friday, May 2 nd , 2014. Introduction. Genomic Analysis for the selection of genes associated with Parkinson’s Disease (PD)
E N D
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12th Annual Research DayFriday, May 2nd, 2014
Introduction • Genomic Analysis for the selection of genes associated with Parkinson’s Disease (PD) • Adoption of Multivariate Techniques • Comparison between several classification algorithms
The Data • Microarray expression data from Affymetrix • Expression Dataset from GenBank (Geo accession GSE6613) generated using MAS5 algorithm • 105 samples, 22,283 measurements of gene expression from three groups: • Parkinson’s disease group (50 patients) • Healthy control group. (22) • Neurodegenerative control group. (33)
Data Preparation • Filtering: removed noise in probesets (measurement) using “Filtering by Present calls” with threshold of 25%: only maintain genes expressed in 25% of the sample. • After the filtering the number of probeset dropped from 23,283 to 8,100.
Normality Assumption & Normalization • The data showed a strong right skewness • We applied logarithmic-scale transformation • Normalized the data using z-score for outlier detection [z score > 5] and algorithmic optimization
Univariate Analysis • Identify which (single) gene is associated with PD • Correspond in running 8,100 hypothesis tests: • H0: mA=mBwith the alternative H1: mA> mB • For this test we use the t-statistic t= with critical region t≥za • Since we have 3 classes a gene is selected if: • Are up-regulated in PD (Parkinson Disease) when compared with the other classes • Are up-regulated in the other classes but down-regulated in PD. • The result of this analysis does not indicate which class contains the up-regulated gene(s), so we need to check.
Upregulated Features We identified 60 genes out of 24,000!
Problems of Univariate Analysis in Genomics In array-based differential expression analysis the problem is to generate a list of genes that are differentially expressed, as meaningful and complete as possible. Let’s have 1,000 genes. We test each with a t-test with a significance level of 0.05: we might expect 40 genes to be differentially expressed. Of the remaining 960 non-differentially expressed genes we can expect 5% errors, or .05 x 960 = 48 false positives There are more false positives than truly differentially expressed genes: this is called multiple hypothesis testing problem
Univariate Vs Multivariate • In Univariate analysis we are considering the effect of each gene, individually, against the target (PD) • The effect of a disease is rarely the result of a single gene. • Even if good univariate leads are found (the 60 genes) this rarely turns into the identification of useful pathways. • We don’t have information on any group of genes that, together, might be involved in the development of PD • Multivariate approaches tests for group of variables that, simultaneously, explain the particular output. • Multivariate theory is much more complex.
Multivariate Mining on Genomics We are trying to identify a subset of genes (as small as possible) used as a classification model that will differentiate classes in the original data set. • Wrapper Subset Evaluator (WSE): implementation of forward wrapper method for feature selection for the creation of an optimal subset. • Correlation-based Feature Selection (CFS): these algorithms evaluate different combinations of features to identify an optimal subset. The feature subsets to be evaluated are generated using different search techniques. We used Best First and Greedy search methods with a forward direction. • R-Support Vector Machine (RSVM): a non-probabilistic binary linear classifier in its recursive version. No matter which algorithm you select it must use multivariate hypothesis testing
Multivariate AnalysisEvaluating several Classification Models
Multivariate AnalysisEvaluating several Classification Models We used 10 folds cross-validation method during the feature selection process. In K-Fold Cross-validation the original data set is split into k equal size sub-partitions. Out of the k sets, one is retained as a validation set for testing the model, and the remaining k-1 used in training the data. The cross-validation is repeated k times, and the results averaged.
Multivariate AnalysisResults - WSE Kappa Statistics is a rate of agreement between tests.
Multivariate AnalysisResults – CFS This looks a good starting point. A further investigation is warrant to understand the relationships between the selected 20 genes.
Conclusions Multivariate models are a necessary tools in genomic studies. Among the algorithms tested in this study, RSVM clearly came out as an effective model to adopt in biomarker discovery, with the important ability of successfully discriminate between PD and other neurodegenerative diseases. This research cannot stop here, and the natural next step is to look for the biological interpretation of this result.