360 likes | 394 Views
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification. by Carlotta Domeniconi and Hong Chai. Outline. Introduction to microarray data Problem description Related work Our methods Experimental Analysis Result Conclusion and future work. Microarray.
E N D
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai
Outline • Introduction to microarray data • Problem description • Related work • Our methods • Experimental Analysis • Result • Conclusion and future work
Microarray • Measures gene expression levels across different conditions, times or tissue samples • Gene expression levels inform cell activity and disease status • Microarray data distinguish between tumor types, define new subtypes, predict prognostic outcome, identify possible drugs, assess drug toxicity, etc.
Microarray Data • A matrix of measurements: rows are gene expression levels; columns are samples/conditions.
Microarray data analysis • Clustering applied to genes to identify genes with similar functions or participate in similar biological processes, or to samples to find potential tumor subclasses. • Classification builds model to predict diseased samples. Diagnostic value.
Classification Problem • Large number of genes (features) - may contain up to 20,000 features. • Small number of experiments (samples) – hundreds but usually less than 100 samples. • The need to identify “marker genes” to classify tissue types, e.g. diagnose cancer - feature selection
Our Focus • Binary classification and feature selection methods extensively studied; Multi-class case received little attention. • Practically many microarray datasets have more than two categories of samples • We focus on multi-class gene ranking and selection.
Related Work Some criteria used in feature ranking • Correlation coefficient • Information gain • Chi-squared • SVM-RFE
Notation • Given C classes • m observations (samples or patients) • n feature measurements (gene expressions) • class labels y= 1,...,C
Correlation Coefficient • Two class problem: y = {-1,+1} • Ranking criterion defined in Golub: • where μj is the mean and σ standard deviation along dimension j in the + and – classes; Large |w| indicates discriminant feature
Fischer’s score • Fisher’s criterion score in Pavlidis:
Assumption of above methods • Features analyzed in isolation. Not considering correlations. • Assumption: independent of each other • Implication: redundant genes selected into a top subset.
Information Gain • A measure of the effectiveness of a feature in classifying the training data. • Expected reduction in entropy caused by partitioning the data according to this feature. • V (A) is the set of all possible values of feature A, and Sv is the subset of S for which feature A has value v
Information Gain • E(S) is the entropy of the entire set S. • wherewhere |Ci| is the number of training data in class Ci, and |S| is thecardinality of the entire set S.
Chi-squared • Measures features individually • Continuous valued features discretized into intervals • Form a matrix A, where Aij is the number of samples of the Ci class within the j-th interval. • Let CIj be the number of samples in the j-th interval
Chi-squared • The expected frequency of Aij is • The Chi-squared statistic of a feature is defined as • Where I is the number of intervals. The larger the statistic, the more informative the feature is.
SVM-RFE • Recursive Feature Elimination using SVM • In the linear SVM model on the full feature set Sign (w•x + b) w is a vector of weights for each feature, x is an input instance, and b a threshold. If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features.
SVM-RFE • After getting w for the full feature set, sort features in descending order of weights. A percentage of lower feature is eliminated. 3. A new linear SVM is built using the new set of features. Repeat the process. 4. The best feature subset is chosen.
Other criteria • The Brown-Forsythe, the Cochran, and the Welch test statistics used in Chen, et al. (Extensions of the t-statistic used in the two-class classification problem.) • PCA (Disadvantage: new dimension formed. None of the original features can be discarded. Therefore can’t identify marker genes.)
Our Ranking Methods • BScatter • MinMax • bSum • bMax • bMin • Combined
Notation • For each class i and each feature j, we define the mean value of feature j for class Ci: • Define the total mean along feature j
Notation • Define between-class scatter along feature j
Function 1: BScatter • Fisher discriminant analysis for multiple classes under feature independence assumption. It credits the largest score to the feature that maximizes the ratio of the between-class scatter to the within-class scatter • where σji is the standard deviation of class i along feature j
Function 2: MinMax • Favors features along which the farthest mean-class difference is large, and the within class variance is small.
Function 3: bSum • For each feature j, we sort the C values μj,i in non-decreasing order: μj1 <= μj2…<= μjC • Define bj,l = μj1+1 - μj1 • bSum rewards the features with large distances between adjacent mean class values:
Function 4: bMax • Rewards features j with a large between-neighbor-class mean difference
Function 5: bMin • Favorsthe features with large smallest between-neighbor-class mean difference
Function 6: Comb • Considers a score function which combines MinMax and bMin
Experiment Design • Gene expression scaled between [-1,1] • Performed 9 comparative feature selection methods (6 proposed scores, Chi-squared, Information Gain, and SVM-RFE) • Obtain subsets of top-ranked genes to train SVM classifier (3 kernel functions: linear, 2-degree polynomial, Gaussian; Soft-margin [1,100]; Gaussian kernel [0.001,2]) • Leave-one-out cross validation due to small sample size • One-vs-one multi-class classification implemented on LIBSVM
Conclusions • SVMs classification benefits from gene selection; • Gene ranking with correlation coefficients gives higher accuracy than SVM-RFE in low dimensions in most data sets. The best performing correlation score varies from problem to problem; • Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent;
Conclusions • For a given classification model, different gene selection methods reach the best performance for different feature set sizes; • Very high accuracy was achieved on all the data sets studied here. In many cases perfect accuracy (based on leave-one-out error) was achieved; • The NCI60 dataset [17] shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVM-RFE and the other gene rankingmethods is highest for this dataset (ca. 11.5%).
Limitations & Future Work • The selection of features over the whole training set induces a bias in the results. Will study valuable suggestions on how to assess and correct the bias in future experiments. • Will take into consideration the correlation between any pair of selected features. Ranking method will be modified so that correlations are lower than a certain threshold. • Evaluate top-ranked genes in our research against marker genes identified in other studies.