1 / 36

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification. by Carlotta Domeniconi and Hong Chai. Outline. Introduction to microarray data Problem description Related work Our methods Experimental Analysis Result Conclusion and future work. Microarray.

candi
Download Presentation

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai

  2. Outline • Introduction to microarray data • Problem description • Related work • Our methods • Experimental Analysis • Result • Conclusion and future work

  3. Microarray • Measures gene expression levels across different conditions, times or tissue samples • Gene expression levels inform cell activity and disease status • Microarray data distinguish between tumor types, define new subtypes, predict prognostic outcome, identify possible drugs, assess drug toxicity, etc.

  4. Microarray Data • A matrix of measurements: rows are gene expression levels; columns are samples/conditions.

  5. Example – Lymphoma Dataset

  6. Microarray data analysis • Clustering applied to genes to identify genes with similar functions or participate in similar biological processes, or to samples to find potential tumor subclasses. • Classification builds model to predict diseased samples. Diagnostic value.

  7. Classification Problem • Large number of genes (features) - may contain up to 20,000 features. • Small number of experiments (samples) – hundreds but usually less than 100 samples. • The need to identify “marker genes” to classify tissue types, e.g. diagnose cancer - feature selection

  8. Our Focus • Binary classification and feature selection methods extensively studied; Multi-class case received little attention. • Practically many microarray datasets have more than two categories of samples • We focus on multi-class gene ranking and selection.

  9. Related Work Some criteria used in feature ranking • Correlation coefficient • Information gain • Chi-squared • SVM-RFE

  10. Notation • Given C classes • m observations (samples or patients) • n feature measurements (gene expressions) • class labels y= 1,...,C

  11. Correlation Coefficient • Two class problem: y = {-1,+1} • Ranking criterion defined in Golub: • where μj is the mean and σ standard deviation along dimension j in the + and – classes; Large |w| indicates discriminant feature

  12. Fischer’s score • Fisher’s criterion score in Pavlidis:

  13. Assumption of above methods • Features analyzed in isolation. Not considering correlations. • Assumption: independent of each other • Implication: redundant genes selected into a top subset.

  14. Information Gain • A measure of the effectiveness of a feature in classifying the training data. • Expected reduction in entropy caused by partitioning the data according to this feature. • V (A) is the set of all possible values of feature A, and Sv is the subset of S for which feature A has value v

  15. Information Gain • E(S) is the entropy of the entire set S. • wherewhere |Ci| is the number of training data in class Ci, and |S| is thecardinality of the entire set S.

  16. Chi-squared • Measures features individually • Continuous valued features discretized into intervals • Form a matrix A, where Aij is the number of samples of the Ci class within the j-th interval. • Let CIj be the number of samples in the j-th interval

  17. Chi-squared • The expected frequency of Aij is • The Chi-squared statistic of a feature is defined as • Where I is the number of intervals. The larger the statistic, the more informative the feature is.

  18. SVM-RFE • Recursive Feature Elimination using SVM • In the linear SVM model on the full feature set Sign (w•x + b) w is a vector of weights for each feature, x is an input instance, and b a threshold. If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features.

  19. SVM-RFE • After getting w for the full feature set, sort features in descending order of weights. A percentage of lower feature is eliminated. 3. A new linear SVM is built using the new set of features. Repeat the process. 4. The best feature subset is chosen.

  20. Other criteria • The Brown-Forsythe, the Cochran, and the Welch test statistics used in Chen, et al. (Extensions of the t-statistic used in the two-class classification problem.) • PCA (Disadvantage: new dimension formed. None of the original features can be discarded. Therefore can’t identify marker genes.)

  21. Our Ranking Methods • BScatter • MinMax • bSum • bMax • bMin • Combined

  22. Notation • For each class i and each feature j, we define the mean value of feature j for class Ci: • Define the total mean along feature j

  23. Notation • Define between-class scatter along feature j

  24. Function 1: BScatter • Fisher discriminant analysis for multiple classes under feature independence assumption. It credits the largest score to the feature that maximizes the ratio of the between-class scatter to the within-class scatter • where σji is the standard deviation of class i along feature j

  25. Function 2: MinMax • Favors features along which the farthest mean-class difference is large, and the within class variance is small.

  26. Function 3: bSum • For each feature j, we sort the C values μj,i in non-decreasing order: μj1 <= μj2…<= μjC • Define bj,l = μj1+1 - μj1 • bSum rewards the features with large distances between adjacent mean class values:

  27. Function 4: bMax • Rewards features j with a large between-neighbor-class mean difference

  28. Function 5: bMin • Favorsthe features with large smallest between-neighbor-class mean difference

  29. Function 6: Comb • Considers a score function which combines MinMax and bMin

  30. Datasets

  31. Experiment Design • Gene expression scaled between [-1,1] • Performed 9 comparative feature selection methods (6 proposed scores, Chi-squared, Information Gain, and SVM-RFE) • Obtain subsets of top-ranked genes to train SVM classifier (3 kernel functions: linear, 2-degree polynomial, Gaussian; Soft-margin [1,100]; Gaussian kernel [0.001,2]) • Leave-one-out cross validation due to small sample size • One-vs-one multi-class classification implemented on LIBSVM

  32. Result – MLL Dataset

  33. Result – Lymphoma Dataset

  34. Conclusions • SVMs classification benefits from gene selection; • Gene ranking with correlation coefficients gives higher accuracy than SVM-RFE in low dimensions in most data sets. The best performing correlation score varies from problem to problem; • Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent;

  35. Conclusions • For a given classification model, different gene selection methods reach the best performance for different feature set sizes; • Very high accuracy was achieved on all the data sets studied here. In many cases perfect accuracy (based on leave-one-out error) was achieved; • The NCI60 dataset [17] shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVM-RFE and the other gene rankingmethods is highest for this dataset (ca. 11.5%).

  36. Limitations & Future Work • The selection of features over the whole training set induces a bias in the results. Will study valuable suggestions on how to assess and correct the bias in future experiments. • Will take into consideration the correlation between any pair of selected features. Ranking method will be modified so that correlations are lower than a certain threshold. • Evaluate top-ranked genes in our research against marker genes identified in other studies.

More Related