Multi-class Gene Selection for Microarray Data Classification

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai

Outline • Introduction to microarray data • Problem description • Related work • Our methods • Experimental Analysis • Result • Conclusion and future work

Microarray • Measures gene expression levels across different conditions, times or tissue samples • Gene expression levels inform cell activity and disease status • Microarray data distinguish between tumor types, define new subtypes, predict prognostic outcome, identify possible drugs, assess drug toxicity, etc.

Microarray Data • A matrix of measurements: rows are gene expression levels; columns are samples/conditions.

Example – Lymphoma Dataset

Microarray data analysis • Clustering applied to genes to identify genes with similar functions or participate in similar biological processes, or to samples to find potential tumor subclasses. • Classification builds model to predict diseased samples. Diagnostic value.

Classification Problem • Large number of genes (features) - may contain up to 20,000 features. • Small number of experiments (samples) – hundreds but usually less than 100 samples. • The need to identify “marker genes” to classify tissue types, e.g. diagnose cancer - feature selection

Our Focus • Binary classification and feature selection methods extensively studied; Multi-class case received little attention. • Practically many microarray datasets have more than two categories of samples • We focus on multi-class gene ranking and selection.

Related Work Some criteria used in feature ranking • Correlation coefficient • Information gain • Chi-squared • SVM-RFE

Notation • Given C classes • m observations (samples or patients) • n feature measurements (gene expressions) • class labels y= 1,...,C

Correlation Coefficient • Two class problem: y = {-1,+1} • Ranking criterion defined in Golub: • where μj is the mean and σ standard deviation along dimension j in the + and – classes; Large |w| indicates discriminant feature

Fischer’s score • Fisher’s criterion score in Pavlidis:

Assumption of above methods • Features analyzed in isolation. Not considering correlations. • Assumption: independent of each other • Implication: redundant genes selected into a top subset.

Information Gain • A measure of the effectiveness of a feature in classifying the training data. • Expected reduction in entropy caused by partitioning the data according to this feature. • V (A) is the set of all possible values of feature A, and Sv is the subset of S for which feature A has value v

Information Gain • E(S) is the entropy of the entire set S. • wherewhere |Ci| is the number of training data in class Ci, and |S| is thecardinality of the entire set S.

Chi-squared • Measures features individually • Continuous valued features discretized into intervals • Form a matrix A, where Aij is the number of samples of the Ci class within the j-th interval. • Let CIj be the number of samples in the j-th interval

Chi-squared • The expected frequency of Aij is • The Chi-squared statistic of a feature is defined as • Where I is the number of intervals. The larger the statistic, the more informative the feature is.

SVM-RFE • Recursive Feature Elimination using SVM • In the linear SVM model on the full feature set Sign (w•x + b) w is a vector of weights for each feature, x is an input instance, and b a threshold. If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features.

SVM-RFE • After getting w for the full feature set, sort features in descending order of weights. A percentage of lower feature is eliminated. 3. A new linear SVM is built using the new set of features. Repeat the process. 4. The best feature subset is chosen.

Other criteria • The Brown-Forsythe, the Cochran, and the Welch test statistics used in Chen, et al. (Extensions of the t-statistic used in the two-class classification problem.) • PCA (Disadvantage: new dimension formed. None of the original features can be discarded. Therefore can’t identify marker genes.)

Our Ranking Methods • BScatter • MinMax • bSum • bMax • bMin • Combined

Notation • For each class i and each feature j, we define the mean value of feature j for class Ci: • Define the total mean along feature j

Notation • Define between-class scatter along feature j

Function 1: BScatter • Fisher discriminant analysis for multiple classes under feature independence assumption. It credits the largest score to the feature that maximizes the ratio of the between-class scatter to the within-class scatter • where σji is the standard deviation of class i along feature j

Function 2: MinMax • Favors features along which the farthest mean-class difference is large, and the within class variance is small.

Function 3: bSum • For each feature j, we sort the C values μj,i in non-decreasing order: μj1 <= μj2…<= μjC • Define bj,l = μj1+1 - μj1 • bSum rewards the features with large distances between adjacent mean class values:

Function 4: bMax • Rewards features j with a large between-neighbor-class mean difference

Function 5: bMin • Favorsthe features with large smallest between-neighbor-class mean difference

Function 6: Comb • Considers a score function which combines MinMax and bMin

Datasets

Experiment Design • Gene expression scaled between [-1,1] • Performed 9 comparative feature selection methods (6 proposed scores, Chi-squared, Information Gain, and SVM-RFE) • Obtain subsets of top-ranked genes to train SVM classifier (3 kernel functions: linear, 2-degree polynomial, Gaussian; Soft-margin [1,100]; Gaussian kernel [0.001,2]) • Leave-one-out cross validation due to small sample size • One-vs-one multi-class classification implemented on LIBSVM

Result – MLL Dataset

Result – Lymphoma Dataset

Conclusions • SVMs classification benefits from gene selection; • Gene ranking with correlation coefficients gives higher accuracy than SVM-RFE in low dimensions in most data sets. The best performing correlation score varies from problem to problem; • Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent;

Conclusions • For a given classification model, different gene selection methods reach the best performance for different feature set sizes; • Very high accuracy was achieved on all the data sets studied here. In many cases perfect accuracy (based on leave-one-out error) was achieved; • The NCI60 dataset [17] shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVM-RFE and the other gene rankingmethods is highest for this dataset (ca. 11.5%).

Limitations & Future Work • The selection of features over the whole training set induces a bias in the results. Will study valuable suggestions on how to assess and correct the bias in future experiments. • Will take into consideration the correlation between any pair of selected features. Ranking method will be modified so that correlations are lower than a certain threshold. • Evaluate top-ranked genes in our research against marker genes identified in other studies.

Multi-class Gene Selection for Microarray Data Classification

Multi-class Gene Selection for Microarray Data Classification

Presentation Transcript

Gene Selection For Discriminant Microarray Data Analyses

An Evaluation of Interpolation Methods for MOLA Data

Statistical Methods for the Screening and Classification of Microarray Gene Expression Data

Statistical Methods for Analyzing Ordered Gene Expression Microarray Data

Multi-class Microarray Experiments

Recursive partitioning for tumor classification with gene microarray data

A Gene Selection Method for Microarray Data based on Sampling

Classification of Microarray data

Gene expression: Microarray data analysis

Classification of Microarray Gene Expression Data

Microarray Evaluation for Gene Regulation Analysis

Gene Set Enrichment Analysis Microarray Classification

Application of Class Discovery and Class Prediction Methods to Microarray Data

Classification of microarray gene expression data using support vector machines ( SVM )

Feature Selection Stability Analysis for Classification Using Microarray Data

Multi-class SVM with Negative Data Selection for Web Page Classification

Classification of Microarray Gene Expression Data

Statistical Methods for the Screening and Classification of Microarray Gene Expression Data

Classification of Microarray Data

Multi-class Classification

Eigensolvers for analysis of microarray gene expression data

Classification and Feature Selection Algorithms for Multi-class CGH data