340 likes | 522 Views
A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis. Presented by: Renikko Alleyne. Outline. Motivation Major Concerns Methods SVMs Non-SVMs Ensemble Classification Datasets Experimental Design Gene Selection
E N D
A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis Presented by: Renikko Alleyne
Outline • Motivation • Major Concerns • Methods • SVMs • Non-SVMs • Ensemble Classification • Datasets • Experimental Design • Gene Selection • Performance Metrics • Overall Design • Results • Discussion & Limitations • Contributions • Conclusions
GEMS (Gene Expression Model Selector) Microarray data Creation of powerful and reliable cancer diagnostic models Equip with best classifier, gene selection, and cross-validation methods 11 datasets spanning 74 diagnostic categories & 41 cancer types & 12 normal tissue types Evaluation of major algorithms for multicategory classification, gene selection methods, ensemble classifier methods & 2 cross validation designs
Major Concerns • The studies conducted limited experiments in terms of the number of classifiers, gene selection algorithms, number of datasets and types of cancer involved. • Cannot determine which classifier performs best. • It is poorly understood what are the best combinations of classification and gene selection algorithms across most array-based cancer datasets. • Overfitting. • Underfitting.
Goals for the Development of an Automated System that creates high-quality diagnostic models for use in clinical applications • Investigate which classifier currently available for gene expression diagnosis performs the best across many cancer types • How classifiers interact with existing gene selection methods in datasets with varying sample size, number of genes and cancer types • Whether it is possible to increase diagnostic performance further using meta-learning in the form of ensemble classification • How to parameterize the classifiers and gene selection procedures to avoid overfitting
Why use Support Vector Machines (SVMs)? • Achieve superior classification performance compared to other learning algorithms • Fairly insensitive to the curse of dimensionality • Efficient enough to handle very large-scale classification in both sample and variables
How SVMs Work • Objects in the input space are mapped using a set of mathematical functions (kernels). • The mapped objects in the feature (transformed) space are linearly separable, and instead of drawing a complex curve, an optimal line (maximum-margin hyperplane) can be found to separate the two classes.
Binary SVMs Support Vector • Main idea is to identify the maximum-margin hyperplane that separates training instances. • Selects a hyperplane that maximizes the width of the gap between the two classes. • The hyperplane is specified by support vectors. • New classes are classified depending on the side of the hyperplane they belong to. Hyperplane
1. Multiclass SVMs: one-versus-rest (OVR) • Simplest MC-SVM • Construct k binary SVM classifiers: • Each class (positive) vs all other classes (negatives). • Computationally Expensive because there are k quadratic programming (QP) optimization problems of size n to solve.
2. Multiclass SVMs: one-versus-one (OVO) • Involves construction of binary SVM classifiers for all pairs of classes • A decision function assigns an instance to a class that has the largest number of votes (Max Wins strategy) • Computationally less expensive
3. Multiclass SVMs: DAGSVM • Constructs a decision tree • Each node is a binary SVM for a pair of classes • k leaves: k classification decisions • Non-leaf (p, q): two edges • Left edge: not p decision • Right edge: not q decision
4 & 5. Multiclass SVMs: Weston & Watkins (WW) and Crammer & Singer (CS) • Constructs a single classifier by maximizing the margin between all the classes simultaneously • Both require the solution of a single QP problem of size (k-1)n, but the CS MC-SVM uses less slack variables in the constraints of the optimization problem, thereby making it computationally less expensive
? ? K-Nearest Neighbors (KNN) • For each case to be classified, locate the k closest members of the training dataset. • A Euclidean Distance measure is used to calculate the distance between the training dataset members and the target case. • The weighted sum of the variable of interest is found for the k nearest neighbors. • Repeat this procedure for the other target set cases.
Backpropagation Neural Networks (NN) & Probabilistic Neural Networks (PNNs) • Back Propagation Neural Networks: • Feed forward neural networks with signals propagated forward through the layers of units. • The unit connections have weights which are adjusted when there is an error, by the backpropagation learning algorithm. • Probabilistic Neural Networks: • Design similar to NNs except that the hidden layer is made up of a competitive layer and a pattern layer and the unit connections do not have weights.
Ensemble Classification MethodsIn order to improve performance: Classifier 1 Classifier 2 Classifier N Output 1 Output 2 Output N Techniques: Major Voting, Decision Trees, MC-SVM (OVR, OVO, DAGSVM) Ensembled Classifiers
Datasets & Data Preparatory Steps • Nine multicategory cancer diagnosis datasets • Two binary cancer diagnosis datasets • All datasets were produced by oligonucleotide-based technology • The oligonucleotides or genes with absent calls in all samples were excluded from analysis to reduce any noise.
Experimental Designs • Two Experimental Designs to obtain reliable performance estimates and avoid overfitting. • Data split into mutually exclusive sets. • Outer Loop estimates performance by: • Training on all splits but one (use for testing). • Inner Loop determines the best parameter of the classifier.
Experimental Designs • Design I uses stratified 10 fold cross-validation in both loops while Design II uses 10 fold cross-validation in its inner loop and leave-one-out-cross-validation in its outer loop. • Building the final diagnostic model involves: • Finding the best parameters for the classification using a single loop of cross-validation • Building the classifier on all data using the previously found best parameters • Estimating a conservative bound on the classifier’s accuracy by using either Designs
Performance Metrics • Accuracy • Easy to interpret • Simplifies statistical testing • Sensitive to prior class probabilities • Does not describe the actual difficulty of the decision problem for unbalanced distributions • Relative classifier information (RCI) • Corrects for the differences in: • Prior probabilities of the diagnostic categories • Number of categories
Statistical Comparison among classifiers To test that differences b/t the best method and the other methods are non-random
Performance Results (Accuracies) without Gene Selection Using Design I
Performance Results (RCI) without Gene Selection Using Design I
Total Time of Classification Experiments w/o gene selection for all 11 datasets and two experimental designs • Executed in a Matlab R13 environment on 8 dual-CPU workstations connected in a cluster. • Fastest MC-SVMs: WW & CS • Fastest overall algorithm: KNN • Slowest MC-SVM: OVR • Slowest overall algorithms: NN and PNN
Performance Results (Accuracies) with Gene Selection Using Design I Improvement by gene selection Applied the 4 gene selection methods to the 4 most challenging datasets
Performance Results (RCI) with Gene Selection Using Design I Improvement by gene selection Applied the 4 gene selection methods to the 4 most challenging datasets
Discussion & Limitations • Limitations: • Use of the two performance metrics • Choice of KNN, PNN and NN classifiers • Future Research: • Improve existing gene selection procedures with the selection of optimal number of genes by cross-validation • Applying multivariate Markov blanket and local neighborhood algorithms • Extend comparisons with more MC-SVMs as they become available • Updating GEMS system to make it more user-friendly.
Contributions of Study • Conducted the most comprehensive systematic evaluation to date of multicategory diagnosis algorithms applied to the majority of multicategory cancer-related gene expression human datasets. • Creation of the GEMS system that automates the experimental procedures in the study in order to: • Develop optimal classification models for the domain of cancer diagnosis with microarray gene expression data. • Estimate their performance in future patients.
Conclusions • MSVMs are the best family of algorithms for these types of data and medical tasks. They outperform non-SVM machine learning techniques • Among MC-SVM methods OVR, CS and WW are the best w.r.t classification performance • Gene selection can improve the performance of MC and non-SVM methods • Ensemble classification does not further improve the classification performance of the best MC-SVM methods