230 likes | 358 Views
A HYBRID OF GENETIC ALGORITHMS AND SUPPORT VECTOR MACHINES (GASVM) FOR GENE SELECTION. A Flowchart of GASVM. A Flowchart of GASVM. The overall hybrid method consists of two main components: GA and SVM classifier.
E N D
A HYBRID OF GENETIC ALGORITHMS AND SUPPORT VECTOR MACHINES (GASVM) FOR GENE SELECTION
A Flowchart of GASVM • The overall hybrid method consists of two main components: • GA and • SVM classifier. • The GA selects the subsets of features and then the SVM classifier evaluates the subsets during a classification process. • The result of the classification is used for the fitness value of GA. • where accuracy(x) is the leave one out cross validation (LOOCV) accuracy of the classifier with the features subset selection which represented by x.
Chromosome Representation in GASVM • Let n be the total number of genes available for representing the data to be classified. • Hence, the chromosome is represented by binary vector of dimension n.
Chromosome Representation in GASVM • A chromosome = a solution or a gene subset. • If bit is 1,gene is selected. If bit is 0,gene is unselected. An example of chromosome representation in GASVM for gene selection.
Investigation of GASVM Limitation • It demonstrated an exponential nature of subsets that exist as the number of features (genes) increases -> NP-complete
Drawback of GASVM • GASVM - search space is too large due to high dimensional data • complexity of search space • low accuracy • high number of selected genes
Proposed Solution N/2 N • Correlations between number of subset y and number of selected features x from total of features n.
Chromosome representation in GASVM-II An example of chromosome representation in GASVM-II for genes selection.
Drawback of GASVM-II • GASVM-II • selected gene manually. • overfitting - High LOOCV accuracy, but low test accuracy – inconsistent result
Case Study: GASVM Versus GASVM for Gene Selection • Leukemia Dataset • The first benchmark gene expression microarray dataset is Leukemia Cancer. The data contains examples of human acute leukemia, originally analyzed by Golub et al. • The dataset containing expression levels of 7129 genes can be obtained at http://www.genome.wi.mit.edu/mpr. • The bone marrow or blood samples were taken from 72 patients, 25 with acute myeloid leukemia (AML) and 47 with acute lymphoblastic leukemia (ALL). • The training data consists of 38 samples and the remaining 34 samples were used as testing data.
Colon Dataset • The second benchmark dataset is Colon Cancer. The data contains expression levels of 2000 genes from 40 tumor and 22 normal colon tissues. • The dataset only has 62 samples for training data, originally analyzed by Alon et al.12 and downloaded from http://microarray.princeton.edu/oncology/affydata/index.html.
Experimental environment • Parameters of the GASVM and GASVM-II for the Leukemia and Colon Cancer datasets
Results analysis and discussions • Classification accuracies for different gene subsets using GASVM-II method
Results analysis and discussions • Benchmark of GASVM, GASVM-II and SVM performances and current best of previous methods on Leukemia Cancer dataset
Results analysis and discussions • Benchmark of GASVM, GASVM-II and SVM performances and current best of previous methods on the Colon Cancer dataset
Biological plausibility for informative genes in datasets • List of the same informative genes in the Leukemia Cancer dataset produced by GASVM-II and previous works
Biological plausibility for informative genes in datasets • List of the same informative genes in the Leukemia Colon dataset produced by GASVM-II and previous works