220 likes | 329 Views
Parallel muiticategory Support Vector Machines (PMC-SVM) for Classifying Microarray Data. Outline. Introduction SMO-SVM Parallel Muiticategory SVM Parallel Implementation and Environment Parallel Evaluation and Analysis Classifying Microarray Data Conclusions. Introduction.
E N D
Parallel muiticategory Support Vector Machines (PMC-SVM) for Classifying Microarray Data
Outline • Introduction • SMO-SVM • Parallel Muiticategory SVM • Parallel Implementation and Environment • Parallel Evaluation and Analysis • Classifying Microarray Data • Conclusions
Introduction • Biologists want to separate the data into multiple categories using a reliable cancer diagnostic model. • Based on a comprehensive evaluation of several muiticategory classification methods, it is found that support vector machines (SVM) are the most effective classifiers for performing accurate cancer diagnosis form gene expression. • In the paper, we developed new parallel muiticategory support vector machines (PMC-SVM) based on the sequential minimum optimization-type decomposition methods for support vector machines (SMO-SVM) of LibSVM term that needs less memory.
SMO-SVM The basic idea behind SVM is to separate two point classes of a training set, (1) by using a decision function optimization by solving a convex quadratic programming optimization problem of the form Subject to
SMO-SVM where and is a constant. is a vector of all ones. is the symmetric positive semidefinite matrix. entries are defined as (3) where denotes a kernel function, such as polynomial kernel or Gaussian kernel.
SMO-SVM • The subset, denoted as B, is called working set. • If B is restricted to have only two elements, this special type of decomposition method is the Sequential Minimal Optimization (SMO).
There are four steps to implement SMO: Step1: Find as the initial feasible solution. Set Step2: If Is a stationary point of (2), stop. Otherwise, find a two-element working set Define , and and as subvector of corresponding to and ,respectively.
If Step3: Solve the following sub-problem with the variable : (4) subject to else solve (5) subject to constraints of (4) Step4: Set to be the optimal solution of (4) and and go to step 2. . Set
Parallel Muiticategory SVM(PMC-SVM) • In muiticategory classification of support vector machines, the algorithm will generate sub models for categories. • Generating models is the most time consuming task in this algorithm so it is desirable to distribute all the sub models onto multiple processors and each processor perform a subtask to improve the performance.
Example: We have 4 processors and k=16, that means we have to generate k(k-1)/2 models, which are total 120 models. where is the total number of the processors and the number of categories.
Parallel Implementation and Environment • One is the sharedmemory SGI Origin 2800 Supercomputers(sweetgum) equipped with 128 CPUs, 64 gigabytes of memory, and 1.6 Terabytes of fiberchannel disk. • The other is a distributed memory Linux cluster (mimosa) with 192 nodes.
Parallel Evaluation and Analysis • PMC-SVM is tested on both sweetgum and mimosa platforms using the above two datasets. Dataset 1: Letter_scale classes: 26 trainig size: 16,000 features: 16 Dataset 2: Mnist_scale classes: 10 training size: 21,000 features: 780
Figure 2. The speedup of PMC-SVM on sweetgum with Dataset 1 (Letter_scale ) Figure 3. The speedup of PMC-SVM on mimosa with Datasets 1 (Leetter_scale)
Figure 4. The speedup of PMC-SVM on swetgum with Datasets 2 (Mnist_problem) Figure 4. The speedup of PMC-SVM on mimosa with Datasets 2 (Mnist_problem)
Classifying Microarray Data In the work, two microarray datasets were to demonstrate the performance of PMC-SVM, as listed below: Dataset 3: 14_Tumors(40Mb) Human tumor types: 14 normal tissue types: 12 Dataset 4: 11_Tumors(18Mb) Human tumor types: 11
Table 6: Performance on sweetgum (Dataset 3) Table 7: Performance on sweetgum (Dataset 4)
Conclusions • PMC-SVM has been developed for classifying large datasets based on SMO-type decomposition method. • The experimental results show that the high performance computing techniques and parallel implementation can achieve a significant speedup.