240 likes | 447 Views
Classification of multiple cancer types by multicategory support vector machines using gene expression data. Support Vector Machine. A classification method which successfully diagnosis cancer problems Two types
E N D
Classification of multiple cancer types by multicategory support vector machines using gene expression data
Support Vector Machine • A classification method which successfully diagnosis cancer problems • Two types • Binary SVM:optimal extension to more than two classes not seen therefore limitation on its application to multiple tumor types • Multicategory SVM:(recently proposed) Demonstrated on leukemia data and small round blue cells of childhood tumor.
DNA microarray techonology • This method measures the relative amount of mRNA in isolated cells or biosped tissues • Uses SVM, solves a series of binary problems- DAG SVM algorithm • MSVM is applied to two gene expression data sets
Features • Effectiveness • Prediction strength • Effect of data preprocessing • Gene selection • Dimension reduction
Procedure- 3 class problem • Gene expression was monitored for classification of 2 leukemias ALL acute lymphoblastic leukemia) and AML ( acute myeloid leukemia) • ALL • B-cell • T-cell
Procedure conc. • Number of genes 7129 • 38 samples- training set • 34 samples- test set • Preprocessing steps performed • Thresholding(floor-100, ceiling 16000) • Filtering of genes (max/min <= 5 and max-min< =500) • Base 10 logarithmic transformation
Procedure conc. • Standardization of each variable • Variable selection • Prescreening measure – ratio of between classes sum of squares to within class sum of squares for each gene( largest ratios taken)
Small round blue cell tumors data (SRBCTs) • 4 types • Neuroblastoma (NB) • Rhabdomyosarcoma (RMS) • Non Hodgkin lymphoma (NHL) • Ewing family of tumors ( EWS)
Used Artificial Neural Networks (ANN) • Training set – 63 samples • Test set – 20 samples • Nearest Neighbor, weighted voting , linear SVM was applied to data • MSVM was applied for comparison • Logarithm base 10 of expression levels
SANN • For multiclass classification • Classification results superior to ANN • ANN uses back propagation algorithm • Why ? • Non linear connections • Inclusion of interactions within independent variables input) • Independence from conventional processes
Limitations • Learned knowledge is contained 100’s-1000’s weights (synapses) • Cannot be analyzed in a single regression formula
Combining several ANNs • Through ensembles of networks An ensemble: collection of finite number of different classifiers • Cascading ANNs
Two level ANN • Task : Chest Radiograms • Lung Nodules( Class A) • Without Lung Nodules( Class B)
Two level architecture carrying lower level and higher level concepts • Task: differentiate (higher level) • Normal cells (class A) • From malignant cells (class B) (lower level) • Class B_1 • Class B_2 • Class B_3 • Class B_4
One vs. all • Used with SVM • K binary classes- distinguish one class from all lumped together • Sample assigned to classifier achieving greatest output activity
ALL Pairs approach • Builds K(K-1)/2 Binary classifiers • K-1 binary classifiers distinguish from other classifiers • Output activities summed up –class with greatest activity is the winning class
SANN • Oriented to human decision making • Exclusion performed- preferences narrowed down • Classification made by first ANN is a preselection for second successive ANN
References • http://info.cchmc.org/presentations/ylee_13Dec02.pdf