Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg

Support Vector Machine ClassificationComputation & Informatics in Biology & MedicineMadison Retreat, November 15, 2002 Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg & Collaborators at ExonHit – Paris Data Mining Institute University of Wisconsin - Madison

What is a Support Vector Machine? • An optimally defined surface • Linear or nonlinear in the input space • Linear in a higher dimensional feature space • Implicitly defined by a kernel function • K(A,B)  C

What are Support Vector Machines Used For? • Classification • Regression & Data Fitting • Supervised & Unsupervised Learning

Principal Topics • Proximal support vector machine classification • Classify by proximity to planes instead of halfspaces • Massive incremental classification • Classify by retiring old data & adding new data • Knowledge-based classification • Incorporate expert knowledge into a classifier • Fast Newton method classifier • Finitely terminating fast algorithm for classification • Breast cancer prognosis & chemotherapy • Classify patients on basis of distinct survival curves • Isolate a class of patients that may benefit from chemotherapy

Principal Topics • Proximal support vector machine classification

Support Vector MachinesMaximize the Margin between Bounding Planes A+ A-

Proximal Support Vector Machines Maximize the Margin between Proximal Planes A+ A-

Given m points in n dimensional space • Represented by an m-by-n matrix A • Membership of each in class +1 or –1 specified by: • An m-by-m diagonal matrix D with +1 & -1 entries • Separate by two bounding planes, • More succinctly: where e is a vector of ones. Standard Support Vector MachineAlgebra of 2-Category Linearly Separable Case

Solve the quadratic program for some : min (QP) , s. t. where , denotes or membership. • Marginis maximized by minimizing Standard Support Vector Machine Formulation

min (QP) s. t. Solving for in terms of and gives: min Proximal SVM Formulation (PSVM) Standard SVM formulation: This simple, but critical modification, changes the nature of the optimization problem tremendously!! (Regularized Least Squares or Ridge Regression)

Advantages of New Formulation • Objective function remains strongly convex. • An explicit exact solution can be written in terms of the problem data. • PSVM classifier is obtained by solving a single system of linear equations in the usually small dimensional input space. • Exact leave-one-out-correctness can be obtained in terms of problem data.

We want to solve: min Linear PSVM • Setting the gradient equal to zero, gives a nonsingular system of linear equations. • Solution of the system gives the desired PSVM classifier.

Here, • The linear system to solve depends on: which is of size is usually much smaller than Linear PSVM Solution

Linear & Nonlinear PSVM MATLAB Code function [w, gamma] = psvm(A,d,nu)% PSVM: linear and nonlinear classification % INPUT: A, d=diag(D), nu. OUTPUT: w, gamma% [w, gamma] = psvm(A,d,nu); [m,n]=size(A);e=ones(m,1);H=[A -e]; v=(d’*H)’ %v=H’*D*e; r=(speye(n+1)/nu+H’*H)\v % solve (I/nu+H’*H)r=v w=r(1:n);gamma=r(n+1); % getting w,gamma from r

Numerical experimentsOne-Billion Two-Class Dataset • Synthetic dataset consisting of 1 billion points in 10- dimensional input space • Generated by NDC (Normally Distributed Clustered) dataset generator • Dataset divided into 500 blocks of 2 million points each. • Solution obtained in less than 2 hours and 26 minutes on a 400Mhz • About 30% of the time was spent reading data from disk. • Testing set Correctness 90.79%

Principal Topics • Knowledge-based classification (NIPS*2002)

Conventional Data-Based SVM

Knowledge-Based SVM via Polyhedral Knowledge Sets

Suppose that the knowledge set: belongs to the class A+. Hence it must lie in the halfspace : • We therefore have the implication: Incoporating Knowledge Sets Into an SVM Classifier • This implication is equivalent to a set of constraints that can be imposed on the classification problem.

Numerical TestingThe Promoter Recognition Dataset • Promoter: Short DNA sequence that precedes a gene sequence. • A promoter consists of 57 consecutive DNA nucleotides belonging to {A,G,C,T} . • Important to distinguish between promoters and nonpromoters • This distinction identifies starting locations of genes in long uncharacterizedDNA sequences.

The Promoter Recognition DatasetNumerical Representation • Simple “1 of N” mapping scheme for converting nominal attributes into a real valued representation: • Not most economical representation, but commonly used.

The Promoter Recognition DatasetNumerical Representation • Feature space mapped from 57-dimensional nominal space to a real valued 57 x 4=228 dimensional space. 57 nominal values 57 x 4 =228 binary values

Promoter Recognition Dataset Prior Knowledge Rules • Prior knowledge consist of the following 64 rules:

where denotes position of a nucleotide, with respect to a meaningful reference point starting at position and ending at position Then: Promoter Recognition Dataset Sample Rules

The Promoter Recognition DatasetComparative Algorithms • KBANN Knowledge-based artificial neural network [Shavlik et al] • BP: Standard back propagation for neural networks [Rumelhart et al] • O’Neill’s Method Empirical method suggested by biologist O’Neill [O’Neill] • NN: Nearest neighbor with k=3 [Cost et al] • ID3: Quinlan’s decision tree builder[Quinlan] • SVM1: Standard 1-norm SVM [Bradley et al]

The Promoter Recognition DatasetComparative Test Results

Wisconsin Breast Cancer Prognosis Dataset Description of the data • 110 instances corresponding to 41 patients whose cancer had recurred and 69 patients whose cancer had not recurred • 32 numerical features • The domain theory: two simple rules used by doctors:

Wisconsin Breast Cancer Prognosis Dataset Numerical Testing Results • Doctor’s rules applicable to only 32 out of 110 patients. • Only 22 of 32 patients are classified correctly by this rule (20% Correctness). • KSVM linear classifier applicable to allpatients with correctness of 66.4%. • Correctness comparable to best available results using conventional SVMs. • KSVM can get classifiers based on knowledge without using any data.

Principal Topics • Fast Newton method classifier

Fast Newton Algorithm for Classification Standard quadratic programming (QP) formulation of SVM:

Newton Algorithm • Newton algorithm terminates in a finite number of steps • Termination at global minimum • Error rate decreases linearly • Can generate complex nonlinear classifiers • By using nonlinear kernels: K(x,y)

Nonlinear Spiral Dataset94 Red Dots & 94 White Dots

Principal Topics • Breast cancer prognosis & chemotherapy

Kaplan-Meier Curves for Overall Patients:With & Without Chemotherapy

Breast Cancer Prognosis & ChemotherapyGood, Intermediate & Poor Patient Groupings(6 Input Features : 5 Cytological, 1 Histological)(Grouping: Utilizes 2 Histological Features &Chemotherapy)

Kaplan-Meier Survival Curvesfor Good, Intermediate & Poor Patients82.7% Classifier Correctness via 3 SVMs

Kaplan-Meier Survival Curves for Intermediate Group Note Reversed Role of Chemotherapy

Conclusion • New methods for classification • All based on rigorous mathematical foundation • Fast computational algorithms capable of classifying massive datasets • Classifiers based on both abstract prior knowledge as well as conventional datasets • Identification of breast cancer patients that can benefit from chemotherapy

Future Work • Extend proposed methods to broader optimization problems • Linear & quadratic programming • Preliminary results beat state-of-the-art software • Incorporate abstract concepts into optimization problems as constraints • Develop fast online algorithms for intrusion and fraud detection • Classify the effectiveness of new drug cocktails in combating various forms of cancer • Encouraging preliminary results for breast cancer

Breast Cancer Treatment ResponseJoint with ExonHit ( French BioTech) • 35 patients treated by a drug cocktail • 9 partial responders; 26 nonresponders • 25 gene expression measurements made on each patient • 1-Norm SVM classifier selected: 12 out of 25 genes • Combinatorially selected 6 genes out of 12 • Separating plane obtained: 2.7915 T11 + 0.13436 S24 -1.0269 U23 -2.8108 Z23 -1.8668 A19 -1.5177 X05 +2899.1 = 0. • Leave-one-out-error:1 out of 35 (97.1% correctness)

E1 I1 E2 I2 E3 I3 E4 I4 E5 DNA Transcription E1 I1 E2 I2 E3 I3 E4 I4 E5 pre-mRNA (m=messenger) 5' 3' Alternative RNA splicing E1 E2 E4 E5 E1 E2 E3 E4 E5 mRNA (A)n (A)n Translation DATAS Proteins NH2 COOH NH2 COOH E3 Chemo-Sensitive Chemo-Resistant DATAS: Differential Analysis of Transcripts with Alternative Splicing Detection of Alternative RNA Isoforms via DATAS (Levels of mRNA that Correlate with Senitivity to Chemotherapy)

Talk Available www.cs.wisc.edu/~olvi

Olvi L. Mangasarian with G. M. Fung, Y.-J. Lee, J.W. Shavlik, W. H. Wolberg