Nonlinear Data Discrimination via Generalized Support Vector Machines

Nonlinear Data Discriminationvia Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison www.cs.wisc.edu/~musicant

Outline • The linear support vector machine (SVM) • Linear kernel • Generalized support vector machine (GSVM) • Nonlinear indefinite kernel • Linear Programming Formulation of GSVM • MINOS • Quadratic Programming Formulation of GSVM • Successive Overrelaxation (SOR) • Numerical comparisons • Conclusions

The Discrimination ProblemThe Fundamental 2-Category Linearly Separable Case A+ A- Separating Surface:

Separate by two bounding planes: such that: • More succinctly:where e is a vector of ones. The Discrimination ProblemThe Fundamental 2-Category Linearly Separable Case • Given m points in the n dimensional space Rn • Represented by an mx n matrix A • Membership of each point Ai in the classes 1 or -1 is specified by: • An m x m diagonal matrix D with along its diagonal

Preliminary Attempt at the (Linear) Support Vector Machine:Robust Linear Programming • Solve the following mathematical program: where y = nonnegative error (slack) vector • Note: y = 0 if convex hulls of A+ and A- do not intersect.

The (Linear) Support Vector MachineMaximize Margin Between Separating Planes A+ A-

The (Linear) Support Vector Machine Formulation • Solve the following mathematical program: where y = nonnegative error (slack) vector • Note: y = 0 if convex hulls of A+ and A- do not intersect.

Linear Support Vector Machine (linear separating surface ) • By “duality”, set (linear separating surface ) • Nonlinear Support Vector Machine: Replace AA’ by nonlinear kernel . Nonlinear separating surface: GSVM: Generalized Support Vector MachineLinear Programming Formulation

Examples of Kernels • Examples • Polynomial Kernel • denotes componentwise exponentiation as in MATLAB • Radial Basis Kernel • Neural Network Kernel` • denotes the step functioncomponentwise.

A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points in R2Separate 486 Asterisks from 514 Dots

Previous Work

Polynomial Kernel:

Large Margin Classifier(SOR) Reformulation in Space A+ A-

(SOR) Linear Support Vector MachineQuadratic Programming Formulation • Solve the following mathematical program: • The quadratic term here maximizes the distance between the bounding planes in the space

Substitute in a kernel for the AA’ term: • Linear separating surface: Introducing a Nonlinear Kernel • The Wolfe Dual for the SOR Linear SVM is: • Linear separating surface:

SVM Optimality Conditions • Define • Then dual SVM becomes much simpler! • Gradient Projection necessary & sufficient optimality condition: • denotes projecting u onto the region

SOR Algorithm & Convergence • Above optimality conditions lead to the SOR algorithm: • Remember, optimality conditions are expressed as: • SOR Linear Convergence [Luo-Tseng 1993]: • The iterates of the SOR algorithm converge R-linearly to a solution of the dual problem • The objective function values converge Q-linearlyto

Numerical Testing • Comparison of Linear & Nonlinear Kernels using • Linear Programming • Quadratic Programming - SOR Formulations • Data Sets: • UCI Liver Disorders: 345 points in R6 • Bell Labs Checkerboard: 1000 points in R2 • Gaussian Synthetic: 1000 points in R32 • SCDS Synthetic: 1 million points in R32 • Massive Synthetic: 10 million points in R32 • Machines: • Cluster of 4 Sun Enterprise E6000 machines each consisting of 16 UltraSPARC II 250 MHz Processors with 2 Gig RAM • Total: 64 Processors, 8 Gig RAM

Comparison of Linear & Nonlinear SVMsLinear Programming Generated • Nonlinear kernels yield better training and testing set correctness

SOR Results • Comparison of linear and nonlinear kernels • Examples of training on massive data: • 1 million point dataset generated by SCDS generator: • Trained completely in 9.7 hours • Tuning set reached 99.7% of final accuracy in 0.3 hours • 10 million point randomly generated dataset: • Tuning set reached 95% of final accuracy in 14.3 hours • Under 10,000 iterations

Conclusions • Linear programming and successive overrelaxation can generate complex nonlinear separating surfaces via GSVMs • Nonlinear separating surfaces improve generalization over linear ones • SOR can handle very large problems not (easily) solveable by other methods • SOR scales up with virtually no changes • Future directions • Parallel SOR for very large problems not resident in memory • Massive multicategory discrimination via SOR • Support vector regression

Questions?

Nonlinear Data Discrimination via Generalized Support Vector Machines

Nonlinear Data Discrimination via Generalized Support Vector Machines

Presentation Transcript

Support Vector Machines: Nonlinear Case

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Data Mining and Machine Learning via Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines