240 likes | 763 Views
Nonlinear Data Discrimination via Generalized Support Vector Machines. David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison. www.cs.wisc.edu/~musicant. Outline. The linear support vector machine (SVM) Linear kernel Generalized support vector machine (GSVM)
E N D
Nonlinear Data Discriminationvia Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison www.cs.wisc.edu/~musicant
Outline • The linear support vector machine (SVM) • Linear kernel • Generalized support vector machine (GSVM) • Nonlinear indefinite kernel • Linear Programming Formulation of GSVM • MINOS • Quadratic Programming Formulation of GSVM • Successive Overrelaxation (SOR) • Numerical comparisons • Conclusions
The Discrimination ProblemThe Fundamental 2-Category Linearly Separable Case A+ A- Separating Surface:
Separate by two bounding planes: such that: • More succinctly:where e is a vector of ones. The Discrimination ProblemThe Fundamental 2-Category Linearly Separable Case • Given m points in the n dimensional space Rn • Represented by an mx n matrix A • Membership of each point Ai in the classes 1 or -1 is specified by: • An m x m diagonal matrix D with along its diagonal
Preliminary Attempt at the (Linear) Support Vector Machine:Robust Linear Programming • Solve the following mathematical program: where y = nonnegative error (slack) vector • Note: y = 0 if convex hulls of A+ and A- do not intersect.
The (Linear) Support Vector MachineMaximize Margin Between Separating Planes A+ A-
The (Linear) Support Vector Machine Formulation • Solve the following mathematical program: where y = nonnegative error (slack) vector • Note: y = 0 if convex hulls of A+ and A- do not intersect.
Linear Support Vector Machine (linear separating surface ) • By “duality”, set (linear separating surface ) • Nonlinear Support Vector Machine: Replace AA’ by nonlinear kernel . Nonlinear separating surface: GSVM: Generalized Support Vector MachineLinear Programming Formulation
Examples of Kernels • Examples • Polynomial Kernel • denotes componentwise exponentiation as in MATLAB • Radial Basis Kernel • Neural Network Kernel` • denotes the step functioncomponentwise.
A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points in R2Separate 486 Asterisks from 514 Dots
Large Margin Classifier(SOR) Reformulation in Space A+ A-
(SOR) Linear Support Vector MachineQuadratic Programming Formulation • Solve the following mathematical program: • The quadratic term here maximizes the distance between the bounding planes in the space
Substitute in a kernel for the AA’ term: • Linear separating surface: Introducing a Nonlinear Kernel • The Wolfe Dual for the SOR Linear SVM is: • Linear separating surface:
SVM Optimality Conditions • Define • Then dual SVM becomes much simpler! • Gradient Projection necessary & sufficient optimality condition: • denotes projecting u onto the region
SOR Algorithm & Convergence • Above optimality conditions lead to the SOR algorithm: • Remember, optimality conditions are expressed as: • SOR Linear Convergence [Luo-Tseng 1993]: • The iterates of the SOR algorithm converge R-linearly to a solution of the dual problem • The objective function values converge Q-linearlyto
Numerical Testing • Comparison of Linear & Nonlinear Kernels using • Linear Programming • Quadratic Programming - SOR Formulations • Data Sets: • UCI Liver Disorders: 345 points in R6 • Bell Labs Checkerboard: 1000 points in R2 • Gaussian Synthetic: 1000 points in R32 • SCDS Synthetic: 1 million points in R32 • Massive Synthetic: 10 million points in R32 • Machines: • Cluster of 4 Sun Enterprise E6000 machines each consisting of 16 UltraSPARC II 250 MHz Processors with 2 Gig RAM • Total: 64 Processors, 8 Gig RAM
Comparison of Linear & Nonlinear SVMsLinear Programming Generated • Nonlinear kernels yield better training and testing set correctness
SOR Results • Comparison of linear and nonlinear kernels • Examples of training on massive data: • 1 million point dataset generated by SCDS generator: • Trained completely in 9.7 hours • Tuning set reached 99.7% of final accuracy in 0.3 hours • 10 million point randomly generated dataset: • Tuning set reached 95% of final accuracy in 14.3 hours • Under 10,000 iterations
Conclusions • Linear programming and successive overrelaxation can generate complex nonlinear separating surfaces via GSVMs • Nonlinear separating surfaces improve generalization over linear ones • SOR can handle very large problems not (easily) solveable by other methods • SOR scales up with virtually no changes • Future directions • Parallel SOR for very large problems not resident in memory • Massive multicategory discrimination via SOR • Support vector regression