470 likes | 490 Views
Machine Learning and Data Mining via Mathematical Programming Based Support Vector Machines. May 8, 2003. Glenn M. Fung. Ph. D. Dissertation Talk University of Wisconsin - Madison. Thesis Overview. Proximal support vector machines (PSVM) Binary Classification Multiclass Classification
E N D
Machine Learning and Data Miningvia Mathematical Programming Based Support Vector Machines May 8, 2003 Glenn M. Fung Ph. D. Dissertation Talk University of Wisconsin - Madison
Thesis Overview • Proximal support vector machines (PSVM) • Binary Classification • Multiclass Classification • Incremental Classification (massive datasets) • Knowledge based SVMs (KSVM) • Linear KSVM • Extension to Nonlinear KSVM • Sparse classifiers • Data selection for linear classifiers • Minimize # of support vectors • Minimal kernel classifiers • Feature selection Newton method for SVM. • Semi-Supervised SVMs • Finite Newton method for Lagrangian SVM classifiers
Outline of Talk • (Standard) Support vector machine (SVM) • Classify by halfspaces • Proximal support vector machine (PSVM) • Classify by proximity to planes • Incremental PSVM classifiers • Synthetic dataset consisting of 1 billion points in 10- dimensional input space classified in less than2 hours and 26 minutes • Knowledge based SVMs • Incorporate prior knowledge sets into classifiers • Minimal kernel classifiers • Reduce data dependence of nonlinear classifiers
Support vectors Support Vector MachinesMaximizing the Margin between Bounding Planes A+ A-
Given m points in n dimensional space • Represented by an m-by-n matrix A • Membership of each in class +1 or –1 specified by: • An m-by-m diagonal matrix D with +1 & -1 entries • Separate by two bounding planes, • More succinctly: where e is a vector of ones. Standard Support Vector MachineAlgebra of 2-Category Linearly Separable Case
Solve the quadratic program for some : min (QP) , s. t. where , denotes or membership. • Marginis maximized by minimizing Standard Support Vector Machine Formulation
Proximal Vector MachinesFitting the Data using two parallel Bounding Planes A+ A-
min (QP) s. t. Solving for in terms of and gives: min PSVM Formulation We have from the QP SVM formulation: This simple, but critical modification, changes the nature of the optimization problem tremendously!!
Advantages of New Formulation • Objective function remains strongly convex • An explicit exact solution can be written in terms of the problem data • PSVM classifier is obtained by solving a single system of linear equations in the usually small dimensional input space • Exact leave-one-out-correctness can be obtained in terms of problem data
We want to solve: min Linear PSVM • Setting the gradient equal to zero, gives a nonsingular system of linear equations. • Solution of the system gives the desired PSVM classifier
Here, • The linear system to solve depends on: which is of the size is usually much smaller than Linear PSVM Solution
Input Define Calculate Solve Classifier: Linear Proximal SVM Algorithm
Linear PSVM: (Linear separating surface: ) : min (QP) s. t. . Maximizing the margin By QP “duality”, in the “dual space” , gives: min min • Replace by a nonlinear kernel Nonlinear PSVM Formulation
The nonlinear classifier: : • Gaussian (Radial Basis) Kernel • The represents the “similarity” -entryof of data points and The Nonlinear Classifier • Where K is a nonlinear kernel, e.g.:
Similar to the linear case, setting the gradient equal to zero, we obtain: Defining slightly different: • Here, the linear system to solve is of the size Nonlinear PSVM However, reduced kernel techniques can be used (RSVM) to reduce dimensionality.
Input Define Calculate Classifier: Classifier: Linear Proximal SVM Algorithm Non Solve
Suppose we have two “blocks” of data and • The linear system to solve depends on the compressed blocks: which are of the size Incremental PSVM Classification
Initialization Read from disk Compute and Store in memory Discard: Update in memory Keep: Compute output Yes No Linear Incremental Proximal SVM Algorithm
Linear Incremental Proximal SVM Adding – Retiring Data • Capable of modifying an existing linear classifier by both adding and retiring data • Option of retiring old data is similar to adding new data • Financial Data: old data is obsolete • Option of keeping old data and merging it with the new data: • Medical Data: old data does not obsolesce.
Numerical experimentsOne-Billion Two-Class Dataset • Synthetic dataset consisting of 1 billion points in 10- dimensional input space • Generated by NDC (Normally Distributed Clustered) dataset generator • Dataset divided into 500 blocks of 2 million points each. • Solution obtained in less than 2 hours and 26 minutes • About 30% of the time was spent reading data from disk. • Testing set correctness 90.79%
Numerical Experiments Simulation of Two-month 60-Million Dataset • Synthetic dataset consisting of 60 million points (1 million per day) in 10- dimensional input space • Generated using NDC • At the beginning, we only have data corresponding to the first month • Every day: • The oldest block of data is retired (1 Million) • A new block is added (1 Million) • A new linear classifier is calculated daily • Only an 11 by 11 matrix is kept in memory at the end of each day. All other data is purged.
Numerical experiments Normals to the separating hyperplanes Corresponding to 5 day intervals
min s.t. min s.t. Support Vector MachinesLinear Programming Formulation • Use the 1-norm instead of the 2-norm: • This is equivalent to the following linear program:
Support Vector MachinesMaximizing the Margin between Bounding Planes A+ A-
Suppose that the knowledge set: belongs to the class A+. Hence it must lie in the halfspace : • We therefore have the implication: Incoporating Knowledge Sets Into an SVM Classifier • Will show that this implication is equivalent to a set of constraints that can be imposed on the classification problem.
Adding one set of constraints for each knowledge set to the 1-norm SVM LP, we have: Knowledge-Based SVM Classification
Numerical TestingThe Promoter Recognition Dataset • Promoter: Short DNA sequence that precedes a gene sequence. • A promoter consists of 57 consecutive DNA nucleotides belonging to {A,G,C,T} . • Important to distinguish between promoters and nonpromoters • This distinction identifies starting locations of genes in long uncharacterizedDNA sequences.
Goal #1: Generate a very sparse solution vector . Minimal kernel Classifiers:Model Simplification • Why?Minimizes number of kernel functions used. • Simplifies separating surface. • Reduces storage • Goal #2: Minimize number of active constraints. • Why?Reduces data dependence. • Useful for massive incremental classification.
The nonlinear separating surface: The separating surface does not depend explicitly on the datapoint • Minimize the number of nonzero Model Simplification Goal #1Simplifying Separating Surface
Minimize the number of nonzero Model Simplification Goal #2Minimize Data Dependence • By KKT conditions: Hence:
min s.t. • Where is given by: • The new loss function # is given by: Achieving Model Simplification:Minimal Kernel Classifier Formulation
min s.t. • For we have: Minimal Kernel Classifier as a Concave Minimization Problem • Problem can be effectively solved using the finite Successive Linearization Algorithm (SLA) (Mangasarian 1996)
Start with: • Having determine by solving the LP: min s.t. • Stop when: Minimal Kernel Algorithm (SLA)
Minimal Kernel Algorithm (SLA) • Each iteration of the algorithm solves a Linear program. • The algorithm terminates in a finite number of iterations (typically 5 to 7 iterations). • Solution obtained satisfies the Minimum Principle necessary optimality condition.
Checkerboard Separating Surface # of Kernel Functions=27 *# of Active Constraints= 30 o
Conclusions (PSVM) • PSVM is an extremely simple procedure for generating linear and nonlinearclassifiers by solving a single system of linear equations • Comparable test set correctness to standard SVM • Much faster than standard SVMs : typically an order of magnitude less. • We also Proposed algorithm is an extremely simple procedure for generating linear classifiers in an incremental fashion for huge datasets. • The proposed algorithm has the ability to retire old data and add new data in a very simple manner. • Only a matrix of the size of the input space is kept in memory at any time.
Conclusions (KSVM) • Prior knowledge easily incorporated into classifiers through polyhedral knowledge sets. • Resulting problem is a simple LP. • Knowledge sets can be used with or without conventional labeled data. • In either case KSVM is better than most knowledge based classifiers.
Conclusions (Minimal Kernel Classifiers) • A finite algorithm that generates a classifier depending on a fraction of the input data only. • Important for fast online testing of unseen data, e.g. fraud or intrusion detection. • Useful for incremental training of massive data. • Overall algorithm consists of solving 5 to 7 LPs. • Kernel data dependence reduced up to 98.8% of the data used by a standard SVM. • Testing time reduction up to:98.2%. • MKC testing set correctnesscomparable to that of more complex standard SVM.