Support Vector Machines

Support Vector Machines Mei-Chen Yeh 04/20/2010

The Classification Problem • Label instances, usually represented by feature vectors, into one of the predefined categories. • Example: Image classification

Starting from the simplest setting • Two-class • Samples are linearly separable > 0 How many classifiers we may have to separate the data? < 0 Class 2 infinite! Hyperplane g(x) = wTx + w0 = 0 Class 1 weight vector threshold

Formulation • Given training data: (xi, yi), i = 1, 2, …, N, • xi: feature vector • yi: label • Learn a hyper-plane which separates all data • variables: w and w0 • Testing: decision function f(x) = sign(wTx + w0) • x: test data

H2 H3 Class 2 H1 Class 1 Hyperplanes H1, H2, and H3 are candidate classifiers. Which one is preferred? Why?

Choose the one with large margin! Class 2 Class 2 Class 1 Class 1

margin? Class 2 wTx + w0 = δ 1 wTx + w0 = 0 wTx + w0 = -δ -1 scale w, w0 so that Class 1

Formulation • Compute w, w0 so that to: Side information:

Formulation • The problem is equal to the optimization task: • w can be recovered by • Classification rule: • Assign x to ω1 (ω2) if Lagrange multipliers

Remarks • Just some λ are not zeros. • xi with non-zero λ are called support vectors. • The hyperplane is determined only by the support vectors. • The cost function is in the form of inner products. • does not depend explicitly on the dimensionality of the input space! Class 2 Class 1

Non-separable Classes Allow training errors! Previous constraint: yi(wTxi + w0) ≥ 1 Class 2 Introduce errors: yi(wTxi + w0) ≥ 1- ξi ξi > 1 0 < ξi ≤ 1 Class 1 others, ξi = 0

Formulation • Compute w, w0 so that to: penalty parameter

Formulation • The dual problem:

Non-linear Case • Linear separable in other spaces? • Idea: map the feature vector to higher dimensional space

f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Non-linear Case • Example: f(.)

Problems • High computation burden • Hard to get a good estimate

Kernel Trick • Recall that in the dual problem, w can be recovered by • g(x) = wTx + w0 = All we need here is the inner product of (transformed) feature vectors!

Kernel Trick • Decision function • Kernel function • K(xi, xj) = f(xi)Tf(xj)

Example kernel The inner product can be directly computed without going through the mapping f(.)

Remarks • In practice, we specify K, thereby specifying f(.) indirectly, instead of choosing f(.) • Intuitively, K(x, y) represents the similarity between data x and y • K(x, y) needs to satisfy the Mercer condition in order for f(.) to exist

Examples of Kernel Functions • Polynomial kernel with degree d • Radial basis function kernel with width s • Sigmoid with parameter k and q

Pros and Cons • Strengths • Training is relatively easy • It scales relatively well to high dimensional data • Tradeoff between classifier complexity and error can be controlled explicitly • Weaknesses • No practical method for the best selection of the kernel function • Binary classification alone

Combing SVM binary classifiers for multi-class problem (1) • M-category classification (ω1,ω2, … ,ωM) • Two popular approaches • One-against-all (ωi, M-1 others) • M classifiers • Choose the one with the largest output Example: 5 categories Winner: ω1

Combing SVM binary classifiers for multi-class problem (2) • Pair-wise coupling (ωi, ωj) • M(M-1)/2 classifiers • Aggregate the outputs Example: 5 categories Voting! 1: 4 2: 1 3: 3 4: 0 5: 2 Winner: ω1 svm outputs decision

Data normalization • The features may have different ranges. Example: We use weight (w) and height (h) for classifying male and female college students. • male: avg.(w) = 69.80 kg, avg.(h) = 174.36 cm • female: avg.(w) = 52.86 kg, avg.(h) = 159.77 cm Different scales!

Data normalization • “Data pre-processing” • Equalize scales among different features • Zero mean and unit variance • Two cases in practice • (0, 1) if all feature values are positive • (-1, 1) if feature values may be positive or negative

Data normalization • xik : feature k, sample i, • Mean and variance • Normalization back

Assignment #4 • Develop a SVM classifier usingeither • OpenCV, or • LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) • Use “training.txt” to train your classifier, and evaluate the performance “test.txt” • Write a 1-page report that summarizes how you implement your classifier, and the classification accuracy rate.

Final project announcement • Please prepare a short (<5 minutes) presentation on what you’re going to develop for the final project.

Support Vector Machines