1 / 29

Support Vector Machines

Support Vector Machines. Mei-Chen Yeh 04/20/2010. The Classification Problem. Label instances, usually represented by feature vectors, into one of the predefined categories. Example: Image classification. Starting from the simplest setting. Two-class Samples are linearly separable. > 0.

seth
Download Presentation

Support Vector Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines Mei-Chen Yeh 04/20/2010

  2. The Classification Problem • Label instances, usually represented by feature vectors, into one of the predefined categories. • Example: Image classification

  3. Starting from the simplest setting • Two-class • Samples are linearly separable > 0 How many classifiers we may have to separate the data? < 0 Class 2 infinite! Hyperplane g(x) = wTx + w0 = 0 Class 1 weight vector threshold

  4. Formulation • Given training data: (xi, yi), i = 1, 2, …, N, • xi: feature vector • yi: label • Learn a hyper-plane which separates all data • variables: w and w0 • Testing: decision function f(x) = sign(wTx + w0) • x: test data

  5. H2 H3 Class 2 H1 Class 1 Hyperplanes H1, H2, and H3 are candidate classifiers. Which one is preferred? Why?

  6. Choose the one with large margin! Class 2 Class 2 Class 1 Class 1

  7. margin? Class 2 wTx + w0 = δ 1 wTx + w0 = 0 wTx + w0 = -δ -1 scale w, w0 so that Class 1

  8. Formulation • Compute w, w0 so that to: Side information:

  9. Formulation • The problem is equal to the optimization task: • w can be recovered by • Classification rule: • Assign x to ω1 (ω2) if Lagrange multipliers

  10. Remarks • Just some λ are not zeros. • xi with non-zero λ are called support vectors. • The hyperplane is determined only by the support vectors. • The cost function is in the form of inner products. • does not depend explicitly on the dimensionality of the input space! Class 2 Class 1

  11. Non-separable Classes Allow training errors! Previous constraint: yi(wTxi + w0) ≥ 1 Class 2 Introduce errors: yi(wTxi + w0) ≥ 1- ξi ξi > 1 0 < ξi ≤ 1 Class 1 others, ξi = 0

  12. Formulation • Compute w, w0 so that to: penalty parameter

  13. Formulation • The dual problem:

  14. Non-linear Case • Linear separable in other spaces? • Idea: map the feature vector to higher dimensional space

  15. f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Non-linear Case • Example: f(.)

  16. Problems • High computation burden • Hard to get a good estimate

  17. Kernel Trick • Recall that in the dual problem, w can be recovered by • g(x) = wTx + w0 = All we need here is the inner product of (transformed) feature vectors!

  18. Kernel Trick • Decision function • Kernel function • K(xi, xj) = f(xi)Tf(xj)

  19. Example kernel The inner product can be directly computed without going through the mapping f(.)

  20. Remarks • In practice, we specify K, thereby specifying f(.) indirectly, instead of choosing f(.) • Intuitively, K(x, y) represents the similarity between data x and y • K(x, y) needs to satisfy the Mercer condition in order for f(.) to exist

  21. Examples of Kernel Functions • Polynomial kernel with degree d • Radial basis function kernel with width s • Sigmoid with parameter k and q

  22. Pros and Cons • Strengths • Training is relatively easy • It scales relatively well to high dimensional data • Tradeoff between classifier complexity and error can be controlled explicitly • Weaknesses • No practical method for the best selection of the kernel function • Binary classification alone

  23. Combing SVM binary classifiers for multi-class problem (1) • M-category classification (ω1,ω2, … ,ωM) • Two popular approaches • One-against-all (ωi, M-1 others) • M classifiers • Choose the one with the largest output Example: 5 categories Winner: ω1

  24. Combing SVM binary classifiers for multi-class problem (2) • Pair-wise coupling (ωi, ωj) • M(M-1)/2 classifiers • Aggregate the outputs Example: 5 categories Voting! 1: 4 2: 1 3: 3 4: 0 5: 2 Winner: ω1 svm outputs decision

  25. Data normalization • The features may have different ranges. Example: We use weight (w) and height (h) for classifying male and female college students. • male: avg.(w) = 69.80 kg, avg.(h) = 174.36 cm • female: avg.(w) = 52.86 kg, avg.(h) = 159.77 cm Different scales!

  26. Data normalization • “Data pre-processing” • Equalize scales among different features • Zero mean and unit variance • Two cases in practice • (0, 1) if all feature values are positive • (-1, 1) if feature values may be positive or negative

  27. Data normalization • xik : feature k, sample i, • Mean and variance • Normalization back

  28. Assignment #4 • Develop a SVM classifier usingeither • OpenCV, or • LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) • Use “training.txt” to train your classifier, and evaluate the performance “test.txt” • Write a 1-page report that summarizes how you implement your classifier, and the classification accuracy rate.

  29. Final project announcement • Please prepare a short (<5 minutes) presentation on what you’re going to develop for the final project.

More Related