1 / 64

Non-linear SVMs & libSVM

Non-linear SVMs & libSVM. Advanced Statistical Methods in NLP Ling 572 February 23, 2012. Roadmap. Non-linear SVMs: Motivation: Non-linear data The kernel trick Linear  Non-linear SVM models LibSVM : svm -train & svm -predict Models HW #7. Non-Linear SVMs. Problem:

ivi
Download Presentation

Non-linear SVMs & libSVM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Non-linear SVMs & libSVM Advanced Statistical Methods in NLP Ling 572 February 23, 2012

  2. Roadmap • Non-linear SVMs: • Motivation: Non-linear data • The kernel trick • Linear  Non-linear SVM models • LibSVM: • svm-train & svm-predict • Models • HW #7

  3. Non-Linear SVMs • Problem: • Sometimes data really isn’t linearly separable

  4. Non-Linear SVMs • Problem: • Sometimes data really isn’t linearly separable • Approach: • Map data non-linearly into higher dimensional space • Data is separable in the higher dimensional space

  5. Non-Linear SVMs • Problem: • Sometimes data really isn’t linearly separable • Approach: • Map data non-linearly into higher dimensional space • Data is separable in the higher dimensional space Figure from Hearst et al ‘98

  6. Feature Space • Basic approach: • Original data is not linearly separable

  7. Feature Space • Basic approach: • Original data is not linearly separable • Map data into ‘feature space’ • Higher dimensional dot product space • Mapping via non-linear map:Φ

  8. Feature Space • Basic approach: • Original data is not linearly separable • Map data into ‘feature space’ • Higher dimensional dot product space • Mapping via non-linear map:Φ • Compute separating hyperplane • In higher dimensional space

  9. Issues with Feature Space • Mapping idea is simple, • But has some practical problems

  10. Issues with Feature Space • Mapping idea is simple, • But has some practical problems • Feature space can be very high – infinite? – dimensional

  11. Issues with Feature Space • Mapping idea is simple, • But has some practical problems • Feature space can be very high – infinite? – dimensional • Approach depends on computing similarity (dot product) • Computationally expensive

  12. Issues with Feature Space • Mapping idea is simple, • But has some practical problems • Feature space can be very high – infinite? – dimensional • Approach depends on computing similarity (dot product) • Computationally expensive • Approach depends on mapping: • Also possibly intractable to compute

  13. Solution • “Kernel trick”: • Use a kernel function K: X x X  R

  14. Solution • “Kernel trick”: • Use a kernel function K: X x X  R • Computes similarity measure on images of data points

  15. Solution • “Kernel trick”: • Use a kernel function K: X x X  R • Computes similarity measure on images of data points • Can often compute similarity efficiently even on high (or infinite) dimensional space

  16. Solution • “Kernel trick”: • Use a kernel function K: X x X  R • Computes similarity measure on images of data points • Can often compute similarity efficiently even on high (or infinite) dimensional space • Choice of K equivalent to selection of Φ

  17. Example (Russell & Norvig)

  18. Example (cont’d) • Original 2-D data : x=(x1,x2)

  19. Example (cont’d) • Original 2-D data : x=(x1,x2) • Mapping to new values in 3-D feature space F(x): • f1=x12; f2=x22; f3=

  20. Example (cont’d) • Original 2-D data : x=(x1,x2) • Mapping to new values in 3-D feature space F(x): • f1=x12; f2=x22; f3=

  21. Example (cont’d) • Original 2-D data : x=(x1,x2) • Mapping to new values in 3-D feature space F(x): • f1=x12; f2=x22; f3=

  22. Example (cont’d) • Original 2-D data : x=(x1,x2) • Mapping to new values in 3-D feature space F(x): • f1=x12; f2=x22; f3=

  23. Example (cont’d) • Original 2-D data : x=(x1,x2) • Mapping to new values in 3-D feature space F(x): • f1=x12; f2=x22; f3=

  24. Example (cont’d) • Original 2-D data : x=(x1,x2) • Mapping to new values in 3-D feature space F(x): • f1=x12; f2=x22; f3=

  25. Example (cont’d) • More generally

  26. Example (cont’d) • More generally

  27. Example (cont’d) • More generally

  28. Example (cont’d) • More generally

  29. Example (cont’d) • More generally

  30. Example (cont’d) • More generally

  31. Kernel Trick: Summary • Avoids explicit mapping to high-dimensional space • Avoids explicit computation of inner product in feature space • Avoids explicit computation of mapping function • Or even feature vector • Replace all inner products in SVM train/test with K

  32. Non-Linear SVM Training • Linear version: • Maximize • subject to

  33. Non-Linear SVM Training • Linear version: • Maximize • subject to • Non-linear version: • Maximize

  34. Decoding • Linear SVM:

  35. Decoding • Linear SVM: • Non-linear SVM:

  36. Common Kernel Functions • Implemented in most packages • Linear: • Polynomial: • Radial Basis Function (RBF): • Sigmoid • :

  37. Kernels • Many conceivable kernels: • Function is a kernel if obeys Mercer’s theorem: • Is symmetric, continuous, and matrix is positive definite

  38. Kernels • Many conceivable kernels: • Function is a kernel if obeys Mercer’s theorem: • Is symmetric, continuous, and matrix is positive definite • Selection of kernel can have huge impact • Dramatic differences in accuracy

  39. Kernels • Many conceivable kernels: • Function is a kernel if obeys Mercer’s theorem: • Is symmetric, continuous, and matrix is positive definite • Selection of kernel can have huge impact • Dramatic differences in accuracy • Knowledge about ‘shape’ of data can help select • Ironically, linear SVMs perform well on many tasks

  40. Summary • Find decision hyperplane that maximizes margain

  41. Summary • Find decision hyperplane that maximizes margain • Employ soft-margin to support noisy data

  42. Summary • Find decision hyperplane that maximizes margain • Employ soft-margin to support noisy data • For non-linearly separable data, use non-linear SVMs • Project to higher dimensional space to separate • Use kernel trick to avoid intractable computation of • Projection or inner products

  43. MaxEntvs SVM Due to F. Xia

  44. LibSVM • Well-known SVM package • Chang & Lin, most recent version 2011 • Supports: • Many different SVM variants • Range of kernels • Efficient multi-class implementation • Range of language interfaces, as well as CLI • Frameworks for tuning – cross-validation, grid search • Why libSVM? SVM not in Mallet

  45. libSVM Information • Main website: • http://www.csie.ntu.edu.tw/~cjlin/libsvm/ • Documentation & examples • FAQ

  46. libSVM Information • Main website: • http://www.csie.ntu.edu.tw/~cjlin/libsvm/ • Documentation & examples • FAQ • Installation on patas: • /NLP_TOOLS/ml_tools/svm/libsvm/latest/ • Calling prints usage

  47. libSVM • Main components: • svm-train: • Trains model on provided data • svm-predict: • Applies trained model to test data

  48. svm-train • Usage: • svm-train [options] training_datamodel_file • Options: • -t: kernel type • [0-3] • -d: degree – used in polynomial • -g: gamma – used in polynomial, RBF, sigmoid • -r: coef0 – used in polynomial, sigmoid • lots of others

  49. LibSVM Kernels • Predefined libSVM kernels • Set with –t switch: default 2 • 0 -- linear: u'*v • 1 -- polynomial: (gamma*u'*v + coef0)^degree • 2 -- radial basis function: exp(-gamma*|u-v|^2) • 3 -- sigmoid: tanh(gamma*u'*v + coef0) From libSVM docs

More Related