960 likes | 970 Views
A beginner-friendly guide to Supervised vs. Unsupervised learning using SVM, HMM, clustering, and more. Learn about separating hyperplanes, maximizing margins, transforming data to feature space, and exploring different algorithms.
E N D
A Reassuring Introduction to Support Vector Machines • Mark Stamp A Reassuring Introduction to SVM
Supervised vs Unsupervised A Reassuring Introduction to SVM • Often use supervised learning… • …where training relies on labeled data • Training data must be pre-processed • In contrast, unsupervised learning… • …uses unlabeled data • No pre-processing required for training • Also semi-supervised algorithms • Supervised, but not too much?
HMM for Supervised Learning A Reassuring Introduction to SVM • Suppose we want to use HMM for malware detection • Train model on set of malware • All from one specific family • Data labeled as malware of that type • Test to see how well it distinguishes malware from benign • This is supervised learning
Unsupervised Learning? A Reassuring Introduction to SVM • Recall HMM for English text example • Using N = 2, we find hidden states correspond to consonants and vowels • We did not specify consonants/vowels • HMM extracted this info from raw data • Unsupervised or semi-supervised? • It seems to depend on your definition
Unsupervised Learning A Reassuring Introduction to SVM • Clustering • Good example of unsupervised learning • Other examples? • For “mixed” dataset, often the goal of clustering is to reveal structure • No pre-processing • Often no idea how to pre-process • Usually used in “data exploration” mode
Supervised Learning A Reassuring Introduction to SVM • SVM one of the most popular supervised learning method • Also, HMM, PHMM, PCA, ANN, etc., used for supervised learning • SVM is for binary classification • I.e., 2 classes, such as malware vs benign • SVM generalizes to multiple classes • As does LDA and some other techniques
Support Vector Machine A Reassuring Introduction to SVM • According to another author… • “SVMs are a rare example of a methodology where geometric intuition, elegant mathematics, theoretical guarantees, and practical algorithms meet” • We have something to say about each aspect of this… • Geometry, math, theory, and algorithms
Support Vector Machine A Reassuring Introduction to SVM • SVM based on four BIG ideas • Separating hyperplane • Maximize the “margin” • Maximize minimum separation between classes • Work in a higher dimensional space • More “room”, so easier to separate • Kernel trick • This is intimately related to 3 • Both 1 and 2 are fairly intuitive
SVM A Reassuring Introduction to SVM • SVMs can apply to any training data • Note that SVM yields classification… • … not a score, per se • With HMM, for example • We first train a model… • …then generate scores and set threshold • SVM directly gives classification • Skip the intermediate (testing) step
Separating Classes A Reassuring Introduction to SVM • Consider labeled data • Binary classifier • Red class is type “1” • Blue class is “-1” • And (x,y) are features • How to separate? • We’ll use a “hyperplane”… • …a line in this case
Separating Hyperplanes A Reassuring Introduction to SVM • Consider labeled data • Here, easy to separate • Draw a hyperplane to separate points • Classify new data based on separating hyperplane • Which hyperplane is better? Or best? Why?
Maximize Margin A Reassuring Introduction to SVM • Margin is min distance to misclassification • Maximize the margin • Yellow hyperplane is better than purple • Seems like a good idea • But, may not be possible • See next slide…
Separating… NOT A Reassuring Introduction to SVM • What about this case? • Yellow line not an option • Why not? • No longer “separating” • What to do? • Allow for some errors? • E.g., hyperplane need not completely separate
Soft Margin A Reassuring Introduction to SVM • Ideally, large margin and no errors • But allowing some misclassifications might increase the margin by a lot • I.e., relax “separating” requirement • How many errors to allow? • Let it be a user defined parameter • Tradeoff?Errors vs larger margin • In practice, can use trial and error
Feature Space A Reassuring Introduction to SVM • Transform data to “feature space” • Feature space in higher dimension • But what about curse of dimensionality? • Q: Why increase dimensionality??? • A: Easier to separate in feature space • Goal is to make data “linearly separable” • Want to separate classes with hyperplane • But not pay a price for high dimensionality
Input Space & Feature Space ϕ Input space Feature space A Reassuring Introduction to SVM • Why transform? • Sometimes nonlinear can become linear…
Feature Space in Higher Dimension A Reassuring Introduction to SVM An example of what can happen when transforming to a higher dimension
Feature Space A Reassuring Introduction to SVM • Usually, higher dimension is worse • From computational complexity POV… • ...and from statistical significance POV • But higher dimensional feature space can make data linearly separable • Can we have our cake and eat it too? • Linearly separable and easy to compute? • Yes! Thanks to the kernel trick
Kernel Trick A Reassuring Introduction to SVM • Enables us to work in input space • With results mapped to feature space • No work done explicitly in feature space • Computations in input space • Lower dimension, so computation easier • But, things “happen” in feature space • Higher dimension, so easier to separate • Very, very cool trick!
Kernel Trick A Reassuring Introduction to SVM • Unfortunately, to understand kernel trick, must dig a little (a lot?) deeper • Makes all aspects of SVM clearer • We won’t cover every detail here • Just enough to get idea across • Well, maybe a little more than that… • We’ll need Lagrange multipliers • But first, constrained optimization
Constrained Optimization A Reassuring Introduction to SVM • General problem (in 2 variables) • Maximize: f(x,y) • Subject to: g(x,y) = c • Objective functionf and constraintg • For example, • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • We’ll look at this example in detail
Specific Example A Reassuring Introduction to SVM Maximize: f(x,y) = 16 – (x2 + y2) Subject to: 2x – y = 4 Graph of f(x,y)
Intersection A Reassuring Introduction to SVM Intersection of f(x,y) and 2x – y = 4 What is the solution to problem?
Constrained Optimization A Reassuring Introduction to SVM • This example looks easy • But how to solve in general? • Recall, general case (in 2 variables) is • Maximize: f(x,y) • Subject to: g(x,y) = c • How to “simplify”? • Combine objective function f(x,y) and constraint g(x,y) = c into one equation!
Proposed Solution A Reassuring Introduction to SVM • Define J(x,y) = f(x,y) + I(x,y) • Where I(x,y)is 0 whenever g(x,y) = c and -∞otherwise • Recall the general problem… • Maximize: f(x,y) • Subject to: g(x,y) = c • Solution is given by max J(x,y) • Here, max is over (x,y)
Proposed Solution A Reassuring Introduction to SVM • We know how to solve maximization problems using calculus • So, we’ll use calculus to solve the problem max J(x,y), right? • WRONG! • The function J(x,y) is not at all “nice” • This function is not differentiable • It’s not even continuous!
Proposed Solution A Reassuring Introduction to SVM • Again, let J(x,y) = f(x,y) + I(x,y) • Where I(x,y) is 0 whenever g(x,y) = c and -∞ otherwise • Then max J(x,y) is solution to problem • This is good • But we can’t solve this max problem • This is very bad • What to do???
New-and-Improved Solution A Reassuring Introduction to SVM • Let’s replace I(x,y) with a nice function • What are the nicest functions of all? • Linear function(in the constraint) • To maximize f(x,y), subject to g(x,y) = cwe first define the Lagrangian L(x,y,λ) = f(x,y) + λ(g(x,y) – c) • Nice function in λ, so calculus applies • But, not just a max problem (next slide…)
New-and-Improved Solution A Reassuring Introduction to SVM • Maximize: f(x,y), subject to: g(x,y) = c • Again, the Lagrangian is L(x,y,λ) = f(x,y) + λ(g(x,y) – c) • Observe that min L(x,y,λ) = J(x,y) • Where min is over λ • Recall that max J(x,y) solves problem • So max min L(x,y,λ)also solves problem • Advantage of this form of problem ?
Lagrange Multipliers A Reassuring Introduction to SVM • Maximize: f(x,y), subject to: g(x,y) = c • Lagrangian:L(x,y,λ)=f(x,y)+λ(g(x,y)–c) • Solution given by max min L(x,y,λ) • Note this is maxwrt(x,y) variables… • ...and min is wrt λ parameter • So, solution is at a “saddle point” wrt overall function, i.e., (x,y,λ) variables • By definition of a saddle point
Saddle Points A Reassuring Introduction to SVM • Graph of L(x,λ) = 4-x2 +λ(x-1) • Note, f(x) = 4-x2 and constraint is x=1
New-and-Improved Solution A Reassuring Introduction to SVM Maximize: f(x,y), subject to: g(x,y) = c Lagrangian isL(x,y,λ)=f(x,y)+λ(g(x,y)–c) Solved by max min L(x,y,λ) Calculus to the rescue! And which implies g(x,y) = c Langrangian: Constrained optimization converted to unconstrained optimization
More, More, More A Reassuring Introduction to SVM • Lagrangian generalizes to more variables and/or more constraints • Or, more succinctly • Where x=(x1,x2,…,xn) and λ=(λ1, λ2,…,λm)
Another Example A Reassuring Introduction to SVM • Lots of good geometric examples • First, we do a non-geometric example • Consider discrete probability distribution on n points: p1,p2,p3,…,pn • What distribution has max entropy? • We want to maximize entropy function • Subject to constraint that the pj form a probability distribution
Maximize Entropy A Reassuring Introduction to SVM • Shannon entropy: Σpj log2pj • Have a probability distribution, so… • Require 0 ≤ pj ≤ 1 for all j, and Σpj = 1 • We will solve this simplified problem: • Maximize: f(p1,..,pn) = Σpj log2pj • Subject to constraint: Σpj = 1 • How should we solve this? • Do you really have to ask?
Entropy Example A Reassuring Introduction to SVM • Recall L(x,y,λ) = f(x,y) + λ (g(x,y) – c) • Problem statement • Maximize f(p1,..,pn) = Σpj log2pj • Subject to constraint Σpj = 1 • In this case, Lagrangian is L(p1,…,pn,λ) = Σpj log2pj + λ (Σpj – 1) • Compute partial derivatives wrt each pj and partial derivative wrtλ
Entropy Example A Reassuring Introduction to SVM • Have L(x,y,λ) = Σpj log2pj + λ (Σpj – 1) • Partial derivatives wrt any pj yields log2pj + 1/ln(2) + λ = 0 (#) • And wrtλ yields the constraint Σpj – 1 = 0 orΣpj = 1 (##) • Equation (#) implies all pj are equal • With equation (##), all pj = 1/n • Conclusion?
Notation A Reassuring Introduction to SVM • Let x=(x1,x2,…,xn) and λ=(λ1,λ2,…,λm) • Again, we write Lagrangian as L(x,λ) = f(x) + Σλi (gi(x) – ci) • Note: L is a function of n+m variables • Can view the problem as… • Constraints gi define a feasible region • Maximize the objective function f over this feasible region
Lagrangian Duality A Reassuring Introduction to SVM • For Lagrange multipliers… • Primal problem: max min L(x,y,λ) • Where max over (x,y) and min over λ • Dual problem: min max L(x,y,λ) • As above, max over (x,y) and min over λ • We claim it’s easy to see that min max L(x,y,λ) ≥ max min L(x,y,λ) • Why is this true? Next slide...
Dual Problem A Reassuring Introduction to SVM • Recall J(x,y) = f(x,y) + I(x,y) • Where I(x,y) is 0 whenever g(x,y) = c and -∞ otherwise • And max J(x,y) is a solution • Then L(x,y,λ) ≥ J(x,y) • And max L(x,y,λ) ≥ max J(x,y)for all λ • Therefore, min max L(x,y,λ) ≥ max J(x,y) • min max L(x,y,λ) ≥ max min L(x,y,λ)
Dual Problem A Reassuring Introduction to SVM • So, we have shown that dual problem provides upper bound • min max L(x,y,λ) ≥ max min L(x,y,λ) • That is, dual solution ≥ primal solution • But it’s even better than that • For Lagrangian, equality holds true • Why equality? • Because Lagrangian is convex function
Primal Problem A Reassuring Introduction to SVM • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • Then L(x,y,λ) = 16 – (x2 + y2) + λ(2x – y - 4) • Compute partial derivatives… dL/dx = -2x + 2λ = 0 dL/dy = -2y – λ = 0 dL/dλ = 2x – y – 4 = 0 • Result: (x,y,λ) = (-8/5,4/5,-8/5) • Which yields max of f(x,y) = 64/5
Dual Problem A Reassuring Introduction to SVM • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • Then L(x,y,λ) = 16–(x2+y2)+λ(2x–y-4) • Recall that dual problem is min max L(x,y,λ) • Where max is over (x,y), min is over λ • How can we solve this?
Dual Problem A Reassuring Introduction to SVM • Dual problem: min max L(x,y,λ) • So, can first take max of L over (x,y) • Then we are left with function Lonly in λ • To solve problem, then find minL(λ) • On next slide, we illustrate this for L(x,y,λ) = 16 – (x2 + y2) + λ(2x – y - 4) • Same example as considered above
Dual Problem A Reassuring Introduction to SVM • Given L(x,y,λ) = 16 – (x2 + y2) + λ(2x – y – 4) • Maximize over (x,y) by computing dL/dx = -2x + 2λ = 0 dL/dy = -2y - λ = 0 • Which implies x = λand y = -λ/2 • Substitute these into L to obtain L(λ) = 5/4λ2 + 4λ + 16
Dual Problem A Reassuring Introduction to SVM • Original problem • Maximize: f(x,y) = 16 – (x2 + y2) • Subject to: 2x – y = 4 • Solution can be found by minimizing L(λ) = 5/4λ2 + 4λ + 16 • Then L’(λ) = 5/2λ + 4 = 0, which gives λ = -8/5 and (x,y) = (-8/5,4/5) • Same solution as the primal problem!
Summary of Dual Problem A Reassuring Introduction to SVM • Maximize L to find (x,y) in terms of λ • Then rewrite L as function of λ only • Finally, minimize L(λ) to solve problem • But, why all of the fuss? • Dual problem allows us to write the problem in much more user-friendly way • In SVM, we’ll consider dual of L(λ)
Lagrange Multipliers and SVM A Reassuring Introduction to SVM • Lagrange multipliers very cool indeed • But what does this have to do with SVM? • Can view (soft) margin computation as constrained optimization problem • In this form, kernel trick becomes clear • We can kill 2 birds with 1 stone • Make margin calculation clearer • Make kernel trick perfectly clear
Problem Setup A Reassuring Introduction to SVM • Let X1,X2,…,Xn be data pts (vectors) • Each Xi = (xi,yi) a point in the plane • In general, could be higher dimension • Let z1,z2,…,zn be corresponding class labels, where each zi{-1,1} • Where zi = 1 if classified as “red” type • And zi = -1 if classified as “blue” type • Note this is a binary classification
Geometric View y m x A Reassuring Introduction to SVM • Equation of yellow line w1x + w2y + b = 0 • Equation of red line w1x + w2y + b = 1 • Equation of blue line w1x + w2y + b = -1 • Margin m is length of green line