960 likes | 1.16k Views
Kernel – Based Methods. Presented by Jason Friedman Lena Gorelick. Advanced Topics in Computer and Human Vision Spring 2003. Agenda…. Structural Risk Minimization (SRM) Support Vector Machines (SVM) Feature Space vs. Input Space Kernel PCA Kernel Fisher Discriminate Analysis (KFDA).
E N D
Kernel – Based Methods Presented by Jason Friedman Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2003
Agenda… • Structural Risk Minimization (SRM) • Support Vector Machines (SVM) • Feature Space vs. Input Space • Kernel PCA • Kernel Fisher Discriminate Analysis (KFDA)
Agenda… • Structural Risk Minimization (SRM) • Support Vector Machines (SVM) • Feature Space vs. Input Space • Kernel PCA • Kernel Fisher Discriminate Analysis (KFDA)
Structural Risk Minimization (SRM) • Definition: • Training set with l observations:Each observation consists of a pair: 16x16=256
Structural Risk Minimization (SRM) • The task:“Generalization” - find a mapping • Assumption: Training and test data drawn from the same probability distribution, i.e.(x,y) is “similar” to (x1,y1), …, (xl,yl)
Structural Risk Minimization (SRM) – Learning Machine • Definition: • Learning machine is a family of functions {f()}, is a set of parameters. • For a task of learning two classes f(x,) 2 {-1,1} 8 x, Class of oriented lines in R2:sign(1x1 + 2x2 + 3)
Too little Capacity Too much Capacity ? ? Does it have the same # of leaves? Is the color green? overfitting underfitting Structural Risk Minimization (SRM) – Capacity vs. Generalization • Definition: • Capacity of a learning machine measures the ability to learn any training set without error.
Structural Risk Minimization (SRM) – Capacity vs. Generalization • For small sample sizes overfitting or underfitting might occur • Best generalization = right balance between accuracy and capacity
Structural Risk Minimization (SRM) – Capacity vs. Generalization • Solution: Restrict the complexity (capacity) of the function class. • Intuition: “Simple” function that explains most of the data is preferable to a “complex” one.
Structural Risk Minimization (SRM) -VC dimension • What is a “simple”/”complex” function? • Definition: • Given l points (can be labeled in 2l ways) • The set of points is shattered by the function class {f()} if for each labeling there is a function which correctly assigns those labels.
Structural Risk Minimization (SRM) -VC dimension • Definition • VC dimension of {f()} is the maximum number of points that can be shattered by {f()} and is a measure of capacity.
Structural Risk Minimization (SRM) -VC dimension • Theorem: The VC dimension of the set of orientedhyperplanes in Rn is n+1. • Low # of parameters ) low VC dimension
Structural Risk Minimization (SRM) -Bounds • Definition: Actual risk • Minimize R() • But, we can’t measure actual risk, since we don’t know p(x,y)
Structural Risk Minimization (SRM) -Bounds • Definition: Empirical risk • Remp() ! R(), l!1But for small training set deviations might occur
Structural Risk Minimization (SRM) -Bounds Not valid for infinite VC dimension • Risk bound: Confidence term with probability (1-) h is VC dimension of the function class • Note: R() is independent of p(x,y)
Structural Risk Minimization (SRM)-Principal Method • Principle method for choosing a learning machine for a given task:
Risk Bound Complexity SRM • Divide the class of functions into nested subsets • Either calculate h for each subset, or get a bound on it • Train each subset to achieve minimal empirical error • Choose the subset with the minimal risk bound
Agenda… • Structural Risk Minimization (SRM) • Support Vector Machines (SVM) • Feature Space vs. Input Space • Kernel PCA • Kernel Fisher Discriminate Analysis (KFDA)
Support Vector Machines (SVM) • Currently the “en vogue” approach to classification • Successful applications in bioinformatics, text, handwriting recognition, image processing • Introduced by Bosner, Gayon and Vapnik, 1992 • SVM are a particular instance of Kernel Machines
Linear SVM – Separable case • Two given classes are linearly separable
Linear SVM - definitions • Separating hyperplane H: • w is normal to H • |b|/||w|| is the perpendicular distance from H to the origin • d+ (d-) is the shortest distance from H to the closest positive (negative) point.
Linear SVM - definitions • If H is a separating hyperplane, then • No training points fall between H1 and H2
Linear SVM - definitions • By scaling w and b, we can require that Or more simply: • Equality holds xi lies on H1 or H2
Linear SVM - definitions • Note: w is no longer a unit vector • Margin is now 2 / ||w|| • Find hyperplane with the largest margin.
Linear SVM – maximizing margin • Maximizing the margin , minimizing ||w||2 • ) more room for unseen points to fall • ) restrict the capacity R is the radius of the smallest ball around data
Linear SVM – Constrained Optimization • Introduce Lagrange multipliers • “Primal” formulation: • Minimize LP with respect to w and bRequire
Linear SVM – Constrained Optimization • Objective function is quadratic • Linear constraint defines a convex set • Intersection of convex sets is a convex set • ) can formulate “WolfeDual” problem
Linear SVM – Constrained Optimization The Solution • Maximize LP with respect to i Require • Substitute into LP to give: • Maximize with respect to i
Linear SVM – Constrained Optimization • Using Karush Kuhn Tuckerconditions: • If i > 0 then lies either on H1 or H2) The solution is sparse in i • Those training points are called “support vectors”. Their removal would change the solution
SVM – Test Phase • Given the unseen sample x we take the class of x to be
Linear SVM – Non-separable case • Separable case corresponds to empirical risk of zero. • For noisy data this might not be the minimum in the actual risk. (overfitting ) • No feasible solution for non-separable case
Linear SVM – Non-separable case • Relax the constraints by introducing positive slack variables i • is an upper bound on the number of errors
Linear SVM – Non-separable case • Assign extra cost to errors • Minimize where C is a penalty parameterchosen by the user
Linear SVM – Non-separable case • Lagrange formulation again: Lagrange multiplier • “Wolfe Dual” problem - maximize:subject to: • The solution:
Linear SVM – Non-separable case • Using Karush Kuhn Tucker conditions: • The solution is sparse in i
Nonlinear SVM • Non linear decision function might be needed
Nonlinear SVM- Feature Space • Map the data to a high dimensional (possibly infinite) feature space • Solution depends on • If there were function k(xi,xj) s.t.) no need to know explicitly
Nonlinear SVM – Toy example Input Space Feature Space
Nonlinear SVM – Avoid the Curse • Curse of dimensionality:The difficulty of estimating a problem increases drastically with the dimension • But! Learning in F may be simpler if one uses low complexity function class (hyperplanes)
Nonlinear SVM-Kernel Functions • Kernel functions exist! • effectively compute dot products in feature space • Can use it without knowing and F • Given a kernel, and F are not unique • F with smallest dim is calledminimal embedding space
Nonlinear SVM-Kernel Functions • Mercer’s condition:There exists a pair {,F} such thatiff for any g(x) s.t. is finitethen
Nonlinear SVM-Kernel Functions • Formulation of algorithm in terms of kernels
Nonlinear SVM-Kernel Functions • Kernels frequently used:
Nonlinear SVM-Feature Space d=256, p=4 ) dim(F)= 183,181,376 • Hyperplane {w,b} requires dim(F) + 1 parameters • Solving SVM means adjusting l+1 parameters
SVM - Solution • LD is convex ) the solution is global • Two type of non-uniqueness: • {w,b} is not unique • {w,b} is unique, but the set {i} is notPrefer the set with less support vectors(sparse)