Kernel – Based Methods

Kernel – Based Methods Presented by Jason Friedman Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2003

Agenda… • Structural Risk Minimization (SRM) • Support Vector Machines (SVM) • Feature Space vs. Input Space • Kernel PCA • Kernel Fisher Discriminate Analysis (KFDA)

Structural Risk Minimization (SRM) • Definition: • Training set with l observations:Each observation consists of a pair: 16x16=256

Structural Risk Minimization (SRM) • The task:“Generalization” - find a mapping • Assumption: Training and test data drawn from the same probability distribution, i.e.(x,y) is “similar” to (x1,y1), …, (xl,yl)

Structural Risk Minimization (SRM) – Learning Machine • Definition: • Learning machine is a family of functions {f()},  is a set of parameters. • For a task of learning two classes f(x,) 2 {-1,1} 8 x, Class of oriented lines in R2:sign(1x1 + 2x2 + 3)

Too little Capacity Too much Capacity ? ? Does it have the same # of leaves? Is the color green? overfitting underfitting Structural Risk Minimization (SRM) – Capacity vs. Generalization • Definition: • Capacity of a learning machine measures the ability to learn any training set without error.

Structural Risk Minimization (SRM) – Capacity vs. Generalization • For small sample sizes overfitting or underfitting might occur • Best generalization = right balance between accuracy and capacity

Structural Risk Minimization (SRM) – Capacity vs. Generalization • Solution: Restrict the complexity (capacity) of the function class. • Intuition: “Simple” function that explains most of the data is preferable to a “complex” one.

Structural Risk Minimization (SRM) -VC dimension • What is a “simple”/”complex” function? • Definition: • Given l points (can be labeled in 2l ways) • The set of points is shattered by the function class {f()} if for each labeling there is a function which correctly assigns those labels.

Structural Risk Minimization (SRM) -VC dimension • Definition • VC dimension of {f()} is the maximum number of points that can be shattered by {f()} and is a measure of capacity.

Structural Risk Minimization (SRM) -VC dimension • Theorem: The VC dimension of the set of orientedhyperplanes in Rn is n+1. • Low # of parameters ) low VC dimension

Structural Risk Minimization (SRM) -Bounds • Definition: Actual risk • Minimize R() • But, we can’t measure actual risk, since we don’t know p(x,y)

Structural Risk Minimization (SRM) -Bounds • Definition: Empirical risk • Remp() ! R(), l!1But for small training set deviations might occur

Structural Risk Minimization (SRM) -Bounds Not valid for infinite VC dimension • Risk bound: Confidence term with probability (1-) h is VC dimension of the function class • Note: R() is independent of p(x,y)

Structural Risk Minimization (SRM) -Bounds

Structural Risk Minimization (SRM)-Principal Method • Principle method for choosing a learning machine for a given task:

Risk Bound Complexity SRM • Divide the class of functions into nested subsets • Either calculate h for each subset, or get a bound on it • Train each subset to achieve minimal empirical error • Choose the subset with the minimal risk bound

Agenda… • Structural Risk Minimization (SRM) • Support Vector Machines (SVM) • Feature Space vs. Input Space • Kernel PCA • Kernel Fisher Discriminate Analysis (KFDA)

Support Vector Machines (SVM) • Currently the “en vogue” approach to classification • Successful applications in bioinformatics, text, handwriting recognition, image processing • Introduced by Bosner, Gayon and Vapnik, 1992 • SVM are a particular instance of Kernel Machines

Linear SVM – Separable case • Two given classes are linearly separable

Linear SVM - definitions • Separating hyperplane H: • w is normal to H • |b|/||w|| is the perpendicular distance from H to the origin • d+ (d-) is the shortest distance from H to the closest positive (negative) point.

Linear SVM - definitions

Linear SVM - definitions • If H is a separating hyperplane, then • No training points fall between H1 and H2

Linear SVM - definitions • By scaling w and b, we can require that Or more simply: • Equality holds  xi lies on H1 or H2

Linear SVM - definitions • Note: w is no longer a unit vector • Margin is now 2 / ||w|| • Find hyperplane with the largest margin.

Linear SVM – maximizing margin • Maximizing the margin , minimizing ||w||2 • ) more room for unseen points to fall • ) restrict the capacity R is the radius of the smallest ball around data

Linear SVM – Constrained Optimization • Introduce Lagrange multipliers • “Primal” formulation: • Minimize LP with respect to w and bRequire

Linear SVM – Constrained Optimization • Objective function is quadratic • Linear constraint defines a convex set • Intersection of convex sets is a convex set • ) can formulate “WolfeDual” problem

Linear SVM – Constrained Optimization The Solution • Maximize LP with respect to i Require • Substitute into LP to give: • Maximize with respect to i

Linear SVM – Constrained Optimization • Using Karush Kuhn Tuckerconditions: • If i > 0 then lies either on H1 or H2) The solution is sparse in i • Those training points are called “support vectors”. Their removal would change the solution

SVM – Test Phase • Given the unseen sample x we take the class of x to be

Linear SVM – Non-separable case • Separable case corresponds to empirical risk of zero. • For noisy data this might not be the minimum in the actual risk. (overfitting ) • No feasible solution for non-separable case

Linear SVM – Non-separable case • Relax the constraints by introducing positive slack variables i • is an upper bound on the number of errors

Linear SVM – Non-separable case • Assign extra cost to errors • Minimize where C is a penalty parameterchosen by the user

Linear SVM – Non-separable case • Lagrange formulation again: Lagrange multiplier • “Wolfe Dual” problem - maximize:subject to: • The solution:

Linear SVM – Non-separable case • Using Karush Kuhn Tucker conditions: • The solution is sparse in i

Nonlinear SVM • Non linear decision function might be needed

Nonlinear SVM- Feature Space • Map the data to a high dimensional (possibly infinite) feature space • Solution depends on • If there were function k(xi,xj) s.t.) no need to know  explicitly

Nonlinear SVM – Toy example Input Space Feature Space

Nonlinear SVM – Avoid the Curse • Curse of dimensionality:The difficulty of estimating a problem increases drastically with the dimension • But! Learning in F may be simpler if one uses low complexity function class (hyperplanes)

Nonlinear SVM-Kernel Functions • Kernel functions exist! • effectively compute dot products in feature space • Can use it without knowing  and F • Given a kernel,  and F are not unique • F with smallest dim is calledminimal embedding space

Nonlinear SVM-Kernel Functions • Mercer’s condition:There exists a pair {,F} such thatiff for any g(x) s.t. is finitethen

Nonlinear SVM-Kernel Functions • Formulation of algorithm in terms of kernels

Nonlinear SVM-Kernel Functions • Kernels frequently used:

Nonlinear SVM-Feature Space d=256, p=4 ) dim(F)= 183,181,376 • Hyperplane {w,b} requires dim(F) + 1 parameters • Solving SVM means adjusting l+1 parameters

SVM - Solution • LD is convex ) the solution is global • Two type of non-uniqueness: • {w,b} is not unique • {w,b} is unique, but the set {i} is notPrefer the set with less support vectors(sparse)

Nonlinear SVM-Toy Example

Kernel – Based Methods

Kernel – Based Methods

Presentation Transcript

Age determination using the human skeleton

Advanced Kernel Debugging

Introduction to Trenchless Methods

Windows 7 and Windows Server 2008 R2 Kernel Changes

PROTEINS

ThreadX Kernel API’s

Linux Kernel Internals

Review of Methods from Prerequisite Course

AMCS/CS 340: Data Mining

Flood and Runoff estimation methods

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong

Kernel Methods: the Emergence of a Well-founded Machine Learning

Barron’s AP #2: Methods

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

SYLLABUS

Very High Performance Cache Based Techniques for Iterative Methods

Kernel Synchronization

V. Megalooikonomou Spatial Access Methods (SAMs) I

Chapter 5

CISSP CBK #2 Access Control

主講人：虞台文