Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

An Introduction to Kernel-Based Learning AlgorithmsK.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Outline • Problem Description • Nonlinear Algorithms in Kernel Feature Space • Supervised Learning: • Nonlinear SVM • Kernel Fisher Discriminant Analysis • Supervised Learning: • Kernel Principle Component Analysis • Applications • Model specific kernels

Problem Description • 2-class Classification: estimate a function, using input-output training data such that will correctly classify unseen examples. • i.e., Find a mapping: • Assume: Training and test data are drawn from the same probability distribution

Problem Description • A learning machine is a family of functions • For a task of learning two classes f(x,) 2 {-1,1}, 8 x, • Too complex ) Overfitting • Not complex enough ) Underfitting • Want to find the right balance between accuracy and complexity

Problem Description • Best is the one that minimizes the expected error: • Empirical Risk (training error): • Remp()! R() as n!1

Structural Risk Minimization • Construct a nested family of function classes , with non-decreasing VC dimension. • Let be the solutions of the empirical risk minimization. • SRM chooses the function class and the function such that an upper bound on the generalization error is minimized.

Nonlinear Algorithms in Kernel Feature Space • Via a non-linear mapping the data is mapped into a potentially much higher dimensional feature space. • Given this mapping, we can compute scalar products in feature spaces using kernel functions. • does not need to be known explicitly ) every linear algorithm that only uses scalar products can implicitly be executed in by using kernels.

Nonlinear Algorithms in Kernel Feature Space:Example

Supervised Learning: Nonlinear SVM • Consider linear classifiers in feature space using dot products. • Conditions for classification without training error: • GOAL: Find and b such that the empirical risk and regularization term are minimized. • But we cannot explicitly access w in the feature space, so we introduce Lagrange multipliers, i, one for each of the above constraints.

Supervised Learning: Nonlinear SVM • Last class we saw that the nonlinear SVM primal problem is: • Which leads to the dual:

Supervised Learning: Nonlinear SVM • Using KKT second order optimality conditions on the dual SVM problem, we obtain: • The solution is sparse in ) many patterns are outside the margin area and the optimal i’s are zero. • Without sparsity, SVM would be impractical for large data sets.

Supervised Learning: Nonlinear SVM • The dual problem can be rewritten as: • Where • Since objective function is convex, every local max is a global max, but there can be several optimal solutions (in terms of i) • Once i’s are found using QP solvers, simply plug into prediction rule:

Supervised Learning: KFD • Discriminant analysis seeks to find a projection of the data in a direction that is efficient for discrimination. Image from: R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification, John Wiley & Sons, INC., 2001.

Supervised Learning: KFD • Solve Fisher’s linear discriminant in kernel feature space. • Aims at finding linear projections such that the classes are well separated. • How far are the projected means apart? (should be large) • How big is the variance of the data in this direction? (should be small) • Recall, that this can be achieved by maximizing the Rayleigh quotient: • where

Supervised Learning: KFD • In kernel feature space , express w in terms of mapped training patterns: • To get: • Where,

Supervised Learning: KFD • Projection of a test point onto the discriminant is computed by: • Can solve the generalized eigenvalue problem: • But N and M may be large and non-sparse, can transform KFD into a convex QP problem. • Question – can we use numerical approximations to the eigenvalue problem?

Supervised Learning: KFD • Can reformulate as constrained optimization problem. • FD tries to minimize the distance between the variance of the data along the projection whilst maximizing the distance between the means: • This QP is equivalent to J() since • M is a matrix of rank 1 (columns are linearly dependent) • Solutions w in J() are invariant under scaling. )Can fix the distance of the means to some arbitrary, positive value and just minimize the variance.

Connection Between Boosting and Kernel Methods • Can show that Boosting maximizes the smallest margin . • Recall, SVM attempts to maximize w • In general, using an arbitrary lp norm constraint on the weight vector leads to maximizing the lq distance between the hyperplane and the training points. • Boosting uses l1 norm • SVM uses l2 norm

Unsupervised Methods: Linear PCA • Principal Components Analysis (PCA) attempts to efficiently represent the data by finding orthonormal axes which maximally decorrelate the data • Given centered observations: • PCA finds the principal axes by diagonalizing the covariance matrix • Note that C is positive definite,and thus can be diagonalized with nonnegative eigenvalues.

Unsupervised Methods: Linear PCA • Eigenvectors lie in the span of x1, …, xn: • Thus it can be shown that, • But is just a scalar, so all solutions v with   0 lie in the span of x1, …, xn, i.e.

Unsupervised Methods: Kernel PCA • If we first map the data into another space, • Then assuming we can center the data, we can write the covariance matrix as: • Which can be diagonalized with nonnegative eigenvalues satisfying:

Unsupervised Methods: Kernel PCA • As in linear PCA, all solutions v with  0 lie in the span of (xi), …, (xm) i.e. • Substituting, we get: • Where K is the inner product kernel: • Premultiplying both sides by (xk)T, we finally get:

Unsupervised Methods: Kernel PCA • The resulting set of eigenvectors are then used to extract the Principle Components of a test point by:

Unsupervised Methods: Kernel PCA • Nonlinearities only enter the computation at two points: • In the calculation of the matrix K • In the evaluation of new points • Drawback of PCA: • For large data sets, storage and computational complexity issues. • Can use sparse approximations of K. • Question: Can we think of other unsupervised methods which can make use of kernels? • Kernel k-means, Kernal ICA, Spectral Clustering

Unsupervised Methods: Linear PCA

Unsupervised Methods: Kernel PCA

Applications • Support Vector Machines and Kernel Fisher Discriminant: • Bioinformatics: protein classification • OCR • Face Recognition • Content based image retrieval • Decision Tree Predictive Modeling • … • Kernel PCA • Denoising • Compression • Visualization • Feature extraction for classification

Kernels for Specific Applications • Image Segmentation: Gaussian weighted 2-distance between local color histograms. • Can be shown to be robust for color and texture discrimination • Text classification: Vector Space kernels • Structured Data (strings, trees, etc.): Spectrum kernels • Generative models: P-kernels, Fisher kernels

Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006