310 likes | 582 Views
An Overview of Kernel-Based Learning Methods. Yan Liu Nov 18, 2003. Outline. Introduction Theory Basis: Reproducing Kernel Hilbert space(RKHS), Mercer’s theorem, Representer theorem, regularization Kernel –based learning algorithm
E N D
An Overview of Kernel-Based Learning Methods Yan Liu Nov 18, 2003
Outline • Introduction • Theory Basis: • Reproducing Kernel Hilbert space(RKHS), Mercer’s theorem, Representer theorem, regularization • Kernel –based learning algorithm • Supervised learning: support vector machines(SVMs), kernel fisher discriminant (KFD) • Unsupervised learning: one class SVM , kernel PCA • Kernel design • Standard kernels • Making kernels from kernels • Application oriented kernels: Fisher kernel
Example Idea: map the problem into higher dimensional space. Let F be a potentially much higher dimensional feature space. Let f : X -> F, x->f(x) Learning problem now works with samples (f(x_1), y_1), . . . , (f(x_N)), y_N) in F × Y. Key : Can this mapped problem be classified in a “simple” way? Introduction
Reproducing Kernel Hilbert Space -1 • Inner product space: • Hilbert space: • Hilbert space is a complete inner product space, obeying the following:
Reproducing Kernel Hilbert Space -2 • Reproducing Kernel Hilbert Space (RKHS) • Gram matrix • given a kernel k(x, y), define the gram matrix to be Kij = k(xi, xj) • We say the kernel is positive definite when the corresponding gram matrix is positive definite • Definition of RKHS
Reproducing Kernel Hilbert Space -3 • Reproducing properties: • Comment • RKHS is a “bounded” Hilbert space • RKHS is a “smoothed” Hilbert space
Mercer’s Theorem-1 • Mercer’s Theorem • For discrete case, assume A is the Gram Matrix. If A is positive definite, then
Mercer’s Theorem-2 • Comment • Mercer’s theorem provides a concrete way to construct the basis for a RKHS • Mercer’s condition is the only constraint for a kernel: the corresponding gram matrix must be positive definite to be a kernel
Representer Theorem-2 • Comment • Representer theorem is a powerful result. It shows that although we search for the optimal solution in an infinite-dimension feature space, adding the regularization term reduces the problem to finite-dimensional space (training examples) • In reality, regularization and RKHS are equivalent.
Outline • Introduction • Theory Basis: • Reproducing Kernel Hilbert space(RKHS), Mercer’s theorem, Representer theorem, regularization • Kernel –based learning algorithm • Supervised learning: support vector machines(SVMs), kernel fisher discriminant (KFD) • Unsupervised learning: one class SVM , kernel PCA • Kernel design • Standard kernels • Making kernels from kernels • Application oriented kernels: Fisher kernel
Support Vector Machines-3 • Parameter Sparsity • Most a_i are zeros • C: regularization constant • : slack variables
Support Vector Machines-4Optimization technique • Chunking: • Each step sovles the problem containing all non-zero a_I plus some of the a_I violating KKT conditions • Decomposition methods: SVM_light • The size of the subproblem is fixed, add and remove one sample in each iteration • Sequential minimal optimization (SMO) • Each iteration solves a quadratic problem of size two
Kernel Fisher Discriminant-1Overview of LDA • Fisher’s discriminant (or LDA): find the linear projection with the most discriminative direction • Maximizing the Rayleigh coefficient where S_w is the within class variance and S_B is between class variance. • Comparison with PCA
Kernel Fisher Discriminant-2 • KFD: solves the problem of Fisher’s linear discriminant to get a nonlinear discriminant in input space. • One can express w in terms of mapped training patterns: • The optimization problem for the KFD can be written as:
Kernel PCA -1 • The basic idea of PCA: find a set of orthogonal directions that capture most of the variance in the data. • However, sometimes the clusters are more than N (N is the number of dimensions) • Kernel PCA tries to map the data into a higher dimensional space and perform standard PCA. Using the kernel trick, we can do all our calculations in a lower dimension.
Kernel PCA -2 • Covariance matrix • By definition • Then we have • Define the gram matrix • At last we have: • Therefore we simply have to solve an eigenvalue problem on the Gram matrix.
Outline • Introduction • Theory Basis: • Reproducing Kernel Hilbert space(RKHS), Mercer’s theorem, Representer theorem, regularization • Kernel –based learning algorithm • Supervised learning: support vector machines(SVMs), kernel fisher discriminant (KFD) • Unsupervised learning: one class SVM , kernel PCA • Kernel design • Standard kernels • Making kernels from kernels • Application oriented kernels: Fisher kernel
Making kernels out of Kernels • Theorem: • K(x, z) = K1(x,z) + K2(x,z) • K(x, z) = aK1(x,z) • K(x, z) = K1(x,z) * K2(x, z) • K(x, z) = f(x) f(z) • K(x, z) = K3(Φ (x), Φ (y)) • Kernel selection
Fisher-kernel • Jaakolla and Haussler proposed using a generative model as a kernel in a discriminative (non-probabilistic) kernel classifier. • Build a HMM model for each family • Compute the fisher scores for each parameter in the HMM • Use scores as features and predict by SVM with RBF kernel • Good performance for protein family classification