E N D
Ch 6. Kernel Methodsby Aizerman et al. (1964).Re-introduced in the context of large margin classifiers by Boser et al. (1992).Vapnik (1995), Burges (1998), CristianiniandShawe-Taylor (2000), M uller et al. (2001), SchölkopfandSmola(2002),andHerbrich (2002). C. M. Bishop, 2006.
Recall, in linearmethodsforclassificationandregression ClassicalApproaches: Linear, parametricornonparametric. A set of training data is used to obtain a parameter vector . • Step1: Train • Step 2: Recognize KernelMethods: Memory-based • store the entire training set in order to make predictions for future data points (nearest neighbors). • Transform data tohigherdimensionalspaceforlinearseparability
Kernel methods approach • The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: • The expectation is that the feature space has a much higher dimension than the input space. CCKM'06
Example • Consider the mapping • If we consider a linear equation in this feature space: • We actually have an ellipse – i.e. a non-linear shape in the input space. CCKM'06
Capacity of feature spaces • The capacity is proportional to the dimension • 2-dim:
Form of the functions • So kernel methods use linear functions in a feature space: • For regression this could be the function • For classification require thresholding CCKM'06
Problems of high dimensions • Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well • Computational costs involved in dealing with large vectors CCKM'06
Recall • Two theoretical approaches converged on similar algorithms: • Bayesian approach led to Bayesian inference using Gaussian Processes • Frequentist Approach: MLE • First we briefly discuss the Bayesian approach before mentioning some of the frequentist results CCKM'06
I. Bayesian approach • The Bayesian approach relies on a probabilistic analysis by positing • a pdf model • a prior distribution over the function class • Inference involves updating the prior distribution with the likelihood of the data • Possible outputs: • MAP function • Bayesian posterior average CCKM'06
Bayesian approach • Avoids overfitting by • Controlling the prior distribution • Averaging over the posterior CCKM'06
Bayesian approach • Subject to assumptions about pdf model and prior distribution: • Can get error on the output • Compute evidence for the model and use for model selection • Approach developed for different pdf models • eg classification • Typically requires approximate inference CCKM'06
2. Frequentistapproach • Source of randomness is assumed to be a distribution that generates the training data i.i.d. – with the same distribution generating the test data • Different/weaker assumptions than the Bayesian approach – so more general but less analysis can typically be derived • Main focus is on generalisation error analysis CCKM'06
Generalization • What do we mean by generalisation? CCKM'06
Generalizationof a learner CCKM'06
Example of Generalisation • We consider the Breast Cancer dataset • Use the simple Parzen window classifier: weight vector is where is the average of the positive (negative) training examples. • Threshold is set so hyperplane bisects the line joining these two points. CCKM'06
Example of Generalisation • By repeatedly drawing random training sets S of size m we estimate the distribution of by using the test set error as a proxy for the true generalisation • We plot the histogram and the average of the distribution for various sizes of training set 648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7. CCKM'06
Example of Generalisation • Since the expected classifier is in all cases the same we do not expect large differences in the average of the distribution, though the non-linearity of the loss function means they won't be the same exactly. CCKM'06
Error distribution: full dataset CCKM'06
Observations • Things can get bad if number of training examples small compared to dimension • Mean can be bad predictor of true generalisation – • i.e. things can look okay in expectation, but still go badly wrong • Key ingredient of learning – keep flexibility high while still ensuring good generalisation CCKM'06
Controlling generalisation • The critical method of controlling generalisation for classification is to force a large margin on the training data: CCKM'06
Kernel methods approach • The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: • The expectation is that the feature space has a much higher dimension than the input space.
Study: HilbertSpace • Functionals: A mapfrom vector space to a field • Duality: • Inner product • Norm • Similarity • Distance • Metric (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
KernelFunctions • k(x, x )=φ(x)Tφ(x ). • For example • k(x, x )=(xTx’+c)M • What if x and x’ are two images? • The kernel represents a particular weighted sum of all possible products of M pixels in the first image with M pixels in the second image.
KernelFunction: Evaluated at the training data pointsk(x, x’ )=φ(x)Tφ(x’ ). • Linear Kernels: k;(x,x’) = xTx’ • Stationary kernels: Invariant to translation • Homogeneous kernels, i.e., radial basis functions:
KernelTrick • if we have an algorithm in which the input vector x enters only in the form of scalar products, then we can replace that scalar product with some other choice of kernel.
6.1 Dual Representations • Consider a linear regression model for regularized SSE function • If we set • Where nth row of is • And
6.1 Dual Representations (2/4) • We can now reformulate the least-squares algorithm in terms of a (dual representation). We substitute into to obtain • Define Gram Matrix with entries
6.1 Dual Representations (3/4) • The sum-of-squares error function can be written as • Setting the gradient of with respect to a to zero, we obtain optimal a • Recall a was a function of w
6.1 Dual Representations (4/4) • We obtain the following prediction for a new input x by substituting this as where we define the vector k(x) with elements • Prediction y(x) is computed from thelinear combo of t • Y(x) is expressed entirely in terms of the kernel function k(x,x’). • w is expressed in terms of linear combo of a w =aTф(x)
Recall • Linear regresion solution : w= [ΦT Φ +λI]-1ΦTt • Dual Representation: a= [K+λI]-1t • Note K is NxN • Φ is MxM
6.2 Constructing Kernels (1/5) • Kernel function is defined as inner product of two functions • Example of kernel
Basis Functions and corresponding Kernels Figure 6.1 Upper Plot: basis functions (polynomials, Gaussians, logistic sigmoid), and lower plots are kernel functions. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Constructing Kernels • A necessary and sufficient condition for a function to be a valid kernel is that the Gram matrix K should be positive semidefinite. • Techniques for constructing new kernels: given k1 (x,x’) and k2 (x,x’), the following new kernels will also be valid. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Gaussian Kernel • Show that: The feature vector that corresponds to the Gaussian kernel has infinity dimensionality. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Construction of Kernels from Generative Models • Given p(x), define a kernel function k((x,x’) = p(x)p(x’) • A kernel function measuring the similarity of two sequences: z is hidden variable • Leads to hidden Markov model if x and x’ are sequences of outcomes
Fisher Kernel • Consider Fisher Score: • Then fisher kernel is defined as Where F is the Fisher information matrix, (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Sigmoid kernel • This sigmoid kernel form gives the support machine a superficial resemblance to neural network model.
How to select the functions? x • Assume fixed nonlinear transformation • Transform inputs using a vector of basis functions • The resulting decision boundaries will be linear in the feature space y(x)= WT Φ (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Radial Basis Function Networks • Each basis function depends only on the radial distance from a center μj, so that (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6.3 Radial Basis Function Networks (2/3) • Let’s consider of the interpolation problem when the input variables are noisy. If the noise on the input vector x is described by a variable ξ having a distribution ν(ξ), the sum-of-squares error function becomes as follows: • Using the calculus of variation, (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/