1 / 43

Kernel Methods

Kernel Methods. Mesfin Adane DEMA. Polynomial regression. K-Nearest Neighbour. Advantage of increase in Dimension. Classifier of 2D data. Classifier of 1D data. Kernel Methods. Kernel Methods.

ervin
Download Presentation

Kernel Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernel Methods MesfinAdane DEMA

  2. Polynomial regression

  3. K-Nearest Neighbour

  4. Advantage of increase in Dimension Classifier of 2D data Classifier of 1D data

  5. Kernel Methods

  6. Kernel Methods • Are used to find and study general types of relations (eg. Clustering, correlation, classification) in general types of data (vectors, images, sets). • Approaches the problem by mapping into higher dimensions where each coordinates corresponds to one data item. • For nonlinear feature space mapping , the kernel function is given as

  7. Kernel Methods • Kernel Trick (Kernel Substitution) :- If the input vector enters in the form of scalar product, then the scalar product can be replaced by some other kernel functions . • Linear Kernels: • Stationary Kernels: • Homogeneous Kernels :

  8. Dual Representations • Consider linear regression model whose regularized sum-of-squares error function is given by • Minimizing with respect to w, where Φ is the design matrix , whose nth row is given by

  9. Dual Representations • Substituting into • Define Gram Matrix which is an N x N symmetric matrix with elements • Re-writing the sum-of-squares error using the Gram Matrix

  10. Dual Representations • Minimizing with respect to a ,and solving • Substituting back to the original linear regression model • Note that vector a is computed by inverting N x N matrix. • Compare:

  11. Dual Representations • Even though the dual formulation requires higher dimensions , it comes with some advantages. • We can see that is expressed entirely in terms of the kernel functions. • This allows us to work with a kernel with a higher dimension, even with infinite dimensionality.

  12. Constructing Kernels • First Approach: Select the feature space mapping and use it to construct the corresponding kernel. • Examples of basis function: • Polynomial: • Gaussian: • Sigmoid:

  13. Constructing Kernels

  14. Constructing Kernels • Second Approach: Construct kernel functions directly. We need to make sure that we are selecting a valid kernel. • Valid Kernels: kernels who has a Gram Matrix whose components are positive semi-definite for all possible choices of input data. • Note that a positive semi-definite matrix has a property that the conjugate transpose of itself does not change the matrix. Furthermore, all the eigenvalues are real and non-negative.

  15. Constructing Kernels • Consider a simple example: • Considering a 2D input space • New complex Kernels can also be reconstructed by using simpler kernels as building blocks.

  16. Constructing Kernels • Gaussian Kernels Taking the inner part: Substituting kernel

  17. Constructing Kernels • Kernels are a powerful tool to combine generative and discriminative probabilistic approaches. • By combining the two approaches, we can benefit from generative model in case of missing data and obtain better performance on discriminative tasks from discriminative model. • Define a kernel using generative model and use this kernel in discriminative approach.

  18. Constructing Kernels • Given a generative model ,define a kernel: • Note that the kernel measures the similarity between x and x’. • Taking different probabilistic distributions • Example: Hidden Markov Model

  19. Radial Basis Function • Each basis function depends only the radial distance(typically Euclidean) from a centre. • Given input vectors and corresponding targets • Goal: • Achieved by expressing as linear combination of RBF one centred on every data

  20. Radial Basis Function • For a noisy input data having distribution the sum-of-squares error is given by • Optimizing it, we can get • where

  21. Radial Basis Function

  22. Radial Basis Function • Computationally costly while predicting for new data points. • Orthogonal Least square: the data point to be chosen is the one which gives great reduction on sum-of-squares error.

  23. Nadaraya-Watson Model

  24. Gaussian Process

  25. Gaussian Process

  26. Linear Regression Revisited • Consider • Taking prior distribution over w • Note that this induces a probability distribution over function y(x). • In terms of vector representation

  27. Linear Regression Revisited • Note that since y is a linear combination of Gaussian distributed values w, and hence it is also a Gaussian distributed. • where K is the Gram matrix with elements • Definition: A Gaussian process is defined as a probability distribution over functions evaluated at an arbitrary set of points jointly have a Gaussian distribution.

  28. Linear Regression Revisited • Note that the joint distribution of Gaussian process can be completely specified by the mean and covariance. • Note also that the covariance can be evaluated from the kernel function. • Taking the Gaussian and exponential kernel functions, for example, samples are drawn.

  29. Linear Regression Revisited

  30. Gaussian processes for regression • Considering noise on the observed target • Considering Gaussian noise • For independent noise, the joint distribution is given by • Using the Gaussian process • Marginalizing the probability

  31. Gaussian processes for regression • Where C is the covariance matrix given as • Note the summation of the covariance. • Widely used Kernel function

  32. Gaussian processes for regression

  33. Gaussian processes for regression • Goal of regression: predict • Using the joint distribution • Partitioning the covariance • K has elements and c is given as • Using conditional distribution

  34. Gaussian processes for regression • Note that the mean and covariance are dependent on the term k which is dependent on the input • Note also that the additional kernel matrix should be a valid kernel. • The Gaussian process viewpoint is advantageous in that we can consider covariance functions that can be expressed in terms of an infinite number of basis functions.

  35. Gaussian processes for regression

  36. Gaussian processes for regression

  37. Gaussian processes for classification • Since the Gaussian process makes prediction over the entire real axis, it cannot be directly applied for classification. • Solution: apply an activation function over the output of the Gaussian process. • Note that the covariance term does not include noise term because of correct label. But introduce noise-like term for compational reason.

  38. Gaussian processes for classification

  39. Gaussian processes for classification • Considering two class problem • Intractable integral and approximation is necessary • Variational Inference • Expectation Propagation • Laplace Approximation

  40. Gaussian processes for classification

More Related