460 likes | 770 Views
Kernel Methods. Mesfin Adane DEMA. Polynomial regression. K-Nearest Neighbour. Advantage of increase in Dimension. Classifier of 2D data. Classifier of 1D data. Kernel Methods. Kernel Methods.
E N D
Kernel Methods MesfinAdane DEMA
Advantage of increase in Dimension Classifier of 2D data Classifier of 1D data
Kernel Methods • Are used to find and study general types of relations (eg. Clustering, correlation, classification) in general types of data (vectors, images, sets). • Approaches the problem by mapping into higher dimensions where each coordinates corresponds to one data item. • For nonlinear feature space mapping , the kernel function is given as
Kernel Methods • Kernel Trick (Kernel Substitution) :- If the input vector enters in the form of scalar product, then the scalar product can be replaced by some other kernel functions . • Linear Kernels: • Stationary Kernels: • Homogeneous Kernels :
Dual Representations • Consider linear regression model whose regularized sum-of-squares error function is given by • Minimizing with respect to w, where Φ is the design matrix , whose nth row is given by
Dual Representations • Substituting into • Define Gram Matrix which is an N x N symmetric matrix with elements • Re-writing the sum-of-squares error using the Gram Matrix
Dual Representations • Minimizing with respect to a ,and solving • Substituting back to the original linear regression model • Note that vector a is computed by inverting N x N matrix. • Compare:
Dual Representations • Even though the dual formulation requires higher dimensions , it comes with some advantages. • We can see that is expressed entirely in terms of the kernel functions. • This allows us to work with a kernel with a higher dimension, even with infinite dimensionality.
Constructing Kernels • First Approach: Select the feature space mapping and use it to construct the corresponding kernel. • Examples of basis function: • Polynomial: • Gaussian: • Sigmoid:
Constructing Kernels • Second Approach: Construct kernel functions directly. We need to make sure that we are selecting a valid kernel. • Valid Kernels: kernels who has a Gram Matrix whose components are positive semi-definite for all possible choices of input data. • Note that a positive semi-definite matrix has a property that the conjugate transpose of itself does not change the matrix. Furthermore, all the eigenvalues are real and non-negative.
Constructing Kernels • Consider a simple example: • Considering a 2D input space • New complex Kernels can also be reconstructed by using simpler kernels as building blocks.
Constructing Kernels • Gaussian Kernels Taking the inner part: Substituting kernel
Constructing Kernels • Kernels are a powerful tool to combine generative and discriminative probabilistic approaches. • By combining the two approaches, we can benefit from generative model in case of missing data and obtain better performance on discriminative tasks from discriminative model. • Define a kernel using generative model and use this kernel in discriminative approach.
Constructing Kernels • Given a generative model ,define a kernel: • Note that the kernel measures the similarity between x and x’. • Taking different probabilistic distributions • Example: Hidden Markov Model
Radial Basis Function • Each basis function depends only the radial distance(typically Euclidean) from a centre. • Given input vectors and corresponding targets • Goal: • Achieved by expressing as linear combination of RBF one centred on every data
Radial Basis Function • For a noisy input data having distribution the sum-of-squares error is given by • Optimizing it, we can get • where
Radial Basis Function • Computationally costly while predicting for new data points. • Orthogonal Least square: the data point to be chosen is the one which gives great reduction on sum-of-squares error.
Linear Regression Revisited • Consider • Taking prior distribution over w • Note that this induces a probability distribution over function y(x). • In terms of vector representation
Linear Regression Revisited • Note that since y is a linear combination of Gaussian distributed values w, and hence it is also a Gaussian distributed. • where K is the Gram matrix with elements • Definition: A Gaussian process is defined as a probability distribution over functions evaluated at an arbitrary set of points jointly have a Gaussian distribution.
Linear Regression Revisited • Note that the joint distribution of Gaussian process can be completely specified by the mean and covariance. • Note also that the covariance can be evaluated from the kernel function. • Taking the Gaussian and exponential kernel functions, for example, samples are drawn.
Gaussian processes for regression • Considering noise on the observed target • Considering Gaussian noise • For independent noise, the joint distribution is given by • Using the Gaussian process • Marginalizing the probability
Gaussian processes for regression • Where C is the covariance matrix given as • Note the summation of the covariance. • Widely used Kernel function
Gaussian processes for regression • Goal of regression: predict • Using the joint distribution • Partitioning the covariance • K has elements and c is given as • Using conditional distribution
Gaussian processes for regression • Note that the mean and covariance are dependent on the term k which is dependent on the input • Note also that the additional kernel matrix should be a valid kernel. • The Gaussian process viewpoint is advantageous in that we can consider covariance functions that can be expressed in terms of an infinite number of basis functions.
Gaussian processes for classification • Since the Gaussian process makes prediction over the entire real axis, it cannot be directly applied for classification. • Solution: apply an activation function over the output of the Gaussian process. • Note that the covariance term does not include noise term because of correct label. But introduce noise-like term for compational reason.
Gaussian processes for classification • Considering two class problem • Intractable integral and approximation is necessary • Variational Inference • Expectation Propagation • Laplace Approximation