Machine Learning & Bioinformatics

Machine Learning & Bioinformatics Tien-HaoChang (Darby Chang) Machine Learning and Bioinformatics

Using kernel density estimation in enhancing conventional instance-based classifier • Kernel approximation • Kernel density estimation • Kernel approximation in O(n) • Multi-dimensional kernel approximation • Application in data classification Machine Learning and Bioinformatics

Identifying boundary of different classes of objects Machine Learning and Bioinformatics

Boundary identified Machine Learning and Bioinformatics

Problem definition of kernel approximation • Given the values of function at a set of samples • We want to find a set of symmetric kernel functions and the corresponding weights such that Machine Learning and Bioinformatics

Kernel approximationSpherical Gaussian functions • Hartman et al. showed that a linear combination of spherical Gaussian functions can approximate any function with arbitrarily small error • “Layered neural networks with Gaussian hidden units as universal approximations”, Neural Computation, Vol. 2, No. 2, 1990 Machine Learning and Bioinformatics

Kernel approximationSpherical Gaussian functions • With the Gaussian kernel functions, we want to find such that Machine Learning and Bioinformatics

Problem definition of kernel density estimation • Assume that we are given a set of samples taken from a probability distribution in a d-dimensional vector space • The problem definition of kernel density estimation is to find a linear combination of kernel functions that provides a good approximation of the probability density function Machine Learning and Bioinformatics

Probability density estimation • The value of the probability density function at a vector v can be estimated as follows: Machine Learning and Bioinformatics

Probability density estimation • nis the total number of samples • R(v) is the distance between vector vand its k-th nearest samples • is the volume of a sphere with radius = R(v) in a d-dimensional vector space Machine Learning and Bioinformatics

The Gamma function Machine Learning and Bioinformatics

Since • Since Machine Learning and Bioinformatics

Kernel approximationAn 1-D example Machine Learning and Bioinformatics

Kernel approximation in O(n) • In the learning algorithm, we assume uniform sampling • namely, samples are located at the crosses of an evenly-spaced grid in the d-dimensional vector space • Let denote the distance between two adjacent samples • If the assumption of uniform sampling does not hold, then some sort of interpolation can be conducted to obtain the approximate function values at the crosses of the grid Machine Learning and Bioinformatics

A 2-D example of uniform sampling Machine Learning and Bioinformatics

The O(n)kernel approximation algorithmThe basic idea • Under the assumption that the sampling density is sufficiently high, i.e., we have the function values at a sample and its k nearest samples, , are virtually equal: • In other words, f(v) is virtually a constant function equal to in the proximity of • Accordingly, we can expect that ; Machine Learning and Bioinformatics

The O(n)kernel approximation algorithmAn 1-D example Machine Learning and Bioinformatics

The O(n) kernel approximation algorithm • In the 1-D example, samples located at , where h is an integer • Under the assumption that , we have Machine Learning and Bioinformatics

The O(n) kernel approximation algorithm • The issue now is to find appropriate and such thatin the proximity of Machine Learning and Bioinformatics

Machine Learning and Bioinformatics

The O(n) kernel approximation algorithm • Therefore, with , we can set and obtain for Machine Learning and Bioinformatics

The O(n) kernel approximation algorithm • In fact, it can be shown that is bounded by • Therefore, we have the following function approximate: Machine Learning and Bioinformatics

The O(n)kernel approximation algorithmGeneralization • We can generalize the result by setting , where is a real number Machine Learning and Bioinformatics

The O(n)kernel approximation algorithmGeneralization • The table shows the bounds of Machine Learning and Bioinformatics

An example of the effect of different setting of beta Machine Learning and Bioinformatics

The smoothing effect • The kernel approximation function is actually a weighted average of the sampled function values • Selecting a larger value implies that the smoothing effect will be more significant • Our suggestion is set Machine Learning and Bioinformatics

An example of the smoothing effect Machine Learning and Bioinformatics

Multi-dimensional kernel approximation • Under the assumption that the sampling density is sufficiently high, i.e., we have the function values at a sample and its k nearest samples, , are virtually equal: Machine Learning and Bioinformatics

Multi-dimensional kernel approximation • As a result, we can expect that and , where and are the weights and bandwidths of the Gaussian functions located at , respectively Machine Learning and Bioinformatics

Multi-dimensional kernel approximation • Since the influence of a Gaussian function decreases exponentially as the distance increases, we can set k to a value such that, for a vector vin the proximity of sample , we have Machine Learning and Bioinformatics

Since we have and ,our objectives is to find and such that Machine Learning and Bioinformatics

Multi-dimensional kernel approximation • Let,then we have Machine Learning and Bioinformatics

We have=,where • If we set to , then is bounded by • Accordingly, we have is bounded by Machine Learning and Bioinformatics

Multi-dimensional kernel approximation • Therefore, with , g(v) is virtually a constant function and for v in the proximity of • Accordingly, we want to set Machine Learning and Bioinformatics

Multi-dimensional kernel approximation • Finally, by setting uniformly to , we obtain the following kernel smoothing function that approximates f(v):, where are the k nearest samples of the object located at v Machine Learning and Bioinformatics

Generally speaking, if we set uniformly to, then we have , where • If we select a proper value for , then Machine Learning and Bioinformatics

Multi-dimensional kernel approximation • As a result, we want to set , where and obtain the following function approximate:,where are the k nearest samples of the object located at v Machine Learning and Bioinformatics

Machine Learning and Bioinformatics

Lack something? ? How about kernel density estimation? Machine Learning and Bioinformatics

Application in data classification • Recent development in data classification focuses on the support vector machines (SVM), due to accuracy concern • In this lecture, we will describe a kernel density estimation based data classification algorithm that can delivers the same level of accuracy as the SVM and enjoys some advantages Machine Learning and Bioinformatics

The KDE-based classifier • The learning algorithm constructs one approximate probability density function for one class of objects Machine Learning and Bioinformatics

Let us adopt the following estimation of the value of the probability density function at each training sample:, where • is the probability density function of class-m objects; • is the distance between training sample and its k-th nearest training samples of the same class; • d is the dimension of the vector space Machine Learning and Bioinformatics

The KDE-based classifier • In the kernel approximation problem, we set the bandwidth of each Gaussian function uniformly to , where is the distance between two adjacent training samples • In the kernel density estimation problem, for each training sample, we need to determine the average distance between two adjacent training samples of the same class in the local region Machine Learning and Bioinformatics

The KDE-based classifier • In the d-dimensional vector space, if the average distance between samples is , then the number of samples in a subspace of volume V is approximately equal to • Accordingly, we can estimate by Machine Learning and Bioinformatics

Accordingly, with the kernel approximation function that we obtain earlier, we have the following approximate probability density function for class-m objects:,where are the k nearest class-m training samples of the object located at v, , and Machine Learning and Bioinformatics

The KDE-based classifier • An interesting observation is that, as long as , then regardless of the value of , we have • With this observation, Machine Learning and Bioinformatics

In the discussion above, is defined to be the distance between sample and its k-th nearest training sample • However, this definition depends on only one single sample and tends to be unreliable, if the data set is noisy • We can replace with, where are the k nearest training samples of the same class as Machine Learning and Bioinformatics

Parameter tuning • The discussions so far are based on the assumption that the sampling density is sufficiently high, which may not hold for some real data sets Machine Learning and Bioinformatics

Parameter tuning • Three parameters, namely are incorporated in the learning algorithm: ,where are the nearest training samples of the same class as Machine Learning and Bioinformatics

Parameter tuning • One may wonder how should be set • According to our experimental results, the value of has essentially no effect, as long as is set to a value within Machine Learning and Bioinformatics

Machine Learning & Bioinformatics