580.691 Learning Theory Reza Shadmehr

580.691 Learning Theory Reza Shadmehr Linear and quadratic decision boundariesKernel estimates of densityMissing data

Bayesian classification • Suppose we wish to classify vector x as belonging to a class: {1,…,L}. We are given labeled data and need to form a classification function: Likelihood prior Classify x into the class l that maximizes the posterior probability. marginal

0.04 0.03 0.02 0.035 0.02 0.01 0.03 0.015 0.025 0.02 160 180 200 0.01 0.015 0.01 0.005 0.005 160 180 200 160 180 200 • Classification when distributions have equal variance • Suppose we wish to classify a person as male or female based on height. What we have: What we want: Assume equal probability of being male or female: female male Note that the two densities have equal variance

4 0.02 2 0.015 160 180 200 0.01 -2 0.005 -4 160 180 200 Classification when distributions have equal variance

Estimating the decision boundary between data of equal variance • Suppose the distributions for the data in each class is a Gaussian. The decision boundary between any two classes is where the log of the ratio is zero. If the data in each class has a Gaussian density with equal variance, then the boundary between any two classes is a line.

Estimating the decision boundary from estimated densities • From the data we can get an ML estimate of Gaussian parameters Class 2 Class 1 Class 3

Relationship between Bayesian classification and Fischer discriminant If we have two classes, class -1 and class +1, then the decision boundary is at 0: For the Bayesian classifier, under assumption of equal variance, the decision boundary is at: The Fischer decision boundary is the same as the Bayesian when the two classes have equal variance and equal prior probability.

0.025 Assume: 0.02 0.015 1 0.035 0.01 0.03 0.8 0.005 0.025 0.6 0.02 160 180 200 0.015 0.4 0.01 0.25 0.2 0.005 0.2 160 180 200 160 180 200 0.15 0.1 0.05 0 200 140 160 180 Classification when distributions have unequal variance What we have: Classification:

0.025 0.02 0.015 0.01 160 180 200 -2 0.005 -4 -6 160 180 200 -8 -10 -12

Quadratic discriminant: when data comes from unequal variance Gaussians green red The decision boundary between any two classes is where the log of the ratio is zero. If the data in each class has a Gaussian density with unequal variance, then the boundary between any two classes is a quadratic function of x. blue

10 8 6 4 2 0.08 -20 -10 10 20 0.06 0.06 0.04 0.04 0.02 0.02 -20 -10 10 20 -20 -10 10 20 Non-parametric estimate of densities: Kernel density estimate Suppose we have points x(i) that belong to class l. Suppose we can’t assume that these points come from a Gaussian distribution. To estimate the density, we need to form a function that assigns a weight to each point x in our space, with the integral of this function equal to 1. It seems that the more data points x(i) we find around x, the more the weight of x should be. The kernel density estimate puts a Gaussian centered at each data point. Where there are more data points, there are more Gaussians, and the sum is the density. Histogram of the sampled data belonging to class l ML estimate of a Gaussian density density estimate using a Gaussian kernel Kernel

Non-parametric estimate of densities: Kernel density estimate green red blue

3 2 1 0 -1 -2 -3 -4 -2 0 2 4 6 8 Classification with missing data Suppose that we have built a Bayesian classifier and are now given a new data point to classify, but that this new data point is missing some of the “features” that we normally expect to see. In the example below, we have two features (x1 and x2), and four classes. The likelihood function is plotted. Suppose that we are given data point (*,-1) to classify. This data point is missing a value for x1. If we assume the missing value is the average of the previously observed x1, then we would estimate it to be about 1. Assuming that the prior probabilities are equal among the four classes, we classify (1,-1) as class c2. However, c4 is a better choice because when x2=-1, c4 is the most likely class as it has the highest likelihood.

Classification with missing data good data bad (or missing) data

580.691 Learning Theory Reza Shadmehr