Pattern Recognition

Chapter 3 Maximum Likelihood and Bayesian Estimation – Part1 Pattern Recognition

Practical Issues • We could design an optimal classifier if we knew: • P(i) (priors) • p(x/i) (class-conditional densities) • In practice, we rarely have this complete information! • Design the classifier from a set of training examples. • Estimating P(i) is usually easy. • Estimating p(x/i) is more difficult: • Number of samples is often too small • Dimensionality of feature space is large

Parameter Estimation • Assumptions • We are given a sample set D ={x1, x2, ...., xn}, where the samples were drawn according to p(x|wj) • p(x|wj) has a known parametric form, that is, it is determined by parameters q e.g., p(x/i) ~ N( i, i) • Parameter estimation problem • Given D, find the best possible q • This is a classical problem in statistics!

Main Methods inParameter Estimation • Maximum Likelihood (ML) • It assumes that the values of the parameters are fixed but unknown. • Best estimate is obtained by maximizing the probability of obtaining the samples actually observed (i.e., training data) • Bayesian Estimation • It assumes that the parameters are random variables having some known a-priori distribution. • To determine the true value of parameters, it converts this to a posterior density using the samples.

Maximum Likelihood (ML)Estimation - Assumptions • Suppose the training data is divided in c sets (i.e., one for each class): D1, D2, ...,Dc • Assume that samples in Dj have been drawn independently according to p(x/ωj). • Assume that p(x/ωj)has known parametric form with parameters θj, : e.g, θj =(μj , Σj) for Gaussian distributions or, in general, θj =(θ1 , θ2, …, θp)t

ML Estimation - Problem Definition and Solution • Problem: given D1, D2, ...,Dc and a model for each class, estimate θ1, θ2,…, θc • If samples in Dj give no information about θi( ), we need to solve c independent problems (i.e., one for each class) • The ML estimate for D={x1,x2,..,xn} is the value that maximizes p(D/θ) (i.e., best supports the training data).

ML Parameter Estimation (cont’d) θ=μ

ML Parameter Estimation (cont’d) • How to find the maximum? • Easier to consider • The solution maximizes p(D/ θ) or ln p(D/ θ)

Maximum A-Priori Estimator (MAP) • Assume that θis a random vector and that we know p(θ). • Given D, MAP converts p(θ) to p(θ/D): • The goal is to maximize p(θ/D) or p(D/θ)p(θ): • MAP is equivalent to ML when p(θ) is uniform

ML for Gaussian Density:Case of Unknown θ=μ Consider ln p(x/μ) where Computing the gradient, we have where (by setting x=xk)

ML for Gaussian Density:Case of Unknown θ=μ (cont’d) • Setting we have: • The solution is given by • The ML estimate is simply the “sample mean”.

ML for Gaussian Density:Case of Unknown θ=(θ1,θ2)=(μ,σ) • Consider ln p(x/θ) where p(xk/θ) p(xk/θ) p(xk/θ)

ML for Gaussian Density:Case of Unknown θ=(θ1,θ2)=(μ,σ) (cont’d) p(xk/θ)=0 =0 =0 The solutions are given by: θ1= θ2=

ML for Gaussian Density:Case of Unknown θ=(μ,Σ) • In the general case (i.e., multivariate Gaussian) the solutions are:

Biased and Unbiased Estimates • An estimate is unbiased when where θ is the true value. • The ML estimate is unbiased, i.e., • The ML estimates and are biased:

Biased and Unbiased Estimates (cont’d) • The following are unbiased estimators for and • Note: the ML estimates of and are asymptotically unbiased:

Some comments about ML • ML estimation is usually simpler than alternative methods. • Has good convergence properties as the number of training samples increases. • If the model chosen for p(x/θ) is correct, and independence assumptions among variables are true, ML will give very good results. • If the model is wrong, ML will give poor results.

Pattern Recognition