Pattern Recognition

Bayesian Decision Theory & ML Estimation Pattern Recognition

Bayesian Decision Theory

Bayesian Decision Theory • Fundamental statistical approach to problem classification. • Quantifies the tradeoffs between various classification decisions using probabilities and the costs associated with such decisions. • Each action is associated with a cost or risk. • The simplest risk is the classification error. • Design classifiers to recommend actions that minimize some total expected risk.

Terminology (using sea bass – salmon classification example) • State of nature ω(random variable): • ω1 for sea bass, ω2 for salmon. • Probabilities P(ω1) and P(ω2 )(priors) • prior knowledge of how likely is to get a sea bass or a salmon • Probability density function p(x) (evidence): • how frequently we will measure a pattern with feature value x (e.g., x is a lightness measurement) Note: if x and y are different measurements, p(x) and p(y) correspond to different pdfs: pX(x) and pY(y)

Terminology (cont’d)(using sea bass – salmon classification example) • Conditional probability density p(x/ωj) (likelihood): • how frequently we will measure a pattern with feature value x given that the pattern belongs to class ωj e.g., lightness distributions between salmon/sea-bass populations

Terminology (cont’d)(using sea bass – salmon classification example) • Conditional probability P(ωj /x) (posterior): • the probability that the fish belongs to class ωj given measurement x. Note: we will be using an uppercase P(.) to denote a probability mass function (pmf) and a lowercase p(.) to denote a probability density function (pdf).

Decision Rule Using Priors Only Decide ω1 ifP(ω1) > P(ω2); otherwise decideω2 P(error) = min[P(ω1), P(ω2)] • Favours the most likely class … (optimum if no other info is available). • This rule would be making the same decision all the times! • Makes sense to use for judging just one fish …

Decision Rule Using Conditional pdf • Using Bayes’ rule, the posterior probability of category ωj given measurement x is given by: where (scale factor – sum of probs = 1) Decide ω1 if P(ω1 /x) > P(ω2/x);otherwise decide ω2or Decide ω1 if p(x/ω1)P(ω1)>p(x/ω2)P(ω2) otherwise decide ω2

Decision Rule Using Conditional pdf (cont’d)

Probability of Error • The probability of error is defined as: • The average probability error is given by: • The Bayes rule is optimum, that is, it minimizes the average probability error since: P(error/x) = min[P(ω1/x), P(ω2/x)]

Discriminant Functions • Functional structure of a general statistical classifier Assignx to ωi if: gi(x) > gj(x) for all (discriminant functions) pick max

Discriminants for Bayes Classifier • Using risks: gi(x)=-R(αi/x) • Using zero-one loss function (i.e., min error rate): gi(x)=P(ωi/x) • Is the choice of gi unique? • Replacing gi(x) with f(gi(x)), where f() is monotonically increasing, does not change the classification results.

Decision Regions and Boundaries • Decision rules divide the feature space in decision regions R1, R2, …, Rc • The boundaries of the decision regions are the decision boundaries. g1(x)=g2(x) at the decision boundaries

Case of two categories • More common to use a single discriminant function (dichotomizer) instead of two: • Examples of dichotomizers:

Discriminant Function for Multivariate Gaussian • Assume the following discriminant function: N(μ,Σ) p(x/ωi)

Multivariate Gaussian Density:Case I • Assumption: Σi=σ2 • Features are statistically independent • Each feature has the same variance favors the a-priori more likely category

Multivariate Gaussian Density:Case I (cont’d) wi= threshold or bias ) )

Multivariate Gaussian Density:Case I (cont’d) • Comments about this hyperplane: • It passes through x0 • It is orthogonal to the line linking the means. • What happens when P(ωi)= P(ωj) ? • If P(ωi)= P(ωj), then x0 shifts away from the more likely mean. • If σ is very small, the position of the boundary is insensitive to P(ωi) and P(ωj)

Multivariate Gaussian Density:Case I (cont’d)

Multivariate Gaussian Density:Case I (cont’d) • Minimum distance classifier • When P(ωi) is the same for each of the c classes

Maximum Likelihood Estimation

Practical Issues • We could design an optimal classifier if we knew: • P(i) (priors) • p(x/i) (class-conditional densities) • In practice, we rarely have this complete information! • Design the classifier from a set of training examples. • Estimating P(i) is usually easy. • Estimating p(x/i) is more difficult: • Number of samples is often too small • Dimensionality of feature space is large

Parameter Estimation • Assumptions • We are given a sample set D ={x1, x2, ...., xn}, where the samples were drawn according to p(x|wj) • p(x|wj) has a known parametric form, that is, it is determined by parameters q e.g., p(x/i) ~ N( i, i) • Parameter estimation problem • Given D, find the best possible q • This is a classical problem in statistics!

Main Methods inParameter Estimation • Maximum Likelihood (ML) • It assumes that the values of the parameters are fixed but unknown. • Best estimate is obtained by maximizing the probability of obtaining the samples actually observed (i.e., training data) • Bayesian Estimation • It assumes that the parameters are random variables having some known a-priori distribution. • To determine the true value of parameters, it converts this to a posterior density using the samples.

Maximum Likelihood (ML)Estimation - Assumptions • Suppose the training data is divided in c sets (i.e., one for each class): D1, D2, ...,Dc • Assume that samples in Dj have been drawn independently according to p(x/ωj). • Assume that p(x/ωj) has known parametric form with parameters θj, : e.g, θj =(μj , Σj) for Gaussian distributions or, in general, θj =(θ1 , θ2, …, θp)t

ML Estimation - Problem Definition and Solution • Problem: given D1, D2, ...,Dc and a model for each class, estimate θ1, θ2,…, θc • If samples in Dj give no information about θi( ), we need to solve c independent problems (i.e., one for each class) • The ML estimate for D={x1,x2,..,xn} is the value that maximizes p(D/θ) (i.e., best supports the training data).

ML Parameter Estimation (cont’d) θ=μ

ML Parameter Estimation (cont’d) • How to find the maximum? • Easier to consider • The solution maximizes p(D/ θ) or ln p(D/ θ)

ML for Gaussian Density:Case of Unknown θ=μ Consider ln p(x/μ) where Computing the gradient, we have where (by setting x=xk)

ML for Gaussian Density:Case of Unknown θ=μ (cont’d) • Setting we have: • The solution is given by • The ML estimate is simply the “sample mean”.

Pattern Recognition