240-650 Principles of Pattern Recognition

240-650 Principles of Pattern Recognition Montri Karnjanadecha montri@coe.psu.ac.th http://fivedots.coe.psu.ac.th/~montri 240-650: Chapter 2: Bayesian Decision Theory

Chapter 2 Bayesian Decision Theory 240-650: Chapter 2: Bayesian Decision Theory

Statistical Approach to Pattern Recognition 240-650: Chapter 2: Bayesian Decision Theory

A Simple Example • Suppose that we are given two classes w1 and w2 • P(w1) = 0.7 • P(w2) = 0.3 • No measurement is given • Guessing • What shall we do to recognize a given input? • What is the best we can do statistically? Why? 240-650: Chapter 2: Bayesian Decision Theory

A More Complicated Example • Suppose that we are given two classes • A single measurement x • P(w1|x) and P(w2|x) are given graphically 240-650: Chapter 2: Bayesian Decision Theory

A Bayesian Example • Suppose that we are given two classes • A single measurement x • We are given p(x|w1) and p(x|w2) this time 240-650: Chapter 2: Bayesian Decision Theory

A Bayesian Example – cont. 240-650: Chapter 2: Bayesian Decision Theory

Bayesian Decision Theory • Bayes formula • In case of two categories • In English, it can be expressed as 240-650: Chapter 2: Bayesian Decision Theory

Bayesian Decision Theory – cont. • A posterior probability • The probability of the state of nature being given that feature value x has been measured • Likelihood • is the likelihood of with respect to x • Evidence • The evidence factor can be viewed as a scaling factor that guarantees that the posterior probabilities sum to one. 240-650: Chapter 2: Bayesian Decision Theory

Bayesian Decision Theory – cont. • Whenever we observe a particular x, the prob. of error is • The average prob. of error is given by 240-650: Chapter 2: Bayesian Decision Theory

Bayesian Decision Theory--continuous features • Feature space • In general, an input can be represented by a vector, a point in a d-dimensional Euclidean space Rd • Loss function • The loss function states exactly how costly each action is and is used to convert a probability determination into a decision • Written as 240-650: Chapter 2: Bayesian Decision Theory

Loss Function • Describe the loss incurred for taking action ai when the state of nature is wj 240-650: Chapter 2: Bayesian Decision Theory

Conditional Risk • Suppose we observe a particular x • We take action ai • If the true state of nature iswj • By definition we will incur the lossl(ai|wj) • We can minimize our expected loss by selecting the action that minimize the condition risk,R(ai|x) 240-650: Chapter 2: Bayesian Decision Theory

Bayesian Decision Theory • Suppose that there are c categories {w1, w2, ..., wc} • Conditional risk • Risk is the average expected loss 240-650: Chapter 2: Bayesian Decision Theory

Bayesian Decision Theory • Bayes decision rule • For a given x, select the action ai for which the conditional risk is minimum • The resulting minimum overall risk is called the Bayes risk, denoted as R*, which is the best performance that can be achieved 240-650: Chapter 2: Bayesian Decision Theory

Two-Category Classification • Let lij = l(ai|wj) • Conditional risk • Fundamental decision rule Decide w1 if R(a1|x) < R(w2|x) 240-650: Chapter 2: Bayesian Decision Theory

Two-Category Classification – cont. • The decision rule can be written in several ways • Decide w1 if one of the followings is true These rules are equivalent Likelihood Ratio 240-650: Chapter 2: Bayesian Decision Theory

Minimum-Error-Rate Classification • A special case of the Bayes decision rule with the following zero-oneloss function • Assigns no loss to correct decision • Assigns unit loss to any error • All errors are equally costly 240-650: Chapter 2: Bayesian Decision Theory

Minimum-Error-Rate Classification • Conditional risk 240-650: Chapter 2: Bayesian Decision Theory

Minimum-Error-Rate Classification • We should select i that maximizes the posterior probability • For minimum error rate: Decide 240-650: Chapter 2: Bayesian Decision Theory

Minimum-Error-Rate Classification 240-650: Chapter 2: Bayesian Decision Theory

Classifiers, Discriminant Functions, and Decision Surfaces • There are many ways to represent pattern classifiers • One of the most useful is in terms of a set of discriminant functions gi(x), i=1,…,c • The classifier assigns a feature vector x to class if 240-650: Chapter 2: Bayesian Decision Theory

The Multicategory Classifier 240-650: Chapter 2: Bayesian Decision Theory

Classifiers, Discriminant Functions, and Decision Surfaces • There are many equivalent discriminant functions • i.e., the classification results will be the same even though they are different functions • For example, if f is a monotonically increasing function, then 240-650: Chapter 2: Bayesian Decision Theory

Classifiers, Discriminant Functions, and Decision Surfaces • Some of discriminant functions are easier to understand or to compute 240-650: Chapter 2: Bayesian Decision Theory

Decision Regions • The effect of any decision is to divide the feature space into c decision regions, R1, ..., Rc • The regions are separated with decision boundaries, where ties occur among the largest discriminant functions 240-650: Chapter 2: Bayesian Decision Theory

Decision Regions – cont. 240-650: Chapter 2: Bayesian Decision Theory

Two-Category Case (Dichotomizer) • Two-category case is a special case • Instead of two discriminant functions, a single one can be used 240-650: Chapter 2: Bayesian Decision Theory

The Normal Density • Univariate Gaussian Density • Mean • Variance 240-650: Chapter 2: Bayesian Decision Theory

The Normal Density 240-650: Chapter 2: Bayesian Decision Theory

The Normal Density • Central Limit Theorem • The aggregate effect of the sum of a large number of small, independent random disturbances will lead to a Gaussian distribution • Gaussian is often a good model for the actual probability distribution 240-650: Chapter 2: Bayesian Decision Theory

The Multivariate Normal Density • Multivariate Density (in d dimension) Abbreviation 240-650: Chapter 2: Bayesian Decision Theory

The Multivariate Normal Density • Mean • Covariance matrix • The ijth component of 240-650: Chapter 2: Bayesian Decision Theory

Statistically Independence • If xi and xj are statistically independence then • The covariance matrix will become a diagonal matrix where all off-diagonal elements are zero 240-650: Chapter 2: Bayesian Decision Theory

Whitening Transform Diagonal matrix of the corresponding eigenvalues of matrix whose columns are the orthonormal eigenvectors of 240-650: Chapter 2: Bayesian Decision Theory

Whitening Transform 240-650: Chapter 2: Bayesian Decision Theory

Squared Mahalanobis Distance from x to m Constant density Principle axes of hyperellipsiods are given by the eigenvectors ofS Length of axes are determined by eigenvalues ofS 240-650: Chapter 2: Bayesian Decision Theory

Discriminant Functions for the Normal Density • Minimum distance classifier • If the density are multivariate normal– i.e., if Then we have: 240-650: Chapter 2: Bayesian Decision Theory

Discriminant Functions for the Normal Density • Case 1: • Features are statistically independence and each feature has the same variance • Where || . || denotes the Euclidean norm 240-650: Chapter 2: Bayesian Decision Theory

Case 1: Si = s2I 240-650: Chapter 2: Bayesian Decision Theory

Linear Discriminant Function • It is not necessary to compute distances • Expanding the form yields • The term is the same for all i • We have the following linear discriminant function 240-650: Chapter 2: Bayesian Decision Theory

Linear Discriminant Function where and Threshold or bias for the ith category 240-650: Chapter 2: Bayesian Decision Theory

Linear Machine • A classifier that uses linear discriminant functions is called a linear machine • Its decision surfaces are pieces of hyperplanes defined by the linear equations for the two categories with the highest posterior probabilities. For our case this equation can be written as 240-650: Chapter 2: Bayesian Decision Theory

Linear Machine Where And If then the second term vanishes It is called a minimum-distance classifier 240-650: Chapter 2: Bayesian Decision Theory

Priors change -> decision boundaries shift 240-650: Chapter 2: Bayesian Decision Theory

Case 2: Si = S • Covariance matrices for all of the classes are identical but otherwise arbitrary • The cluster for the ith class is centered about mi • Discriminant function: Can be ignored if prior probabilities are the same for all classes 240-650: Chapter 2: Bayesian Decision Theory

Case 2: Discriminant function Where and 240-650: Chapter 2: Bayesian Decision Theory

240-650 Principles of Pattern Recognition