Visual Recognition: Bayesian Decision Theory

Outline • Bayesian Decision Theory • Bayes' formula • Error • Bayes' Decision Rule • Loss function and Risk • Two-Category Classification Born: 1702 • Classifiers, Discriminant Functions, and Decision Surfaces • Discriminant Functions for the Normal Density 236875 Visual Recognition

Bayesian Decision Theory • Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. • Decision making when all the probabilistic information is known. • For given probabilities the decision is optimal. • When new information is added, it is assimilated in optimal fashion for improvement of decisions. 236875 Visual Recognition

Bayesian Decision Theory cont. • Fish Example: • Each fish is in one of 2 states: sea bass or salmon • Let w denote the state of nature • w = w1for sea bass • w = w2 for salmon 236875 Visual Recognition

Bayesian Decision Theory cont. • The State of nature is unpredictable w is a variable that must be described probabilistically. • If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon. • Define • P(w1 ) : a priori probability that the next fish is sea bass • P(w2 ):a priori probability that the next fish is salmon. 236875 Visual Recognition

Bayesian Decision Theory cont. • If other types of fish are irrelevant: P( w1 ) + P( w2 ) = 1. • Prior probabilities reflect our prior knowledge (e.g. time of year, fishing area, …) • Simple decision Rule: • Make a decision without seeing the fish. • Decide w1 if P( w1 ) > P( w2 ); w2 otherwise. • OK if deciding for one fish • If several fish, all assigned to same class. 236875 Visual Recognition

Bayesian Decision Theory cont. • In general, we will have some features and more information. • Feature: lightness measurement = x • Different fish yield different lightness readings (x is a random variable) 236875 Visual Recognition

Bayesian Decision Theory cont. • Define p(x|w1) = Class Conditional Probability Density Probability density function for x given that the state of nature is w1 • The difference between p(x|w1 ) and p(x|w2 ) describes the difference in lightness between sea bass and salmon. 236875 Visual Recognition

Bayesian Decision Theory cont. • Hypothetical class-conditional probability • Density functions are normalized (area under each curve is 1.0) 236875 Visual Recognition

Bayesian Decision Theory cont. • Suppose that we know The prior probabilities P(w1 ) and P(w2 ), The conditional densities and Measure lightness of a fish = x. • What is the category of the fish ? 236875 Visual Recognition

Bayes' formula P(wj | x) = P(x |wj ) P(wj ) / P(x), where 236875 Visual Recognition

Bayes' formula cont. • p(x|wj )is called thelikelihoodof wjwith respect to x. (the wjcategory for which p(x|wj ) is large is more "likely" to be the true category) • p(x) is theevidence how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1. 236875 Visual Recognition

Bayes' formula cont. Posterior probabilities for the particular priors P(w1)=2/3 and P(w2)=1/3. At every x the posteriors sum to 1. 236875 Visual Recognition

Error For a given x, we can minimize the probability of error by deciding w1 if P(w1|x) > P(w2|x) and w2 otherwise. 236875 Visual Recognition

Bayesian Decision Theory: Continuous Features: General Case Formalize the ideas just considered in 4 ways: • Allow more than one feature Replace the scalar x by the feature vector A d-dimensional Euclidean space Rd is called the feature space. • Allow more than 2 states of nature Generalize to several classes • Allow actions other than merely deciding the state of nature Possibility of rejection, i.e., of refusing to make a decision in close cases. • Introducing general loss function 236875 Visual Recognition

Loss function • Loss ( or cost ) function states exactly how costly each action is, and is used to convert a probability determination into a decision. Loss functions let us treat situations in which some kinds of classification mistakes are more costly than others. 236875 Visual Recognition

Formulation • Let {w1, ... , wc } be the finite set of c states of nature ("categories"). • Let be the finite set of a possible actions. • The loss function = loss incurred for taking action when the state of nature is wj. • x = d-dimensional feature vector (random variable) • P(x|wj ) = the state conditional probability density function for x (The probability density function for x conditioned on wjbeing the true state of nature) • P(wj ) = prior probability that nature is in state wj. 236875 Visual Recognition

Expected Loss • Suppose that we observe a particular x and that we contemplate taking action . • If the true state of nature is wj thenloss is • Before we have done an observation the expected loss is 236875 Visual Recognition

Conditional Risk • After the observation theexpected riskwhich is called now “conditional risk” isgiven by 236875 Visual Recognition

Total Risk • Objective: Select the action that minimizes the conditional risk • A general decision rule is a function • For every x, the decision function assumes one of the a values • The “total risk” is 236875 Visual Recognition

Bayes Decision Rule: • Compute the conditional risk for i =1, ... , a. • Select the action for which is minimum. • The resulting minimum total risk is called the Bayes Risk, denoted R*, and is the best performance that can be achieved. 236875 Visual Recognition

Two-Category Classification • Action = deciding that the true state is w1 • Action = deciding that the true state is w2. • Let be the loss incurred for deciding wiwhen the true state iswj. • Decidew1if or if or if and w2 otherwise 236875 Visual Recognition

Two-Category Likelihood Ratio Test • Under reasonable assumption that (why?) decidew1if and w2 otherwise. The ratio is called the likelihood ratio. We can decide w1 if the likelihood ratio exceeds a threshold T value that is independent of the observation x. 236875 Visual Recognition

Minimum-Error-Rate Classification • In classification problems, each state is usually associated with one of a different C classes. • Action = Decision that the true state is wi. • If action is taken, and the true state iswj, then the decision is correct if i = j, and in error otherwise. • The Zero-One Loss function is defined as for i,j=1,…,c all errors are equally costly 236875 Visual Recognition

Minimum-Error-Rate Classificationcont. • The conditional risk is • To minimize the average probability of error, we should select the i that maximizes the posterior probability P(wi|x) Decide wi if P(wi|x) > P(wj|x) for all (same as Bayes' decision rule) 236875 Visual Recognition

Decision Regions • The likelihood ratio p(x| w1 ) /p(x| w2 ) vs. x • The threshold qa for zero-one loss function • If we put l12> l21 we shall get qb> qa 236875 Visual Recognition

Classifiers, Discriminant Functions, and Decision SurfacesThe Multi-Category Case • A pattern classifier can be represented by a set of discriminant functions gi(x); i=1, .., C. • The classifier assigns a feature vector x to class wi if gi(x) > gj(x) for all 236875 Visual Recognition

Statistical Pattern Classifier Statistical pattern classifier 236875 Visual Recognition

The Bayes Classifier • A Bayes classifier can be represented in this way • For the general case with risks • For the minimum error-rate case If we replace every gi(x) by f(gi(x)), where f(.) is a monotonically increasing function, the resulting classification is unchanged, e.g. any of the following choices gives identical classification results 236875 Visual Recognition

The Bayes Classifiercont. • The effect of any decision rule is to divide the feature space into Cdecision regions, R1, .., Rc. • If gi(x) > gj(x) for all then x is in Ri, and x is assigned to wi. • Decision regions are separated by decision boundaries. • Decision boundaries are surfaces in the feature space. 236875 Visual Recognition

The Decision Regions Two dimensional two category classifier 236875 Visual Recognition

The Two-Category Case • Use 2 discriminant functions g1 and g2, and assigning x to w1 if g1>g2. • Alternative: define a single discriminant function g(x) = g1(x) - g2(x), decide w1 if g(x)>0, otherwise decide w2. • In two category case two forms are frequently used: 236875 Visual Recognition

Normal Density - Univariate Case • Gaussian density with mean and standard deviation ( named variance ) • It can be shown that: 236875 Visual Recognition

Entropy • Entropy is given by and measured by nats; if is used instead, the unit is the bit. The entropy measures the fundamental uncertainty in the values of points selected randomly from a distribution. Normal distribution has the maximum entropy of all distributions having a given mean and variance. As stated by the Central Limit Theorem, the aggregative effect of the sum of a large number small, iid random disturbances will lead to a Gaussian distribution. Because many patterns can be viewed as some ideal or prototype pattern corrupted by a large number of random processes, the Gaussian is often a good model for the actual probability distribution. 236875 Visual Recognition

Normal Density - Multivariate Case • The general multivariate normal density (MND) in a d dimensions is written as • It can be shown that: which means for components • The covariance matrix is always symmetric and positive semidefinite. 236875 Visual Recognition

Normal Density - Multivariate Case cont. • Diagonal elements are variances and the off-diagonal elements are covariances of xi and xj • If xi and xj are statistically independent, If all then p(x) is a product of univariate normal densities. • Linear combination of jointly normally distributed random variables are normally distributed: if and where A is d-by-k matrix, then • If A is a vector a, y=atx is a scalar , is a variance of a projection of x onto a. 236875 Visual Recognition

Whitening transform • Define to be the matrix whose columns are the orthogonal eigenvectors of , and the diagonal matrix of the corres-ponding eigenvalues. The transformation with converts an arbitrary MND into a spherical – with covariance matrix I . 236875 Visual Recognition

Normal Density - Multivariate Case cont. • The multivariate normal density MND is completely specified by d+d(d+1)/2 parameters . Samples drawn from MND fall in a cluster which center is determined by and a shape by The loci of points of constant density are hyperellipsoids The r is called Mahalonobis distance from x to . The principal axes of the hyperelli- psoid are given by the eigenvectors of . 236875 Visual Recognition

Normal Density - Multivariate Case cont. • The minimum-error-rate classification can be achieved using the discriminant functions: or • If then 236875 Visual Recognition

Discriminant Functions for the Normal Density • The features are statistically independent, and each feature has the same variance. • Determinant is and the inverse of is is independent of i and can be ignored 236875 Visual Recognition

Case1 cont. where denotes the Eucledian norm is independent ofi or as a linear discriminant function: where 236875 Visual Recognition

Case1 cont. • is called the threshold or bias in the ith direction. • A classifier that uses linear discriminant functions is called a linear machine. • The decision surfaces of a linear machine are pieces of hyperplanes defined by the linear equations for the 2 categories with the highest posterior probabilities. • For this particular example, setting reduces to 236875 Visual Recognition

Case1 cont. • where • The above equation defines a hyperplane through x0 and orthogonal to w (line linking the means) • If P( wi )=P( wj ), then x0 is halfway between the means. 236875 Visual Recognition

Case1 cont. • 1D 2D 236875 Visual Recognition

Case1 cont. • If the covariances of 2 distributions are equal and proportional to the identity matrix, then the distributions are spherical in d-dimensions, and the boundary is a generalized hyperplane of d-1 dimensions, perpendicular to the line separating the means. • If P(wi) is not equal to P(wj), the point x0 shifts away from the more likely mean. 236875 Visual Recognition

Case1 cont. 1D 236875 Visual Recognition

Minimum Distance Classifier • As the priors are changed, the decision boundary shifts. • If all prior probabilities are the same, the optimum decision rule becomes: • Measure the Euclidean distance from each x to each of the C mean vectors. • Assign x to the class of the nearest mean. 236875 Visual Recognition

Discriminant Functions for the Normal Density Case2. Common Covariance Matrices • Case 2: • Covariance matrices for all of the classes are identical but arbitrary. • is independent of i and can be ignored 236875 Visual Recognition

Case2 cont. or • If all prior probabilities are the same, the optimum decision rule becomes: • Measure the squared Mahalanobis distance from x to each of the C mean vectors. • Assign x to the class of the nearest mean. 236875 Visual Recognition

Case2 cont. • Expanding and dropping we shall have a linear classifier where Decision boundaries are given by 236875 Visual Recognition

Visual Recognition: Bayesian Decision Theory

Visual Recognition: Bayesian Decision Theory

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

OUTLINE