Outline

Outline • Bayesian Decision Theory • Bayes' formula • Error • Bayes' Decision Rule • Loss function and Risk • Two-Category Classification • Minimax Criterion • Classifiers, Discriminant Functions, and Decision Surfaces • Discriminant Functions for the Normal Density 236607 Visual Recognition

Bayesian Decision Theory • Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. • Decision making when all the probabilistic information is known. • For given probabilities the decision is optimal. • When new information is added, it is assimilated in optimal fashion for improvement of decisions. 236607 Visual Recognition

Bayesian Decision Theory cont. • Fish Example: • Each fish is in one of 2 states: sea bass or salmon • Let w denote the state of nature • w = w1for sea bass • w = w2 for salmon 236607 Visual Recognition

Bayesian Decision Theory cont. • The State of nature is unpredictable w is a variable that must be described probabilistically. • If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon. • Define • P(w1 ) : a priori probability that the next fish is sea bass • P(w2 ):a priori probability that the next fish is salmon. 236607 Visual Recognition

Bayesian Decision Theory cont. • If other types of fish are irrelevant: P( w1 ) + P( w2 ) = 1. • Prior probabilities reflect our prior knowledge (e.g. time of year, fishing area, …) • Simple decision Rule: • Make a decision without seeing the fish. • Decide w1 if P( w1 ) > P( w2 ); w2 otherwise. • OK if deciding for one fish • If several fish, all assigned to same class. 236607 Visual Recognition

Bayesian Decision Theory cont. • In general, we will have some features and more information. • Feature: lightness measurement = x • Different fish yield different lightness readings (x is a random variable) 236607 Visual Recognition

Bayesian Decision Theory cont. • Define p(x|w1) = Class Conditional Probability Density Probability density function for x given that the state of nature is w1 • The difference between p(x|w1 ) and p(x|w2 ) describes the difference in lightness between sea bass and salmon. 236607 Visual Recognition

Bayesian Decision Theory cont. • Hypothetical class-conditional probability • Density functions are normalized (area under each curve is 1.0) 236607 Visual Recognition

Bayesian Decision Theory cont. • Suppose that we know The prior probabilities P(w1 ) and P(w2 ), The conditional densities and Measure lightness of a fish = x. • What is the category of the fish ? 236607 Visual Recognition

Bayes' formula P(wj | x) = P(x |wj ) P(wj ) / P(x), where 236607 Visual Recognition

Bayes' formula cont. • p(x|wj )is called thelikelihoodof wjwith respect to x. (the wjcategory for which p(x|wj ) is large is more "likely" to be the true category) • p(x) is theevidence how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1. 236607 Visual Recognition

Bayes' formula cont. Posterior probabilities for the particular priors P(w1)=2/3 and P(w2)=1/3. At every x the posteriors sum to 1. 236607 Visual Recognition

Error For a given x, we can minimize the probability of error by deciding w1 if P(w1|x) > P(w2|x) and w2 otherwise. 236607 Visual Recognition

Bayesian Decision Theory: Continuous Features: General Case Formalize the ideas just considered in 4 ways: • Allow more than one feature Replace the scalar x by the feature vector A d-dimensional Euclidean space Rd is called the feature space. • Allow more than 2 states of nature Generalize to several classes • Allow actions other than merely deciding the state of nature Possibility of rejection, i.e., of refusing to make a decision in close cases. • Introducing general loss function 236607 Visual Recognition

Loss function • Loss ( or cost ) function states exactly how costly each action is, and is used to convert a probability determination into a decision. Loss functions let us treat situations in which some kinds of classification mistakes are more costly than others. 236607 Visual Recognition

Formulation • Let {w1, ... , wc } be the finite set of c states of nature ("categories"). • Let be the finite set of a possible actions. • The loss function = loss incurred for taking action when the state of nature is wj. • x = d-dimensional feature vector (random variable) • P(x|wj ) = the state conditional probability density function for x (The probability density function for x conditioned on wjbeing the true state of nature) • P(wj ) = prior probability that nature is in state wj. 236607 Visual Recognition

Expected Loss • Suppose that we observe a particular x and that we contemplate taking action . • If the true state of nature is wj thenloss is • Before we have done an observation the expected loss is 236607 Visual Recognition

Conditional Risk • After the observation theexpected riskwhich is called now “conditional risk” isgiven by 236607 Visual Recognition

Total Risk • Objective: Select the action that minimizes the conditional risk • A general decision rule is a function • For every x, the decision function assumes one of the a values • The “total risk” is 236607 Visual Recognition

Bayes Decision Rule: • Compute the conditional risk for i =1, ... , a. • Select the action for which is minimum. • The resulting minimum total risk is called the Bayes Risk, denoted R*, and is the best performance that can be achieved. 236607 Visual Recognition

Two-Category Classification • Action = deciding that the true state is w1 • Action = deciding that the true state is w2. • Let be the loss incurred for deciding wiwhen the true state iswj. • Decidew1if or if or if and w2 otherwise 236607 Visual Recognition

Two-Category Likelihood Ratio Test • Under reasonable assumption that (why?) decidew1if and w2 otherwise. The ratio is called the likelihood ratio. We can decide w1 if the likelihood ratio exceeds a threshold T value that is independent of the observation x. 236607 Visual Recognition

Zero-One Loss function • In classification problems, each state is usually associated with one of a different C classes. • Action = Decision that the true state is wi. • If action is taken, and the true state iswj, then the decision is correct if i = j, and in error otherwise. • The Zero-One Loss function is defined as for i,j=1,…,c all errors are equally costly 236607 Visual Recognition

Zero-One Loss functioncont. • The conditional risk is • To minimize the average probability of error, we should select the i that maximizes the posterior probability P(wi|x) Decide wi if P(wi|x) > P(wj|x) for all (same as Bayes' decision rule) 236607 Visual Recognition

Decision Regions • The likelihood ratio p(x| w1 ) /p(x| w2 ) vs. x • The threshold for zero-one loss function • If we put l12> l21 we shall get qb> qa 236607 Visual Recognition

Minimax Criterion • Design a classifier which performs well over a range of prior probabilities. • Let - region in a feature space where the classifier decides w1 and likewise for and w2 . The overall risks Using P(w2)=1- P(w1) and we rewrite the risk as 236607 Visual Recognition

Minimax Criterion • This shows that for fixed and the risk is linear in P(w1) If we find boundary, which makes a coefficient inside [] to be zero, then the risk is independent of priors. 236607 Visual Recognition

Minimax Criterion • For minimax solution 236607 Visual Recognition

Minimax Criterion • Thus the minimax solution is 236607 Visual Recognition

Neyman-Pearson Criterion • If we want to minimize the total risk subject to a constraint for some i. Such criterion is called the Neyman-Pearson criterion. 236607 Visual Recognition

ERROR PROBABILITIES AND INTEGRALS • Consider the 2-class problem and suppose that the feature space is divided into 2 regions and . There are 2 ways in which a classification error can occur An observation x falls in , and the true state is w1. An observation x falls in , and the true state is w2. 236607 Visual Recognition

ERROR PROBABILITIES AND INTEGRALS cont. 236607 Visual Recognition

ERROR PROBABILITIES AND INTEGRALS cont. • Because x* is chosen arbitrarily, the probability of error is not as small as it might be. • XB = Bayes optimal decision boundary , and gives the lowest probability of error. • In the multi-category case, there are more ways to be wrong than to be right, and it is simpler to compute the probability of being correct. • This result depends neither on how the feature space is partitioned, nor on the form of the underlying distribution. Bayes classifier maximizes this probability, and no other partitioning can yield a smaller probability of error. 236607 Visual Recognition

Classifiers, Discriminant Functions, and Decision SurfacesThe Multi-Category Case • A pattern classifier can be represented by a set of discriminant functions gi(x); i=1, .., C. • The classifier assigns a feature vector x to class wi if gi(x) > gj(x) for all 236607 Visual Recognition

Statistical Pattern Classifier Statistical pattern classifier 236607 Visual Recognition

The Bayes Classifier • A Bayes classifier can be represented in this way • For the general case with risks • For the minimum error-rate case If we replace every gi(x) by f(gi(x)), where f(.) is a monotonically increasing function, the resulting classification is unchanged, e.g. any of the following choices gives identical classification results 236607 Visual Recognition

The Bayes Classifiercont. • The effect of any decision rule is to divide the feature space into Cdecision regions, R1, .., Rc. • If gi(x) > gj(x) for all then x is in Ri, and x is assigned to wi. • Decision regions are separated by decision boundaries. • Decision boundaries are surfaces in the feature space. 236607 Visual Recognition

The Decision Regions Two dimensional two category classifier 236607 Visual Recognition

The Two-Category Case • Use 2 discriminant functions g1 and g2, and assigning x to w1 if g1>g2. • Alternative: define a single discriminant function g(x) = g1(x) - g2(x), decide w1 if g(x)>0, otherwise decide w2. • Two category case 236607 Visual Recognition

Normal Density - Univariate Case • Gaussian density with mean and standard deviation • It can be shown that: 236607 Visual Recognition

Entropy • Entropy is given by and measured by nats; if is used instead, the unit is the bit. The entropy measures the fundamental uncertainty in the values of points selected randomly from a distribution. Normal distribution has the maximum entropy of all distributions having a given mean and variance. As stated by the Central Limit Theorem, the aggregative effect of the sum of a large number small, iid random disturbances will lead to a Gaussian distribution. Because many patterns can be viewed as some ideal or prototype pattern corrupted by a large number of random processes, the Gaussian is often a good model for the actual probability distribution. 236607 Visual Recognition

Normal Density - Multivariate Case • The general multivariate normal density (MND) in a d dimensions is written as • It can be shown that: which means for components • The covariance matrix is always symmetric and positive semidefinite. 236607 Visual Recognition

Normal Density - Multivariate Case cont. • Diagonal elements are variances and the off-diagonal elements are covariances of xi and xj • If xi and xj are statistically independent, If all then p(x) is a product of univariate normal densities. • Linear combination of jointly normally distributed random variables are normally distributed: if and where A is d-by-k matrix, then • If A is a vector a, y=atx is a scalar , is a variance of a projection of x onto a. 236607 Visual Recognition

Whitening transform • Define to be the matrix whose columns are the orthogonal eigenvectors of , and the diagonal matrix of the corres-ponding eigenvalues. The transformation with converts an arbitrary MND into a spherical – with covariance matrix I . 236607 Visual Recognition

Normal Density - Multivariate Case cont. • The multivariate normal density MND is completely specified by d+d(d+1)/2 parameters . Samples drawn from MND fall in a cluster which center is determined by and a shape by The loci of points of constant density are hyperellipsoids The r is called Mahalonobis distance from x to . The principal axes of the hyperelli- psoid are given by the eigenvectors of . 236607 Visual Recognition

Normal Density - Multivariate Case cont. • The minimum-error-rate classification can be achieved using the discriminant functions: or • If then 236607 Visual Recognition

Discriminant Functions for the Normal Density • The features are statistically independent, and each feature has the same variance. • Determinant is and the inverse of is is independent of i and can be ignored 236607 Visual Recognition

Case1 cont. where denotes the Eucledian norm is independent ofi or as a linear discriminant function: where 236607 Visual Recognition

Case1 cont. • is called the threshold or bias in the ith direction. • A classifier that uses linear discriminant functions is called a linear machine. • The decision surfaces of a linear machine are pieces of hyperplanes defined by the linear equations for the 2 categories with the highest posterior probabilities. • For this particular example, setting reduces to 236607 Visual Recognition

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: