610 likes | 617 Views
Outline. Bayesian Decision Theory Bayes' formula Error Bayes' Decision Rule Loss function and Risk Two-Category Classification Minimax Criterion Classifiers, Discriminant Functions, and Decision Surfaces Discriminant Functions for the Normal Density. Bayesian Decision Theory.
E N D
Outline • Bayesian Decision Theory • Bayes' formula • Error • Bayes' Decision Rule • Loss function and Risk • Two-Category Classification • Minimax Criterion • Classifiers, Discriminant Functions, and Decision Surfaces • Discriminant Functions for the Normal Density 236607 Visual Recognition
Bayesian Decision Theory • Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. • Decision making when all the probabilistic information is known. • For given probabilities the decision is optimal. • When new information is added, it is assimilated in optimal fashion for improvement of decisions. 236607 Visual Recognition
Bayesian Decision Theory cont. • Fish Example: • Each fish is in one of 2 states: sea bass or salmon • Let w denote the state of nature • w = w1for sea bass • w = w2 for salmon 236607 Visual Recognition
Bayesian Decision Theory cont. • The State of nature is unpredictable w is a variable that must be described probabilistically. • If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon. • Define • P(w1 ) : a priori probability that the next fish is sea bass • P(w2 ):a priori probability that the next fish is salmon. 236607 Visual Recognition
Bayesian Decision Theory cont. • If other types of fish are irrelevant: P( w1 ) + P( w2 ) = 1. • Prior probabilities reflect our prior knowledge (e.g. time of year, fishing area, …) • Simple decision Rule: • Make a decision without seeing the fish. • Decide w1 if P( w1 ) > P( w2 ); w2 otherwise. • OK if deciding for one fish • If several fish, all assigned to same class. 236607 Visual Recognition
Bayesian Decision Theory cont. • In general, we will have some features and more information. • Feature: lightness measurement = x • Different fish yield different lightness readings (x is a random variable) 236607 Visual Recognition
Bayesian Decision Theory cont. • Define p(x|w1) = Class Conditional Probability Density Probability density function for x given that the state of nature is w1 • The difference between p(x|w1 ) and p(x|w2 ) describes the difference in lightness between sea bass and salmon. 236607 Visual Recognition
Bayesian Decision Theory cont. • Hypothetical class-conditional probability • Density functions are normalized (area under each curve is 1.0) 236607 Visual Recognition
Bayesian Decision Theory cont. • Suppose that we know The prior probabilities P(w1 ) and P(w2 ), The conditional densities and Measure lightness of a fish = x. • What is the category of the fish ? 236607 Visual Recognition
Bayes' formula P(wj | x) = P(x |wj ) P(wj ) / P(x), where 236607 Visual Recognition
Bayes' formula cont. • p(x|wj )is called thelikelihoodof wjwith respect to x. (the wjcategory for which p(x|wj ) is large is more "likely" to be the true category) • p(x) is theevidence how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1. 236607 Visual Recognition
Bayes' formula cont. Posterior probabilities for the particular priors P(w1)=2/3 and P(w2)=1/3. At every x the posteriors sum to 1. 236607 Visual Recognition
Error For a given x, we can minimize the probability of error by deciding w1 if P(w1|x) > P(w2|x) and w2 otherwise. 236607 Visual Recognition
Bayes' Decision Rule(Minimizes the probability of error) w1 : if P(w1|x) > P(w2|x) w2: otherwise or w1: if P ( x |w1) P(w1) > P(x|w2) P(w2) w2: otherwise and P(Error|x) = min [P(w1|x) , P(w2|x)] 236607 Visual Recognition
Bayesian Decision Theory: Continuous Features: General Case Formalize the ideas just considered in 4 ways: • Allow more than one feature Replace the scalar x by the feature vector A d-dimensional Euclidean space Rd is called the feature space. • Allow more than 2 states of nature Generalize to several classes • Allow actions other than merely deciding the state of nature Possibility of rejection, i.e., of refusing to make a decision in close cases. • Introducing general loss function 236607 Visual Recognition
Loss function • Loss ( or cost ) function states exactly how costly each action is, and is used to convert a probability determination into a decision. Loss functions let us treat situations in which some kinds of classification mistakes are more costly than others. 236607 Visual Recognition
Formulation • Let {w1, ... , wc } be the finite set of c states of nature ("categories"). • Let be the finite set of a possible actions. • The loss function = loss incurred for taking action when the state of nature is wj. • x = d-dimensional feature vector (random variable) • P(x|wj ) = the state conditional probability density function for x (The probability density function for x conditioned on wjbeing the true state of nature) • P(wj ) = prior probability that nature is in state wj. 236607 Visual Recognition
Expected Loss • Suppose that we observe a particular x and that we contemplate taking action . • If the true state of nature is wj thenloss is • Before we have done an observation the expected loss is 236607 Visual Recognition
Conditional Risk • After the observation theexpected riskwhich is called now “conditional risk” isgiven by 236607 Visual Recognition
Total Risk • Objective: Select the action that minimizes the conditional risk • A general decision rule is a function • For every x, the decision function assumes one of the a values • The “total risk” is 236607 Visual Recognition
Bayes Decision Rule: • Compute the conditional risk for i =1, ... , a. • Select the action for which is minimum. • The resulting minimum total risk is called the Bayes Risk, denoted R*, and is the best performance that can be achieved. 236607 Visual Recognition
Two-Category Classification • Action = deciding that the true state is w1 • Action = deciding that the true state is w2. • Let be the loss incurred for deciding wiwhen the true state iswj. • Decidew1if or if or if and w2 otherwise 236607 Visual Recognition
Two-Category Likelihood Ratio Test • Under reasonable assumption that (why?) decidew1if and w2 otherwise. The ratio is called the likelihood ratio. We can decide w1 if the likelihood ratio exceeds a threshold T value that is independent of the observation x. 236607 Visual Recognition
Zero-One Loss function • In classification problems, each state is usually associated with one of a different C classes. • Action = Decision that the true state is wi. • If action is taken, and the true state iswj, then the decision is correct if i = j, and in error otherwise. • The Zero-One Loss function is defined as for i,j=1,…,c all errors are equally costly 236607 Visual Recognition
Zero-One Loss functioncont. • The conditional risk is • To minimize the average probability of error, we should select the i that maximizes the posterior probability P(wi|x) Decide wi if P(wi|x) > P(wj|x) for all (same as Bayes' decision rule) 236607 Visual Recognition
Decision Regions • The likelihood ratio p(x| w1 ) /p(x| w2 ) vs. x • The threshold for zero-one loss function • If we put l12> l21 we shall get qb> qa 236607 Visual Recognition
Minimax Criterion • Design a classifier which performs well over a range of prior probabilities. • Let - region in a feature space where the classifier decides w1 and likewise for and w2 . The overall risks Using P(w2)=1- P(w1) and we rewrite the risk as 236607 Visual Recognition
Minimax Criterion • This shows that for fixed and the risk is linear in P(w1) If we find boundary, which makes a coefficient inside [] to be zero, then the risk is independent of priors. 236607 Visual Recognition
Minimax Criterion • For minimax solution 236607 Visual Recognition
Minimax Criterion • Thus the minimax solution is 236607 Visual Recognition
Neyman-Pearson Criterion • If we want to minimize the total risk subject to a constraint for some i. Such criterion is called the Neyman-Pearson criterion. 236607 Visual Recognition
ERROR PROBABILITIES AND INTEGRALS • Consider the 2-class problem and suppose that the feature space is divided into 2 regions and . There are 2 ways in which a classification error can occur An observation x falls in , and the true state is w1. An observation x falls in , and the true state is w2. 236607 Visual Recognition
ERROR PROBABILITIES AND INTEGRALS cont. 236607 Visual Recognition
ERROR PROBABILITIES AND INTEGRALS cont. • Because x* is chosen arbitrarily, the probability of error is not as small as it might be. • XB = Bayes optimal decision boundary , and gives the lowest probability of error. • In the multi-category case, there are more ways to be wrong than to be right, and it is simpler to compute the probability of being correct. • This result depends neither on how the feature space is partitioned, nor on the form of the underlying distribution. Bayes classifier maximizes this probability, and no other partitioning can yield a smaller probability of error. 236607 Visual Recognition
Classifiers, Discriminant Functions, and Decision SurfacesThe Multi-Category Case • A pattern classifier can be represented by a set of discriminant functions gi(x); i=1, .., C. • The classifier assigns a feature vector x to class wi if gi(x) > gj(x) for all 236607 Visual Recognition
Statistical Pattern Classifier Statistical pattern classifier 236607 Visual Recognition
The Bayes Classifier • A Bayes classifier can be represented in this way • For the general case with risks • For the minimum error-rate case If we replace every gi(x) by f(gi(x)), where f(.) is a monotonically increasing function, the resulting classification is unchanged, e.g. any of the following choices gives identical classification results 236607 Visual Recognition
The Bayes Classifiercont. • The effect of any decision rule is to divide the feature space into Cdecision regions, R1, .., Rc. • If gi(x) > gj(x) for all then x is in Ri, and x is assigned to wi. • Decision regions are separated by decision boundaries. • Decision boundaries are surfaces in the feature space. 236607 Visual Recognition
The Decision Regions Two dimensional two category classifier 236607 Visual Recognition
The Two-Category Case • Use 2 discriminant functions g1 and g2, and assigning x to w1 if g1>g2. • Alternative: define a single discriminant function g(x) = g1(x) - g2(x), decide w1 if g(x)>0, otherwise decide w2. • Two category case 236607 Visual Recognition
Normal Density - Univariate Case • Gaussian density with mean and standard deviation • It can be shown that: 236607 Visual Recognition
Entropy • Entropy is given by and measured by nats; if is used instead, the unit is the bit. The entropy measures the fundamental uncertainty in the values of points selected randomly from a distribution. Normal distribution has the maximum entropy of all distributions having a given mean and variance. As stated by the Central Limit Theorem, the aggregative effect of the sum of a large number small, iid random disturbances will lead to a Gaussian distribution. Because many patterns can be viewed as some ideal or prototype pattern corrupted by a large number of random processes, the Gaussian is often a good model for the actual probability distribution. 236607 Visual Recognition
Normal Density - Multivariate Case • The general multivariate normal density (MND) in a d dimensions is written as • It can be shown that: which means for components • The covariance matrix is always symmetric and positive semidefinite. 236607 Visual Recognition
Normal Density - Multivariate Case cont. • Diagonal elements are variances and the off-diagonal elements are covariances of xi and xj • If xi and xj are statistically independent, If all then p(x) is a product of univariate normal densities. • Linear combination of jointly normally distributed random variables are normally distributed: if and where A is d-by-k matrix, then • If A is a vector a, y=atx is a scalar , is a variance of a projection of x onto a. 236607 Visual Recognition
Whitening transform • Define to be the matrix whose columns are the orthogonal eigenvectors of , and the diagonal matrix of the corres-ponding eigenvalues. The transformation with converts an arbitrary MND into a spherical – with covariance matrix I . 236607 Visual Recognition
Normal Density - Multivariate Case cont. • The multivariate normal density MND is completely specified by d+d(d+1)/2 parameters . Samples drawn from MND fall in a cluster which center is determined by and a shape by The loci of points of constant density are hyperellipsoids The r is called Mahalonobis distance from x to . The principal axes of the hyperelli- psoid are given by the eigenvectors of . 236607 Visual Recognition
Normal Density - Multivariate Case cont. • The minimum-error-rate classification can be achieved using the discriminant functions: or • If then 236607 Visual Recognition
Discriminant Functions for the Normal Density • The features are statistically independent, and each feature has the same variance. • Determinant is and the inverse of is is independent of i and can be ignored 236607 Visual Recognition
Case1 cont. where denotes the Eucledian norm is independent ofi or as a linear discriminant function: where 236607 Visual Recognition
Case1 cont. • is called the threshold or bias in the ith direction. • A classifier that uses linear discriminant functions is called a linear machine. • The decision surfaces of a linear machine are pieces of hyperplanes defined by the linear equations for the 2 categories with the highest posterior probabilities. • For this particular example, setting reduces to 236607 Visual Recognition