540 likes | 545 Views
Learn about Bayesian decision theory, a statistical approach to pattern classification, where decisions are made optimally based on probabilistic information. Explore the use of Bayes' formula, error minimization, loss functions, and decision rules in the classification process.
E N D
Outline • Bayesian Decision Theory • Bayes' formula • Error • Bayes' Decision Rule • Loss function and Risk • Two-Category Classification Born: 1702 • Classifiers, Discriminant Functions, and Decision Surfaces • Discriminant Functions for the Normal Density 236875 Visual Recognition
Bayesian Decision Theory • Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. • Decision making when all the probabilistic information is known. • For given probabilities the decision is optimal. • When new information is added, it is assimilated in optimal fashion for improvement of decisions. 236875 Visual Recognition
Bayesian Decision Theory cont. • Fish Example: • Each fish is in one of 2 states: sea bass or salmon • Let w denote the state of nature • w = w1for sea bass • w = w2 for salmon 236875 Visual Recognition
Bayesian Decision Theory cont. • The State of nature is unpredictable w is a variable that must be described probabilistically. • If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon. • Define • P(w1 ) : a priori probability that the next fish is sea bass • P(w2 ):a priori probability that the next fish is salmon. 236875 Visual Recognition
Bayesian Decision Theory cont. • If other types of fish are irrelevant: P( w1 ) + P( w2 ) = 1. • Prior probabilities reflect our prior knowledge (e.g. time of year, fishing area, …) • Simple decision Rule: • Make a decision without seeing the fish. • Decide w1 if P( w1 ) > P( w2 ); w2 otherwise. • OK if deciding for one fish • If several fish, all assigned to same class. 236875 Visual Recognition
Bayesian Decision Theory cont. • In general, we will have some features and more information. • Feature: lightness measurement = x • Different fish yield different lightness readings (x is a random variable) 236875 Visual Recognition
Bayesian Decision Theory cont. • Define p(x|w1) = Class Conditional Probability Density Probability density function for x given that the state of nature is w1 • The difference between p(x|w1 ) and p(x|w2 ) describes the difference in lightness between sea bass and salmon. 236875 Visual Recognition
Bayesian Decision Theory cont. • Hypothetical class-conditional probability • Density functions are normalized (area under each curve is 1.0) 236875 Visual Recognition
Bayesian Decision Theory cont. • Suppose that we know The prior probabilities P(w1 ) and P(w2 ), The conditional densities and Measure lightness of a fish = x. • What is the category of the fish ? 236875 Visual Recognition
Bayes' formula P(wj | x) = P(x |wj ) P(wj ) / P(x), where 236875 Visual Recognition
Bayes' formula cont. • p(x|wj )is called thelikelihoodof wjwith respect to x. (the wjcategory for which p(x|wj ) is large is more "likely" to be the true category) • p(x) is theevidence how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1. 236875 Visual Recognition
Bayes' formula cont. Posterior probabilities for the particular priors P(w1)=2/3 and P(w2)=1/3. At every x the posteriors sum to 1. 236875 Visual Recognition
Error For a given x, we can minimize the probability of error by deciding w1 if P(w1|x) > P(w2|x) and w2 otherwise. 236875 Visual Recognition
Bayes' Decision Rule(Minimizes the probability of error) w1 : if P(w1|x) > P(w2|x) w2: otherwise or w1: if P ( x |w1) P(w1) > P(x|w2) P(w2) w2: otherwise and P(Error|x) = min [P(w1|x) , P(w2|x)] 236875 Visual Recognition
Bayesian Decision Theory: Continuous Features: General Case Formalize the ideas just considered in 4 ways: • Allow more than one feature Replace the scalar x by the feature vector A d-dimensional Euclidean space Rd is called the feature space. • Allow more than 2 states of nature Generalize to several classes • Allow actions other than merely deciding the state of nature Possibility of rejection, i.e., of refusing to make a decision in close cases. • Introducing general loss function 236875 Visual Recognition
Loss function • Loss ( or cost ) function states exactly how costly each action is, and is used to convert a probability determination into a decision. Loss functions let us treat situations in which some kinds of classification mistakes are more costly than others. 236875 Visual Recognition
Formulation • Let {w1, ... , wc } be the finite set of c states of nature ("categories"). • Let be the finite set of a possible actions. • The loss function = loss incurred for taking action when the state of nature is wj. • x = d-dimensional feature vector (random variable) • P(x|wj ) = the state conditional probability density function for x (The probability density function for x conditioned on wjbeing the true state of nature) • P(wj ) = prior probability that nature is in state wj. 236875 Visual Recognition
Expected Loss • Suppose that we observe a particular x and that we contemplate taking action . • If the true state of nature is wj thenloss is • Before we have done an observation the expected loss is 236875 Visual Recognition
Conditional Risk • After the observation theexpected riskwhich is called now “conditional risk” isgiven by 236875 Visual Recognition
Total Risk • Objective: Select the action that minimizes the conditional risk • A general decision rule is a function • For every x, the decision function assumes one of the a values • The “total risk” is 236875 Visual Recognition
Bayes Decision Rule: • Compute the conditional risk for i =1, ... , a. • Select the action for which is minimum. • The resulting minimum total risk is called the Bayes Risk, denoted R*, and is the best performance that can be achieved. 236875 Visual Recognition
Two-Category Classification • Action = deciding that the true state is w1 • Action = deciding that the true state is w2. • Let be the loss incurred for deciding wiwhen the true state iswj. • Decidew1if or if or if and w2 otherwise 236875 Visual Recognition
Two-Category Likelihood Ratio Test • Under reasonable assumption that (why?) decidew1if and w2 otherwise. The ratio is called the likelihood ratio. We can decide w1 if the likelihood ratio exceeds a threshold T value that is independent of the observation x. 236875 Visual Recognition
Minimum-Error-Rate Classification • In classification problems, each state is usually associated with one of a different C classes. • Action = Decision that the true state is wi. • If action is taken, and the true state iswj, then the decision is correct if i = j, and in error otherwise. • The Zero-One Loss function is defined as for i,j=1,…,c all errors are equally costly 236875 Visual Recognition
Minimum-Error-Rate Classificationcont. • The conditional risk is • To minimize the average probability of error, we should select the i that maximizes the posterior probability P(wi|x) Decide wi if P(wi|x) > P(wj|x) for all (same as Bayes' decision rule) 236875 Visual Recognition
Decision Regions • The likelihood ratio p(x| w1 ) /p(x| w2 ) vs. x • The threshold qa for zero-one loss function • If we put l12> l21 we shall get qb> qa 236875 Visual Recognition
Classifiers, Discriminant Functions, and Decision SurfacesThe Multi-Category Case • A pattern classifier can be represented by a set of discriminant functions gi(x); i=1, .., C. • The classifier assigns a feature vector x to class wi if gi(x) > gj(x) for all 236875 Visual Recognition
Statistical Pattern Classifier Statistical pattern classifier 236875 Visual Recognition
The Bayes Classifier • A Bayes classifier can be represented in this way • For the general case with risks • For the minimum error-rate case If we replace every gi(x) by f(gi(x)), where f(.) is a monotonically increasing function, the resulting classification is unchanged, e.g. any of the following choices gives identical classification results 236875 Visual Recognition
The Bayes Classifiercont. • The effect of any decision rule is to divide the feature space into Cdecision regions, R1, .., Rc. • If gi(x) > gj(x) for all then x is in Ri, and x is assigned to wi. • Decision regions are separated by decision boundaries. • Decision boundaries are surfaces in the feature space. 236875 Visual Recognition
The Decision Regions Two dimensional two category classifier 236875 Visual Recognition
The Two-Category Case • Use 2 discriminant functions g1 and g2, and assigning x to w1 if g1>g2. • Alternative: define a single discriminant function g(x) = g1(x) - g2(x), decide w1 if g(x)>0, otherwise decide w2. • In two category case two forms are frequently used: 236875 Visual Recognition
Normal Density - Univariate Case • Gaussian density with mean and standard deviation ( named variance ) • It can be shown that: 236875 Visual Recognition
Entropy • Entropy is given by and measured by nats; if is used instead, the unit is the bit. The entropy measures the fundamental uncertainty in the values of points selected randomly from a distribution. Normal distribution has the maximum entropy of all distributions having a given mean and variance. As stated by the Central Limit Theorem, the aggregative effect of the sum of a large number small, iid random disturbances will lead to a Gaussian distribution. Because many patterns can be viewed as some ideal or prototype pattern corrupted by a large number of random processes, the Gaussian is often a good model for the actual probability distribution. 236875 Visual Recognition
Normal Density - Multivariate Case • The general multivariate normal density (MND) in a d dimensions is written as • It can be shown that: which means for components • The covariance matrix is always symmetric and positive semidefinite. 236875 Visual Recognition
Normal Density - Multivariate Case cont. • Diagonal elements are variances and the off-diagonal elements are covariances of xi and xj • If xi and xj are statistically independent, If all then p(x) is a product of univariate normal densities. • Linear combination of jointly normally distributed random variables are normally distributed: if and where A is d-by-k matrix, then • If A is a vector a, y=atx is a scalar , is a variance of a projection of x onto a. 236875 Visual Recognition
Whitening transform • Define to be the matrix whose columns are the orthogonal eigenvectors of , and the diagonal matrix of the corres-ponding eigenvalues. The transformation with converts an arbitrary MND into a spherical – with covariance matrix I . 236875 Visual Recognition
Normal Density - Multivariate Case cont. • The multivariate normal density MND is completely specified by d+d(d+1)/2 parameters . Samples drawn from MND fall in a cluster which center is determined by and a shape by The loci of points of constant density are hyperellipsoids The r is called Mahalonobis distance from x to . The principal axes of the hyperelli- psoid are given by the eigenvectors of . 236875 Visual Recognition
Normal Density - Multivariate Case cont. • The minimum-error-rate classification can be achieved using the discriminant functions: or • If then 236875 Visual Recognition
Discriminant Functions for the Normal Density • The features are statistically independent, and each feature has the same variance. • Determinant is and the inverse of is is independent of i and can be ignored 236875 Visual Recognition
Case1 cont. where denotes the Eucledian norm is independent ofi or as a linear discriminant function: where 236875 Visual Recognition
Case1 cont. • is called the threshold or bias in the ith direction. • A classifier that uses linear discriminant functions is called a linear machine. • The decision surfaces of a linear machine are pieces of hyperplanes defined by the linear equations for the 2 categories with the highest posterior probabilities. • For this particular example, setting reduces to 236875 Visual Recognition
Case1 cont. • where • The above equation defines a hyperplane through x0 and orthogonal to w (line linking the means) • If P( wi )=P( wj ), then x0 is halfway between the means. 236875 Visual Recognition
Case1 cont. • 1D 2D 236875 Visual Recognition
Case1 cont. • If the covariances of 2 distributions are equal and proportional to the identity matrix, then the distributions are spherical in d-dimensions, and the boundary is a generalized hyperplane of d-1 dimensions, perpendicular to the line separating the means. • If P(wi) is not equal to P(wj), the point x0 shifts away from the more likely mean. 236875 Visual Recognition
Case1 cont. 1D 236875 Visual Recognition
Minimum Distance Classifier • As the priors are changed, the decision boundary shifts. • If all prior probabilities are the same, the optimum decision rule becomes: • Measure the Euclidean distance from each x to each of the C mean vectors. • Assign x to the class of the nearest mean. 236875 Visual Recognition
Discriminant Functions for the Normal Density Case2. Common Covariance Matrices • Case 2: • Covariance matrices for all of the classes are identical but arbitrary. • is independent of i and can be ignored 236875 Visual Recognition
Case2 cont. or • If all prior probabilities are the same, the optimum decision rule becomes: • Measure the squared Mahalanobis distance from x to each of the C mean vectors. • Assign x to the class of the nearest mean. 236875 Visual Recognition
Case2 cont. • Expanding and dropping we shall have a linear classifier where Decision boundaries are given by 236875 Visual Recognition