1 / 62

Bayesian Decision Theory: Patterns, Classification & Features

Understand Bayesian Decision Theory and its application in classifying continuous features. Learn about decision rules, posterior probabilities, likelihood, errors, risk minimization, and optimal decision properties. Explore actions beyond classification and the minimization of overall risk. Dive into classification methods, discriminant functions, and decision surfaces for accurate pattern recognition.

calle
Download Presentation

Bayesian Decision Theory: Patterns, Classification & Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern ClassificationAll materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000with the permission of the authors and the publisher

  2. Chapter 2 (Part 1): Bayesian Decision Theory(Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

  3. Introduction • The sea bass/salmon example • State of nature, prior • State of nature is a random variable • The catch of salmon and sea bass is equiprobable • P(1) = P(2) (uniform priors) • P(1) + P( 2) = 1 (exclusivity and exhaustivity) Pattern Classification, Chapter 2 (Part 1)

  4. Decision rule with only the prior information • Decide 1 if P(1) > P(2) otherwise decide 2 • Use of the class –conditional information • P(x | 1) and P(x | 2) describe the difference in lightness between populations of sea and salmon Pattern Classification, Chapter 2 (Part 1)

  5. Pattern Classification, Chapter 2 (Part 1)

  6. Posterior, likelihood, evidence • P(j | x) = P(x | j) . P (j) / P(x) • Where in case of two categories • Posterior = (Likelihood. Prior) / Evidence Pattern Classification, Chapter 2 (Part 1)

  7. Pattern Classification, Chapter 2 (Part 1)

  8. Decision given the posterior probabilities X is an observation for which: if P(1 | x) > P(2 | x) True state of nature = 1 if P(1 | x) < P(2 | x) True state of nature = 2 Therefore: whenever we observe a particular x, the probability of error is : P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1 Pattern Classification, Chapter 2 (Part 1)

  9. Minimizing the probability of error • Decide 1 if P(1 | x) > P(2 | x);otherwise decide 2 Therefore: P(error | x) = min [P(1 | x), P(2 | x)] (Bayes decision) Pattern Classification, Chapter 2 (Part 1)

  10. Bayes Λαθος (Δυο κατηγοριες) - 1

  11. Bayes Λαθος (Δυο κατηγοριες) - 2

  12. Bayesian Decision Theory – Continuous Features • Generalization of the preceding ideas • Use of more than one feature • Use more than two states of nature • Allowing actions and not only decide on the state of nature • Introduce a loss of function which is more general than the probability of error Pattern Classification, Chapter 2 (Part 1)

  13. Allowing actions other than classification primarily allows the possibility of rejection • Refusing to make a decision in close or bad cases! • The loss function states how costly each action taken is Pattern Classification, Chapter 2 (Part 1)

  14. Let {1, 2,…, c} be the set of c states of nature (or “categories”) Let {1, 2,…, a}be the set of possible actions Let (i | j)be the loss incurred for taking action i when the state of nature is j Pattern Classification, Chapter 2 (Part 1)

  15. Overall Risk R = Prob-weighted Sum of all R(i | x) for i = 1,…,a Minimizing R Minimizing R(i| x) for i = 1,…, a for i = 1,…,a Conditional risk Pattern Classification, Chapter 2 (Part 1)

  16. Select the action i for which R(i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved! Pattern Classification, Chapter 2 (Part 1)

  17. Two-category classification 1: deciding 1 2: deciding 2 ij = (i|j) loss incurred for deciding iwhen the true state of nature is j Conditional risk: R(1 | x) = 11P(1 | x) + 12P(2 | x) R(2 | x) = 21P(1 | x) + 22P(2 | x) Pattern Classification, Chapter 2 (Part 1)

  18. Our rule is the following: if R(1 | x) < R(2 | x) action 1: “decide 1” is taken This results in the equivalent rule : decide 1if: (21- 11) P(x | 1) P(1) > (12- 22) P(x | 2) P(2) and decide2 otherwise Pattern Classification, Chapter 2 (Part 1)

  19. Likelihood ratio: The preceding rule is equivalent to the following rule: Then take action 1 (decide 1) Otherwise take action 2 (decide 2) Pattern Classification, Chapter 2 (Part 1)

  20. Optimal decision property “If the likelihood ratio exceeds a threshold value independent of the input pattern x, we can take optimal actions” Pattern Classification, Chapter 2 (Part 1)

  21. Exercise Select the optimal decision where: • = {1, 2} P(x | 1) N(2, 0.5) (Normal distribution) P(x | 2) N(1.5, 0.2) P(1) = 2/3 P(2) = 1/3 Pattern Classification, Chapter 2 (Part 1)

  22. Minimax Κριτηριον I

  23. Minimax Κριτηριον II

  24. Chapter 2 (Part 2): Bayesian Decision Theory(Sections 2.3-2.5) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density

  25. Minimum-Error-Rate Classification • Actions are decisions on classes If action i is taken and the true state of nature is j then: the decision is correct if i = j and in error if i  j • Seek a decision rule that minimizes the probability of errorwhich is the error rate Pattern Classification, Chapter 2 (Part 1)

  26. Introduction of the zero-one loss function: Therefore, the conditional risk is: “The risk corresponding to this loss function is the average probability error”  Pattern Classification, Chapter 2 (Part 1)

  27. Minimize the risk requires maximize P(i | x) (since R(i | x) = 1 – P(i | x)) • For Minimum error rate • Decide i if P (i | x) > P(j | x) j  i Pattern Classification, Chapter 2 (Part 1)

  28. Regions of decision (Threshold rule): • Two examples for the loss function  : Pattern Classification, Chapter 2 (Part 1)

  29. Pattern Classification, Chapter 2 (Part 1)

  30. Classifiers, Discriminant Functionsand Decision Surfaces • The multi-category case • Set of discriminant functions gi(x), i = 1,…, c • The classifier assigns a feature vector x to class i if: gi(x) > gj(x) j  i Pattern Classification, Chapter 2 (Part 1)

  31. Pattern Classification, Chapter 2 (Part 1)

  32. Let gi(x) = - R(i | x) (max. discriminant corresponds to min. risk!) • For the minimum error rate, we take gi(x) = P(i | x) (max. discrimination corresponds to max. posterior!) gi(x)  P(x | i) P(i) gi(x) = ln P(x | i) + ln P(i) (ln: natural logarithm) Pattern Classification, Chapter 2 (Part 1)

  33. Feature space divided into c decision regions if gi(x) > gj(x) j  i then x is in Ri (Rimeans assign x to i) • The two-category case • A classifier is a “dichotomizer” that has two discriminant functions g1 and g2 Let g(x)  g1(x) – g2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2 Pattern Classification, Chapter 2 (Part 1)

  34. The computation of g(x) Pattern Classification, Chapter 2 (Part 1)

  35. Pattern Classification, Chapter 2 (Part 1)

  36. The Normal Density • Univariate density • Density which is analytically tractable • Continuous density • A lot of processes are asymptotically Gaussian • Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where:  = mean (or expected value) of x 2 = expected squared deviation or variance Pattern Classification, Chapter 2 (Part 1)

  37. Pattern Classification, Chapter 2 (Part 1)

  38. Multivariate density • Multivariate normal density in d dimensions is: where: x = (x1, x2, …, xd)t(t stands for the transpose vector form)  = (1, 2, …, d)t mean vector  = d*d covariance matrix || and -1 are determinant and inverse respectively Pattern Classification, Chapter 2 (Part 1)

  39. M-dim Normal Density:Μετασχηματισμοι Χαρακτηριστικων

  40. Chapter 2 (part 3)Bayesian Decision Theory (Sections 2-6,2-9) Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features

  41. Discriminant Functions for the Normal Density • We saw that the minimum error-rate classification can be achieved by the discriminant function gi(x) = ln P(x | i) + ln P(i) • Case of multivariate normal Pattern Classification, Chapter 2 (Part 1)

  42. Case i = 2.I(I stands for the identity matrix) Pattern Classification, Chapter 2 (Part 1)

  43. A classifier that uses linear discriminant functions is called “a linear machine” • The decision surfaces for a linear machine are pieces of hyperplanes defined by: gi(x) = gj(x) Pattern Classification, Chapter 2 (Part 1)

  44. Pattern Classification, Chapter 2 (Part 1)

  45. The hyperplane separatingRiand Rj always orthogonal to the line linking the means! Pattern Classification, Chapter 2 (Part 1)

  46. Pattern Classification, Chapter 2 (Part 1)

  47. Pattern Classification, Chapter 2 (Part 1)

  48. Case i =  (covariance of all classes are identical but arbitrary!) • Hyperplane separating Ri and Rj (the hyperplane separating Ri and Rj is generally not orthogonal to the line between the means!) Pattern Classification, Chapter 2 (Part 1)

  49. Pattern Classification, Chapter 2 (Part 1)

  50. Pattern Classification, Chapter 2 (Part 1)

More Related