1 / 131

Chapter 11 Supervised Learning: STATISTICAL METHODS

Chapter 11 Supervised Learning: STATISTICAL METHODS. Cios / Pedrycz / Swiniarski / Kurgan. Outline. Bayesian Methods Basics of Bayesian Methods Bayesian Classification – General Case Classification that Minimizes Risk Decision Regions and Probability of Errors Discriminant Functions

rbarger
Download Presentation

Chapter 11 Supervised Learning: STATISTICAL METHODS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 11Supervised Learning:STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

  2. Outline • Bayesian Methods • Basics of Bayesian Methods • Bayesian Classification – General Case • Classification that Minimizes Risk • Decision Regions and Probability of Errors • Discriminant Functions • Estimation of Probability Densities • Probabilistic Neural Network • Constraints in Classifier Design Cios / Pedrycz / Swiniarski / Kurgan

  3. Outline • Regression • Data Models • Simple Linear Regression • Multiple Regression • General Least Squares and Multiple Regression • Assessing Quality of the Multiple Regression Model Cios / Pedrycz / Swiniarski / Kurgan

  4. Bayesian Methods Statistical processing based on the Bayes decision theory is a fundamental technique for pattern recognition and classification. The Bayes decision theory provides a framework for statistical methods for classifying patterns into classes based on probabilities of patterns and their features. Cios / Pedrycz / Swiniarski / Kurgan

  5. Basics of Bayesian Methods Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk. States of nature C = { “ an eagle ”, “ a hawk ” } Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” } We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”) and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N) Cios / Pedrycz / Swiniarski / Kurgan

  6. Basics of Bayesian Methods • A priori (prior) probability P(ci): • Estimation of a prior P(ci): • P(ci)denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object. Cios / Pedrycz / Swiniarski / Kurgan

  7. Basics of Bayesian Methods The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears. • Natural and best decision: “Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a bird to a class c2 ” • The probability of classification error: P(classification error) = P(c2) if we decide C = c1 P(c1) if we decide C = c2 Cios / Pedrycz / Swiniarski / Kurgan

  8. Involving Object Features in Classification • Feature variable / feature x • It characterizes an object and allows for better discrimination between one class from another • We assume it to be a continuous random variable taking continuous values from a given range • The variability of a random variable x can be expressed in probabilistic terms • We represent a distribution of a random variable xby the class conditional probability density function (the state conditional probability density function): Cios / Pedrycz / Swiniarski / Kurgan

  9. Involving Object Features in Classification Examples of probability densities Cios / Pedrycz / Swiniarski / Kurgan

  10. Involving Object Features in Classification • Probability density function p(x|ci) • also called the likelihood of a class ciwith respect to the valuexof a feature variable • the likelihood that an object belongs to class ciis bigger if p(x|ci)is larger • joint probability density function p(ci , x) • A probability density that an object is in a class ci and has a feature variable value x. • A posteriori (posterior) probability P(x|ci) • The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x. Cios / Pedrycz / Swiniarski / Kurgan

  11. Involving Object Features in Classification • Bayes’ rule / Bayes’ theorem • From probability theory (see Appendix B) • An unconditional probability density function Cios / Pedrycz / Swiniarski / Kurgan

  12. Involving Object Features in Classification • Bayes’ rule • “The conditional probability P(ci|x) can be expressed in terms of the a priori probability function P(ci), together with the class conditional probability density function p(ci|x).” Cios / Pedrycz / Swiniarski / Kurgan

  13. Involving Object Features in Classification • Bayes’ decision rule • P(c2|x) if we decide C = c1 • P(classification error | x) = • P(c1|x) if we decide C = c2 • “This statistical classification rule is best in the sense of minimizing the probability of misclassification (the probability of classification error)” • Bayes’ classification rule guarantees minimization of the average probability of classification error Cios / Pedrycz / Swiniarski / Kurgan

  14. Involving Object Features in Classification • Example • Let us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2). • Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2and p(45|c2) = 1.1053 ∙ 10-2. • Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is Cios / Pedrycz / Swiniarski / Kurgan

  15. Bayesian Classification – General Case • Bayes’ Classification Rule for Multiclass Multifeature Objects • Real-valued features of an object as n-dimensional column vector x  Rn: • The object may belong to l distinct classes (l distinct states of nature): Cios / Pedrycz / Swiniarski / Kurgan

  16. Bayesian Classification – General Case • Bayes’ Classification Rule for Multiclass Multifeature Objects • Bayes’ theorem • A priori probability: P(ci) (i = 1, 2…,l) • Class conditional probability density function : p(x|ci) • A posteriori (posterior) probability: P(ci|x) • Unconditional probability density function: Cios / Pedrycz / Swiniarski / Kurgan

  17. Bayesian Classification – General Case • Bayes’ Classification Rule for Multiclass Multifeature Objects • Bayes classification rule: • A given object with a given value x of a feature vector can be classified as belonging to class cj when: • Assign an object with a given value x of a feature vector to class cj when: Cios / Pedrycz / Swiniarski / Kurgan

  18. Classification that Minimizes Risk • Basic Idea • To incorporate the fact that misclassifications of some classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature • A loss function • Cost (penalty, weight) due to the fact of assigning an object to class cjwhen in fact the true class is ci Cios / Pedrycz / Swiniarski / Kurgan

  19. Classification that Minimizes Risk • A loss matrix • We denote a loss function by Lijmatrix for l-class classification problems • Expected (average) conditional loss • In short, Cios / Pedrycz / Swiniarski / Kurgan

  20. Classification that Minimizes Risk • Overall Risk • The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision. • Bayes risk • Minimal overall risk Rleads to the generalization of Bayes’ rule for minimization of probability of the classification error. Cios / Pedrycz / Swiniarski / Kurgan

  21. Classification that Minimizes Risk • Bayes’ classification rule with Bayes risk • Choose a decision (a class) ci for which: Cios / Pedrycz / Swiniarski / Kurgan

  22. Classification that Minimizes Risk • Bayesian Classification Minimizing the Probability of Error • Symmetrical zero-one conditional loss function • The conditional risk R(cj| x) criterion is the same as the average probability of classification error: • An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision Cios / Pedrycz / Swiniarski / Kurgan

  23. Classification that Minimizes Risk • Generalization of the Maximum Likelihood Classification • Generalized likelihood ratio for classes ci and cj • Generalized threshold value • The maximum likelihood classification rule • “Decide a class cj if Cios / Pedrycz / Swiniarski / Kurgan

  24. Decision Regions and Probability of Errors • Decision regions • A classifier divides the feature space into l disjoint decision subspaces R1,R2, … Rl • The region Ri is a subspace such that each realization x of a feature vector of an object falling into this region will be assigned to a class ci Cios / Pedrycz / Swiniarski / Kurgan

  25. Decision Regions and Probability of Errors • Decision boundaries (decision surfaces) • The regions intersect, and boundaries between adjacent regions • “The task of a classifier design is to find classification rules that will guarantee division of a feature space into optimal decision regions R1,R2, … Rl (with optimal decision boundaries) that will minimize a selected classification performance criterion” Cios / Pedrycz / Swiniarski / Kurgan

  26. Decision Regions and Probability of Errors • Decision boundaries Cios / Pedrycz / Swiniarski / Kurgan

  27. Decision Regions and Probability of Errors • Optimal classification with decision regions • Average probability of correct classification • “Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion” Cios / Pedrycz / Swiniarski / Kurgan

  28. Discriminant Functions • Discriminant functions: • Discriminant type classifier • It assigns an object with a given value x of a feature vector to a class cj if • Classification rule for a discriminant function-based classifier • Compute numerical values of all discriminant functions for x • Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest: • Select a class cj for which dj(x) = max(di(x)); i = 1, 2, …, l Cios / Pedrycz / Swiniarski / Kurgan

  29. Discriminant Functions • Discriminant classifier Cios / Pedrycz / Swiniarski / Kurgan

  30. Discriminant Functions • Discriminant type classifier for Bayesian classification • The natural choice for the discriminant function is the a posteriori conditional probability P(ci|x): • Practical versions using Bayes’ theorem • Bayesian discriminant in a natural logarithmic form Cios / Pedrycz / Swiniarski / Kurgan

  31. Discriminant Functions • Characteristics of discriminant function • Discriminant functions define the decision boundaries that separate the decision regions • Generally, the decision boundaries are defined by neighboring decision regions when the corresponding discriminant function values are equal • The decision boundaries are unaffected by the increasingly monotonic transformation of discriminant functions Cios / Pedrycz / Swiniarski / Kurgan

  32. Discriminant Functions • Bayesian Discriminant Functions for Two Classes • General case • Two discriminant functions: d1(x) and d2(x). • Two decision regions: R1andR2. • The decision boundary: d1(x) = d2(x). • Using dichotomizer • Single discriminant function: d(x) = d1(x) -d2(x). Cios / Pedrycz / Swiniarski / Kurgan

  33. Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Quadratic Discriminant • Assumption: • A multivariate normal Gaussian distribution of the feature vector x within each class • The Bayesian discriminant( in the previous section): Cios / Pedrycz / Swiniarski / Kurgan

  34. Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Quadratic Discriminant • Gaussian distribution of the probability density function • Quadratic Discriminant function • Decision boundaries: • hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.) Cios / Pedrycz / Swiniarski / Kurgan

  35. Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci) • Compute values of the mean vectors i and the covariance matrices i for all classes i = 1, 2, …, l based on the training set • Compute values of the discriminant function for all classes • Choose a class ci as a prediction of true class for which a value of the associated discriminant function dj(x) is largest: • Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l Cios / Pedrycz / Swiniarski / Kurgan

  36. Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Linear Discriminant: • Assumption: equal covariances for all classes i=  • The Quadratic Discriminant: • A linear form of discriminant functions: Cios / Pedrycz / Swiniarski / Kurgan

  37. Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Linear Discriminant: Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space Cios / Pedrycz / Swiniarski / Kurgan

  38. Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • The classification process using linear discriminants • Compute, for a given x, numerical values of discriminant functions for all classes: • Choose a class ci for which a value of the discriminant function dj(x) is largest: • Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l Cios / Pedrycz / Swiniarski / Kurgan

  39. Discriminant Functions • Quadratic and Linear Discriminants • Example • Let us assume that the following two-feature patterns xR2 from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution: Cios / Pedrycz / Swiniarski / Kurgan

  40. Discriminant Functions • Quadratic and Linear Discriminants • Example • The estimates of the symmetric covariance matrices for both classes • The linear discriminant functions for both classes Cios / Pedrycz / Swiniarski / Kurgan

  41. Discriminant Functions • Quadratic and Linear Discriminants • Example • Two-class two-feature pattern dichotomizer. Cios / Pedrycz / Swiniarski / Kurgan

  42. Discriminant Functions • Quadratic and Linear Discriminants • Minimum Mahalanobis Distance Classifier • Assumption • Equal covariances for all classes i= ( i = 1, 2, …, l ) • Equal a priori probabilities for all classes P(ci) = P • Discriminant function Cios / Pedrycz / Swiniarski / Kurgan

  43. Discriminant Functions • Quadratic and Linear Discriminants • Minimum Mahalanobis Distance Classifier • A classifier selects the class cj for which a value x is nearest, in the sense of Mahalanobis distance, to the corresponding mean vector j . This classifier is called a minimum Mahalanobis distance classifier. • Linear version of the minimum Mahalanobis distance classifier Cios / Pedrycz / Swiniarski / Kurgan

  44. Discriminant Functions • Quadratic and Linear Discriminants • Minimum Mahalanobis Distance Classifier • Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector • Compute numerical values of the Mahalanobis distances between x and means i for all classes. • Choose a class cjas a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum: Cios / Pedrycz / Swiniarski / Kurgan

  45. Discriminant Functions • Quadratic and Linear Discriminants • Linear Discriminant for Statistically Independent Features • Assumption • Equal covariances for all classes i= ( i = 1, 2, …, l ) • Features are statistically independent • Discriminant function • where Cios / Pedrycz / Swiniarski / Kurgan

  46. Discriminant Functions • Quadratic and Linear Discriminants • Linear Discriminant for Statistically Independent Features • Discriminants • Quadratic discriminant formula • Linear discriminant formula Cios / Pedrycz / Swiniarski / Kurgan

  47. Discriminant Functions • Quadratic and Linear Discriminants • Linear Discriminant for Statistically Independent Features • “Neural network” style as a linear threshold machine • where • The decision surfaces for the linear discriminants are pieces of hyperplanes defined by equations di(x)-dj(x). Cios / Pedrycz / Swiniarski / Kurgan

  48. Discriminant Functions • Quadratic and Linear Discriminants • Minimum Euclidean Distance Classifier • Assumption • Equal covariances for all classes i= ( i = 1, 2, …, l ) • Features are statistically independent • Equal a priori probabilities for all classes P(ci) = P • Discriminants • or Cios / Pedrycz / Swiniarski / Kurgan

  49. Discriminant Functions • Quadratic and Linear Discriminants • Minimum Euclidean Distance Classifier • The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j. • Linear version of the minimum distance classifier Cios / Pedrycz / Swiniarski / Kurgan

  50. Discriminant Functions • Quadratic and Linear Discriminants • Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector • Compute numerical values of Euclidean distances between x and means i for all classes: • Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest: Cios / Pedrycz / Swiniarski / Kurgan

More Related