Bayesian Methods in Statistical Pattern Recognition

Chapter 11Supervised Learning:STATISTICAL METHODS Cios / Pedrycz / Swiniarski / Kurgan

Outline • Bayesian Methods • Basics of Bayesian Methods • Bayesian Classification – General Case • Classification that Minimizes Risk • Decision Regions and Probability of Errors • Discriminant Functions • Estimation of Probability Densities • Probabilistic Neural Network • Constraints in Classifier Design Cios / Pedrycz / Swiniarski / Kurgan

Outline • Regression • Data Models • Simple Linear Regression • Multiple Regression • General Least Squares and Multiple Regression • Assessing Quality of the Multiple Regression Model Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Methods Statistical processing based on the Bayes decision theory is a fundamental technique for pattern recognition and classification. The Bayes decision theory provides a framework for statistical methods for classifying patterns into classes based on probabilities of patterns and their features. Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk. States of nature C = { “ an eagle ”, “ a hawk ” } Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” } We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”) and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N) Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods • A priori (prior) probability P(ci): • Estimation of a prior P(ci): • P(ci)denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object. Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears. • Natural and best decision: “Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a bird to a class c2 ” • The probability of classification error: P(classification error) = P(c2) if we decide C = c1 P(c1) if we decide C = c2 Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification • Feature variable / feature x • It characterizes an object and allows for better discrimination between one class from another • We assume it to be a continuous random variable taking continuous values from a given range • The variability of a random variable x can be expressed in probabilistic terms • We represent a distribution of a random variable xby the class conditional probability density function (the state conditional probability density function): Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification Examples of probability densities Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification • Probability density function p(x|ci) • also called the likelihood of a class ciwith respect to the valuexof a feature variable • the likelihood that an object belongs to class ciis bigger if p(x|ci)is larger • joint probability density function p(ci , x) • A probability density that an object is in a class ci and has a feature variable value x. • A posteriori (posterior) probability P(x|ci) • The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x. Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification • Bayes’ rule / Bayes’ theorem • From probability theory (see Appendix B) • An unconditional probability density function Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification • Bayes’ rule • “The conditional probability P(ci|x) can be expressed in terms of the a priori probability function P(ci), together with the class conditional probability density function p(ci|x).” Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification • Bayes’ decision rule • P(c2|x) if we decide C = c1 • P(classification error | x) = • P(c1|x) if we decide C = c2 • “This statistical classification rule is best in the sense of minimizing the probability of misclassification (the probability of classification error)” • Bayes’ classification rule guarantees minimization of the average probability of classification error Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification • Example • Let us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2). • Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2and p(45|c2) = 1.1053 ∙ 10-2. • Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Classification – General Case • Bayes’ Classification Rule for Multiclass Multifeature Objects • Real-valued features of an object as n-dimensional column vector x  Rn: • The object may belong to l distinct classes (l distinct states of nature): Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Classification – General Case • Bayes’ Classification Rule for Multiclass Multifeature Objects • Bayes’ theorem • A priori probability: P(ci) (i = 1, 2…,l) • Class conditional probability density function : p(x|ci) • A posteriori (posterior) probability: P(ci|x) • Unconditional probability density function: Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Classification – General Case • Bayes’ Classification Rule for Multiclass Multifeature Objects • Bayes classification rule: • A given object with a given value x of a feature vector can be classified as belonging to class cj when: • Assign an object with a given value x of a feature vector to class cj when: Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk • Basic Idea • To incorporate the fact that misclassifications of some classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature • A loss function • Cost (penalty, weight) due to the fact of assigning an object to class cjwhen in fact the true class is ci Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk • A loss matrix • We denote a loss function by Lijmatrix for l-class classification problems • Expected (average) conditional loss • In short, Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk • Overall Risk • The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision. • Bayes risk • Minimal overall risk Rleads to the generalization of Bayes’ rule for minimization of probability of the classification error. Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk • Bayes’ classification rule with Bayes risk • Choose a decision (a class) ci for which: Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk • Bayesian Classification Minimizing the Probability of Error • Symmetrical zero-one conditional loss function • The conditional risk R(cj| x) criterion is the same as the average probability of classification error: • An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk • Generalization of the Maximum Likelihood Classification • Generalized likelihood ratio for classes ci and cj • Generalized threshold value • The maximum likelihood classification rule • “Decide a class cj if Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors • Decision regions • A classifier divides the feature space into l disjoint decision subspaces R1,R2, … Rl • The region Ri is a subspace such that each realization x of a feature vector of an object falling into this region will be assigned to a class ci Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors • Decision boundaries (decision surfaces) • The regions intersect, and boundaries between adjacent regions • “The task of a classifier design is to find classification rules that will guarantee division of a feature space into optimal decision regions R1,R2, … Rl (with optimal decision boundaries) that will minimize a selected classification performance criterion” Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors • Decision boundaries Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors • Optimal classification with decision regions • Average probability of correct classification • “Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion” Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Discriminant functions: • Discriminant type classifier • It assigns an object with a given value x of a feature vector to a class cj if • Classification rule for a discriminant function-based classifier • Compute numerical values of all discriminant functions for x • Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest: • Select a class cj for which dj(x) = max(di(x)); i = 1, 2, …, l Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Discriminant classifier Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Discriminant type classifier for Bayesian classification • The natural choice for the discriminant function is the a posteriori conditional probability P(ci|x): • Practical versions using Bayes’ theorem • Bayesian discriminant in a natural logarithmic form Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Characteristics of discriminant function • Discriminant functions define the decision boundaries that separate the decision regions • Generally, the decision boundaries are defined by neighboring decision regions when the corresponding discriminant function values are equal • The decision boundaries are unaffected by the increasingly monotonic transformation of discriminant functions Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Bayesian Discriminant Functions for Two Classes • General case • Two discriminant functions: d1(x) and d2(x). • Two decision regions: R1andR2. • The decision boundary: d1(x) = d2(x). • Using dichotomizer • Single discriminant function: d(x) = d1(x) -d2(x). Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Quadratic Discriminant • Assumption: • A multivariate normal Gaussian distribution of the feature vector x within each class • The Bayesian discriminant( in the previous section): Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Quadratic Discriminant • Gaussian distribution of the probability density function • Quadratic Discriminant function • Decision boundaries: • hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.) Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci) • Compute values of the mean vectors i and the covariance matrices i for all classes i = 1, 2, …, l based on the training set • Compute values of the discriminant function for all classes • Choose a class ci as a prediction of true class for which a value of the associated discriminant function dj(x) is largest: • Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Linear Discriminant: • Assumption: equal covariances for all classes i=  • The Quadratic Discriminant: • A linear form of discriminant functions: Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • Linear Discriminant: Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants Derived from the Bayes Rule • The classification process using linear discriminants • Compute, for a given x, numerical values of discriminant functions for all classes: • Choose a class ci for which a value of the discriminant function dj(x) is largest: • Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Example • Let us assume that the following two-feature patterns xR2 from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution: Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Example • The estimates of the symmetric covariance matrices for both classes • The linear discriminant functions for both classes Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Example • Two-class two-feature pattern dichotomizer. Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Minimum Mahalanobis Distance Classifier • Assumption • Equal covariances for all classes i= ( i = 1, 2, …, l ) • Equal a priori probabilities for all classes P(ci) = P • Discriminant function Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Minimum Mahalanobis Distance Classifier • A classifier selects the class cj for which a value x is nearest, in the sense of Mahalanobis distance, to the corresponding mean vector j . This classifier is called a minimum Mahalanobis distance classifier. • Linear version of the minimum Mahalanobis distance classifier Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Minimum Mahalanobis Distance Classifier • Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector • Compute numerical values of the Mahalanobis distances between x and means i for all classes. • Choose a class cjas a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum: Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Linear Discriminant for Statistically Independent Features • Assumption • Equal covariances for all classes i= ( i = 1, 2, …, l ) • Features are statistically independent • Discriminant function • where Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Linear Discriminant for Statistically Independent Features • Discriminants • Quadratic discriminant formula • Linear discriminant formula Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Linear Discriminant for Statistically Independent Features • “Neural network” style as a linear threshold machine • where • The decision surfaces for the linear discriminants are pieces of hyperplanes defined by equations di(x)-dj(x). Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Minimum Euclidean Distance Classifier • Assumption • Equal covariances for all classes i= ( i = 1, 2, …, l ) • Features are statistically independent • Equal a priori probabilities for all classes P(ci) = P • Discriminants • or Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Minimum Euclidean Distance Classifier • The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j. • Linear version of the minimum distance classifier Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions • Quadratic and Linear Discriminants • Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector • Compute numerical values of Euclidean distances between x and means i for all classes: • Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest: Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Methods in Statistical Pattern Recognition