Statistical Pattern Recognition: Theory & Applications

INTRODUCTION TO STATISTICAL PATTERN RECOGNITION Thotreingam Kasar Medical Intelligence and Language Engineering Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bangalore, INDIA - 560012

Outline • Basic Probability Theory • Bayesian Decision Theory • Discussion

Probability theory Probability is a mathematical model to help us study physical systems in an ‘average’ sense • Classical: Ratio of favorable to the total outcomes • Relative Frequency: Measure of frequency of occurrence • Axiomatic theory of probability Kinds of probability

Axiomatic Probability • Probability Space: comprises of the triplet (, , P) • Probability Measure is a function P(.) that assigns to every event E, a number P(E) such that:  - The sample space  - Field of events defined in  P – Probability Measure

Probability Theory • Conditional Probability: The probability of B given A is • Unconditional Probability:A1,A2,…,AC be mutually exclusive events such that then for any event B, Bayes theorem

A random variable x associates events in the sample space  to the real line R Density function Random Variables • Distribution function Properties Properties

Random variables • Expected Value • Conditional Expectation • Moments • Variance • Covariance

Random variables • Uncorrelated • Orthogonal • Independent

Joint Random Variables • X and Y are random variables defined on the same sample space  Joint distribution function is given by Joint probability density function is given by

Marginal Density Functions

Conditional Distribution Function We cannot define the conditional distribution function for continuous random variables X and Y by the following relation

Conditional Density Function

Conditional Density Function We have Density form of Bayes’ theorem Generalization of the conditional density of random variables Xk+1,…Xp given X1,…,Xk leads to Chain Rule

Statistical Pattern Recognition • The Problem: Given a set of measurements xobtained through observation, Assign the pattern to one of C possible classes wi,i=1,2,…C • A decision rule partitions the measurement space into C regions Wi, i=1,…,C • If a pattern vector falls inthe regionWi, then it is assumed to belong toclasswi • If it falls on the boundary between regions Wi, we may reject the pattern or withhold a decision until further information is available

Bayesian Decision Theory • Consider C classes w1,…wC, with a priori probabilities P(w1),…P(wC), assumed known • To minimize the error probability, with no extra information, we would assign a pattern to class wjif

Bayesian Decision Theory • If we have an observation vector x, considered to be a random variable whose distribution is given by p(x|w), then assign x to class wj if MAP rule • For 2 class case, the decision rule is Likelihood Ratio

Bayesian Decision Theory Likelihood Ratio Test 0.16 p(x|w1)P(w1) 0.12 p(x|w2)P(w2) P(x|w1)=N(0,1) P(x|w2) = 0.6N(1,1) + 0.4N(-1,2) P(w1) = P(w2) = 0.5 0.08 0.04 0 -4 -3 -2 -1 0 1 2 3 4 x 1.6 1.2 P(w2) /P(w1) 1.0 0.8 0.4 L(x) 0 -4 -3 -2 -1 0 1 2 3 4 x

Probability of error Probability of error Minimized when P(wj|x) is maximum The average probability of error is For every x, we ensure that P(e|x) is minimum so that the integral must be as small as possible

Conditional Risk & Bayes’ Risk • Loss Measure of the cost of making an error • Conditional Risk The overall risk in choosing action aiso that it is minimum for every x is To minimize the average probability of error, choose i that maximizes the posteriori probability P(wi|x). If a is chosen such that for every x the overall risk is minimized and the resulting minimum overall risk is called the Bayes’ risk.

Bayes decision rule - Reject option • Partition the sample space into 2 regions 1 P(w1|x) 0.9 t 0.8 0.7 1-t 0.6 0.5 0.4 0.3 0.2 P(w2|x) 0.1 0 -1 1 3 4 -4 -3 -2 0 2 A R A x

Discussion • In principle, the Bayes decision rule is optimal with respect to minimizing the classification error. • It assumes a knowledge of the underlying class-conditional probability density functions of the feature vectors for each class - The pdfs are usually unknown and has to be estimated from a set of correctly classified samples i.e. training • Alternative approach is to develop decision rules that use the data to estimate the decision boundaries directly, without explicit calculation of the pdfs

Linear Discriminant functions • A discriminant function is a function of the pattern x that leads to a classification rule • The form of the discriminant functionis specified and is not imposed by the underlying distribution • When g(x) is linear, the decision surface is a hyperplane e.g. For a 2-class case, we seek a weight vector w and threshold wosuch that

Linear Discriminant Functions If x1 and x2 are both on the decision surface, then i.e. Weight vector is normal to vectors in the hyperplane Hyperplane, g = 0 g>0 w g<0 |wo| x |w| g(x) r = |w| xP The value of the discriminant function for a pattern x is a measure of its distance from the hyperplane

Linear Machine • A pattern classifier using linear discriminant functions is called a linear machine • e.g. Minimum distance classifier (Nearest Neighbor Rule) Suppose we are given a set of prototype points p1,…,pC, one for each of the C classes w1,…,wC. The minimum distance classifier assigns a pattern x to the class wi associated with the nearest point pi For each point, the Euclidean distance is Minimum distance classification is achieved by comparing xTpi - 1/2*piTpi and choosing the largest value. So, we can define the linear discriminant as

Linear Discriminant Functions • The decision boundaries are assumed to be linear. The discriminant function divides the feature space by a hyperplane whose orientation is determined by the weight vector w and the distance from the origin by the threshold wo • Different optimization schemes lead to different method such as the perceptron, Fisher’s Linear discriminant function and support vector machines • Linear combinations of nonlinear functions serves as a stepping stone to nonlinear models

References • H. Stark and J.W. Woods, “Probability and Random processes with applications to signal processing”, 3rd Edition, Pearson Education Asia, 2002 • R. O. Duda, P. E. Hart and D. G. Stork, “Pattern Classification”, John Wiley & Sons, Inc, 2001. • A. Webb, “Statistical Pattern Recognition”, John Wiley & Sons, Ltd, 2005.

THANK YOU

Statistical Pattern Recognition: Theory & Applications

Statistical Pattern Recognition: Theory & Applications

Presentation Transcript

Introduction to Pattern Recognition

Pattern Recognition: Statistical and Neural

Pattern Recognition: Statistical and Neural

ECES 690 – Statistical Pattern Recognition

ECES 690 – Statistical Pattern Recognition

Introduction to Pattern Recognition

ECES 690 – Statistical Pattern Recognition

Introduction to Pattern Recognition

Introduction to Pattern Recognition

Pattern Recognition: Statistical and Neural

Pattern Recognition: Statistical and Neural

Pattern Recognition: Statistical and Neural

Pattern Recognition: Statistical and Neural

Pattern Recognition: Statistical and Neural