280 likes | 325 Views
INTRODUCTION TO STATISTICAL PATTERN RECOGNITION. Thotreingam Kasar Medical Intelligence and Language Engineering Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bangalore, INDIA - 560012. Outline. Basic Probability Theory
E N D
INTRODUCTION TO STATISTICAL PATTERN RECOGNITION Thotreingam Kasar Medical Intelligence and Language Engineering Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bangalore, INDIA - 560012
Outline • Basic Probability Theory • Bayesian Decision Theory • Discussion
Probability theory Probability is a mathematical model to help us study physical systems in an ‘average’ sense • Classical: Ratio of favorable to the total outcomes • Relative Frequency: Measure of frequency of occurrence • Axiomatic theory of probability Kinds of probability
Axiomatic Probability • Probability Space: comprises of the triplet (, , P) • Probability Measure is a function P(.) that assigns to every event E, a number P(E) such that: - The sample space - Field of events defined in P – Probability Measure
Probability Theory • Conditional Probability: The probability of B given A is • Unconditional Probability:A1,A2,…,AC be mutually exclusive events such that then for any event B, Bayes theorem
A random variable x associates events in the sample space to the real line R Density function Random Variables • Distribution function Properties Properties
Random variables • Expected Value • Conditional Expectation • Moments • Variance • Covariance
Random variables • Uncorrelated • Orthogonal • Independent
Joint Random Variables • X and Y are random variables defined on the same sample space Joint distribution function is given by Joint probability density function is given by
Conditional Distribution Function We cannot define the conditional distribution function for continuous random variables X and Y by the following relation
Conditional Density Function We have Density form of Bayes’ theorem Generalization of the conditional density of random variables Xk+1,…Xp given X1,…,Xk leads to Chain Rule
Statistical Pattern Recognition • The Problem: Given a set of measurements xobtained through observation, Assign the pattern to one of C possible classes wi,i=1,2,…C • A decision rule partitions the measurement space into C regions Wi, i=1,…,C • If a pattern vector falls inthe regionWi, then it is assumed to belong toclasswi • If it falls on the boundary between regions Wi, we may reject the pattern or withhold a decision until further information is available
Bayesian Decision Theory • Consider C classes w1,…wC, with a priori probabilities P(w1),…P(wC), assumed known • To minimize the error probability, with no extra information, we would assign a pattern to class wjif
Bayesian Decision Theory • If we have an observation vector x, considered to be a random variable whose distribution is given by p(x|w), then assign x to class wj if MAP rule • For 2 class case, the decision rule is Likelihood Ratio
Bayesian Decision Theory Likelihood Ratio Test 0.16 p(x|w1)P(w1) 0.12 p(x|w2)P(w2) P(x|w1)=N(0,1) P(x|w2) = 0.6N(1,1) + 0.4N(-1,2) P(w1) = P(w2) = 0.5 0.08 0.04 0 -4 -3 -2 -1 0 1 2 3 4 x 1.6 1.2 P(w2) /P(w1) 1.0 0.8 0.4 L(x) 0 -4 -3 -2 -1 0 1 2 3 4 x
Probability of error Probability of error Minimized when P(wj|x) is maximum The average probability of error is For every x, we ensure that P(e|x) is minimum so that the integral must be as small as possible
Conditional Risk & Bayes’ Risk • Loss Measure of the cost of making an error • Conditional Risk The overall risk in choosing action aiso that it is minimum for every x is To minimize the average probability of error, choose i that maximizes the posteriori probability P(wi|x). If a is chosen such that for every x the overall risk is minimized and the resulting minimum overall risk is called the Bayes’ risk.
Bayes decision rule - Reject option • Partition the sample space into 2 regions 1 P(w1|x) 0.9 t 0.8 0.7 1-t 0.6 0.5 0.4 0.3 0.2 P(w2|x) 0.1 0 -1 1 3 4 -4 -3 -2 0 2 A R A x
Discussion • In principle, the Bayes decision rule is optimal with respect to minimizing the classification error. • It assumes a knowledge of the underlying class-conditional probability density functions of the feature vectors for each class - The pdfs are usually unknown and has to be estimated from a set of correctly classified samples i.e. training • Alternative approach is to develop decision rules that use the data to estimate the decision boundaries directly, without explicit calculation of the pdfs
Linear Discriminant functions • A discriminant function is a function of the pattern x that leads to a classification rule • The form of the discriminant functionis specified and is not imposed by the underlying distribution • When g(x) is linear, the decision surface is a hyperplane e.g. For a 2-class case, we seek a weight vector w and threshold wosuch that
Linear Discriminant Functions If x1 and x2 are both on the decision surface, then i.e. Weight vector is normal to vectors in the hyperplane Hyperplane, g = 0 g>0 w g<0 |wo| x |w| g(x) r = |w| xP The value of the discriminant function for a pattern x is a measure of its distance from the hyperplane
Linear Machine • A pattern classifier using linear discriminant functions is called a linear machine • e.g. Minimum distance classifier (Nearest Neighbor Rule) Suppose we are given a set of prototype points p1,…,pC, one for each of the C classes w1,…,wC. The minimum distance classifier assigns a pattern x to the class wi associated with the nearest point pi For each point, the Euclidean distance is Minimum distance classification is achieved by comparing xTpi - 1/2*piTpi and choosing the largest value. So, we can define the linear discriminant as
Linear Discriminant Functions • The decision boundaries are assumed to be linear. The discriminant function divides the feature space by a hyperplane whose orientation is determined by the weight vector w and the distance from the origin by the threshold wo • Different optimization schemes lead to different method such as the perceptron, Fisher’s Linear discriminant function and support vector machines • Linear combinations of nonlinear functions serves as a stepping stone to nonlinear models
References • H. Stark and J.W. Woods, “Probability and Random processes with applications to signal processing”, 3rd Edition, Pearson Education Asia, 2002 • R. O. Duda, P. E. Hart and D. G. Stork, “Pattern Classification”, John Wiley & Sons, Inc, 2001. • A. Webb, “Statistical Pattern Recognition”, John Wiley & Sons, Ltd, 2005.