COMPE 467 - Pattern Recognition

Maximum Likelihood Parameter Estimations COMPE 467 - Pattern Recognition

Parameter Estimation • Inpreviouschapters: • Wecoulddesignan optimalclassifierifweknewthepriorprobabilitiesP(wi) andtheclass-conditionalprobabilities P(x|wi) • Unfortunately, in patternrecognitionapplicationswerarelyhavethiskind of completeknowledgeabouttheprobabilisticstructure of the problem. 1

Parameter Estimation • Wehave a number of designsamplesortraining data. • The problem is tofindsomewaytousethisinformationtodesignortraintheclassifier. • Oneapproach: • Usethesamplestoestimatetheunknownprobabilitiesandprobabilitydensities, • Andthenusetheresultingestimates as iftheywerethetruevalues. 1

Parameter Estimation • Whatyouaregoingtoestimate in yourHomework ? 1

Parameter Estimation • Ifweknowthenumberparametersin advanceandour general knowledgeaboutthe problem permits us toparameterizetheconditionaldesitiesthenseverity of the problem can be reducedsignificantly. 1

Parameter Estimation • Forexample: • We can reasonabyassumethatthep(x|wi) is a normal densitywithmeanµandcovariancematrix ∑, • We do not knowtheexactvalues of thesequantities, • However, thisknowledgesimplifiesthe problem fromone of estimating an unknownfunction p(x|wi) toone of estimatingtheparametersthemean µiandcovariancematrix ∑i • Estimating p(x|wi)  estimating µiand ∑i 1

Parameter Estimation • Data availability in a Bayesian framework • We could design an optimal classifier if we knew: • P(i) (priors) • P(x | i) (class-conditional densities) Unfortunately, we rarely have this complete information! • Design a classifier from a training sample • No problem with prior estimation • Samples are often too small for class-conditional estimation (large dimension of feature space!) 1

Parameter Estimation • Given a bunch of data from each class how to estimate the parameters of class conditional densities, P(x | i) ? • Ex: P(x | i) = N( j, j) is Normal. Parameters j=( j, j) 1

Two major approaches • Maximum-Likelihood Method • Bayesian Method • Use P(i | x)for our classification rule! • Results are nearly identical, but the approaches are different

Maximum-Likelihood vs. Bayesian: • Maximum Likelihood • Bayes • Parameters are fixed but unknown! • Best parameters are obtained by maximizing the probability of obtaining the samples observed • Parameters are random variables having some known distribution • Best parameters are obtained by estimating them given the data 1

Major assumptions • A priori informationP( i) for each category is available • Samples are i.i.d. and P(x | i) is Normal P(x | i) ~ N( i, i) • Note: Characterized by 2 parameters 1

Maximum-Likelihood Estimation • Has good convergence properties as the sample size increases • Simpler than any other alternative techniques 2

Maximum-Likelihood Estimation 2

Maximum-Likelihood Estimation • General principle • Assume we have c classes and • The samples in class j have been drawn according to the probability law p(x | j) p(x | j) ~ N( j, j) p(x | j)  P (x | j, j) where: 2

Maximum-Likelihood Estimation • Our problem is to use the information provided by the training samples to obtain good estimates for the unknown parametervectors associated with each category. 2

LikelihoodFunction • Use the informationprovided by the training samples to estimate  = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category • Suppose that D contains n samples, x1, x2,…, xn

Goal: find an estimate • Find  which maximizes P(D | ) “It is the value of  that best agrees with the actually observed training sample”

Maximize log likelihood function:l() = ln P(D | ) • Let  = (1, 2, …, p)t and let  be the gradient operator • New problem statement: Determine  that maximizes the log-likelihood

Set of necessary conditions for an optimum is: l = 0 2

Example of a specific case: unknown  • P(xi | ) ~ N(, ) (Samples are drawn from a multivariate normal population)  =  therefore: • The ML estimate for  must satisfy:

Multiplying by  and rearranging, we obtain: • Just the arithmetic average of the samples of the training samples! Conclusion: If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional feature space; then we can estimate the vector  = (1, 2, …, c)t and perform an optimal classification!

ML Estimation: • Gaussian Case: unknown  and   = (1, 2) = (, 2)

Summation: Combining (1) and (2), one obtains:

Example 1: Consider an exponential distribution otherwise (single feature, single parameter) With a random sample Estimateθ ?

Example 1: Consider an exponential distribution otherwise (single feature, single parameter) With a random sample : valid for (inverse of average)

Example 2: Multivariate Gaussian with unknown mean vector M. Assume is known. k samples from the same distribution: (iid) (linear algebra)

(sample average or sample mean)

Example 3: Binary variables with unknown parameters (n parameters) Howtoestimatethese Thinkaboutyourhomework ?

Example 3: Binary variables with unknown parameters (n parameters) So, k samples here is the element of sample .

So, • is the sample average of the feature.

Since is binary, will be the same as counting the occurances of ‘1’. Consider character recognition problem with binary matrices. For each pixel, count the number of 1’s and this is the estimate of . A B

References • R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, New York: John Wiley, 2001. • Selim Aksoy, Pattern Recognition Course Materials, 2011. • M. Narasimha Murty, V. Susheela Devi, Pattern Recognition an Algorithmic Approach, Springer, 2011. • “Bayes Decision Rule Discrete Features”, http://www.cim.mcgill.ca/~friggi/bayes/decision/binaryfeatures.html, access: 2011.

COMPE 467 - Pattern Recognition