Chapter 3: Maximum-Likelihood Parameter Estimation

Sec 1: Introduction Sec 2: Maximum-Likelihood Estimation Multivariate Case: unknown , known  Univariate Case: unknown  and unknown 2 MultivariateCase: unknown  and unknown 2 Bias Maximum-LikelihoodProblem Statement Sec 5.1: When Do ML and Bayes Methods Differ? Sec 7: Problems of Dimensionality Chapter 3: Maximum-Likelihood Parameter Estimation

Sec 1: Introduction • Data availability in a Bayesian framework • We could design an optimal classifier if we knew: • P(i) (priors) • P(x | i) (class-conditional densities) Unfortunately, we rarely have this complete information! • Design a classifier from a training sample • No problem with prior estimation • Number of samples are often too small for class-conditional estimation (large dimension of feature space!) Pattern Classification, Chapter 3

A priori information about the problem • Normality of P(x | i) • P(x | i) ~ N( i, i) • Characterized by i and i parameters • Estimation techniques • Maximum-Likelihood and Bayesian estimations • Results nearly identical, but approaches are different • We will not cover Bayesian estimation details Pattern Classification, Chapter 3

Parameters in Maximum-Likelihood estimation are assumed fixed but unknown! • Best parameters are obtained by maximizing the probability of obtaining the samples observed • Bayesian methods view the parameters as random variables having some known distribution • In either approach, we use P(i | x)for our classification rule! Pattern Classification, Chapter 3

Sec 2: Maximum-Likelihood Estimation • Has good convergence properties as the sample size increases • Simpler than alternative techniques • General principle • Assume we have c classes and P(x | j) ~ N( j, j) P(x | j)  P (x | j, j) where: Pattern Classification, Chapter 3

Use the information provided by the training samples to estimate  = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category • Suppose that D contains n samples, x1, x2,…, xn , and we simplify our notation by omitting class distinctions • The Maximum Likelihood estimate of  is, by definition, the value that maximizes P(D | ) “It is the value of  that best agrees with the actually observed training sample” Pattern Classification, Chapter 3

(s fixed,  = unknown m) Training data = red dots Likelihood Log-likelihood Pattern Classification, Chapter 3

Optimal estimation • Let  = (1, 2, …, p)t and let  be the gradient operator • We define l() as the log-likelihood function l() = ln P(D | ) • New problem statement: determine  that maximizes the log-likelihood Pattern Classification, Chapter 3

Set of necessary conditions for an optimum is: l = 0 n = number of training samples Pattern Classification, Chapter 3

Multivariate Gaussian: unknown , known  • Samples drawn from multivariate Gaussian population P(xi | ) ~ N(, )  =   =  so the ML estimate for  must satisfy: Pattern Classification, Chapter 3

Multiplying by  and rearranging, we obtain: Just the arithmetic average of the training samples! Conclusion: If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional feature space; then we can estimate the vector  = (1, 2, …, c)t and perform an optimal classification! Pattern Classification, Chapter 3

Univariate Gaussian: unknown , unknown 2 • Samples drawn from univariate Gaussian population P(xi | , 2) ~ N(, 2)  = (1, 2) = (, 2) Pattern Classification, Chapter 3

Summation: Combining (1) and (2), one obtains: Pattern Classification, Chapter 3

Multivariate Gaussian: Maximum-Likelihood estimates for  and  • Maximum-Likelihood estimate for  is: • Maximum-Likelihood estimate for  is: Pattern Classification, Chapter 3

Bias Maximum-Likelihood estimate for 2 is biased • An elementary unbiased estimator for  is: Pattern Classification, Chapter 3

Maximum-Likelihood Problem Statement • Let D = {x1, x2, …, xn} P(x1,…, xn | ) = 1,nP(xk | ); Our goal is to determine (value of  that makes this sample the most representative!) Pattern Classification, Chapter 3

|D| = n . . . . x2 . . x1 xn N(, ) = P(x | 1) P(x | c) P(x | k) D1 x11 . . . . x10 Dk . Dc x8 . . . x20 . . x1 x9 . . Pattern Classification, Chapter 3

 = (1, 2, …, c) Problem: find such that: Pattern Classification, Chapter 3

Sec 5.1: When Do Maximum-Likelihood and Bayes Methods Differ? • They rarely differ • ML is less complex, easier to understand • Sources of system classification error • Bayes Error - Error due to overlapping densities for different classes (inherent error, never eliminated) • Model Error - Error due to having an incorrect model • Estimation Error - Error from estimating parameters from a finite sample Pattern Classification, Chapter 3

Sec 7: Problems of Dimensionality Accuracy, Dimension, Training Sample Size • Classification accuracy depends upon the dimensionality and the amount of training data • Case of two classes multivariate normal with the same covariance Bayes Error Pattern Classification, Chapter 1

If features are independent then: • Most useful features are the ones for which the difference between the means is large relative to the standard deviation • It appears that adding new features improves accuracy • It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model ! Pattern Classification, Chapter 1

7 7 Pattern Classification, Chapter 1

Computational Complexity • Maximum-Likelihood Estimation • Gaussian priors in d feature dimensions, with n training samples for each of c classes • For each category, we have to compute the discriminant function Total = O(d2..n) Total for c classes = O(cd2.n)  O(d2.n) • Costs increase when d and n are large! Pattern Classification, Chapter 1

Overfitting • Number of training samples n can be inadequate for estimating the parameters • What to do? • Simplify the model – reduce the parameters • Assume all classes have same covariance matrix • And maybe the identity matrix or zero off-diagonal elements • Assume statistical independence • Reduce number of features d • Principal Component Analysis, etc. Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Chapter 3: Maximum-Likelihood Parameter Estimation