1 / 25

Chapter 3: Maximum-Likelihood Parameter Estimation

Sec 1: Introduction Sec 2: Maximum-Likelihood Estimation Multivariate Case: unknown , known  Univariate Case: unknown  and unknown  2 Multivariate Case: unknown  and unknown  2 Bias Maximum-Likelihood Problem Statement Sec 5.1: When Do ML and Bayes Methods Differ?

hobert
Download Presentation

Chapter 3: Maximum-Likelihood Parameter Estimation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sec 1: Introduction Sec 2: Maximum-Likelihood Estimation Multivariate Case: unknown , known  Univariate Case: unknown  and unknown 2 MultivariateCase: unknown  and unknown 2 Bias Maximum-LikelihoodProblem Statement Sec 5.1: When Do ML and Bayes Methods Differ? Sec 7: Problems of Dimensionality Chapter 3: Maximum-Likelihood Parameter Estimation

  2. Sec 1: Introduction • Data availability in a Bayesian framework • We could design an optimal classifier if we knew: • P(i) (priors) • P(x | i) (class-conditional densities) Unfortunately, we rarely have this complete information! • Design a classifier from a training sample • No problem with prior estimation • Number of samples are often too small for class-conditional estimation (large dimension of feature space!) Pattern Classification, Chapter 3

  3. A priori information about the problem • Normality of P(x | i) • P(x | i) ~ N( i, i) • Characterized by i and i parameters • Estimation techniques • Maximum-Likelihood and Bayesian estimations • Results nearly identical, but approaches are different • We will not cover Bayesian estimation details Pattern Classification, Chapter 3

  4. Parameters in Maximum-Likelihood estimation are assumed fixed but unknown! • Best parameters are obtained by maximizing the probability of obtaining the samples observed • Bayesian methods view the parameters as random variables having some known distribution • In either approach, we use P(i | x)for our classification rule! Pattern Classification, Chapter 3

  5. Sec 2: Maximum-Likelihood Estimation • Has good convergence properties as the sample size increases • Simpler than alternative techniques • General principle • Assume we have c classes and P(x | j) ~ N( j, j) P(x | j)  P (x | j, j) where: Pattern Classification, Chapter 3

  6. Use the information provided by the training samples to estimate  = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category • Suppose that D contains n samples, x1, x2,…, xn , and we simplify our notation by omitting class distinctions • The Maximum Likelihood estimate of  is, by definition, the value that maximizes P(D | ) “It is the value of  that best agrees with the actually observed training sample” Pattern Classification, Chapter 3

  7. (s fixed,  = unknown m) Training data = red dots Likelihood Log-likelihood Pattern Classification, Chapter 3

  8. Optimal estimation • Let  = (1, 2, …, p)t and let  be the gradient operator • We define l() as the log-likelihood function l() = ln P(D | ) • New problem statement: determine  that maximizes the log-likelihood Pattern Classification, Chapter 3

  9. Set of necessary conditions for an optimum is: l = 0 n = number of training samples Pattern Classification, Chapter 3

  10. Multivariate Gaussian: unknown , known  • Samples drawn from multivariate Gaussian population P(xi | ) ~ N(, )  =   =  so the ML estimate for  must satisfy: Pattern Classification, Chapter 3

  11. Multiplying by  and rearranging, we obtain: Just the arithmetic average of the training samples! Conclusion: If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional feature space; then we can estimate the vector  = (1, 2, …, c)t and perform an optimal classification! Pattern Classification, Chapter 3

  12. Univariate Gaussian: unknown , unknown 2 • Samples drawn from univariate Gaussian population P(xi | , 2) ~ N(, 2)  = (1, 2) = (, 2) Pattern Classification, Chapter 3

  13. Summation: Combining (1) and (2), one obtains: Pattern Classification, Chapter 3

  14. Multivariate Gaussian: Maximum-Likelihood estimates for  and  • Maximum-Likelihood estimate for  is: • Maximum-Likelihood estimate for  is: Pattern Classification, Chapter 3

  15. Bias Maximum-Likelihood estimate for 2 is biased • An elementary unbiased estimator for  is: Pattern Classification, Chapter 3

  16. Maximum-Likelihood Problem Statement • Let D = {x1, x2, …, xn} P(x1,…, xn | ) = 1,nP(xk | ); Our goal is to determine (value of  that makes this sample the most representative!) Pattern Classification, Chapter 3

  17. |D| = n . . . . x2 . . x1 xn N(, ) = P(x | 1) P(x | c) P(x | k) D1 x11 . . . . x10 Dk . Dc x8 . . . x20 . . x1 x9 . . Pattern Classification, Chapter 3

  18.  = (1, 2, …, c) Problem: find such that: Pattern Classification, Chapter 3

  19. Sec 5.1: When Do Maximum-Likelihood and Bayes Methods Differ? • They rarely differ • ML is less complex, easier to understand • Sources of system classification error • Bayes Error - Error due to overlapping densities for different classes (inherent error, never eliminated) • Model Error - Error due to having an incorrect model • Estimation Error - Error from estimating parameters from a finite sample Pattern Classification, Chapter 3

  20. Sec 7: Problems of Dimensionality Accuracy, Dimension, Training Sample Size • Classification accuracy depends upon the dimensionality and the amount of training data • Case of two classes multivariate normal with the same covariance Bayes Error Pattern Classification, Chapter 1

  21. If features are independent then: • Most useful features are the ones for which the difference between the means is large relative to the standard deviation • It appears that adding new features improves accuracy • It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model ! Pattern Classification, Chapter 1

  22. 7 7 Pattern Classification, Chapter 1

  23. Computational Complexity • Maximum-Likelihood Estimation • Gaussian priors in d feature dimensions, with n training samples for each of c classes • For each category, we have to compute the discriminant function Total = O(d2..n) Total for c classes = O(cd2.n)  O(d2.n) • Costs increase when d and n are large! Pattern Classification, Chapter 1

  24. Overfitting • Number of training samples n can be inadequate for estimating the parameters • What to do? • Simplify the model – reduce the parameters • Assume all classes have same covariance matrix • And maybe the identity matrix or zero off-diagonal elements • Assume statistical independence • Reduce number of features d • Principal Component Analysis, etc. Pattern Classification, Chapter 1

  25. Pattern Classification, Chapter 1

More Related