1 / 31

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF )

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF ). Boltzmann-Gibbs Distribution. Given: States s 1 , s 2 , … , s n Density p ( s ) = p s Maximum entropy principle : Without any information, one chooses the density p s to maximize the entropy

foy
Download Presentation

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maximum Entropy (ME)Maximum Entropy Markov Model (MEMM)Conditional Random Field (CRF)

  2. Boltzmann-Gibbs Distribution • Given: • States s1, s2, …, sn • Density p(s) = ps • Maximum entropy principle: • Without any information, one chooses the density ps to maximize the entropy subject to the constraints

  3. Boltzmann-Gibbs (Cnt’d) • Consider the Lagrangian • Take partial derivatives of L with respect to psand set them to zero, we obtain Boltzmann-Gibbs density functions where Z is the normalizing factor

  4. Exercise • From the Lagrangian derive

  5. Boltzmann-Gibbs (Cnt’d) • Classification Rule • Use of Boltzmann-Gibbs as prior distribution • Compute the posterior for given observed data and features fi • Use the optimal posterior to classify

  6. Boltzmann-Gibbs (Cnt’d) • Maximum Entropy (ME) • The posterior is the state probability density p(s | X), where X = (x1, x2, …, xn) • Maximum entropy Markov model (MEMM) • The posterior consists of transition probability densities p(s | s´, X)

  7. Boltzmann-Gibbs (Cnt’d) • Conditional random field (CRF) • The posterior consists of both transition probability densities p(s | s´, X) and state probability densities p(s | X)

  8. References • R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2001. • T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001. • P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Approach, The MIT Press, 2001.

  9. Maximum Entropy Approach

  10. An Example • Five possible French translations of the English word in: • Dans, en, à, au cours de, pendant • Certain constraints obeyed: • When April follows in, the proper translation is en • How do we make the proper translation of a French word y under an English context x?

  11. Formalism • Probability assignment p(y|x): • y: French word, x: English context • Indicator function of a context feature f

  12. Expected Values of f • The expected value of f with respect to the empirical distribution • The expected value of f with respect to the conditional probabilityp(y|x)

  13. Constraint Equation • Set equal the two expected values: or equivalently,

  14. Maximum Entropy Principle • Given n feature functions fi, we want p(y|x) to maximize the entropy measure where p is chosen from

  15. Constrained Optimization Problem • The Lagrangian • Solutions

  16. Iterative Solution • Compute the expectation of fi under the current estimate of probability function • Update Lagrange multipliers • Update probability functions

  17. Feature Selection • Motivation: • For a large collection of candidate features, we want to select a small subset • Incremental growth

  18. Incremental Learning

  19. Algorithm: Feature Selection

  20. Approximation • Computation of maximum entropy model is costly for each candidate f • Simplification assumption: • The multipliers λ associated with S do not change when f is added to S

  21. Approximation (cnt’d)

  22. Approximate Solution

  23. Conditional Random Field (CRF)

  24. CRF

  25. Feature Functions

  26. Difference from MEMM • If the state feature is dropped, we obtain a MEMM model • The drawback of MEMM • The state probabilities are not learnt, but inferred • Bias can be generated, since the transition feature is dominating in the training

  27. Difference from HMM • HMM is a generative model • In order to define a joint distribution, this model must enumerate all possible observation sequences and their corresponding label sequences • This task is intractable, unless observation elements are represented as isolated units

  28. CRF Training Methods • CRF training requires intensive efforts in numerical manipulation • Preconditioned conjugate gradient • Instead of searching along the gradient, conjugate gradient searches along a carefully chosen linear combination of the gradient and the previous search direction • Limited-Memory Quasi-Newton • Limited-memory BFGS (L-BFGS) is a second-order method that estimates the curvature numerically from previous gradients and updates, avoiding the need for an exact Hessian inverse computation • Voted perceptron

  29. Voted Perceptron • Like the perceptron algorithm, this algorithm scans through the training instances, updating the weight vectorλt when a prediction error is detected • Instead of taking just the final weight vector, the voted perceptron algorithms takes the average of theλt

  30. Voted Perceptron (cnt’d)

  31. References • A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A maximum entropy approach to natural language processing • A. McCallum and F. Pereira, Maximum entropy Markov models for information extraction and segmentation • H. M. Wallach, Conditional random fields: an introduction • J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmentation and labeling sequence data • F. Sha and F. Pereira, Shallow parsing with conditional random fields

More Related