1 / 104

MaxEnt : Training, Smoothing, Tagging

MaxEnt : Training, Smoothing, Tagging. Advanced Statistical Methods in NLP Ling572 February 7, 2012. Roadmap. Maxent : Training Smoothing Case study: POS Tagging ( redux ) Beam search. Training. Training. Learn λs from training data. Training. Learn λs from training data

mabyn
Download Presentation

MaxEnt : Training, Smoothing, Tagging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012

  2. Roadmap • Maxent: • Training • Smoothing • Case study: • POS Tagging (redux) • Beam search

  3. Training

  4. Training • Learn λs from training data

  5. Training • Learn λs from training data • Challenge: Usually can’t solve analytically • Employ numerical methods

  6. Training • Learn λs from training data • Challenge: Usually can’t solve analytically • Employ numerical methods • Main different techniques: • Generalized Iterative Scaling (GIS, Darroch &Ratcliffe, ‘72) • Improved Iterative Scaling (IIS, Della Pietra et al, ‘95) • L-BFGS,…..

  7. Generalized Iterative Scaling • GIS Setup: • GIS required constraint: • , where C is a constant

  8. Generalized Iterative Scaling • GIS Setup: • GIS required constraint: • , where C is a constant • If not, then set • :

  9. Generalized Iterative Scaling • GIS Setup: • GIS required constraint: • , where C is a constant • If not, then set

  10. Generalized Iterative Scaling • GIS Setup: • GIS required constraint: • , where C is a constant • If not, then set • and add a correction feature function fk+1:

  11. Generalized Iterative Scaling • GIS Setup: • GIS required constraint: • , where C is a constant • If not, then set • and add a correction feature function fk+1: • GIS also requires at least one active feature for any event • Default feature functions solve this problem

  12. GIS Iteration • Compute the empirical expectation

  13. GIS Iteration • Compute the empirical expectation • Initialization:λj(0) ; set to 0 or some value

  14. GIS Iteration • Compute the empirical expectation • Initialization:λj(0) ; set to 0 or some value • Iterate until convergence for each j:

  15. GIS Iteration • Compute the empirical expectation • Initialization:λj(0) ; set to 0 or some value • Iterate until convergence for each j: • Compute p(y|x) under the current model

  16. GIS Iteration • Compute the empirical expectation • Initialization:λj(0) ; set to 0 or some value • Iterate until convergence for each j: • Compute p(y|x) under the current model • Compute model expectation under current model

  17. GIS Iteration • Compute the empirical expectation • Initialization:λj(0) ; set to 0 or some value • Iterate until convergence for each j: • Compute p(y|x) under the current model • Compute model expectation under current model • Update model parameters by weighted ratio of empirical and model expectations

  18. GIS Iteration • Compute

  19. GIS Iteration • Compute • Initialization:λj(0) ; set to 0 or some value

  20. GIS Iteration • Compute • Initialization:λj(0) ; set to 0 or some value • Iterate until convergence: • Compute

  21. GIS Iteration • Compute • Initialization:λj(0) ; set to 0 or some value • Iterate until convergence: • Compute p(n)(y|x)=

  22. GIS Iteration • Compute • Initialization:λj(0) ; set to 0 or some value • Iterate until convergence: • Compute p(n)(y|x)= • Compute

  23. GIS Iteration • Compute • Initialization:λj(0) ; set to 0 or some value • Iterate until convergence: • Compute p(n)(y|x)= • Compute • Update

  24. Convergence • Methods have convergence guarantees

  25. Convergence • Methods have convergence guarantees • However, full convergence may take very long time

  26. Convergence • Methods have convergence guarantees • However, full convergence may take very long time • Frequently use threshold

  27. Convergence • Methods have convergence guarantees • However, full convergence may take very long time • Frequently use threshold

  28. Calculating LL(p) • LL = 0 • For each sample x in the training data • Let y be the true label of x • prob = p(y|x) • LL += 1/N * prob

  29. Running Time • For each iteration the running time is:

  30. Running Time • For each iteration the running time is O(NPA), where: • N: number of training instances • P: number of classes • A: Average number of active features for instance (x,y)

  31. L-BFGS • Limited-memory version of • Broyden–Fletcher–Goldfarb–Shanno (BFGS) method

  32. L-BFGS • Limited-memory version of • Broyden–Fletcher–Goldfarb–Shanno (BFGS) method • Quasi-Newton method for unconstrained optimization

  33. L-BFGS • Limited-memory version of • Broyden–Fletcher–Goldfarb–Shanno (BFGS) method • Quasi-Newton method for unconstrained optimization • Good for optimization problems with many variables

  34. L-BFGS • Limited-memory version of • Broyden–Fletcher–Goldfarb–Shanno (BFGS) method • Quasi-Newton method for unconstrained optimization • Good for optimization problems with many variables • “Algorithm of choice” for MaxEnt and related models

  35. L-BFGS • References: • Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation35: 773–782 • Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B45 (3): 503–528

  36. L-BFGS • References: • Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation35: 773–782 • Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B45 (3): 503–528 • Implementations: • Java, Matlab, Python via scipy, R, etc • See Wikipedia page

  37. Smoothing Based on Klein & Manning, 2003; F. Xia

  38. Smoothing • Problems of scale:

  39. Smoothing • Problems of scale: • Large numbers of features • Some NLP problems in MaxEnt 1M features • Storage can be a problem

  40. Smoothing • Problems of scale: • Large numbers of features • Some NLP problems in MaxEnt 1M features • Storage can be a problem • Sparseness problems • Ease of overfitting

  41. Smoothing • Problems of scale: • Large numbers of features • Some NLP problems in MaxEnt 1M features • Storage can be a problem • Sparseness problems • Ease of overfitting • Optimization problems • Features can be near infinite, take long time to converge

  42. Smoothing • Consider the coin flipping problem • Three empirical distributions • Models From K&M ‘03

  43. Need for Smoothing • Two problems From K&M ‘03

  44. Need for Smoothing • Two problems • Optimization: • Optimal value of λ? ∞ • Slow to optimize From K&M ‘03

  45. Need for Smoothing • Two problems • Optimization: • Optimal value of λ? ∞ • Slow to optimize • No smoothing • Learned distribution just as spiky (K&M’03) From K&M ‘03

  46. Possible Solutions

  47. Possible Solutions • Early stopping • Feature selection • Regularization

  48. Early Stopping • Prior use of early stopping

  49. Early Stopping • Prior use of early stopping • Decision tree heuristics

  50. Early Stopping • Prior use of early stopping • Decision tree heuristics • Similarly here • Stop training after a few iterations • λwill have increased • Guarantees bounded, finite training time

More Related