1 / 69

Maximum Entropy Model (II)

Maximum Entropy Model (II). LING 572 Fei Xia Week 6: 02/12-02/14/08. Outline. Overview The Maximum Entropy Principle Modeling** Decoding Training* Smoothing* Case study: POS tagging. Training. Algorithms. Generalized Iterative Scaling (GIS): ( Darroch and Ratcliff, 1972)

forest
Download Presentation

Maximum Entropy Model (II)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maximum Entropy Model (II) LING 572 Fei Xia Week 6: 02/12-02/14/08

  2. Outline • Overview • The Maximum Entropy Principle • Modeling** • Decoding • Training* • Smoothing* • Case study: POS tagging

  3. Training

  4. Algorithms • Generalized Iterative Scaling (GIS): (Darroch and Ratcliff, 1972) • Improved Iterative Scaling (IIS): (Della Pietra et al., 1995) • L-BFGS: • …

  5. GIS: setup Requirements for running GIS: • If that’s not the case, let Add a “correction” feature function fk+1:

  6. The “correction” feature function The weight of fk+1 will not affect P(y | x). Therefore, there is no need to estimate the weight.

  7. GIS algorithm • Compute • Initialize (any values, e.g., 0) • Repeat until convergence • For each j • Compute • Compute • Update

  8. “Until convergence”

  9. Calculating LL(p) LL = 0; for each training instance x let y be the true label of x prob = p(y | x); # p is the current model LL += 1/N * log (prob);

  10. Properties of GIS • L(p(n+1)) >= L(p(n)) • The sequence is guaranteed to converge to p*. • The converge can be very slow. • The running time of each iteration is O(NPA): • N: the training set size • P: the number of classes • A: the average number of features that are active for an instance (x, y).

  11. L-BFGS • BFGS stands for Broyden-Fletcher-Goldfarb-Shanno: authors four single-authored papers published in 1970. • Limited-memory BFGS: proposed in 1980s. • It is a quasi-Newton method for unconstrained optimization. • It is especially efficient on problems involving a large number of variables.

  12. L-BFGS (cont) • J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782. • D.C. Liu and J. Nocedal. On the Limited Memory Method for Large Scale Optimization (1989), Mathematical Programming B, 45, 3, pp. 503-528. • Implementation: • Fortune: http://www.ece.northwestern.edu/~nocedal/lbfgs.html • C: http://www.chokkan.org/software/liblbfgs/index.html • Perl: http://search.cpan.org/~laye/Algorithm-LBFGS-0.12/lib/Algorithm/LBFGS.pm

  13. Outline • Overview • The Maximum Entropy Principle • Modeling** • Decoding • Training* • Smoothing* • Case study: POS tagging

  14. Smoothing Many slides come from (Klein and Manning, 2003)

  15. Papers • (Klein and Manning, 2003) • Chen and Rosenfeld (1999): A Gaussian Prior for Smoothing Maximum Entropy Models, CMU Technical report (CMU-CS-99-108).

  16. Smoothing • MaxEnt models for NLP tasks can have millions of features. • Overfitting is a problem. • Feature weights can be infinite, and the iterative trainers can take a long time to reach those values.

  17. An example

  18. Approaches • Early stopping • Feature selection • Regularization

  19. Early stopping

  20. Feature selection • Methods: • Using predefined functions: e.g., Dropping features with low counts • Wrapping approach: Feature selection during training • It is equivalent to setting the removed features’ weights to be zero. • It reduces model size, but the performance could suffer.

  21. Regularization • In statistics and machine learning, regularization is any method of preventing overfitting of data by a model. • Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-norm in support vector machines. • In this case, we change the objective function: Posterior Prior Likelihood

  22. MAP estimate • ML: Maximum likelihood • MAP: Maximum A Posterior

  23. The prior • Uniform distribution, Exponential prior, … • Gaussian prior:

  24. Maximize P(Y|X, ¸): • Maximize P(Y, ¸ | X): • In practice:

  25. L1 or L2 Orthant-Wise limited-memory Quasi-Newton (OW-LQN) method (Andrew and Gao, 2007) L-BFGS method (Nocedal, 1980)

  26. Example: POS tagging

  27. Benefits of smoothing • Softens distributions • Pushes weights onto more explanatory features • Allows many features to be used safely • Speed up convergence (if both are allowed to converge)

  28. Summary: training and smoothing • Training: many methods (e.g., GIS, IIS, L-BFGS). • Smoothing: • Early stopping • Feature selection • Regularization • Regularization: • Changing the objective function by adding the prior • A common prior: Gaussian distribution • Maximizing posterior is no longer the same as maximizing entropy.

  29. Outline • Overview • The Maximum Entropy Principle • Modeling** • Decoding • Training* • Smoothing* • Case study: POS tagging

  30. Case study

  31. POS tagging(Ratnaparkhi, 1996) • Notation variation: • fj(x, y): x: input, y: output • fj(h, t): h: history, t: tag for the word • History: • Training data: • Treat a sentence as a set of (hi, ti) pairs. • How many pairs are there for a sentence?

  32. Using a MaxEnt Model • Modeling: • Training: • Define features templates • Create the feature set • Determine the optimum feature weights via GIS or IIS • Decoding:

  33. Modeling

  34. Training step 1: define feature templates History hi Tag ti

  35. Step 2: Create feature set • Collect all the features from the training data • Throw away features that appear less than 10 times

  36. The thresholds • Raw words: words that occur < 5 in the training data. • Features (not feature functions): • All curWord features will be kept. • For the rest of features, keep them if they occur >= 10 in the training data.

  37. Step 3: determine the weights of feature functions • GIS • Training time: • Each iteration: O(NTA): • N: the training set size • T: the number of allowable tags • A: average number of features that are active for a (h, t). • About 24 hours on an IBM RS/6000 Model 380.

  38. Decoding: Beam search • Generate tags for w1, find top N, set s1j accordingly, j=1, 2, …, N • For i=2 to n (n is the sentence length) • For j=1 to N • Generate tags for wi, given s(i-1)j as previous tag context • Append each tag to s(i-1)j to make a new sequence. • Find N highest prob sequences generated above, and set sij accordingly, j=1, …, N • Return highest prob sequence sn1.

  39. Beam search

  40. Decoding (cont) • Tags for words: • Known words: use tag dictionary • Unknown words: try all possible tags • Ex: “time flies like an arrow” • Running time: O(NTAB) • N: sentence length • B: beam size • T: tagset size • A: average number of features that are active for a given event

  41. Experiment results

  42. Comparison with other learners • HMM: MaxEnt can use more context • DT: MaxEnt does not split data • Naïve Bayes: MaxEnt does not assume that features are independent given the class.

  43. Hw6

  44. Hw6 • Q1: build trainer and test it on classification task. • Q2: run the trainer and the decoder on the stoplight example • Q3-Q5: POS tagging task • More classes: 45 • More training instances: about 1M

  45. Q1: Implementing a GIS trainer • Compute empirical expectation • Compute • Initialize (all zeros) • Repeat until convergence • Calculate model expectation based on • Update the weights

  46. “Until convergence”

  47. Q2: The stoplight example • Run your MaxEnt trainer and decoder on the data created in Hw5. • Three feature sets: • {f1=g, f1=r, f2=g, f2=r} • {f1=g, f1=r, f2=g, f2=r, f3=1, f3=0} • {f3=1, f3=0}

  48. Q3: Prepare data for POS tagging • Training data: sec0_18.word_pos • Dev data: sec19_21.word_pos • Features: • Table 1 in (Ratnaparkhi, 1996) • Rare word: word occurs < 5 times in the training data: see sec0_18.voc • Keep all the wi features. For other features, throw away the ones that occur less than 10 times: count features and then filter them. • Remember to replace “,” with “comma” in the vector files.

  49. Q4: Run Mallet learner • Run the mallet trainer, and save the model: • The training took about 1.5 hours for me. • Run the Mallet decoder on the dev data: • The testing took a couple of minutes • Run your decoder. Do you get the same results?

More Related