Maximum Entropy Model

Maximum Entropy Model LING 572 Fei Xia 02/08/07

Topics in LING 572 • Easy: • kNN, Rocchio, DT, DL • Feature selection, binarization, system combination • Bagging • Self-training

Topics in LING 572 • Slightly more complicated • Boosting • Co-training • Hard (to some people): • MaxEnt • EM

History • The concept of Maximum Entropy can be traced back along multiple threads to Biblical times. • Introduced to NLP area by Berger et. al. (1996). • Used in many NLP tasks: Tagging, Parsing, PP attachment, LM, …

Outline • Main idea • Modeling • Training: estimating parameters • Feature selection during training • Case study

Main idea

Maximum Entropy • Why maximum entropy? • Maximize entropy = Minimize commitment • Model all that is known and assume nothing about what is unknown. • Model all that is known: satisfy a set of constraints that must hold • Assume nothing about what is unknown: choose the most “uniform” distribution  choose the one with maximum entropy

Ex1: Coin-flip example(Klein & Manning 2003) • Toss a coin: p(H)=p1, p(T)=p2. • Constraint: p1 + p2 = 1 • Question: what’s your estimation of p=(p1, p2)? • Answer: choose the p that maximizes H(p) H p1 p1=0.3

Coin-flip example (cont) H p1 + p2 = 1 p2 p1 p1+p2=1.0, p1=0.3

Ex2: An MT example(Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer:

An MT example (cont) Constraints: Intuitive answer:

An MT example (cont) Constraints: Intuitive answer: ??

Ex3: POS tagging(Klein and Manning, 2003)

Ex3 (cont)

Ex4: overlapping features(Klein and Manning, 2003)

Modeling the problem • Objective function: H(p) • Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p). • Question: How to represent constraints?

Modeling

Reference papers • (Ratnaparkhi, 1997) • (Ratnaparkhi, 1996) • (Berger et. al., 1996) • (Klein and Manning, 2003)  Different notations.

The basic idea • Goal: estimate p • Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”).

Setting • From training data, collect (a, b) pairs: • a: thing to be predicted (e.g., a class in a classification problem) • b: the context • Ex: POS tagging: • a=NN • b=the words in a window and previous two tags • Learn the prob of each (a, b): p(a, b)

Features in POS tagging(Ratnaparkhi, 1996) context (a.k.a. history) allowable classes

Features • Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events: • A: the set of possible classes (e.g., tags in POS tagging) • B: space of contexts (e.g., neighboring words/ tags in POS tagging) • Ex:

Some notations Finite training sample of events: Observed probability of x in S: The model p’s probability of x: The jth feature: Observed expectation of : (empirical count of ) Model expectation of :

Constraints • Model’s feature expectation = observed feature expectation • How to calculate ?

Training data  observed events

Restating the problem The task: find p* s.t. where Objective function: H(p) Constraints: Add a feature

Questions • Is P empty? • Does p* exist? • Is p* unique? • What is the form of p*? • How to find p*?

What is the form of p*?(Ratnaparkhi, 1997) Theorem: if then Furthermore, p* is unique.

Using Lagrange multipliers Minimize A(p):

Two equivalent forms

Relation to Maximum Likelihood The log-likelihood of the empirical distribution as predicted by a model q is defined as Theorem: if then Furthermore, p* is unique.

Summary (so far) Goal: find p* in P, which maximizes H(p). It can be proved that, when p* exists, it is unique. The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample

Summary (cont) • Adding constraints (features): (Klein and Manning, 2003) • Lower maximum entropy • Raise maximum likelihood of data • Bring the distribution further away from uniform • Bring the distribution closer to data

Training

Algorithms • Generalized Iterative Scaling (GIS): (Darroch and Ratcliff, 1972) • Improved Iterative Scaling (IIS): (Della Pietra et al., 1995)

GIS: setup Requirements for running GIS: • Obey form of model and constraints: • An additional constraint: Let Add a new feature fk+1:

GIS algorithm • Compute dj, j=1, …, k+1 • Initialize (any values, e.g., 0) • Repeat until converge • For each j • Compute • Compute • Update

Approximation for calculating feature expectation

Properties of GIS • L(p(n+1)) >= L(p(n)) • The sequence is guaranteed to converge to p*. • The converge can be very slow. • The running time of each iteration is O(NPA): • N: the training set size • P: the number of classes • A: the average number of features that are active for a given event (a, b).

Feature selection

Feature selection • Throw in many features and let the machine select the weights • Manually specify feature templates • Problem: too many features • An alternative: greedy algorithm • Start with an empty set S • Add a feature at each iteration

Notation With the feature set S: After adding a feature: The gain in the log-likelihood of the training data:

Feature selection algorithm(Berger et al., 1996) • Start with S being empty; thus ps is uniform. • Repeat until the gain is small enough • For each candidate feature f • Computer the model using IIS • Calculate the log-likelihood gain • Choose the feature with maximal gain, and add it to S  Problem: too expensive

Approximating gains(Berger et. al., 1996) • Instead of recalculating all the weights, calculate only the weight of the new feature.

Training a MaxEnt Model Scenario #1: no feature selection during training • Define features templates • Create the feature set • Determine the optimum feature weights via GIS or IIS Scenario #2: with feature selection during training • Define feature templates • Create candidate feature set S • At every iteration, choose the feature from S (with max gain) and determine its weight (or choose top-n features and their weights).

Case study

POS tagging(Ratnaparkhi, 1996) • Notation variation: • fj(a, b): a: class, b: context • fj(hi, ti): h: history for ith word, t: tag for ith word • History: • Training data: • Treat it as a list of (hi, ti) pairs. • How many pairs are there?

Using a MaxEnt Model • Modeling: • Training: • Define features templates • Create the feature set • Determine the optimum feature weights via GIS or IIS • Decoding:

Modeling

Training step 1: define feature templates History hi Tag ti

Maximum Entropy Model