Maximum Entropy Model (I)

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/01-02/03/2011

MaxEnt in NLP • The maximum entropy principle has a long history. • The MaxEnt algorithm was introduced to the NLP field by Berger et. al. (1996). • Used in many NLP tasks: Tagging, Parsing, PP attachment, …

Reference papers • (Ratnaparkhi, 1997) • (Berger et. al., 1996) • (Ratnaparkhi, 1996) • (Klein and Manning, 2003) People often choose different notations.

Notation We following the notation in (Berger et al., 1996)

Outline • Overview • The Maximum Entropy Principle • Modeling** • Decoding • Training** • Case study: POS tagging

The Overview

Joint vs. Conditional models ** • Given training data {(x,y)}, we want to build a model to predict y for new x’s. For each model, we need to estimate the parameters µ. • Joint (generative) models estimate P(x,y) by maximizing the likelihood: P(X,Y|µ) • Ex: n-gram models, HMM, Naïve Bayes, PCFG • Choosing weights is trivial: just use relative frequencies. • Conditional (discriminative) models estimate P(y|x) by maximizing the conditional likelihood: P(Y|X, µ) • Ex: MaxEnt, SVM, etc. • Choosing weights is harder.

Naïve Bayes Model C … fn f2 f1 Assumption: each fmis conditionally independent from fngiven C.

The conditional independence assumption fmand fnare conditionally independent given c: P(fm | c, fn) = P(fm | c) Counter-examples in the text classification task: - P(“cut” | politics) != P(“cut” | politics, “budget”) Q: How to deal with correlated features? A: Many models, including MaxEnt, do not assume that features are conditionally independent.

Naïve Bayes highlights • Choose c* = argmaxc P(c) k P(fk | c) • Two types of model parameters: • Class prior: P(c) • Conditional probability: P(fk | c) • The number of model parameters: |C|+|CV|

P(f | c) in NB Each cell is a weight for a particular (class, feat) pair.

Weights in NB and MaxEnt • In NB • P(f | y) are probabilities (i.e., 2 [0,1]) • P(f | y) are multiplied at test time • In MaxEnt • the weights are real numbers: they can be negative. • the weights are added at test time

The highlights in MaxEnt fj(x,y) is a feature function, which normally corresponds to a (feature, class) pair. Training: to estimate Testing: to calculate P(y | x)

Main questions • What is the maximum entropy principle? • What is a feature function? • Modeling: Why does P(y|x) have the form? • Training: How do we estimate ¸j ?

Outline • Overview • The Maximum Entropy Principle • Modeling** • Decoding • Training* • Case study

The maximal entropy principle

The maximum entropy principle • Related to Occam’s razor and other similar justifications for scientific inquiry • Make the minimumassumptions about unseen data. • Also: Laplace’s Principle of Insufficient Reason: when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely.

Maximum Entropy • Why maximum entropy? • Maximize entropy = Minimize commitment • Model all that is known and assume nothing about what is unknown. • Model all that is known: satisfy a set of constraints that must hold • Assume nothing about what is unknown: choose the most “uniform” distribution  choose the one with maximum entropy

Ex1: Coin-flip example(Klein & Manning, 2003) • Toss a coin: p(H)=p1, p(T)=p2. • Constraint: p1 + p2 = 1 • Question: what’s p(x)? That is, what is the value of p1? • Answer: choose the p that maximizes H(p) H H p1=0.3 p1

Coin-flip example** (cont) H p1 + p2 = 1 p2 p1 p1+p2=1.0, p1=0.3

Ex2: An MT example(Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer:

An MT example (cont) Constraints: Intuitive answer:

An MT example (cont) Constraints: Intuitive answer: ??

Ex3: POS tagging(Klein and Manning, 2003)

Ex3 (cont)

Ex4: Overlapping features(Klein and Manning, 2003) p1 p2 p3 p4

Ex4 (cont) p1 p2

Ex4 (cont) p1

The MaxEnt Principle summary • Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p). • Q1: How to represent constraints? • Q2: How to find such distributions?

Reading #2 (Q1): Let P(X=i) be the probability of getting an i when rolling a dice. What is the value of P(x) with the maximum entropy if the following is true? • P(X=1) + P(X=2) = ½ ¼, ¼, 1/8, 1/8, 1/8, 1/8 (b) P(X=1) + P(X=2) = 1/2 and P(X=6) = 1/3 ¼, ¼, 1/18, 1/18, 1/18, 1/3

(Q2) In the text classification task, |V| is the number of features, |C| is the number of classes. How many feature functions are there? |C| * |V|

Outline • Overview • The Maximum Entropy Principle • Modeling** • Decoding • Training** • Case study

Modeling

The Setting • From the training data, collect (x, y) pairs: • x 2 X: the observed data • y 2 Y: thing to be predicted (e.g., a class in a classification problem) • Ex: In a text classification task • x: a document • y: the category of the document • To estimate P(y|x)

The basic idea • Goal: to estimate p(y|x) • Choose p(x,y) with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”).

The outline for modeling • Feature function: • Calculating the expectation of a feature function • The forms of P(x,y) and P(y|x)

Feature function

The definition • A feature function is a binary-valued function on events: • Ex:

The weights in NB

The weights in NB Each cell is a weight for a particular (class, feat) pair.

The matrix in MaxEnt Each feature function fj corresponds to a (feat, class) pair.

The weights in MaxEnt Each feature function fj has a weight ¸j.

Feature function summary • A feature function in MaxEnt corresponds to a (feat, class) pair. • The number of feature functions in MaxEnt is approximately |C| * |V|. • A MaxEnt trainer is to learn the weights for the feature functions.

The outline for modeling • Feature function: • Calculating the expectation of a feature function • The forms of P(x,y) and P(y | x)

The expected return • Ex1: • Flip a coin • if it is a head, you will get 100 dollars • if it is a tail, you will lose 50 dollars • What is the expected return? P(X=H) * 100 + P(X=T) * (-50) • Ex2: • If it is a xi, you will receive vi dollars? • What is the expected return?

Calculating the expectation of a function

Empirical expectation • Denoted as : • Ex1: Toss a coin four times and get H, T, H, and H. • The average return: (100-50+100+100)/4 = 62.5 • Empirical distribution: • Empirical expectation: ¾ * 100 + ¼ * (-50) = 62.5

Model expectation • Ex1: Toss a coin four times and get H, T, H, and H. • A model: • Model expectation: 1/2 * 100 + 1/2 * (-50) = 25

Some notations Training data: Empirical distribution: A model: The jthfeature function: Empirical expectation of Model expectation of

Empirical expectation**

Maximum Entropy Model (I)