Log-Linear Models in NLP

Log-Linear Models in NLP Noah A. Smith Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University nasmith@cs.jhu.edu

Outline • Maximum Entropy principle • Log-linear models • Conditional modeling for classification • Ratnaparkhi’s tagger • Conditional random fields • Smoothing • Feature Selection

Data For now, we’re just talking about modeling data. No task. How to assign probability to each shape type?

Maximum Likelihood Fewer parameters? How to smooth? 11 degrees of freedom (12 – 1).

Size Shape Some other kinds of models 11 degrees of freedom (1 + 4 + 6). Color These two are the same! These two are the same! Pr(Color, Shape, Size) = Pr(Color) • Pr(Shape | Color) • Pr(Size | Color, Shape)

Size Shape Some other kinds of models 9 degrees of freedom (1 + 2 + 6). Color Pr(Color, Shape, Size) = Pr(Color) • Pr(Shape) • Pr(Size | Color, Shape)

Size Shape Some other kinds of models 7 degrees of freedom (1 + 2 + 4). Color No zeroes here ... Pr(Color, Shape, Size) = Pr(Size) • Pr(Shape | Size) • Pr(Color | Size)

Size Shape Some other kinds of models 4 degrees of freedom (1 + 2 + 1). Color Pr(Color, Shape, Size) = Pr(Size) • Pr(Shape) • Pr(Color)

This is difficult. Different factorizations affect: smoothing # parameters (model size) model complexity “interpretability” goodness of fit ... Usually, this isn’t done empirically, either!

Desiderata • You decide which features to use. • Some intuitive criterion tells you how to use them in the model. • Empirical.

Maximum Entropy “Make the model as uniform as possible ... but I noticed a few things that I want to model ... so pick a model that fits the data on those things.”

Occam’s Razor One should not increase, beyond what is necessary, the number of entities required to explain anything.

Uniform model

Constraint: Pr(small) = 0.625 0.625

Pr( , small) = 0.048 0.048 0.625

Pr(large, ) = 0.125 0.048 ? 0.625

Questions Is there an efficient way to solve this problem? Does a solution always exist? Is there a way to express the model succinctly? What to do if it doesn’t?

Entropy • A statistical measurement on a distribution. • Measured in bits. •  [0, log2|X|] • High entropy: close to uniform • Low entropy: close to deterministic • Concave in p.

Max The Max Ent Problem H p2 p1

The Max Ent Problem objective function is H probabilities sum to 1 ... picking a distribution ... and are nonnegative expected feature value under the model n constraints expected feature value from the data

The Max Ent Problem H p2 p1

1 if x is a small , 0 otherwise About feature constraints 1 if x is large and light, 0 otherwise 1 if x is small, 0 otherwise

Max Mathematical Magic constrained |X| variables (p) concave in p unconstrained N variables (θ) concave in θ

What’s the catch? The model takes on a specific, parameterized form. It can be shown that any max-ent model must take this form.

Log-linear models Log linear

Log-linear models One parameter (θi) for each feature. Unnormalized probability, or weight Partition function

Max Mathematical Magic Max ent problem constrained |X| variables (p) concave in p unconstrained N variables (θ) concave in θ Log-linear ML problem

What does MLE mean? Independence among examples Arg max is the same in the log domain

MLE: Then and Now

Iterative Methods All of these methods are correct and will converge to the right answer; it’s just a matter of how fast. • Generalized Iterative Scaling • Improved Iterative Scaling • Gradient Ascent • Newton/Quasi-Newton Methods • Conjugate Gradient • Limited-Memory Variable Metric • ...

Questions Is there an efficient way to solve this problem? Yes, many iterative methods. Does a solution always exist? Is there a way to express the model succinctly? Yes, if the constraints come from the data. Yes, a log-linear model.

Conditional Estimation labels examples Training Objective: Classification Rule:

Maximum Likelihood label object

Conditional Likelihood label object

Remember: log-linear models conditional estimation

The Whole Picture

Log-linear models: MLE vs. CLE Sum over all example types  all labels. Sum over all labels.

Classification Rule Pick the most probable label y: We don’t need to compute the partition function at test time! But it does need to be computed during training.

Ratnaparkhi’s POS Tagger (1996) • Probability model: • Assume unseen words behave like rare words. • Rare words ≡ count < 5 • Training: GIS • Testing/Decoding: beam search

Features: common words

Features: rare words

The “Label Bias” Problem (4) (6)

Is this symptomatic of log-linear models? No!

Log-Linear Models in NLP