The Improved Iterative Scaling Algorithm: A gentle Introduction

The Improved Iterative Scaling Algorithm: A gentle Introduction Adam Berger, CMU, 1997

Introduction • Random process • Produces some output value y, a member of a (necessarily finite) set of possible output values • The value of the random variable y is influenced by some conditioning information (or “context”) x • Language modeling problem • Assign a probability p(y| x) to the event that the next word in a sequence of text will be y, given x, the value of the previous words

Features and constraints • The goal is to construct a statistical model of the process which generated the training sample • The building blocks of this model will be a set of statistics of the training sample • The frequency that in translated to either dans or en was 3/10 • The frequency that in translated to either dans or au cours de was ½ • And so on Statistics of the training sample

Features and constraints • Conditioning information x • E.g., in the training sample, if April is the word following in, then the translation of in is en with frequency 9/10 • Indicator function • Expected value of f

Features and constraints • We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f • We call such function a feature function or feature for short

Features and constraints • When we discover a statistic that we feel is useful, we can acknowledge its importance by requiring that our modelaccord with it • We do this by constraining the expected value that the model assigns to the corresponding feature function f • The expected value of f with respect to the model p(y | x) is

Features and constraints • We constrain this expected value to be the same as the expected value of f in the training sample. That is, we require • We call this requirement a constraint equationor simply a constraint • Finally, we get

Features and constraints • To sum up so far, we now have • A means of representing statistical phenomena inherent in a sample of data (namely, ) • A means of requiring that our model of the process exhibit these phenomena (namely, ) • Feature: • Is a binary-value function of (x, y) • Constraint • Is an equation between the expected value of the feature function in the model and its expected value in the training data

The maxent principle • Suppose that we are given n feature functions fi, which determine statistics we feel are important in modeling the process. We would like our model to accord with these statistics • That is, we would like p to lie in the subset C of P defined by

Exponential form • The maximum entropy principle presents us with a problem in constrained optimization: find the pCwhich maximizes H(p) • Find

Exponential form • We maximize H(p) subject to the following constraints: • 1. • 2. • This and the previous condition guarantee that p is a conditional probability distribution • 3. • In other words, p C, and so satisfies the active constraints C

Exponential form • To solve this optimization problem, introduce the Lagrangian

Exponential form (1)

(2)

Maximum likelihood

(4)

Finding *

(5)

(6) (7) p(x)q(x)

(8)

The Improved Iterative Scaling Algorithm: A gentle Introduction