Advanced Statistical Methods in NLP Ling 572 March 6, 2012

EM Advanced Statistical Methods in NLP Ling 572 March 6, 2012 Slides based on F. Xia11

Roadmap • Motivation: • Unsupervised learning • Maximum Likelihood Estimation • EM: • Basic concepts • Main ideas • Example: Forward-backward algorithm

Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model

Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States:

Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations:

Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities:

Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities:

Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get:

Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get:

Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get: Phonetic labeling of lots of recorded audio • Can we train our model without the ‘hard to get’ part?

Motivation • Task: Train a probabilistic context-free grammar • Model:

Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get:

Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get:

Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get: parse trees on lots of text sentences • Can we train our model without the ‘hard to get’ part?

Approach • Unsupervised learning • EM approach: • Family of unsupervised parameter estimation techniques • General framework • Many specific algorithms implement: • Forward-Backward, Inside-Outside, IBM MT models, etc

EM • Expectation-Maximization: • Two-step iterative procedure

EM • Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation

EM • Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation • General form provided by (Dempster, Laird, Rubin ’77) • Unified framework • Specific instantiations predate

Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ Based on F. Xia11

Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) Based on F. Xia11

Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) • Maximum likelihood: • ΘML = argmaxΘ log P(X|Θ) Based on F. Xia11

MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a set of N coin flips, m are heads • Data X Based on F. Xia11

Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ Based on F. Xia11

Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ: p • What value of p maximizes probability of data? Based on F. Xia11

Simple Example, Formally • L(Θ) = log P(X|Θ) Based on F. Xia11

Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) Based on F. Xia11

Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) = Based on F. Xia11

Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ Based on F. Xia11

EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) Based on F. Xia11

EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) • In many cases, computing P(X|θ) is hard • However, computing P(X,Y|θ) can be easier Based on F. Xia11

Terminology • Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data

Terminology • Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data • Articles mix the labels and terms

Forms of EM Based on F. Xia11

Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set

Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y

Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y

Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y • Iterate until convergence

Key Features of EM • General framework for ‘hidden’ data problems • General iterative methodology • Must be specialized to particular problems: • Forward-Backward for HMMs • Inside-Outside for PCFGs • IBM models for MT

Mains Ideas in EM

Maximum Likelihood • EM performs parameter estimation for maximum likelihood estimation: • ΘML = argmax L(Θ) • ΘML = argmax log P(X|Θ) • Introduces ‘hidden’ data Y to allow more tractable solution

Advanced Statistical Methods in NLP Ling 572 March 6, 2012

Advanced Statistical Methods in NLP Ling 572 March 6, 2012

Presentation Transcript

Advanced Statistical Methods: Beyond Linear Regression

ML for NLP: Ling 572

BASIC TECHNIQUES IN STATISTICAL NLP

Statistical NLP: Lecture 6

March 6, 2012

Seminar: Statistical NLP

Statistical techniques in NLP

Advanced Techniques in NLP

COMP790: Statistical NLP

Statistical methods in NLP

Statistical methods in NLP

Statistical Methods in NLP Course 10

LING 439/539: Statistical Methods in Speech and Language Processing

Statistical NLP: Lecture 6

ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH

Statistical Methods in NLP Course 7 Diana Trandabăț 2013-2014

Statistical NLP: Lecture 6

Statistical NLP: Lecture 6

Advanced Topics in NLP

Advanced Topics in NLP

Advanced Statistical Methods: Beyond Linear Regression

BASIC TECHNIQUES IN STATISTICAL NLP