890 likes | 1.01k Views
EM. Advanced Statistical Methods in NLP Ling 572 March 6, 2012. Slides based on F. Xia11. Roadmap. Motivation: Unsupervised learning Maximum Likelihood Estimation EM: Basic concepts Main ideas Example: Forward-backward algorithm. Motivation. Task: Train a speech recognizer
E N D
EM Advanced Statistical Methods in NLP Ling 572 March 6, 2012 Slides based on F. Xia11
Roadmap • Motivation: • Unsupervised learning • Maximum Likelihood Estimation • EM: • Basic concepts • Main ideas • Example: Forward-backward algorithm
Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model
Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States:
Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations:
Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities:
Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities:
Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get:
Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get:
Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get: Phonetic labeling of lots of recorded audio • Can we train our model without the ‘hard to get’ part?
Motivation • Task: Train a probabilistic context-free grammar • Model:
Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get:
Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get:
Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get: parse trees on lots of text sentences • Can we train our model without the ‘hard to get’ part?
Approach • Unsupervised learning • EM approach: • Family of unsupervised parameter estimation techniques • General framework • Many specific algorithms implement: • Forward-Backward, Inside-Outside, IBM MT models, etc
EM • Expectation-Maximization: • Two-step iterative procedure
EM • Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation
EM • Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation • General form provided by (Dempster, Laird, Rubin ’77) • Unified framework • Specific instantiations predate
Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ Based on F. Xia11
Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) Based on F. Xia11
Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) • Maximum likelihood: • ΘML = argmaxΘ log P(X|Θ) Based on F. Xia11
MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11
MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11
MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11
MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11
MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11
Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a set of N coin flips, m are heads • Data X Based on F. Xia11
Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ Based on F. Xia11
Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ: p • What value of p maximizes probability of data? Based on F. Xia11
Simple Example, Formally • L(Θ) = log P(X|Θ) Based on F. Xia11
Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) Based on F. Xia11
Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) = Based on F. Xia11
Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11
Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11
Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11
Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11
EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ Based on F. Xia11
EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) Based on F. Xia11
EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) • In many cases, computing P(X|θ) is hard • However, computing P(X,Y|θ) can be easier Based on F. Xia11
Terminology • Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data
Terminology • Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data • Articles mix the labels and terms
Forms of EM Based on F. Xia11
Forms of EM Based on F. Xia11
Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set
Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y
Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y
Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y • Iterate until convergence
Key Features of EM • General framework for ‘hidden’ data problems • General iterative methodology • Must be specialized to particular problems: • Forward-Backward for HMMs • Inside-Outside for PCFGs • IBM models for MT
Maximum Likelihood • EM performs parameter estimation for maximum likelihood estimation: • ΘML = argmax L(Θ) • ΘML = argmax log P(X|Θ) • Introduces ‘hidden’ data Y to allow more tractable solution