1 / 59

Forward-backward algorithm

Forward-backward algorithm. LING 575 Week 2: 01/10/08. Some facts. The forward-backward algorithm is a special case of EM. EM stands for “expectation maximization”. EM falls into the general framework of maximum-likelihood estimation (MLE). Outline. Maximum likelihood estimate (MLE)

hua
Download Presentation

Forward-backward algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Forward-backward algorithm LING 575 Week 2: 01/10/08

  2. Some facts • The forward-backward algorithm is a special case of EM. • EM stands for “expectation maximization”. • EM falls into the general framework of maximum-likelihood estimation (MLE).

  3. Outline • Maximum likelihood estimate (MLE) • EM in a nutshell • Forward-backward algorithm • Hw1 • Additional slides • EM main ideas • EM for PM models

  4. MLE

  5. What is MLE? • Given • A sample X={X1, …, Xn} • A list of parameters θ • We define • Likelihood of the data: P(X | θ) • Log-likelihood of the data: L(θ)=log P(X|θ) • Given X, find

  6. MLE (cont) • Often we assume that Xis are independently identically distributed (i.i.d.) • Depending on the form of P(X | θ), solving this optimization problem can be easy or hard.

  7. An easy case • Assuming • A coin has a probability p of being a head, 1-p of being a tail. • Observation: We toss a coin N times, and the result is a set of Hs and Ts, and there are m Hs. • What is the value of p based on MLE, given the observation?

  8. An easy case (cont) p= m/N

  9. EM: basic concepts

  10. Basic setting in EM • X is a set of data points: observed data •  is a parameter vector. • EM is a method to find θML where • Calculating P(X | θ) directly is hard. • Calculating P(X,Y|θ) is much simpler, where Y is “hidden” data (or “missing” data).

  11. The basic EM strategy • Z = (X, Y) • Z: complete data (“augmented data”) • X: observed data (“incomplete” data) • Y: hidden data (“missing” data)

  12. The “missing” data Y • Y need not necessarily be missing in the practical sense of the word. • It may just be a conceptually convenient technical device to simplify the calculation of P(X | θ). • There could be many possible Ys.

  13. Examples of EM

  14. Basic idea • Consider a set of starting parameters • Use these to “estimate” the missing data • Use “complete” data to update parameters • Repeat until convergence

  15. The EM algorithm • Start with initial estimate, θ0 • Repeat until convergence • E-step: calculate • M-step: find

  16. EM Highlights • It is a general algorithm for missing data problems. • It requires “specialization” to the problem at hand. • Some classes of problem have a closed-form solution for the M-step. For example, • Forward-backward algorithm for HMM • Inside-outside algorithm for PCFG • EM in IBM MT Models

  17. The forward-backward algorithm

  18. Notation • A sentence: O1,T=o1…oT, • T is the sentence length • The state sequence X1,T+1=X1 … XT+1 • t: time t, range from 1 to T+1. • Xt: the state at time t. • i, j: state si, sj. • k: word wk in the vocabulary

  19. Forward probability It is the probability of producing o1,t-1 while ending up in state si: Initialization: Induction:

  20. Backward probability • It is the probability of producing the sequence Ot,T, given that at time t, we are at state si. Initialization: Induction:

  21. Calculating the prob of the observation

  22. is the prob of traversing a certain arc at time t given O: (denoted by pt(i, j) in M&S)

  23. is the prob of being at state i at time t given O:

  24. Expected counts • Calculating expected counts by summing over the time index t • Expected # of transitions from state i to j in O: • Expected # of transitions from state i in O:

  25. Update parameters

  26. Final formulae

  27. The inner loop for forward-backward algorithm Given an input sequence and • Calculate forward probability: • Base case • Recursive case: • Calculate backward probability: • Base case: • Recursive case: • Calculate expected counts: • Update the parameters:

  28. Summary • A way of estimating parameters for HMM • Define forward and backward probability, which can calculated efficiently (DP) • Given an initial parameter setting, we re-estimate the parameters at each iteration. • The forward-backward algorithm is a special case of EM algorithm for PM model

  29. Hw1

  30. Arc-emission HMM: Q1: How to estimate the emission probability in a state-emission HMM?

  31. Given an input sequence and an HMM • Calculate forward probability: • Base case • Recursive case: • Calculate backward probability: • Base case: • Recursive case: • Calculate expected counts: 4. Update the parameters: Q2: how to modify the algorithm when there are multiple input sentences (e.g., a set of sentences)?

  32. EM: main ideas

  33. Additional slides

  34. Idea #1: find θ that maximizes the likelihood of training data

  35. Idea #2: find the θt sequence No analytical solution  iterative approach, find s.t.

  36. Idea #3: find θt+1 that maximizes a tight lower bound of a tight lower bound

  37. Idea #4: find θt+1 that maximizes the Q function Lower bound of The Q function

  38. The Q-function • Define the Q-function (a function of θ): • Y is a random vector. • X=(x1, x2, …, xn) is a constant (vector). • Θt is the current parameter estimate and is a constant (vector). • Θ is the normal variable (vector) that we wish to adjust. • The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.

  39. The EM algorithm • Start with initial estimate, θ0 • Repeat until convergence • E-step: calculate • M-step: find

  40. Important classes of EM problem • Products of multinomial (PM) models • Exponential families • Gaussian mixture • …

  41. The EM algorithm for PM models

  42. PM models Where is a partition of all the parameters, and for any j

  43. HMM is a PM

  44. PCFG • PCFG: each sample point (x,y): • x is a sentence • y is a possible parse tree for that sentence.

  45. PCFG is a PM

  46. Q-function for PM

  47. Maximizing the Q function Maximize Subject to the constraint Use Lagrange multipliers

  48. Optimal solution Expected count Normalization factor

  49. PCFG example • Calculate expected counts • Update parameters

  50. The EM algorithm for PM models // for each iteration // for each training example xi // for each possible y // for each parameter // for each parameter

More Related