1 / 27

Parameter estimate in IBM Models:

Explore the essential concepts, formulas, and training methods in IBM models for machine translation. Learn about word alignment, modeling, training processes, and the EM algorithm. Uncover the basics of IBM Model 1 and its application in translation and noisy channel models.

mmarrow
Download Presentation

Parameter estimate in IBM Models:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??

  2. Outline • IBM Model 1 Review: (from LING571) • Word alignment • Modeling • Training: formula • Formulae

  3. IBM Model Basics • Classic paper: Brown et. al. (1993) • Translation: F  E (or Fr  Eng) • Resource required: • Parallel data (a set of “sentence” pairs) • Main concepts: • Source channel model • Hidden word alignment • EM training

  4. Intuition • Sentence pairs: word mapping is one-to-one. • (1) S: a b c d e T: l m n o p • (2) S: c a e T: p n m • (3) S: d a c T: n p l  (b, o), (d, l), (e, m), and (a, p), (c, n), or (a, n), (c, p)

  5. Source channel model for MT P(E) P(F | E) Fr sent Eng sent Noisy channel • Two types of parameters: • Language model: P(E) • Translation model: P(F | E)

  6. Word alignment • a(j)=i  aj = i • a = (a1, …, am) • Ex: • F: f1 f2 f3 f4 f5 • E: e1 e2 e3 e4 • a4=3 • a = (0, 1, 1, 3, 2)

  7. Word alignment • An alignment, a, is a function from Fr word position to Eng word position: a(j)=i means that the fj is generated by ei. • The constraint: each fr word is generated by exactly one Eng word (including e0):

  8. Modeling p(F | E) with alignment

  9. Notation • E: the Eng sentence: E = e1 …el • ei: the i-th Eng word. • F: the Fr sentence: f1 … fm • fj: the j-th Fr word. • e0: the Eng NULL word • F0 : the Fr NULL word. • aj: the position of Eng word that generates fj.

  10. Notation (cont) • l: Eng sent leng • m: Fr sent leng • i: Eng word position • j: Fr word position • e: an Eng word • f: a Fr word

  11. Generative process • To generate F from E: • Pick a length m for F, with prob P(m | l) • Choose an alignment a, with prob P(a | E, m) • Generate Fr sent given the Eng sent and the alignment, with prob P(F | E, a, m). • Another way to look at it: • Pick a length m for F, with prob P(m | l). • For j=1 to m • Pick an Eng word index aj, with prob P(aj | j, m, l). • Pick a Fr word fj according to the Eng word ei, whereaj=I, with prob P(fj | ei ).

  12. Decomposition

  13. Approximation • Fr sent length depends only on Eng sent length: • Fr word depends only on the Eng word that generates it: • Estimating P(a | E, m): All alignments are equally likely:

  14. Decomposition

  15. Final formula and parameters for Model 1 • Two types of parameters: • Length prob: P(m | l) • Translation prob: P(fj | ei), or t(fj | ei),

  16. Training • Mathematically motivated: • Having an objective function to optimize • Using several clever tricks • The resulting formulae • are intuitively expected • can be calculated efficiently • EM algorithm • Hill climbing, and each iteration guarantees to improve objective function • It does not guaranteed to reach global optimal.

  17. Training: Fractional counts • Let Ct(f, e) be the fractional count of (f, e) pair in the training data, given alignment prob P. Alignment prob Actual count of times e and f are linked in (E,F) by alignment a

  18. Estimating P(a|E,F) • We could list all the alignments, and estimate P(a | E, F).

  19. Formulae so far  New estimate for t(f|e)

  20. The algorithm • Start with an initial estimate of t(f | e): e.g., uniform distribution • Calculate P(a | F, E) • Calculate Ct (f, e), Normalize to get t(f|e) • Repeat Steps 2-3 until the “improvement” is too small.

  21. No need to enumerate all word alignments • Luckily, for Model 1, there is a way to calculate Ct(f, e) efficiently.

  22. The algorithm • Start with an initial estimate of t(f | e): e.g., uniform distribution • Calculate P(a | F, E) • Calculate Ct (f, e), Normalize to get t(f|e) • Repeat Steps 2-3 until the “improvement” is too small.

  23. Summary of Model 1 • Modeling: • Pick the length of F with prob P(m | l). • For each position j • Pick an English word position aj, with prob P(aj | j, m, l). • Pick a Fr word fj according to the Eng word ei, with t(fj | ei), where i=aj • The resulting formula can be calculated efficiently. • Training: EM algorithm. The update can be done efficiently. • Finding the best alignment: can be easily done.

  24. New stuff

  25. EM algorithm • EM: expectation maximization • In a model with hidden states (e.g., word alignment), how can we estimate model parameters? • EM does the following: • E-step: Take an initial model parameterization and calculate the expected values of the hidden data. • M-step: Use the expected values to maximize the likelihood of the training data.

  26. Objective function

  27. Training Summary • Mathematically motivated: • Having an objective function to optimize • Using several clever tricks • The resulting formulae • are intuitively expected • can be calculated efficiently • EM algorithm • Hill climbing, and each iteration guarantees to improve objective function • It does not guaranteed to reach global optimal.

More Related