1 / 72

Statistical Machine Translation: IBM Models and the Alignment Template System

Statistical Machine Translation: IBM Models and the Alignment Template System. Statistical Machine Translation. Goal: Given foreign sentence f : “Maria no dio una bofetada a la bruja verde” Find the most likely English translation e : “Maria did not slap the green witch”.

toby
Download Presentation

Statistical Machine Translation: IBM Models and the Alignment Template System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Machine Translation: IBM Models and the Alignment Template System

  2. Statistical Machine Translation • Goal: • Given foreign sentence f: • “Maria no dio una bofetada a la bruja verde” • Find the most likely English translation e: • “Maria did not slap the green witch”

  3. Statistical Machine Translation • Most likely English translation e is given by: • P(e|f) estimates conditional probability of any e given f

  4. Statistical Machine Translation • How to estimate P(e|f)? • Noisy channel: • Decompose P(e|f) into P(f|e) * P(e) / P(f) • Estimate P(f|e) and P(e) separately using parallel corpus • Direct: • Estimate P(e|f) directly using parallel corpus (more on this later)

  5. Noisy Channel Model • Translation Model • P(f|e) • How likely is f to be a translation of e? • Estimate parameters from bilingual corpus • Language Model • P(e) • How likely is e to be an English sentence? • Estimate parameters from monolingual corpus • Decoder • Given f, what is the best translation e?

  6. Noisy Channel Model • Generative story: • Generate e with probability p(e) • Pass e through noisy channel • Out comes f with probability p(f|e) • Translation task: • Given f, deduce most likely e that produced f, or:

  7. Translation Model • How to model P(f|e)? • Learn parameters of P(f|e) from a bilingual corpus S of sentence pairs <ei,fi> : < e1,f1 > = <the blue witch, la bruja azul> < e2,f2 > = <green, verde> … < eS,fS > = <the witch, la bruja>

  8. Translation Model • Insufficient data in parallel corpus to estimate P(f|e) at the sentence level (Why?) • Decompose process of translating e -> f into small steps whose probabilities can be estimated

  9. Translation Model • English sentence e = e1…el • Foreign sentence f = f1…fm • Alignment A = {a1…am}, where ajε{0…l} • A indicates which English word generates each foreign word

  10. Alignments e: “the blue witch” f: “la bruja azul” A = {1,3,2} (intuitively “good” alignment)

  11. Alignments e: “the blue witch” f: “la bruja azul” A = {1,1,1} (intuitively “bad” alignment)

  12. Alignments e: “the blue witch” f: “la bruja azul” (illegal alignment!)

  13. Alignments • Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m?

  14. Alignments • Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m? • Answer: • Each foreign word can align with any one of |e| = l words, or it can remain unaligned • Each foreign word has (l + 1) choices for an alignment, and there are |f| = m foreign words • So, there are (l+1)^m alignments for a given e and f

  15. Alignments • Question: If all alignments are equally likely, what is the probability of any one alignment, given e?

  16. Alignments • Question: If all alignments are equally likely, what is the probability of any one alignment, given e? • Answer: • P(A|e) = p(|f| = m) * 1/(l+1)^m • If we assume that p(|f| = m) is uniform over all possible values of |f|, then we can let p(|f| = m) = C • P(A|e) = C /(l+1)^m

  17. Generative Story e: “blue witch” f: “bruja azul” ? How do we get from e to f?

  18. IBM Model 1 • Model parameters: • T(fj | eaj ) = translation probability of foreign word given English word that generated it

  19. IBM Model 1 • Generative story: • Given e: • Pick m = |f|, where all lengths m are equally probable • Pick A with probability P(A|e) =1/(l+1)^m, since all alignments are equally likely given l and m • Pick f1…fm with probability where T(fj | eaj )is the translation probability of fj given the English word it is aligned to

  20. IBM Model 1 Example e: “blue witch”

  21. IBM Model 1 Example e: “blue witch” f: “f1 f2” Pick m = |f| = 2

  22. IBM Model 1 Example e: blue witch” f: “f1 f2” Pick A = {2,1} with probability 1/(l+1)^m

  23. IBM Model 1 Example e: blue witch” f: “bruja f2” Pick f1 = “bruja” with probability t(bruja|witch)

  24. IBM Model 1 Example e: blue witch” f: “bruja azul” Pick f2 = “azul” with probability t(azul|blue)

  25. IBM Model 1: Parameter Estimation • How does this generative story help us to estimate P(f|e) from the data? • Since the model for P(f|e) contains the parameter T(fj | eaj ),we first need to estimate T(fj | eaj )

  26. lBM Model 1: Parameter Estimation • How to estimate T(fj | eaj )from the data? • If we had the data and the alignments A, along with P(A|f,e), then we could estimate T(fj | eaj ) using expected counts as follows:

  27. lBM Model 1: Parameter Estimation • How to estimate P(A|f,e)? • P(A|f,e) = P(A,f|e) / P(f|e) • But • So we need to compute P(A,f|e)… • This is given by the Model 1 generative story:

  28. IBM Model 1 Example e: “the blue witch” f: “la bruja azul” P(A|f,e) = P(f,A|e)/ P(f|e) =

  29. IBM Model 1: Parameter Estimation • So, in order to estimate P(f|e), we first need to estimate the model parameter T(fj | eaj ) • In order to compute T(fj | eaj ), we need to estimate P(A|f,e) • And in order to compute P(A|f,e), we need to estimate T(fj | eaj )…

  30. IBM Model 1: Parameter Estimation • Training data is a set of pairs < ei, fi> • Log likelihood of training data given model parameters is: • To maximize log likelihood of training data given model parameters, use EM: • hidden variable = alignments A • model parameters = translation probabilities T

  31. EM • Initialize model parameters T(f|e) • Calculate alignment probabilities P(A|f,e) under current values of T(f|e) • Calculate expected counts from alignment probabilities • Re-estimate T(f|e) from these expected counts • Repeat until log likelihood of training data converges to a maximum

  32. IBM Model 2 • Model parameters: • T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated it • d(i|j,l,m) = distortion probability, or probability that fj is aligned to ei , given l and m

  33. IBM Model 3 • Model parameters: • T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated it • r(j|i,l,m) = reverse distortion probability, or probability of position fj, given its alignment to ei, l, and m • n(ei) = fertility of word ei , or number of foreign words aligned to ei • p1 = probability of generating a foreign word by alignment with the NULL English word

  34. IBM Model 3 • Generative Story: • Choose fertilities for each English word • Insert spurious words according to probability of being aligned to the NULL English word • Translate English words -> foreign words • Reorder words according to reverse distortion probabilities

  35. IBM Model 3 Example • Consider the following example from [Knight 1999]: • Maria did not slap the green witch

  36. IBM Model 3 Example • Maria did not slap the green witch • Maria not slap slap slap the green witch • Choose fertilities: phi(Maria) = 1

  37. IBM Model 3 Example • Maria did not slap the green witch • Maria not slap slap slap the green witch • Maria not slap slap slap NULL the green witch • Insert spurious words: p(NULL)

  38. IBM Model 3 Example • Maria did not slap the green witch • Maria not slap slap slap the green witch • Maria not slap slap slap NULL the green witch • Maria no dio una bofetada a la verde bruja • Translate words: t(verde|green)

  39. IBM Model 3 Example • Maria no dio una bofetada a la verde bruja • Maria no dio una bofetada a la bruja verde • Reorder words

  40. IBM Model 3 • For models 1 and 2: • We can compute exact EM updates • For models 3 and 4: • Exact EM updates cannot be efficiently computed • Use best alignments from previous iterations to initialize each successive model • Explore only the subspace of potential alignments that lies within same neighborhood as the initial alignments

  41. IBM Model 4 • Model parameters: • Same as model 3, except uses more complicated model of reordering (for details, see Brown et al. 1993)

  42. Language Model • Given an English sentence e1, e2 …el : P(e1, e2 …el ) = P(e1) * P(e2|e1 ) * … * P(el| e1, e2 …el-1 ) • N-gram model: • Assume P(ei) depends only on the N-1 previous words, so that P(ei |e1,e2, …ei-1) = P(ei |ei-N,ei-N+1, …ei-1)

  43. N=2: Bigram Language Model P(Maria did not slap the green witch) = P(Maria|START) * P(did|Maria) * P(not|did) * … P(END|witch)

  44. Word-Based MT • Word = fundamental unit of translation • Weaknesses: • no explicit modeling of word context • word-by-word translation may not accurately convey meaning of phrase: • “il ne va pas” -> “he does not go” • IBM models prevent alignment of foreign words with >1 English word: • “aller” -> “to go”

  45. Phrase-Based MT • Phrase = basic unit of translation • Strengths: • explicit modeling of word context • captures local reorderings, local dependencies

  46. Example Rules: • English: he does not go • Foreign: il ne va pas • ne va pas -> does not go

  47. Alignment Template System • [Och and Ney, 2004] • Alignment template: • Pair of source and target language phrases • Word alignment among words within those phrases • Formally, an alignment template is a triple (F,E,A): • F = words on foreign side • E = words on English side • A = alignments among words on the foreign and English sides

  48. Estimating P(e|f) • Noisy channel: • Decompose P(e|f) into P(f|e) and P(e) • Estimate P(f|e) and P(e) separately • Direct: • Estimate P(e|f) directly from training corpus • Use log-linear model

  49. Log-linear Models for MT • Compute best translation as follows: • where hi are the feature functions and λi are the model parameters • Typical feature functions include: • phrase translation probabilities • lexical translation probabilities • language model probability • reordering model • word penalty [Koehn 2003]

  50. Log-linear Models for MT • Noisy Channel model is a special case of Log-Linear model where: • h1 = log(P(f|e)), λ1 = 1 • h2 = log(P(e)), λ2 = 1 • Then: [Och and Ney 2003]

More Related