820 likes | 1.18k Views
Statistical Machine Translation: IBM Models and the Alignment Template System. Statistical Machine Translation. Goal: Given foreign sentence f : “Maria no dio una bofetada a la bruja verde” Find the most likely English translation e : “Maria did not slap the green witch”.
E N D
Statistical Machine Translation: IBM Models and the Alignment Template System
Statistical Machine Translation • Goal: • Given foreign sentence f: • “Maria no dio una bofetada a la bruja verde” • Find the most likely English translation e: • “Maria did not slap the green witch”
Statistical Machine Translation • Most likely English translation e is given by: • P(e|f) estimates conditional probability of any e given f
Statistical Machine Translation • How to estimate P(e|f)? • Noisy channel: • Decompose P(e|f) into P(f|e) * P(e) / P(f) • Estimate P(f|e) and P(e) separately using parallel corpus • Direct: • Estimate P(e|f) directly using parallel corpus (more on this later)
Noisy Channel Model • Translation Model • P(f|e) • How likely is f to be a translation of e? • Estimate parameters from bilingual corpus • Language Model • P(e) • How likely is e to be an English sentence? • Estimate parameters from monolingual corpus • Decoder • Given f, what is the best translation e?
Noisy Channel Model • Generative story: • Generate e with probability p(e) • Pass e through noisy channel • Out comes f with probability p(f|e) • Translation task: • Given f, deduce most likely e that produced f, or:
Translation Model • How to model P(f|e)? • Learn parameters of P(f|e) from a bilingual corpus S of sentence pairs <ei,fi> : < e1,f1 > = <the blue witch, la bruja azul> < e2,f2 > = <green, verde> … < eS,fS > = <the witch, la bruja>
Translation Model • Insufficient data in parallel corpus to estimate P(f|e) at the sentence level (Why?) • Decompose process of translating e -> f into small steps whose probabilities can be estimated
Translation Model • English sentence e = e1…el • Foreign sentence f = f1…fm • Alignment A = {a1…am}, where ajε{0…l} • A indicates which English word generates each foreign word
Alignments e: “the blue witch” f: “la bruja azul” A = {1,3,2} (intuitively “good” alignment)
Alignments e: “the blue witch” f: “la bruja azul” A = {1,1,1} (intuitively “bad” alignment)
Alignments e: “the blue witch” f: “la bruja azul” (illegal alignment!)
Alignments • Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m?
Alignments • Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m? • Answer: • Each foreign word can align with any one of |e| = l words, or it can remain unaligned • Each foreign word has (l + 1) choices for an alignment, and there are |f| = m foreign words • So, there are (l+1)^m alignments for a given e and f
Alignments • Question: If all alignments are equally likely, what is the probability of any one alignment, given e?
Alignments • Question: If all alignments are equally likely, what is the probability of any one alignment, given e? • Answer: • P(A|e) = p(|f| = m) * 1/(l+1)^m • If we assume that p(|f| = m) is uniform over all possible values of |f|, then we can let p(|f| = m) = C • P(A|e) = C /(l+1)^m
Generative Story e: “blue witch” f: “bruja azul” ? How do we get from e to f?
IBM Model 1 • Model parameters: • T(fj | eaj ) = translation probability of foreign word given English word that generated it
IBM Model 1 • Generative story: • Given e: • Pick m = |f|, where all lengths m are equally probable • Pick A with probability P(A|e) =1/(l+1)^m, since all alignments are equally likely given l and m • Pick f1…fm with probability where T(fj | eaj )is the translation probability of fj given the English word it is aligned to
IBM Model 1 Example e: “blue witch”
IBM Model 1 Example e: “blue witch” f: “f1 f2” Pick m = |f| = 2
IBM Model 1 Example e: blue witch” f: “f1 f2” Pick A = {2,1} with probability 1/(l+1)^m
IBM Model 1 Example e: blue witch” f: “bruja f2” Pick f1 = “bruja” with probability t(bruja|witch)
IBM Model 1 Example e: blue witch” f: “bruja azul” Pick f2 = “azul” with probability t(azul|blue)
IBM Model 1: Parameter Estimation • How does this generative story help us to estimate P(f|e) from the data? • Since the model for P(f|e) contains the parameter T(fj | eaj ),we first need to estimate T(fj | eaj )
lBM Model 1: Parameter Estimation • How to estimate T(fj | eaj )from the data? • If we had the data and the alignments A, along with P(A|f,e), then we could estimate T(fj | eaj ) using expected counts as follows:
lBM Model 1: Parameter Estimation • How to estimate P(A|f,e)? • P(A|f,e) = P(A,f|e) / P(f|e) • But • So we need to compute P(A,f|e)… • This is given by the Model 1 generative story:
IBM Model 1 Example e: “the blue witch” f: “la bruja azul” P(A|f,e) = P(f,A|e)/ P(f|e) =
IBM Model 1: Parameter Estimation • So, in order to estimate P(f|e), we first need to estimate the model parameter T(fj | eaj ) • In order to compute T(fj | eaj ), we need to estimate P(A|f,e) • And in order to compute P(A|f,e), we need to estimate T(fj | eaj )…
IBM Model 1: Parameter Estimation • Training data is a set of pairs < ei, fi> • Log likelihood of training data given model parameters is: • To maximize log likelihood of training data given model parameters, use EM: • hidden variable = alignments A • model parameters = translation probabilities T
EM • Initialize model parameters T(f|e) • Calculate alignment probabilities P(A|f,e) under current values of T(f|e) • Calculate expected counts from alignment probabilities • Re-estimate T(f|e) from these expected counts • Repeat until log likelihood of training data converges to a maximum
IBM Model 2 • Model parameters: • T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated it • d(i|j,l,m) = distortion probability, or probability that fj is aligned to ei , given l and m
IBM Model 3 • Model parameters: • T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated it • r(j|i,l,m) = reverse distortion probability, or probability of position fj, given its alignment to ei, l, and m • n(ei) = fertility of word ei , or number of foreign words aligned to ei • p1 = probability of generating a foreign word by alignment with the NULL English word
IBM Model 3 • Generative Story: • Choose fertilities for each English word • Insert spurious words according to probability of being aligned to the NULL English word • Translate English words -> foreign words • Reorder words according to reverse distortion probabilities
IBM Model 3 Example • Consider the following example from [Knight 1999]: • Maria did not slap the green witch
IBM Model 3 Example • Maria did not slap the green witch • Maria not slap slap slap the green witch • Choose fertilities: phi(Maria) = 1
IBM Model 3 Example • Maria did not slap the green witch • Maria not slap slap slap the green witch • Maria not slap slap slap NULL the green witch • Insert spurious words: p(NULL)
IBM Model 3 Example • Maria did not slap the green witch • Maria not slap slap slap the green witch • Maria not slap slap slap NULL the green witch • Maria no dio una bofetada a la verde bruja • Translate words: t(verde|green)
IBM Model 3 Example • Maria no dio una bofetada a la verde bruja • Maria no dio una bofetada a la bruja verde • Reorder words
IBM Model 3 • For models 1 and 2: • We can compute exact EM updates • For models 3 and 4: • Exact EM updates cannot be efficiently computed • Use best alignments from previous iterations to initialize each successive model • Explore only the subspace of potential alignments that lies within same neighborhood as the initial alignments
IBM Model 4 • Model parameters: • Same as model 3, except uses more complicated model of reordering (for details, see Brown et al. 1993)
Language Model • Given an English sentence e1, e2 …el : P(e1, e2 …el ) = P(e1) * P(e2|e1 ) * … * P(el| e1, e2 …el-1 ) • N-gram model: • Assume P(ei) depends only on the N-1 previous words, so that P(ei |e1,e2, …ei-1) = P(ei |ei-N,ei-N+1, …ei-1)
N=2: Bigram Language Model P(Maria did not slap the green witch) = P(Maria|START) * P(did|Maria) * P(not|did) * … P(END|witch)
Word-Based MT • Word = fundamental unit of translation • Weaknesses: • no explicit modeling of word context • word-by-word translation may not accurately convey meaning of phrase: • “il ne va pas” -> “he does not go” • IBM models prevent alignment of foreign words with >1 English word: • “aller” -> “to go”
Phrase-Based MT • Phrase = basic unit of translation • Strengths: • explicit modeling of word context • captures local reorderings, local dependencies
Example Rules: • English: he does not go • Foreign: il ne va pas • ne va pas -> does not go
Alignment Template System • [Och and Ney, 2004] • Alignment template: • Pair of source and target language phrases • Word alignment among words within those phrases • Formally, an alignment template is a triple (F,E,A): • F = words on foreign side • E = words on English side • A = alignments among words on the foreign and English sides
Estimating P(e|f) • Noisy channel: • Decompose P(e|f) into P(f|e) and P(e) • Estimate P(f|e) and P(e) separately • Direct: • Estimate P(e|f) directly from training corpus • Use log-linear model
Log-linear Models for MT • Compute best translation as follows: • where hi are the feature functions and λi are the model parameters • Typical feature functions include: • phrase translation probabilities • lexical translation probabilities • language model probability • reordering model • word penalty [Koehn 2003]
Log-linear Models for MT • Noisy Channel model is a special case of Log-Linear model where: • h1 = log(P(f|e)), λ1 = 1 • h2 = log(P(e)), λ2 = 1 • Then: [Och and Ney 2003]