1 / 132

Machine Translation- 2

Machine Translation- 2. Autumn 2008. Lecture 17 4 Sep 2008. Statistical Machine Translation. Goal: Given foreign sentence f : “Maria no dio una bofetada a la bruja verde” Find the most likely English translation e : “Maria did not slap the green witch”. Statistical Machine Translation.

Download Presentation

Machine Translation- 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Translation- 2 Autumn 2008 Lecture 17 4 Sep 2008

  2. Statistical Machine Translation • Goal: • Given foreign sentence f: • “Maria no dio una bofetada a la bruja verde” • Find the most likely English translation e: • “Maria did not slap the green witch”

  3. Statistical Machine Translation • Most likely English translation e is given by: • P(e|f) estimates conditional probability of any e given f

  4. What makes a good translation • Translators often talk about two factors we want to maximize: • Faithfulness or fidelity • How close is the meaning of the translation to the meaning of the original • (Even better: does the translation cause the reader to draw the same inferences as the original would have) • Fluency or naturalness • How natural the translation is, just considering its fluency in the target language

  5. Statistical MT Systems Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Broken English Spanish English What hunger have I, Hungry I am so, I am so hungry, Have I that hunger … Que hambre tengo yo I am so hungry

  6. Statistical MT Systems Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Broken English Spanish English Translation Model P(s|e) Language Model P(e) Que hambre tengo yo I am so hungry Decoding algorithm argmax P(e) * P(s|e) e

  7. Statistical MT: Faithfulness and Fluency formalized! • Best-translation of a source sentence S: • Developed by researchers who were originally in speech recognition at IBM • Called the IBM model

  8. Three Problems for Statistical MT • Language model • Given an English string e, assigns P(e) by formula • good English string -> high P(e) • random word sequence -> low P(e) • Translation model • Given a pair of strings <f,e>, assigns P(f | e) by formula • <f,e> look like translations -> high P(f | e) • <f,e> don’t look like translations -> low P(f | e) • Decoding algorithm • Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e)

  9. Parallel Corpus • Example from DE-News (8/1/1996)

  10. Word-Level Alignments • Given a parallel sentence pair we can link (align) words or phrases that are translations of each other:

  11. Parallel Resources • Newswire: DE-News (German-English), Hong-Kong News, Xinhua News (Chinese-English), • Government: Canadian-Hansards (French-English), Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish), UN Treaties (Russian, English, Arabic, . . . ) • Manuals: PHP, KDE, OpenOffice (all from OPUS, many languages) • Web pages: STRAND project (Philip Resnik)

  12. Sentence Alignment • If document De is translation of document Df how do we find the translation for each sentence? • The n-th sentence in De is not necessarily the translation of the n-th sentence in document Df • In addition to 1:1 alignments, there are also 1:0, 0:1, 1:n, and n:1 alignments • Approximately 90% of the sentence alignments are 1:1

  13. Sentence Alignment (c’ntd) • There are several sentence alignment algorithms: • Align (Gale & Church): Aligns sentences based on their character length (shorter sentences tend to have shorter translations then longer sentences). Works astonishingly well • Char-align: (Church): Aligns based on shared character sequences. Works fine for similar languages or technical domains • K-Vec (Fung & Church): Induces a translation lexicon from the parallel texts based on the distribution of foreign-English word pairs.

  14. Computing Translation Probabilities • Given a parallel corpus we can estimate P(e | f) The maximum likelihood estimation of P(e | f) is: freq(e,f)/freq(f) • Way too specific to get any reasonable frequencies! Vast majority of unseen data will have zero counts! • P(e | f ) could be re-defined as: • Problem: The English words maximizing P(e | f ) might not result in a readable sentence

  15. Computing Translation Probabilities (c’tnd) • We can account for adequacy: each foreign word translates into its most likely English word • We cannot guarantee that this will result in a fluent English sentence • Solution: transform P(e | f) with Bayes’ rule: P(e | f) = P(e) P(f | e) / P(f) • P(f | e) accounts for adequacy • P(e) accounts for fluency

  16. Decoding • The decoder combines the evidence from P(e) and P(f | e) to find the sequence e that is the best translation: • The choice of word e’ as translation of f’ depends on the translation probability P(f’ | e’) and on the context, i.e. other English words preceding e’

  17. Noisy Channel Model for Translation

  18. Noisy Channel Model • Generative story: • Generate e with probability p(e) • Pass e through noisy channel • Out comes f with probability p(f|e) • Translation task: • Given f, deduce most likely e that produced f, or:

  19. Translation Model • How to model P(f|e)? • Learn parameters of P(f|e) from a bilingual corpus S of sentence pairs <ei,fi> : < e1,f1 > = <the blue witch, la bruja azul> < e2,f2 > = <green, verde> … < eS,fS > = <the witch, la bruja>

  20. Translation Model • Insufficient data in parallel corpus to estimate P(f|e) at the sentence level (Why?) • Decompose process of translating e -> f into small steps whose probabilities can be estimated

  21. Translation Model • English sentence e = e1…el • Foreign sentence f = f1…fm • Alignment A = {a1…am}, where ajε {0…l} • A indicates which English word generates each foreign word

  22. Alignments e: “the blue witch” f: “la bruja azul” A = {1,3,2} (intuitively “good” alignment)

  23. Alignments e: “the blue witch” f: “la bruja azul” A = {1,1,1} (intuitively “bad” alignment)

  24. Alignments e: “the blue witch” f: “la bruja azul” (illegal alignment!)

  25. Alignments • Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m?

  26. Alignments • Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m? • Answer: • Each foreign word can align with any one of |e| = l words, or it can remain unaligned • Each foreign word has (l + 1) choices for an alignment, and there are |f| = m foreign words • So, there are (l+1)^m alignments for a given e and f

  27. Alignments • Question: If all alignments are equally likely, what is the probability of any one alignment, given e?

  28. Alignments • Question: If all alignments are equally likely, what is the probability of any one alignment, given e? • Answer: • P(A|e) = p(|f| = m) * 1/(l+1)^m • If we assume that p(|f| = m) is uniform over all possible values of |f|, then we can let p(|f| = m) = C • P(A|e) = C /(l+1)^m

  29. Generative Story e: “blue witch” f: “bruja azul” ? How do we get from e to f?

  30. Language Modeling • Determines the probability of some English sequence of length l • P(e) is hard to estimate directly, unless l is very small • P(e) is normally approximated as: where m is size of the context, i.e. number of previous words that are considered, normally m=2 (tri-gram language model

  31. Translation Modeling • Determines the probability that the foreign word f is a translation of the English word e • How to compute P(f | e) from a parallel corpus? • Statistical approaches rely on the co-occurrence of e and f in the parallel data: If e and f tend to co-occur in parallel sentence pairs, they are likely to be translations of one another

  32. Finding Translations in a Parallel Corpus • Into which foreign words f, . . . , f’ does e translate? • Commonly, four factors are used: • How often do e and f co-occur? (translation) • How likely is a word occurring at position i to translate into a word occurring at position j? (distortion) For example: English is a verb-second language, whereas German is a verb-final language • How likely is e to translate into more than one word? (fertility) For example: defeated can translate into eine Niederlage erleiden • How likely is a foreign word to be spuriously generated? (null translation)

  33. Translation Steps

  34. IBM Models 1–5 • Model 1: Bag of words • Unique local maxima • Efficient EM algorithm (Model 1–2) • Model 2: General alignment: • Model 3: fertility: n(k | e) • No full EM, count only neighbors (Model 3–5) • Deficient (Model 3–4) • Model 4: Relative distortion, word classes • Model 5: Extra variables to avoid deficiency

  35. IBM Model 1 • Model parameters: • T(fj | eaj ) = translation probability of foreign word given English word that generated it

  36. IBM Model 1 • Generative story: • Given e: • Pick m = |f|, where all lengths m are equally probable • Pick A with probability P(A|e) =1/(l+1)^m, since all alignments are equally likely given l and m • Pick f1…fm with probability where T(fj | eaj )is the translation probability of fj given the English word it is aligned to

  37. IBM Model 1 Example e: “blue witch”

  38. IBM Model 1 Example e: “blue witch” f: “f1 f2” Pick m = |f| = 2

  39. IBM Model 1 Example e: blue witch” f: “f1 f2” Pick A = {2,1} with probability 1/(l+1)^m

  40. IBM Model 1 Example e: blue witch” f: “bruja f2” Pick f1 = “bruja” with probability t(bruja|witch)

  41. IBM Model 1 Example e: blue witch” f: “bruja azul” Pick f2 = “azul” with probability t(azul|blue)

  42. IBM Model 1: Parameter Estimation • How does this generative story help us to estimate P(f|e) from the data? • Since the model for P(f|e) contains the parameter T(fj | eaj ),we first need to estimate T(fj | eaj)

  43. lBM Model 1: Parameter Estimation • How to estimate T(fj | eaj )from the data? • If we had the data and the alignments A, along with P(A|f,e), then we could estimate T(fj | eaj ) using expected counts as follows:

  44. lBM Model 1: Parameter Estimation • How to estimate P(A|f,e)? • P(A|f,e) = P(A,f|e) / P(f|e) • But • So we need to compute P(A,f|e)… • This is given by the Model 1 generative story:

  45. IBM Model 1 Example e: “the blue witch” f: “la bruja azul” P(A|f,e) = P(f,A|e)/ P(f|e) =

  46. IBM Model 1: Parameter Estimation • So, in order to estimate P(f|e), we first need to estimate the model parameter T(fj | eaj ) • In order to compute T(fj | eaj ), we need to estimate P(A|f,e) • And in order to compute P(A|f,e), we need to estimate T(fj | eaj )…

  47. IBM Model 1: Parameter Estimation • Training data is a set of pairs < ei, fi> • Log likelihood of training data given model parameters is: • To maximize log likelihood of training data given model parameters, use EM: • hidden variable = alignments A • model parameters = translation probabilities T

  48. EM • Initialize model parameters T(f|e) • Calculate alignment probabilities P(A|f,e) under current values of T(f|e) • Calculate expected counts from alignment probabilities • Re-estimate T(f|e) from these expected counts • Repeat until log likelihood of training data converges to a maximum

  49. IBM Model 1 Example • Parallel ‘corpus’: the dog :: le chien the cat :: le chat • Step 1+2 (collect candidates and initialize uniformly): P(le | the) = P(chien | the) = P(chat | the) = 1/3 P(le | dog) = P(chien | dog) = P(chat | dog) = 1/3 P(le | cat) = P(chien | cat) = P(chat | cat) = 1/3 P(le | NULL) = P(chien | NULL) = P(chat | NULL) = 1/3

More Related