1 / 29

(Statistical) Approaches to Word Alignment

Explore word alignment models for machine translation, learn how to translate words and phrases, and model associations between languages.

Download Presentation

(Statistical) Approaches to Word Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. (Statistical) Approaches to Word Alignment 11-734 Advanced Machine Translation Seminar Sanjika Hewavitharana Language Technologies Institute Carnegie Mellon University 02/02/2006

  2. Word Alignment Models • We want to learn how to translate words and phrases • Can learn it from parallel corpora • Typically work with sentence aligned corpora • Available from LDC, etc • For specific applications new data collection required • Model the associations between the different languages • Word to word mapping -> lexicon • Differences in word order -> distortion model • ‘Wordiness’, i.e. how many words to express a concept -> fertility • Statistical translation is based on word alignment models

  3. Alignment Example Observations: • Often 1-1 • Often monotone • Some 1-to-many • Some 1-to-nothing

  4. Word Alignment Models • IBM1 – lexical probabilities only • IBM2 – lexicon plus absolut position • IBM3 – plus fertilities • IBM4 – inverted relative position alignment • IBM5 – non-deficient version of model 4 • HMM – lexicon plus relative position • BiBr – Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation • Syntactical alignment models [Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999, Wu 1997, Yamada et al. 2003]

  5. Notation • Source language • f : source (French) word • J : length of source sentence • j : position in source sentence; j = 1,2,...,J • : source sentence • Target language • e : target (English) word • I : length of target sentence • i : position in target sentence; i = 1,2,...,I • : target sentence

  6. SMT - Principle • Translate a ‘French’ stringinto an ‘English’ string • Bayes’ decision rule for translation: • Based on Noisy channel model • We will call f source and e target

  7. Alignment as Hidden Variable • ‘Hidden alignments’ to capture word-to-word correspondences • Number of connections: J * I (each source word with each target word) • Number of alignments: 2JI • Restricted alignment • Each source word has one connection – a function • i = aj: position i of ei which is connected to j • Number of alignments is now: IJ • : whole alignment • Relationship between Translation Model and Alignment Model

  8. Empty Position (Null Word) • Sometimes a word has no correspondence • Alignment function aligns each source word to one target word, i.e. cannot skip source word • Solution: • Introduce empty position 0 with null word e0 • ‘Skip’ source word fj by aligning it to e0 • Target sentence is extended to: • Alignment is extended to:

  9. Translation Model • Sum over all possible alignments • 3 probability distributions: • Length: • Alignment: • Lexicon:

  10. Model Assumptions Decompose interaction into pairwise dependencies • Length: Source length only dependent on target length (very weak) • Alignment: • Zero order model: target position only dependent on source position • First order model: target position only dependent on previous target position • Lexicon: source word only dependent on aligned word

  11. IBM Model 1 • Length: Source length only dependent on target length • Alignment: Assume uniform probability for position alignment • Lexicon: source word only dependent on aligned word • Alignment probability

  12. IBM Model 1 – Generative Process To generate a French string from an English string : • Step 1: Pick the length of • All lengths are equally probable; is a constant • Step 2: Pick an alignment with probability • Step 3: Pick the French words with probability • Final Result:

  13. IBM Model 1 – Training • Parameters of the model: • Training data: parallel sentence pairs • We adjust parameters s.t. it maximize • Normalized for each : • EM Algorithm used for the estimation • Initialize the parameters uniformly • Collect counts for each pair in the corpus • Re-estimate parameters using counts • Repeated for several iterations • Model simple enough to compute over all alignments • Parameters does not depend on initial values

  14. IBM Model 1 Training– Pseudo Code # Accumulation (over corpus) For each sentence pair For each source position j Sum = 0.0 For each target position i Sum += p(fj|ei) For each target position i Count(fj,ei) += p(fj|ei)/Sum # Re-estimate probabilities (over count table) For each target word e Sum = 0.0 For each source word f Sum += Count(f,e) For each source word f p(f|e) = Count(f,e)/Sum # Repeat for several iterations

  15. IBM Model 2 Only Difference from Model 1 is in Alignment Probability • Length: Source length only dependent on target length • Alignment: Target position depends on the source position (in addition to the source length and target length) • Model 1 is a special case of Model 2, where • Lexicon: source word only dependent on aligned word

  16. IBM Model 2 – Generative Process To generate a French string from an English string : • Step 1: Pick the length of • All lengths are equally probable; is a constant • Step 2: Pick an alignment with probability • Step 3: Pick the French words with probability • Final Result:

  17. IBM Model 2 – Training • Parameters of the model: • Training data: parallel sentence pairs • We maximize w.r.t translation and alignment params. • EM Algorithm used for the estimation • Initialize alignment parameters uniformly, and translation probabilities from Model 1 • Accumulate counts, re-estimate parameters • Model simple enough to compute over all alignments

  18. Fertility-based Alignment Models • Models 3-5 are based on Fertility • Fertility: Number of source words connected with a target word : fertility values of = probability that is connected with source words • Alignment: Defined in the reverse-direction (target to source) = probability of French position j given English position is i

  19. IBM Model 3 – Generative Process To generate a French string from an English string : • Step 1: Choose (I+1) fertilities with probability

  20. IBM Model 3 – Generative Process • Step 2: For each , for k =1… , choose a position 1…J and a French word with probability For a given alignment, there are orderings

  21. IBM Model 3 – Example [Knight 99] e0 Mary did not slap the green witch 1 0 1 3 1 1 1 Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde 1 2 3 4 5 6 7 8 9 1 3 4 4 4 0 5 7 6 [e] 1 [choose fertility] [fertility for e0] [choose translation] [choose target positions j ] [aj ]

  22. IBM Model 3 – Training • Parameters of the model: • EM Algorithm used for the estimation • Not possible to compute exact EM updates • Initialize n,d,p uniformly, and translation probabilities from Model 2 • Accumulate counts, re-estimate parameters • Cannot efficiently compute over all alignments • Only Viterbi alignment is used • Model 3 is deficient • Probability mass is wasted on impossible translations

  23. IBM Model 4 • Try to model re-ordering of phrases • is replaced with two sets of parameters: • One for placing the first word (head) of a group of words • One for placing rest of the words relative to the head • Deficient • Alignment can generate source positions outside of sentence length J • Model 5 removes this deficiency

  24. HMM Alignment Model • Idea: relative position model Target Source [Vogel 96]

  25. HMM Alignment • First order model: target position dependent on previous target position(captures movement of entire phrases) • Alignment probability: • Alignment depends on relative position • Maximum approximation:

  26. IBM2 vs HMM [Vogel 96]

  27. Enhancements to HMM & IBM Models • HMM model with empty word • Adding I empty words to the target side • Model 6 • IBM 4: predicts distance between subsequent target positions • HMM: predicts distance between subsequent source positions • Model 6: A log-linear combination of IBM 4 and HMM Models • Smoothing • Alignment prob. – Interpolate with uniform dist. • Fertility prob. – Depends of number of letters in a word • Symmetrization • Heuristic postprocessing to combine alignments in both directions

  28. Experimental Results [Franz 03] • Refined models perform better • Models 4,5,6 better than Model 1 or Dice coefficient model • HMM better than IBM 2 • Alignment quality based on the training method and bootstrap scheme used • IBM 1->HMM->IBM 3 better than IBM 1->IBM 2->IBM 3 • Smoothing and Symmetrization have a significant effect on alignment quality • More alignments in training yields better results • Using word classes • Improvement for large corpora but not for small corpora

  29. References: • Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer (1993). The Mathematics of Statistical Machine Translation , Computational Linguistics, vol. 19, no. 2. • Stephan Vogel, Hermann Ney, Christoph Tillmann (1996). HMM-based Word Alignment in Statistical Translation , COLING, The 16th Int. Conf. on Computational Linguistics, Copenhagen, Denmark, August, pp. 836-841. • Franz Josef Och, Hermann Ney (2003), A Systematic Comparison of Various Statistical Alignment Models , Computational Linguistics, vol. 29, no.1, pp. 19-51. • Knight, Kevin, (1999), A Statistical MT Tutorial Workbook, Available at http://www.isi.edu/natural-language/mt/wkbk.rtf.

More Related