1 / 15

Bayesian Word Alignment for Statistical Machine Translation

Bayesian Word Alignment for Statistical Machine Translation. Authors: Coskun Mermer, Murat Saraclar Present by Jun Lang 2011-10-13 I2R SMT-Reading Group. Paper info. Bayesian Word Alignment for Statistical Machine Translation ACL 2011 Short Paper With Source Code in Perl on 379 lines

nancy
Download Presentation

Bayesian Word Alignment for Statistical Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Word Alignment for Statistical Machine Translation Authors: Coskun Mermer, Murat Saraclar Present by Jun Lang 2011-10-13 I2R SMT-Reading Group

  2. Paper info • Bayesian Word Alignment for Statistical Machine Translation • ACL 2011 Short Paper • With Source Code in Perl on 379 lines • Authors • Coskun Mermer • Murat Saraclar

  3. Core Idea • Propose a Gibbs Sampler for Fully Bayesian Inference in IBM Model 1 • Result • Outperform classical EM in BLEU up to 2.99 • Effectively address the rare word problem • Much smaller phrase table than EM

  4. Mathematics • (E, F): parallel corpus • ei , fj : i-th (j-th) source (target) word in e (f), which contains I (J) words in corpus E (F). • e0 : Each E sentence contains “null” word • VE(VF): size of source (target) vocabulary • a (A): alignment for sentence (corpus) • aj : fj has alignment aj for source word eaj • T: parameter table, size is VEx VF • te,f = P(f|e): word translation probability

  5. IBM Model 1 T as a random variable

  6. Dirichlet Distribution • T={te,f} is an exponential family distribution • Specifically being multinomial distribution • We choose the conjugate prior • In the case of Dirichlet Distribution for computational convenience

  7. Dirichlet Distribution Each source word type te is a distribution over the target vocabulary, to be a Dirichlet distribution Avoid rare words acting as “garbage collectors”

  8. Dirichlet Distribution samplethe unknowns A and T in turn ¬j denotes the exclusion of the current value of aj .

  9. Algorithm A can be arbitrary, but normal EM output is better

  10. Results

  11. Code View bayesalign.pl

  12. Conclusions • Outperform classical EM in BLEU up to 2.99 • Effectively address the rare word problem • Much smaller phrase table than EM • Shortcomings • Too slow: 100 sentence pairs costs 18 mins • Maybe can be speedup by parallel computing

  13. 3

More Related