130 likes | 291 Views
Language Model Adaptation in Machine Translation from Speech. Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul. Why need LM adaptation in MT?. Model genre/style variations LM training data sources are not homogeneous
E N D
Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul
Why need LM adaptation in MT? • Model genre/style variations • LM training data sources are not homogeneous • Largest corpora (e.g. Gigaword), which may not be the most relevant, dominate when n-gram counts are merged • Style/topics vary depending on • source of data (i.e. publisher) • epoch • original medium (newswire vs. broadcast) • processing (data created in English vs. translated from other languages)
Types of Adaptation Studied Here • LM interpolation: combine LMs by interpolating their probabilities with weights estimated by • Minimizing perplexity on a tuning set (supervised) • Minimizing perplexity on 1-best output (unsupervised) • Optimizing MT performance criteria (TER or BLEU) directly using tuning set n-best lists (discriminative) • Log-linear combination of scores from different LMs
LM Adaptation via Linear Interpolation • Build separate LMs from each training corpus • Choose a tuning set similar in topics/style to the test material • Interpolate corpus LMs using weights estimated by minimizing perplexity on the tuning set … Corpus1 CorpusN … Estimate LM Estimate LM Corpus LM1 Corpus LMN … Tuning set Interpolate LMs Interpolated LM
Unsupervised Adaptation Input: one document or entire test set • Use unadapted 3-gram LM to decode input, either one document at a time or the entire test set • Produce n-best lists and 1-best hypotheses • Adapt the 5-gram LM by optimizing LM interpolation weights on the 1-best • Rescore n-best lists with the adapted 5-gram LM Unadapted 3-gramLM MT decoder N-best lists 1-best hypotheses LM adaptation Rescore Adapted 5-gram LM Output: translations
Discriminative Adaptation • Hill climbing optimization (maxBLEU or minTER) using tuning set n-best lists • Log-linear space: treat each LM component as an independent knowledge source • Probability space: identical to standard LM interpolation
EvaluationTask • Translation from Arabic speech to English text • Broadcast News (BN) and Broadcast Conversations (BC) around 30K words in each genre • Both BN and BC are transcribed/translated jointly but scored separately • Translation performance is reported for both: reference transcriptions and STT outputs • Two scoring metrics used: BLEU and TER (translation edit rate, similar to WER, but allows phrase shifts) • Tuning • Tuning set (BNC-tune) similar to the test in epoch and sources • MT system is optimized using reference transcripts • Two sets of weights are computed with different optimization criteria: minTER and maxBLEU
ExperimentalSetup • BBN Arabic STT system • 1300 hours of acoustic training (SCTM, MPFE, SAT) • 1B words of LM training (2,3-gram decoding, 4-gram rescoring) • BBN MT translation engine • 140M words of Arabic-English parallel text • 6B words of English LM training data • Phrase translations are obtained by running GIZA++ • Phrase translations are generalized by using POS classes • Features used by the decoder include • Backward and forward translation probabilities • Pruned 3-gram LM score • Penalty for phrase reordering • Phrase segmentation score • Word insertion penalty • N-best (n=300) lists rescored with unpruned 5-gram LM
English LM Training • English training texts totaling around 6B words • Gigaword v2 • News articles from on-line archives • UW Web corpus: web text of conversation-like style • CNN talk show transcripts from CNN.com • News articles from a variety of on-line publishers downloaded daily (02/2005-02/2006) • English side of news portion of the parallel data • 5-gram Kneser-Ney LM, without pruninig • 4.3B n-grams
LM Component Weights after Adaptation * The log-linear combination weights are normalized for ease of comparison.
Conclusions and Future Work • LM adaptation leads to improvements (up to half a point in TER or BLEU) in MT performance from speech • Discriminative adaptation gave largest gains • Gains from unsupervised adaptation diminish as WER increases • Unsupervised adaptation at the document level did not outperform full test set adaptation • Larger impact from adaptation likely if: • Both decoding and rescoring LMs are adapted, especially if the decoding LM is pruned • Unsupervised adaptation is performed on groups of similar documents
Related Work at BBN • S. Matsoukas, I. Bulyko, B. Xiang, R. Schwartz and J. Makhoul, “Integrating speech recognition and machine translation,” ICASSP07. (lecture tomorrow, 3:45pm) • B. Xiang, J. Xu, R. Bock, I. Bulyko, J. Maguire, S. Matsoukas, A. Rosti, R. Schwartz, R. Weischedel and J. Makhoul, “The BBN machine translation system for the NIST 2006 MT evaluation,” presentation, NIST MT06 Workshop. • J. Ma and S. Matsoukas, “Unsupervised training on a large amount of Arabic broadcast news data,” ICASSP07. (poster this morning) • M. Snover, B. Dorr, R. Schwartz, L. Micciulla and J. Makhoul, “A study of translation edit rate with targeted human annotation,” in Proc. AMTA, 2006.