Language Model Adaptation in Machine Translation from Speech

Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul

Why need LM adaptation in MT? • Model genre/style variations • LM training data sources are not homogeneous • Largest corpora (e.g. Gigaword), which may not be the most relevant, dominate when n-gram counts are merged • Style/topics vary depending on • source of data (i.e. publisher) • epoch • original medium (newswire vs. broadcast) • processing (data created in English vs. translated from other languages)

Types of Adaptation Studied Here • LM interpolation: combine LMs by interpolating their probabilities with weights estimated by • Minimizing perplexity on a tuning set (supervised) • Minimizing perplexity on 1-best output (unsupervised) • Optimizing MT performance criteria (TER or BLEU) directly using tuning set n-best lists (discriminative) • Log-linear combination of scores from different LMs

LM Adaptation via Linear Interpolation • Build separate LMs from each training corpus • Choose a tuning set similar in topics/style to the test material • Interpolate corpus LMs using weights estimated by minimizing perplexity on the tuning set … Corpus1 CorpusN … Estimate LM Estimate LM Corpus LM1 Corpus LMN … Tuning set Interpolate LMs Interpolated LM

Unsupervised Adaptation Input: one document or entire test set • Use unadapted 3-gram LM to decode input, either one document at a time or the entire test set • Produce n-best lists and 1-best hypotheses • Adapt the 5-gram LM by optimizing LM interpolation weights on the 1-best • Rescore n-best lists with the adapted 5-gram LM Unadapted 3-gramLM MT decoder N-best lists 1-best hypotheses LM adaptation Rescore Adapted 5-gram LM Output: translations

Discriminative Adaptation • Hill climbing optimization (maxBLEU or minTER) using tuning set n-best lists • Log-linear space: treat each LM component as an independent knowledge source • Probability space: identical to standard LM interpolation

EvaluationTask • Translation from Arabic speech to English text • Broadcast News (BN) and Broadcast Conversations (BC) around 30K words in each genre • Both BN and BC are transcribed/translated jointly but scored separately • Translation performance is reported for both: reference transcriptions and STT outputs • Two scoring metrics used: BLEU and TER (translation edit rate, similar to WER, but allows phrase shifts) • Tuning • Tuning set (BNC-tune) similar to the test in epoch and sources • MT system is optimized using reference transcripts • Two sets of weights are computed with different optimization criteria: minTER and maxBLEU

ExperimentalSetup • BBN Arabic STT system • 1300 hours of acoustic training (SCTM, MPFE, SAT) • 1B words of LM training (2,3-gram decoding, 4-gram rescoring) • BBN MT translation engine • 140M words of Arabic-English parallel text • 6B words of English LM training data • Phrase translations are obtained by running GIZA++ • Phrase translations are generalized by using POS classes • Features used by the decoder include • Backward and forward translation probabilities • Pruned 3-gram LM score • Penalty for phrase reordering • Phrase segmentation score • Word insertion penalty • N-best (n=300) lists rescored with unpruned 5-gram LM

English LM Training • English training texts totaling around 6B words • Gigaword v2 • News articles from on-line archives • UW Web corpus: web text of conversation-like style • CNN talk show transcripts from CNN.com • News articles from a variety of on-line publishers downloaded daily (02/2005-02/2006) • English side of news portion of the parallel data • 5-gram Kneser-Ney LM, without pruninig • 4.3B n-grams

LM Component Weights after Adaptation * The log-linear combination weights are normalized for ease of comparison.

MT Performance

Conclusions and Future Work • LM adaptation leads to improvements (up to half a point in TER or BLEU) in MT performance from speech • Discriminative adaptation gave largest gains • Gains from unsupervised adaptation diminish as WER increases • Unsupervised adaptation at the document level did not outperform full test set adaptation • Larger impact from adaptation likely if: • Both decoding and rescoring LMs are adapted, especially if the decoding LM is pruned • Unsupervised adaptation is performed on groups of similar documents

Related Work at BBN • S. Matsoukas, I. Bulyko, B. Xiang, R. Schwartz and J. Makhoul, “Integrating speech recognition and machine translation,” ICASSP07. (lecture tomorrow, 3:45pm) • B. Xiang, J. Xu, R. Bock, I. Bulyko, J. Maguire, S. Matsoukas, A. Rosti, R. Schwartz, R. Weischedel and J. Makhoul, “The BBN machine translation system for the NIST 2006 MT evaluation,” presentation, NIST MT06 Workshop. • J. Ma and S. Matsoukas, “Unsupervised training on a large amount of Arabic broadcast news data,” ICASSP07. (poster this morning) • M. Snover, B. Dorr, R. Schwartz, L. Micciulla and J. Makhoul, “A study of translation edit rate with targeted human annotation,” in Proc. AMTA, 2006.

Language Model Adaptation in Machine Translation from Speech