170 likes | 340 Views
“Applying Morphology Generation Models to Machine Translation”. By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation Reading Group, 19 th May 2008. Meta-Motivation.
E N D
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation Reading Group, 19th May 2008
Meta-Motivation • Machine Translation is a collection of subproblems: alignment (corpus, sentence, word/phrase), reordering, phrase extraction, language modeling, transliteration, capitalization, etc etc. • It’s hard to work on just a sub-problem in Machine Translation and have those gains translate (har!) over to the overall system performance. • A side goal is to work on independent, portable, modules in the MT system.
Motivation • Many languages have morphological inflection to express agreement, gender, case, etc. English… not so much. • Shows up in surface form of a word:prefix + stem + suffix (more or less. let’s not talk about infixes and circumfixes) • Standard difficulty: data sparseness.(see fewer of each token)
Morphology in MT • It’s problematic when morphological information in one half of a language pair is not present in the other half. • Depending on the translation direction, you either have “extra” information that you need to learn to ignore, (easy)oryou need to generate this extra information somehow (hard)
Too much morphology • Easy hack: PRE-PROCESS! • Strip out gender, split compounds, segment clitics– use as much perl as it takes.
Not enough morphology • Current approaches mostly use a rich language model on the target side. • Downside: just rescoring MT system output, not actually affecting the options. • Factored translation: fold the morphological generation into the translation model– do it all during decoding. • Downside: computationally expensive, so have to prune search space heavily– too much?
Was gibt es neues? • The approach in this paper is to treat morphological inflection as a standalone (post)process:first, decode the input.then, for the sequence of word stems in the output,generate the most likely sequence of inflections given the original input. • Experiments: English Russian (1.6M sentence pairs), English Arabic (460k sentence pairs)
Inflection prediction • Lexicon determines three operations: • Stemming: produce set of x possible stems S_w = {s^x} for a word w • Inflection: produce set of y surface word formsI_w = {i^y} for the set of stems S_w • Morphological Analysis: produce set of z morph. analyses A_w = {a^z} for a word w.each a is a vector of categorical values (POS, gender, etc).
Morphological analysis • Morphological features:7 for Russian (including capitalization!)12 for Arabic • Each word can be factored into a stem + a subset of the morph features. • Average 14 inflections / stem in Russian, 24 / stem in Arabic (!).
How do you get them? • Arabic: Buckwalter analyzer • Russian: Off-the-shelf lexicon • Neither is exact, neither is domain-specific; there could be errors here. • (Curse of MT: error propagation)
Models • Prob of an inflection is product of local probabilities for each word, conditioned on context window (prior predictions): • 5gram Russian, 3gram Arabic. • Unlike the morphological analyzer which is just word-based, the inflection model can use arbitrary features/dependencies (such as projected treelet syntactic information)
Inflection Prediction Features • Binary • Pairs up the context (x, y_(t-1), y_(t-2),…)with the target label (y_t) • Features can be anything!
Baseline experiments • Stemmed the reference translations, try to predict the inflections. • Done on 5k Russian sentences,1k Arabic (why?) • Very good accuracy (91% +) • Better than trigram LM (but how about 5gram for Russian?)
MT systems used • 1. The Microsoft treelet translation system • 2. Pharaoh reimplementation • Trained on the MS corpus of technical manuals
Experiments: • Translations are selected by the translation model, language model, and inflection model as follows: • For each hypothesis in the n-best MT system output, select the best inflection. • Then, for each input sentence, select the best inflected hypothesis.
Nbest lists • Only tested up to n=100, but then optimized n and the interpolation weights via grid search. • Optimum size of nbest list: • Russian: 32Arabic: 2 • (!!)
Experiments • 1. Train a regular MT system. stem the output, and re-inflect. • 2. Train MT system, but stem the target language after alignment. The system output is now just word stems, so inflect. • 3. Stem the parallel corpus, then train an MT system.