280 likes | 460 Views
A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010, Massachusetts, USA. lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas.
E N D
A Hybrid Morpheme-Word Representationfor Machine Translation of Morphologically Rich LanguagesMinh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010, Massachusetts, USA
lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilaslentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas This is a Finnish word!!! lento+ kone+ suihku+ turbiini+ moottori+ apu+ mekaanikko+ ali+ upseeri+ oppilas Highly inflected languages are hard for MT Analysis at the morpheme level is sensible. flight machine shower turbine engine assistance mechanic under officer student technical warrant officer trainee specialized in aircraft jet engines used in the Finnish Air Force • Highly inflected languages (Arabic, Basque, Turkish, Russian, Hungarian, etc. ) • Extensive use of affixes • Huge number of distinct word forms A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Morphological Analysis Helps auto+ si =yourcar auto+ i+ si=yourcar+ s auto+ i+ ssa+ si=inyour car+ s auto+ i+ ssa+ si+ ko=inyourcar+ s? Our approach: the basic unit of translation is the morpheme, but word boundaries are respected at all stages. • Morpheme representation alleviates data sparseness Arabic English(Lee, 2004) Czech English (Goldwater & McClosky, 2005) Finnish English(Yang & Kirchhoff, 2006) • Word representation captures context better for large corpora Arabic English (Sadat & Habash, 2006) Finnish English (de Gispert et al., 2009) A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Translation into Morphologically Rich Languages We are interested in unsupervised approach, and experiment translation with a large dataset • A challenging translation direction • Recent interest in morphologically rich languages: • Arabic (Badr et al., 2008) • Turkish (Oflazer and El-Kahlout, 2007) enhance the performance for only small bi-texts • Greek (Avramidis and Koehn, 2008), • Russian (Toutanova et al., 2008) • rely heavily on language-specific tools A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Methodology 5) Translation Model Enrichment Phrase pairs Phrase table PT scoring 1) Morphological analysis Alignment training 2) Phrase extraction Morph- emes Align- ment 3) MERT tuning Words 4) Decoding Morphological Analysis – Unsupervised Morphological Enhancements – Respect word boundaries Translation Model Enrichment – Merge phrase tables (PT) A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Morphological Analysis • Use Morfessor (Creutz and Lagus, 2007) - unsupervised morphological analyzer • Segments words morphemes (PRE, STM, SUF) un/PRE+ care/STM+ ful/SUF+ ly/SUF • “+” sign used to enforce word boundary constraints later A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Word Boundary-aware Phrase Extraction advoid proposing nonwords SRC = theSTM newSTM , unPRE+ democraticSTMimmigrationSTM policySTM TGT = uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+ enSUFmaahanmuuttoPRE+ politiikanSTM (uusi=new , epädemokraattisen=undemocratic maahanmuuttopolitiikan=immigration policy) Training • Typical SMT: maximum phrase length n=7words • Problem: morpheme phrases • of length n can span less than n words • may only partially span words Severe for morphologically rich languages • Solution: morpheme phrases • span n words or less • span a sequence of whole words A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Morpheme MERT Optimizing Word BLEU Tuning • Why ? • BLEU brevity penalty influenced by sentence length • Same # words span different # morphemes • a 6-morpheme Finnish word: epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+ enSUF Suboptimal weight for the word penalty feature function • Solution: for each iteration of MERT, • Decode at the morpheme level • Convert morpheme translation word sequence • Compute word BLEU • Convert back to morphemes A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Decoding with Twin Language Models Current hypothesis Previous hypotheses • uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+ enSUF maahanmuuttoPRE+ politiikanSTM • Morpheme LM: Decoding • Morpheme language model (LM) • Pros: alleviate data sparseness • Cons: phrases span fewer words • Introduce a second LM at the word level • Log-linear model: add a separate feature • Moses decoder: add word-level “view” on the morpheme-level hypotheses A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Decoding with Twin Language Models Current hypothesis Previous hypotheses • uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+sSUF+ enSUF maahanmuuttoPRE+ politiikanSTM • Morpheme LM: “sSUF+ enSUF maahanmuuttoPRE+” Decoding • Morpheme language model (LM) • Pros: alleviate data sparseness • Cons: phrases span fewer words • Introduce a second LM at the word level • Log-linear model: add a separate feature • Moses decoder: add word-level “view” on the morpheme-level hypotheses A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Decoding with Twin Language Models Current hypothesis Previous hypotheses • uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+enSUF maahanmuuttoPRE+ politiikanSTM • Morpheme LM: “sSUF+ enSUF maahanmuuttoPRE+” ; “enSUF maahanmuuttoPRE+ politiikanSTM ” Decoding • Morpheme language model (LM) • Pros: alleviate data sparseness • Cons: phrases span fewer words • Introduce a second LM at the word level • Log-linear model: add a separate feature • Moses decoder: add word-level “view” on the morpheme-level hypotheses A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Decoding with Twin Language Models • Morpheme language model (LM) • Pros: alleviate data sparseness • Cons: phrases span fewer words • Introduce a second LM at the word level • Log-linear model: add a separate feature • Moses decoder: add word-level “view” on the morpheme-level hypotheses Current hypothesis Previous hypotheses • uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+ enSUF maahanmuuttoPRE+ politiikanSTM • Morpheme LM: “sSUF+ enSUF maahanmuuttoPRE+” ; “enSUF maahanmuuttoPRE+ politiikanSTM ” • Word LM: uusi , epädemokraattisen maahanmuuttopolitiikan Decoding A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Decoding with Twin Language Models • Morpheme language model (LM) • Pros: alleviate data sparseness • Cons: phrases span fewer words • Introduce a second LM at the word level • Log-linear model: add a separate feature • Moses decoder: add word-level “view” on the morpheme-level hypotheses Current hypothesis Previous hypotheses • uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+ enSUF maahanmuuttoPRE+ politiikanSTM • Morpheme LM: “sSUF+ enSUF maahanmuuttoPRE+” ; “enSUF maahanmuuttoPRE+ politiikanSTM ” • Word LM: uusi , epädemokraattisen maahanmuuttopolitiikan This is: (1) different from scoring with two word-level LMs & (2) superior to n-best rescoring Decoding A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Building Twin Translation Models Parallel corpora Word Morpheme GIZA++ GIZA++ Word alignment Morpheme alignment Phrase Extraction Phrase Extraction problems ||| vaikeuksista PTw PTm Morphological segmentation problemSTM+ sSUF ||| ongelmaSTM+ tSUF PTw→m problemSTM+ sSUF ||| vaikeuSTM+ ksiSUF+ staSUF PT merging Decoding From the same source, we generate two translation models Enriching A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Phrase Table (PT) Merging • Lexicalized translation probabilities • Phrase translation probabilities problemSTM+ sSUF ||| vaikeuSTM+ ksiSUF+ staSUF ||| 0.07 0.11 0.01 0.01 2.7 2.7 1 ||| 0.37 0.60 0.11 0.14 2.7 1 2.7 problemSTM+ sSUF ||| vaikeuSTM+ ksiSUF+ staSUF ||| 0.07 0.11 0.01 0.01 2.7 ||| 0.37 0.60 0.11 0.14 2.7 Phrase penalty problemSTM+ sSUF ||| ongelmaSTM+ tSUF problemSTM+ sSUF ||| ongelmaSTM+ tSUF We take into account the fact: our twin translation models are of equal quality Enriching • Add-feature methods e.g., Chen et al., 2009 • Heuristic-driven • Interpolation-based methods e.g., Wu & Wang, 2007 • Linear interpolation of phrase and lexicalized translation probabilities For two PTs originated from different sources A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Our method – Phrase translation probabilities The number of times the pair was extracted from the training dataset PTw→m PTm Enriching PTwm PTm • Preserve the normalized ML estimations (Koehn et al., 2003) • Use the raw counts of both models to compute A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Lexicalized translation probabilities problemSTM+ sSUF||| vaikeuSTM+ ksiSUF+sta problemSTM+ sSUF||| ongelmaSTM+ tSUF problemSTM+ sSUF ||| vaikeuSTM+ ksiSUF+sta problemSTM+ sSUF ||| ongelmaSTM+ staSUF PTm PTwm Morpheme Lexical Model (PTm) Word Lexical Model (PTw) (problemSTM+ sSUF|ongelmaSTM+ tSUF) (problemSTM+ sSUF| ongelmaSTM+ tSUF) (problems | ongelmat) Enriching PTwm PTm • Use linear interpolation • What happens if a phrase pair belongs to only one PT? • Previous methods: interpolate with 0 • Might cause some good phrases to be penalized • Our method: induce all scores before interpolation Use lexical model of one PT to score phrase pairs of the other A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Dataset & Settings Experiments • Dataset • Past shared task WPT05 (en/fi) • 714K sentence pairs • Split into T1, T2, T3, and T4 of sizes 40K, 80K, 160K, and 320K • Standard phrase-based SMT settings: • Moses • IBM Model 4 • Case insensitive BLEU A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
MT Baseline Systems Either the m-system does not perform as well as the w-system or BLEU is not capable of measuring morpheme improvements Experiments w-system – word level m-system – morpheme level m-BLEU – morpheme version of BLEU A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Morphological Enhancements - Individual Individual enhancement yields improvements for both small and large corpora Experiments phr: boundary-aware phrase extraction tune: MERT tuning – word BLEU lm: decoding with twin LMs A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Morphological Enhancements - combined • phr: boundary-aware phrase extraction • tune: MERT tuning – word BLEU • lm: decoding with twin LMs Morphological enhancements: on par with the w-system and yield sizable improvements over the m-system Experiments A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Translation Model Enrichment Our method outperforms the w-system baseline Experiments add-1: one extra feature add-2: two extra features Interpolation: linear interpolation ourMethod A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Result Significance Analysis • Absolute improvement of 0.74 BLEU over the m-system, • non-trival relative improvement of 5.6% • Outperform the w-system by 0.24 points (1.56% relative) • Statistically significant with p < 0.01 (Collins’ sign test) A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Translation Proximity Match Hypothesis: our approach yields translations close to the reference wordforms but unjustly penalized by BLEU Analysis • Automatically extract phrase triples (src, out, ref ) • use src as the pivot: translation output (out) & reference (ref) • longest common subsequence ratio • high character-level similarity between out & ref (economic and social, taloudellisia ja sosiaalisia, taloudellisten ja sosiaalisten) • 16,262 triples: 6,758 match the references exactly • the remaining triples were close wordforms A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Human Evaluation The judges consistently preferred: (1) ourSystem to the m-system, (2) ourSystem to thew-system, (3) w-system to the m-system Analysis • 4 native Finnish speakers • 50 random test sentences • follow WMT’09 evaluation: • provided judges with the source sentence, its reference translation & • outputs of (m-system, w-system, ourSystem) shown in random order • asked for three pairwise judgments A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Sample translations Match reference Wrong case Confusing meaning Match reference Wrong case OOV Change the meaning ourSystem consistently outperforms the m-system & w-system, and seems to blend well translations from both baselines! src: we were very constructive and we negotiated until the last minute of these talks in the hague . ref: olimme erittain rakentavia ja neuvottelimme haagissa viime hetkeen saakka . our: olemme olleet hyvin rakentavia ja olemme neuvotelleet viime hetkeen saakka naiden neuvottelujen haagissa . w : olemme olleet hyvin rakentavia ja olemme neuvotelleet viime tippaan niin naiden neuvottelujen haagissa . m : olimme erittain rakentavan ja neuvottelimme viime hetkeen saakka naiden neuvotteluiden haagissa . Rank: our > m ≥ w src: it would be a very dangerous situation if the europeans were to become logistically reliant on russia . ref: olisi erittäin vaarallinen tilanne , jos eurooppalaiset tulisivat logistisesti riippuvaisiksi venäjästä . our: olisi erittäin vaarallinen tilanne , jos eurooppalaiset tulee logistisesti riippuvaisia venäjän . w : se olisi erittäin vaarallinen tilanne , jos eurooppalaisten tulisi logistically riippuvaisia venäjän . m : se olisi hyvin vaarallinen tilanne , jos eurooppalaisethaluavat tulla logistisesti riippuvaisia venäjän . Rank: our > w ≥ m A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Conclusion Thank you! • Two key challenges: • bring the performance of morpheme systems to a level rivaling the standard word ones • incorporate morphological analysis directly into the translation process • We havebuilt a preliminary framework for the second challenge • Future work: • Extend the morpheme level framework • Tackle the second challenge A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Analysis – Translation Model Comparison Analysis • PTm+phr vs. PTm • PTm+phr is about half the size of PTm • 95.07% of the phrase pairs in PTm+phr are also in PTm boundary-aware phrase extraction selects good phrase pairs from PTm to be retained in PTm+phr • PTm+phr vs. PTwm • comparable in size: 22.5M and 28.9M pairs, • overlap is only 47.67% of PTm+phr enriching the translation model with PTwm helps improve coverage A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages