Morphological Analysis for Phrase-Based Statistical Machine Translation

Morphological Analysis for Phrase-Based Statistical Machine Translation LUONG Minh Thang Supervisor: Dr. KAN Min Yen National University of Singapore Web IR / NLP Group (WING)

State-of-the-art systems phrase-phrase translation with data-intensive techniques but still, they treat words as different entities don’t understand the internal structure of words Modern Machine Translation (MT) We investigate the incorporation of word structure knowledge (morphology) and adopt a language-independent approach Machine translation: understand word structure 2

Issues we address “cars” has two morphemes “car”+“s” • Morphologically-aware system • Out-of-vocabulary problem • Derive word structure from only raw data, language-general approach • Translation to high-inflected languages • English-Finnish case study • Understand the characteristics  Suggestion of self-correcting model Seen “car” before, but not “cars” auto/si: your car auto/i/si: your cars auto/i/ssa/si: in your cars auto/i/ssa/si/ko: in your cars? auto car Machine translation: understand word structure

What others have done? • A majority of works address the translation direction from high- to low-inflected languages • Arabic-English, German-English, Finnish-English • Only few works touch at the reverse direction, which is considered more challenging • English-Turkish: (El-Kahlout & Oflazer, 2007) • English-Russian, Arabic: (Toutanova et. al., 2008) • employ feature-rich approach using abundant annotation data & language-specific tools. We also look at the reverse direction, English-Finnish, but stick to our language-general approach! Machine translation: understand word structure

Agenda • Baseline statistical MT Terminology • Our morphologically-aware SMT system • Baseline + Morphological layers • Finnish study – morphological aspects • Suggestion of self-correcting model • Experiments & results Machine translation: understand word structure

Baseline statistical MT (SMT) - overview • We construct our baseline using Moses (Koehn et.al, 2007), a state-of-the-art open-source SMT toolkit Monolingual/Parallel train data Training Language model Translation model Reordering model Test data (source language) Decoding Output translation (target language) Evaluating BLEU score Machine translation: understand word structure

Baseline statistical MT - Terminology • Parallel data: pairs of sentences in both language (implies alignment correspondence) • Monolingual data: from one language only Source Reordering effect Target • Distortion limit parameter: control reordering • - how far translated words could be from the source word • We test the effect of this parameter later Machine translation: understand word structure

Automatic evaluation in SMT • Human judgment is expensive & labor-consuming • Automatically evaluate using reference translation(s) Input: Mary did not slap the green witch Baseline SMT system Output: Maria daba una botefada a verde bruja Ref: Maria no daba una botefada a la bruja verde Evaluating BLEU score Machine translation: understand word structure

Automatic evaluation in SMT – BLEU score • Match unigram, bigram, trigram, and up to N-gram Ref: Maria no daba una botefada a la bruja verde Output: Maria daba una botefada a verde bruja • p1 (unigram) = 7 • p2 (bigram) = 4 Output: Maria daba una botefada a verde bruja Output: Maria daba una botefada a verde bruja • p3 (trigram) = 2 Output: Maria daba una botefada a verde bruja • p4 (4-gram) = 1 BLEU score = length_ratio* exp(p1+ ..+ p4)/4 Machine translation: understand word structure

Baseline SMT – Shortcomings? • Only deal with language of similar morphology level • Suffer from data sparseness problem in high-inflected languages (Statistics from 714 K) Type: number of different words (vocabulary size) Token: the total number of words Machine translation: understand word structure

Why high-inflected language is hard? • Has huge vocabulary size. • Finnish vocabulary ~ 6 times English vocabulary • Could freely concatenate prefixes/suffixes to form new word Finnish: oppositio/kansa/n/edusta/ja (opposition/people/of/represent/-ative) = opposition of parliarment member Turkish: uygarlas,tiramadiklarimizdanmis,sinizcasina (uygar/las, tir/ama/dik/lar/imiz/dan/mis, siniz/casina) = (behaving) as if you are among those whom we could not cause to become civilized Make our system morphologically-aware to address these This is a word!!! Machine translation: understand word structure

Agenda • Baseline statistical MT Terminology • Our morphological-aware SMT system • Baseline + Morphological layers • Finnish study – morphological aspects • Suggestion of self-correcting model • Experiments & results Machine translation: understand word structure

Morpheme pre- & post-processing modules Parallel train data Monolingual train data Language model training Translation & reordering model training auto + t cars Test data Morpheme Pre-processing Morpheme Post-processing Decoding Final translation autot car + s Machine translation: understand word structure

Incorporating morphological layers Parallel train data Monolingual train data Morpheme Pre-processing Language model training Translation & reordering model training Test data Our morphologically-aware SMT Morpheme Pre-processing Decoding Morpheme Post-processing Final translation E Machine translation: understand word structure

Preprocessing – morpheme segmentation (MS) • We perform MS to address the data sparse problem • cars might not appears in the training, but car & s do • (Oflazer, 2007) & (Toutanova, 2008) also perform MS but use morphological analyzers that • customized for a specific language • utilize richly-featured annotated data • We use an unsupervised morpheme segmentation tool, Morfessor, that requires only unannotated monolingual data. Machine translation: understand word structure

Morpheme segmentation- Morfessor • Morfessor – segment words, unsupervised manner straight/STM + forward/STM + ness/SUF • 3 tags: PRE (prefix), STM (stem), & SUF(suffix) (Statistics from 714 K) Reduce data sparseness problem Machine translation: understand word structure

Post-processing – morpheme concatenation • Output after decoding is a sequence of morphemes Pitäkää mme se omassa täsmällis essä tehtävä ssä ä n  How to put them back into words? • During translation, keep the tag info & “+” sign to (indicate internal morpheme) • Use word structure : WORD = ( PRE* STM SUF* )+ Test data omassa/STM se/STM Pitäkää/STM+ mme/SUF Morpheme Pre-processing • tehtävä/STM+ ssä/SUF+ ä/SUF+ n/SUF täsmällis/STM+ essä/STM Decoding omassa täsmällisessä • tehtävässään se Pitäkäämme Morpheme Post-processing Final translation Machine translation: understand word structure

Finnish study – two distinct characteristics • More case endings than usual Indo-European languages • Normally correspond to prepositions or postpositions. • E.g.: auto/sta “out of the car”, auto/on “into the car” • Use endings where Indo-European languages have function words. • Finnish possessive suffixes = English possessive pronouns • E.g.: auto/si “my car”, auto/mme “our car”. Machine translation: understand word structure

Structure of nominal –A word followed by many suffixes • Structure: Nominal + number + case + possessive + particle Machine translation: understand word structure

Structure of finite verb form –Finnish suffixes ~ English function words • Structure: Nominal + tense/mood + personal ending + particle Machine translation: understand word structure

Potential challenges of high-inflected language to the system • A word might be followed by several suffixes • A potential that the system might get the stem right, but miss a suffix. • Correct translation: my cars  auto/i/ni (i: plural, ni: my) • Intuition: use “my” and “s” to help ………. my/STM car/STM+ s/SUF ………. How to self-correct this suffix to i/ni? ..…………..auto/STM+ i/SUF………….... Machine translation: understand word structure

Preliminary self-correcting model • Suffixes in high-inflected language ~ function words in low-inflected language • Besides prefixes & suffixes, make use of source function words • Model as a sequence labeling task – Labels are suffixes Suffixt-1 Suffixt Suffixt+1 my/STM car/STM+ s/SUF Predict correct suffix = ini/ auto/STM+ i/SUF func=“my” suf =“s” Stemt=“auto” Stemt-1 Stemt+1 Machine translation: understand word structure

Datasets from European Parliament corpora • Four data sets of various sizes • select by first pick a keyword for each dataset, and extract all sentences contain the key word and its morphological variants • modest in size as compared to 714K of the full corpora. • We choose because: • - Reduce running time • - Simulate the real situation of scarce resources Machine translation: understand word structure

Experiments – Out-of-vocabulary (OOV) rates • OOV rate = number of un-translated words / total words • Reduction rate = (baseline OOV – our OOV rates) / baseline OOV rate Reduction rate: 10.33% to 34.74%. Highest effect when data is limited Machine translation: understand word structure

Overall results with BLEU score • Use BLEU score metric – judge at • word level: unit in N-gram is word • morpheme level: unit in N-gram is morpheme Word BLEU: our STM is as competitive as the baseline SMT Morpheme BLEU: our STM shows better morpheme coverage Machine translation: understand word structure

Overall results - distortion limit tuning • Distortion limit controls reordering • Has influential effect on the performance (Virpioja, 2007) Baseline STM is best at 6 Our STM is best at 9 Our SMT is better in both word and morpheme BLUE Machine translation: understand word structure

Error analysis • Interested to know how many times the system could get the stem right but not the suffixes Real need of the self-correcting model Machine translation: understand word structure

Even further analysis – New results after thesis !!! • Our datasets are specialized on their keywords • Result will be more conclusive if we look at translations of phrases containing dataset keywords Conclusion: our SMT performs better in both tasks, getting the stems and suffixes right. Machine translation: understand word structure

Reference • Kohen, P., et. al, 2007. Moses: open source toolkit for statistical machine translation • Oflazer & Durgar El-Kahlout, 2007. Exploring different representational units in English-to-Turkish statistical machine translation • Virproja, S., et. al., 2007. Morphology-aware statistical machine translations based on morphs induced in an unsupervised manner • Toutanova et. al., 2008. Applying morphology generation models to machine translation Machine translation: understand word structure

Q & A? • Thank you Machine translation: understand word structure

Baseline statistical MT Train data Target train data Translation model training EM algorithm , symmetrizing word alignment (GIZA++ tool) Language model training N-gram extraction, SRILM Development data Phrase tables Language model Tuning P(E|F) ~ ∑ λi fi (E|F)Learn λi (Minimume error rate training) λ*I Test data (F) Decoding E = argmaxE ∑ λ*i fi (E|F) (Beam search, Moses toolkit) Final translation E Machine translation: understand word structure

Standard SMT system – translation model • Learn how to translate from one source phrase to a target phrase • Output phrase table Parallel train data Translation model training car industry in europe ||| euroopan autoteollisuus car industry in the ||| autoteollisuuden car industry in ||| autoteollisuuden Machine translation: understand word structure

Standard SMT system – language model Target train data Language model training • Constraints on a sequence of words that could go togerther • Output N-gram table -2.882216 commission 's argument 0 -3.182358 commission 's arguments 0 -3.620942 commission 's assertion 0 -3.11402 commission 's assessment 0 Machine translation: understand word structure

Standard SMT system - tuning • Determine the weights to combine different models, e.g. translation or language model. Parallel development data P(E|F) ~ ∑ λi fi (E|F)Learn λi Tuning Machine translation: understand word structure

Standard SMT system - Decoding • Use phrase table in translation model, N-gram table in language model, and parameters to combine them in tuning. • Generate for each input sentence F, a set of best translations, and pick the highest-score one. Test data F Decoding Final translation E Machine translation: understand word structure

Morphological Analysis for Phrase-Based Statistical Machine Translation

Morphological Analysis for Phrase-Based Statistical Machine Translation

Presentation Transcript

Statistical Machine Translation

Statistical Machine Translation

Morphological Preprocessing for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation Part III – Phrase- based SMT / Decoding

Statistical Machine Translation Part V – Phrase-based SMT

Morphological Analysis for Phrase-Based Statistical Machine Translation

Machine Translation Phrase Alignment

A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang

Statistical Machine Translation

Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem

Machine Translation Decoder for Phrase-Based SMT

Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure

Morphological Generation of German for Statistical Machine Translation

Morphological Processing for Statistical Machine Translation

Machine Translation Decoder for Phrase-Based SMT

Statistical Machine Translation

Statistical Machine Translation

Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation

Unsuperv ised Turkish Morphological Segmentation for Statistical Machine Translation