280 likes | 544 Views
Translation Models: Taking Translation Direction into Account. Gennadi Lembersky Noam Ordan Shuly Wintner ISCOL, 2011. Statistical Machine Translation (SMT). Given foreign sentence f : “Maria no dio una bofetada a la bruja verde ” Find the most likely English translation e :
E N D
Translation Models: Taking Translation Direction into Account GennadiLembersky Noam Ordan ShulyWintner ISCOL, 2011
Statistical Machine Translation (SMT) • Given foreign sentence f: • “Maria no diounabofetada a la brujaverde” • Find the most likely English translation e: • “Maria did not slap the green witch” • Most likely English translation e is given by: arg max P(e|f): • P(e|f) estimates conditional probability of any e given f • How to estimate P(e|f)? • Noisy channel: • Decompose P(e|f) into P(f|e) * P(e) / P(f) • Estimate P(f|e)using parallel corpus (translation model) • Estimate P(e) using monolingual corpus (language model)
Translation Model • How to model P(f|e)? • Learn parameters of P(f|e) from a parallel corpus • Estimate translation model parameters at the phrase level • explicit modeling of word context • captures local reorderings, local dependencies • IBM Models define how words in a source sentence can be aligned to words in a parallel target sentence • EM is used to estimate the parameters • Aligned words are extended to phrases • Results: phrase-table
Log-Linear Models • Log-linear models • where hi are the feature functions and λi are the model parameters • typical feature functions: phrase translation probabilities, lexical translation probabilities, language model probability, reordering model • Model parameter estimation (tuning) using discriminative training; MERT algorithm (Och,2003)
Evaluation • Human evaluation is not practical – too slow and costly • Automatic evaluation is based on a human reference translation • The output of an MT system is compared to the human translation of the same set of sentences • The metric basically calculate the distance between MT output and the reference translation • Tens of metrics were developed • BLEU is the most popular one • METEOR and TER are close
Original vs. Translated Texts Given this simplified model: Two points are made with regard to the “intermediate component” (TM and LM): • TM is blind to direction (but see Kurokawa et al., 2009) • LMs are based on originally written texts. LM Source Text Target Text TM
Original vs. Translated Texts Translated texts are ontologically different from non-translated texts ; they generally exhibit • Simplification of the message, the grammar or both (Al-Shabab, 1996, Laviosa, 1998) ; • Explicitation, the tendency to spell out implicit utterances that occur in the source text (Blum-Kulka, 1986).
Original vs. Translated Texts • Translated texts can be distinguished from non-translated texts with high accuracy (87% and more) • For Italian (Baroni & Bernardini, 2006) • For Spanish (Iliseiet al., 2010); • For English (Koppel & Ordan, 2011)
How Translation Direction Affects MT? • Language Models • Our work (accepted to EMNLP) shows that translated LMs are better for MT systems than the original ones. • Translation Models • Kurokawa et al, 2009 showed that when translating French into English it is better to use French-translated-to-English parallel corpus and vice versa. • This work supports this claim and extends it (in review for WMT)
Our Setup • Canadian Hansard corpus: parallel French-English corpus • 80% Original English (EO) • 20% Original French (FO) • The ‘source’ language is marked • Two scenarios: • Balanced: 750K FO sentences and 750K EO sentences • Biased: 750K FO sentences and 3M EO sentences • MOSES PB-SMT toolkit • Tuning & Evaluation: • 1000 FO sentences for tuning and 5000 FO sentences for evaluation
Baseline Experiments • We translate French-to-English • EO – train the phrase-table on EO portion of the parallel corpus • FO – train the phrase-table on FO portion of the parallel corpus • FO+EO – train the phrase-table on all the parallel corpus
SystemA: Two Phrase-Tables • EO – train the phrase-table on EO portion of the parallel corpus • FO – train the phrase-table on FO portion of the parallel corpus • SystemA – let MOSES use both phrase-tables • Log-linear model training gives each phrase-table different scores
SystemA Results • In the balanced scenario we gained 1.29 BLEU • In the biased scenario we gained 0.69 BLEU • The cost is the decoding time and memory resources
Looking Inside… • Complete table – a phrase-table after training • Filtered table – a phrase-table that contains only phrases that appear in the evaluation set
Few Observations… / 1 • Balanced Set / Complete tables • FO table has many more unique French phrases (15.8M vs. 13M) • EO table has more translation options per each source phrase (1.42 vs. 1.33) • The sources phrases in the intersection are shorter (3.76 vs. 5.07-5.16), but they have more translations (3.08-3.21 vs. 1.09-1.10)
Few Observations… / 2 • Balanced Set / Filtered tables • The intersection comprises 96.1% of the translation phrase-pairs in the FO table and 98.3% of the translation phrase-pairs in the EO table.
Few Observations… / 3 • Biased Set – we added 2,250,000 English-original sentences. What happens? • In ‘complete’ EO table – everything grows • In Filtered Tables • number of phrase-pairs increases by a factor of 3 • number of unique source phrases increases by 1/3 • Coverage of French phrases haven’t improved by much • The average number of translations increases by a factor of 2.3 (from 13.2 to 30.3) • Long tail – the probability is split between larger number of translations. Good translations get lower probability than in FO table
How does MOSES Select Phrases? • Balanced Set • 96.5% comes from FO table • 99.3% of the phrase-pairs selected from the intersection originated in the FO table • Biased Set • 94.5% comes from FO table • 98.2% of the phrase-pairs selected from the intersection originated in the FO table
The tuning effect /1 • A question: Is FO phrase-table better than the EO phrase-table or it becomes better during the tuning. • Let’s test SystemA with initial (pre-tuning) configuration and with the configuration generated by tuning.
Balanced Set / Before tuning 58% only comes from the FO table 57.7% of the phrase-pairs selected from the intersection originated in the FO table The tuning effect /2 • Balanced Set / After tuning • 95.4% comes from FO table • 99.3% of the phrase-pairs selected from the intersection originated in the FO table
The tuning effect /3 • The decoder prefers the FO table in the initial configuration (58%). • The preference becomes much stronger after tuning (95.4%) • Interestingly, the decoder doesn’t just replace EO phrases with FO phrases; it searches for the longer phrases; • The average length of a phrase selected from the EO table increases by about 1.5 words.
New Experiment: SystemB • Based on these results, we can through away the intersection subset of the EO phrase-table • We expect a small loss in quality, but a significant improvement in translation speed.
What about classified corpus? • Annotation of the source language is rarely available in the parallel corpora. • Will our SystemA and SystemB outperform FO+EO and FO MT systems? • We use we use SVM for classification, and our features are punctuation marks and the n-grams of part-of-speech tags. • We train the classifier on an English-French subset of the Europarlcorpus. • Accuracy is about 73.5%