300 likes | 481 Views
Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets. Jörg Tiedemann, Uppsala University Preslav Nakov, Qatar Computing Research Institute. RANLP’2013 September 10, 2013, Hissar, Bulgaria. Statistical Machine Translation (SMT): Trained on Bi-texts. English.
E N D
Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets Jörg Tiedemann, Uppsala University Preslav Nakov, Qatar Computing Research Institute RANLP’2013 September 10, 2013, Hissar, Bulgaria
Statistical Machine Translation (SMT): Trained on Bi-texts English Reach Out to Asia (ROTA) has announced its fifth Wheels ‘n’ Heels, Qatar’s largest annual community event, which will promote ROTA’s partnership with the Qatar Japan 2012 Committee. Held at the Museum of Islamic Art Park on 10 February, the event will celebrate 40 years of cordial relations between the two countries. Essa Al Mannai, ROTA Director, said: “A group of 40 Japanese students are traveling to Doha especially to take part in our event. SMT systems: - learn from human-generated translations - extract useful knowledge and build models - use the models to translate new sentences
The Problem: No Enough Training Datafor Most Language Pairs Zipfiandistribution of language resources
The Lack of Training Bi-texts is a Big Issue MacedonianEnglish SMT • Ref: It's a simple matter of self-preservation. • SMT: It's simply a question of себесочувување. • Ref: Your girlfriend's very cynical. • SMT: Пријателкатациничнаyou very much.
Typical Solution: Pivoting Macedonian: Никогаш не сум преспала цела сезона. Bulgarian: Никога не съм спала цял сезон. English: I’ve never slept for an entire season. • For related languages • subword transformations • use character-level translation?
Character-Level SMT • MK: Никогаш не сум преспала цела сезона. • BG: Никога не съм спала цял сезон. • MK: Н и к о г а ш _ н е _ с у м _ п р е с п а л а _ ц е л а _ с е з о н а _ . • BG: Н и к о г а _ н е _ с ъ м _ с п а л а _ ц я л _ с е з о н _ .
Character-Level Phrase Pairs Can cover: • word prefixes/suffixes • entire words • word sequences • combinations thereof Max-phrase-length=10 LM-order=10
Data: OPUS movie subtitles (cleansed & realigned) • Training • Development: 10K sentences • Test: 10K sentences
Character Alignment and Phrase Table Filtering Macedonian-Bulgarian character-level SMT
The Impact of Data Size: MKBG Macedonian-Bulgarian
Optimizing MKBGEN Pivot SMT:Local vs. Global Tuning combined = baseline + word-based + char-based
Optimizing MKBGEN Pivot SMT:Local vs. Global Tuning combined = baseline + word-based + char-based global tuning based on 20 x 20 n-best lists
Pivot Languages for MK??EN SMT CZ SL SR BG MK
Varying the Training Data Size MK-XX Pivoting (baseline MK-EN = 22.33)
Using Synthetic Data Translate Bulgarian to Macedonian in a BG-EN corpus
Using Synthetic Data Translate Bulgarian to Macedonian in a BG-XX corpus
Using Synthetic Data Translate Bulgarian to Macedonian in a BG-XX corpus
Using Synthetic Data Translate Bulgarian to Macedonian in a BG-XX corpus All synthetic data combined (+mk-en): 36.69 BLEU
Conclusion and Future Work • Findings • character alignment: use bigrams! +0.4 BLEU • phrase table filtering: removes noise! +0.5 BLEU • global tuning: better than local! +1.0 BLEU • bitext size: char- better than word-level with little data! +3.0 BLEU • choice of pivot language: closer is better! • synthetic data: better than pivoting +2.5 BLEU • results confirmed by manual evaluation • Overall: +14 BLEU • Future Work • robustness of char-level models • domain shifts • noisy inputs: spelling and tokenization, etc. • other language pairs Thank you! Thanks to Petya Kirova and Veno Pacovski for the manual judgments.