English-Persian SMT

English-Persian SMT Reza Saeedi Reza.saeedi@stu-mail.um.ac.ir WTLAB Wednesday, May 25, 2011

Outline • MT Introduction • SMT Introduction • Requirements for SMT • Evaluation metrics • English-Persian MT challenges • English-Persian SMT • System1 • System2 • Problems in English-Persian SMT

MT Introduction • Automatic translation of text written in a natural language into another one by the use of computers is referred to as Machine Translation. • There are several way to do this work: • Dictionary-based • Rule-based • Example-based • Statistical approach

SMT Introduction • First ideas of Statistical machine translation was proposed by Warren Weaver in 1947. • Statistical machine translation tries to learn the translation by examining the translations made by humans.

SMT Introduction(Cont.) • Statistical MT models take the view that every sentence in the target language is a translation of the source language sentence with some probability. • The best translation, of course, is the sentence that has the highest probability. • The key problems in statistical MT are: • estimating the probability of a translation • and efficiently finding the sentence with the highest probability.

SMT Introduction(Cont.) • Philipp koehn • http://homepages.inf.ed.ac.uk/pkoehn

Why SMT? • Better use of resources • Not need linguistic knowledge • It can use for any pair of language • But • We need a big training corpus

Steps of SMT

Requirements for SMT • Bilingual and Monolingual Corpus: • For bilingual need tow file aligned sentence by sentence (one file for source language and other for target language) • Microsoft Bi-Lingual sentence Aligner • Language Model: • We need a tool to compute P(e) • For this step we need to monolingual corpus • SRILM: a tool for create N-grams

LM output

Requirements for SMT • Translation Model: • We need a tool for compute P(f|e) • For this step we need to bilingual corpus • GIZA++ • The output of this tool is a phrase table • Decode: • For search and find best translation • Moses

Phrase table

Moses tool

The training steps • Prepare data • Run GIZA++ • Align words • Get lexical translation table • Extract phrases • Score phrases • Build reordering model • Build generation models • Create configuration file

Evaluation metrics • BLEU(BiLingual Evaluation Understudy) • Developed at IBM’s • The closer a MT is to a professional human translation, the better it is • NIST

English-Persian MT challenges • The Persian language structure is very different in comparison to English • The structure of Persian language is very complex • There has been little previous work done for this language pair • Effective SMT systems rely on very large bilingual corpora but there are not readily available for the English/Persian language pair

English-Persian SMT • There have been few English-Persian MT systems developed • Most of them are purely rule-based • There are two work on English-Persian SMT • Mohaghegh and Sarrafzadeh (Massey University) • Pilevar and Faili (Tehran University)

System1 • Corpus: BBC news

System1(Cont.) • Tools: SRILM, GIZA++, Moses

System1: Improved Language Modeling

System2 • Corpus: • Bidirectional(TEP): Subtitle of films, 3 books, KDE4

System2(Cont.) • Corpus: • Monolingual: Hamshahri, subtitle of films

System2(Cont.) • Tools: SRILM, GIZA++, Moses PersianSMT with 4-gram Sub-LM

Comparison PersianSMT with Google Translator

Problems in English-Persian SMT • compound verbs (aligning problem) • Use a phrase-based SMT system • But problem is inflectional morphology • Large number of inflected verb forms does not let the system learn to translate all the individual forms of a compound verb • Persian takes personal pronouns as an optional element in the sentence (aligning problem)

Problems(Cont.) • failure of the system to place the elements of the sentence in the right order • Use a phrase-based SMT system • Re-rank the n-best output list and/or reorder the output sentences • Prior to translation, the input sentence is reordered using morpho-syntactic information, so that the word order resembles better that of the target language.

References • [1] A. Ramanathan, "Statistical Machine Translation", Ph.D. Seminar Report, Department of Computer Science and Engineering Indian Institute of Technology, 2000. • [2] A. LOPEZ, "Statistical Machine Translation", ACM Computing Surveys, 2008. • [3] M. Mohaghegh, & A. Sarrafzadeh, “The first english-persian statistical machine translation”, New Zealand Postgraduate Conference, 2009 . • [4] M. Mohaghegh, & A. Sarrafzadeh, " An analysis of the effect of training data variation in English-Persian Statistical Machine Translation”, 2009 International Conference on Innovations in Information Technology (IIT 2009) • [5] M. Mohaghegh, & A. Sarrafzadeh, " Performance evaluation of various training data in English-Persian statistical machine translation “, Appear in Proceedings of the 10th International Conference on the Statistical Analysis of Textual Data (JADT 2010), Rome, Italy, June 9-11, 2010. • [6] M. Mohaghegh, & A. Sarrafzadeh, " Improved Language Modeling for English-Persian Statistical Machine Translation”, COLING 2010 / SIGMT Workshop 23rd International Conference on Computational Linguistics Beijing, China 28 August 2010

References(Cont.) • [7] M.T. Pilevar and H. Faili, "PersianSMT: A First Attempt to English-Persian Statistical Machine Translation", to appear in Proc. of 10th International Conference on statistical analysis of textual data (JADT 2010)

English-Persian SMT

English-Persian SMT

Presentation Transcript

Persian Rugs

----SMT

Persian Wars

SMT TRAINING

Persian Empire

Persian Empire

PERSIAN LANGUAGE

SMT Dallas

Persian Wars

“ SMT=2 ” means “ smt enabled? ”

Persian Wars

SMT Issues

PERSIAN

SMT Issues

Persian Names