310 likes | 509 Views
English-Persian SMT. Reza Saeedi Reza.saeedi@stu-mail.um.ac.ir. WTLAB. Wednesday, May 25, 2011. Outline. MT Introduction SMT Introduction Requirements for SMT Evaluation metrics English-Persian MT challenges English-Persian SMT System1 System2 Problems in English-Persian SMT.
E N D
English-Persian SMT Reza Saeedi Reza.saeedi@stu-mail.um.ac.ir WTLAB Wednesday, May 25, 2011
Outline • MT Introduction • SMT Introduction • Requirements for SMT • Evaluation metrics • English-Persian MT challenges • English-Persian SMT • System1 • System2 • Problems in English-Persian SMT
MT Introduction • Automatic translation of text written in a natural language into another one by the use of computers is referred to as Machine Translation. • There are several way to do this work: • Dictionary-based • Rule-based • Example-based • Statistical approach
SMT Introduction • First ideas of Statistical machine translation was proposed by Warren Weaver in 1947. • Statistical machine translation tries to learn the translation by examining the translations made by humans.
SMT Introduction(Cont.) • Statistical MT models take the view that every sentence in the target language is a translation of the source language sentence with some probability. • The best translation, of course, is the sentence that has the highest probability. • The key problems in statistical MT are: • estimating the probability of a translation • and efficiently finding the sentence with the highest probability.
SMT Introduction(Cont.) • Given a Source sentence f, we seek the target sentence e that maximizes P(e | f). e‘ = argmaxe P(e | f) • Intuitively, P(e|f) should depend on two factors: • P(e|f) = P(e) * P(f | e) / P(f) • argmaxe P(e | f) = argmaxe P(e) * P(f | e) fluency faithfulness
SMT Introduction(Cont.) • Philipp koehn • http://homepages.inf.ed.ac.uk/pkoehn
Why SMT? • Better use of resources • Not need linguistic knowledge • It can use for any pair of language • But • We need a big training corpus
Requirements for SMT • Bilingual and Monolingual Corpus: • For bilingual need tow file aligned sentence by sentence (one file for source language and other for target language) • Microsoft Bi-Lingual sentence Aligner • Language Model: • We need a tool to compute P(e) • For this step we need to monolingual corpus • SRILM: a tool for create N-grams
Requirements for SMT • Translation Model: • We need a tool for compute P(f|e) • For this step we need to bilingual corpus • GIZA++ • The output of this tool is a phrase table • Decode: • For search and find best translation • Moses
The training steps • Prepare data • Run GIZA++ • Align words • Get lexical translation table • Extract phrases • Score phrases • Build reordering model • Build generation models • Create configuration file
Evaluation metrics • BLEU(BiLingual Evaluation Understudy) • Developed at IBM’s • The closer a MT is to a professional human translation, the better it is • NIST
English-Persian MT challenges • The Persian language structure is very different in comparison to English • The structure of Persian language is very complex • There has been little previous work done for this language pair • Effective SMT systems rely on very large bilingual corpora but there are not readily available for the English/Persian language pair
English-Persian SMT • There have been few English-Persian MT systems developed • Most of them are purely rule-based • There are two work on English-Persian SMT • Mohaghegh and Sarrafzadeh (Massey University) • Pilevar and Faili (Tehran University)
System1 • Corpus: BBC news
System1(Cont.) • Tools: SRILM, GIZA++, Moses
System2 • Corpus: • Bidirectional(TEP): Subtitle of films, 3 books, KDE4
System2(Cont.) • Corpus: • Monolingual: Hamshahri, subtitle of films
System2(Cont.) • Tools: SRILM, GIZA++, Moses PersianSMT with 4-gram Sub-LM
Problems in English-Persian SMT • compound verbs (aligning problem) • Use a phrase-based SMT system • But problem is inflectional morphology • Large number of inflected verb forms does not let the system learn to translate all the individual forms of a compound verb • Persian takes personal pronouns as an optional element in the sentence (aligning problem)
Problems(Cont.) • failure of the system to place the elements of the sentence in the right order • Use a phrase-based SMT system • Re-rank the n-best output list and/or reorder the output sentences • Prior to translation, the input sentence is reordered using morpho-syntactic information, so that the word order resembles better that of the target language.
References • [1] A. Ramanathan, "Statistical Machine Translation", Ph.D. Seminar Report, Department of Computer Science and Engineering Indian Institute of Technology, 2000. • [2] A. LOPEZ, "Statistical Machine Translation", ACM Computing Surveys, 2008. • [3] M. Mohaghegh, & A. Sarrafzadeh, “The first english-persian statistical machine translation”, New Zealand Postgraduate Conference, 2009 . • [4] M. Mohaghegh, & A. Sarrafzadeh, " An analysis of the effect of training data variation in English-Persian Statistical Machine Translation”, 2009 International Conference on Innovations in Information Technology (IIT 2009) • [5] M. Mohaghegh, & A. Sarrafzadeh, " Performance evaluation of various training data in English-Persian statistical machine translation “, Appear in Proceedings of the 10th International Conference on the Statistical Analysis of Textual Data (JADT 2010), Rome, Italy, June 9-11, 2010. • [6] M. Mohaghegh, & A. Sarrafzadeh, " Improved Language Modeling for English-Persian Statistical Machine Translation”, COLING 2010 / SIGMT Workshop 23rd International Conference on Computational Linguistics Beijing, China 28 August 2010
References(Cont.) • [7] M.T. Pilevar and H. Faili, "PersianSMT: A First Attempt to English-Persian Statistical Machine Translation", to appear in Proc. of 10th International Conference on statistical analysis of textual data (JADT 2010)