1 / 30

English-Persian SMT

English-Persian SMT. Reza Saeedi Reza.saeedi@stu-mail.um.ac.ir. WTLAB. Wednesday, May 25, 2011. Outline. MT Introduction SMT Introduction Requirements for SMT Evaluation metrics English-Persian MT challenges English-Persian SMT System1 System2 Problems in English-Persian SMT.

sawyer
Download Presentation

English-Persian SMT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. English-Persian SMT Reza Saeedi Reza.saeedi@stu-mail.um.ac.ir WTLAB Wednesday, May 25, 2011

  2. Outline • MT Introduction • SMT Introduction • Requirements for SMT • Evaluation metrics • English-Persian MT challenges • English-Persian SMT • System1 • System2 • Problems in English-Persian SMT

  3. MT Introduction • Automatic translation of text written in a natural language into another one by the use of computers is referred to as Machine Translation. • There are several way to do this work: • Dictionary-based • Rule-based • Example-based • Statistical approach

  4. SMT Introduction • First ideas of Statistical machine translation was proposed by Warren Weaver in 1947. • Statistical machine translation tries to learn the translation by examining the translations made by humans.

  5. SMT Introduction(Cont.) • Statistical MT models take the view that every sentence in the target language is a translation of the source language sentence with some probability. • The best translation, of course, is the sentence that has the highest probability. • The key problems in statistical MT are: • estimating the probability of a translation • and efficiently finding the sentence with the highest probability.

  6. SMT Introduction(Cont.) • Given a Source sentence f, we seek the target sentence e that maximizes P(e | f). e‘ = argmaxe P(e | f) • Intuitively, P(e|f) should depend on two factors: • P(e|f) = P(e) * P(f | e) / P(f) • argmaxe P(e | f) = argmaxe P(e) * P(f | e) fluency faithfulness

  7. SMT Introduction(Cont.) • Philipp koehn • http://homepages.inf.ed.ac.uk/pkoehn

  8. Why SMT? • Better use of resources • Not need linguistic knowledge • It can use for any pair of language • But • We need a big training corpus

  9. Steps of SMT

  10. Requirements for SMT • Bilingual and Monolingual Corpus: • For bilingual need tow file aligned sentence by sentence (one file for source language and other for target language) • Microsoft Bi-Lingual sentence Aligner • Language Model: • We need a tool to compute P(e) • For this step we need to monolingual corpus • SRILM: a tool for create N-grams

  11. LM output

  12. Requirements for SMT • Translation Model: • We need a tool for compute P(f|e) • For this step we need to bilingual corpus • GIZA++ • The output of this tool is a phrase table • Decode: • For search and find best translation • Moses

  13. Phrase table

  14. Moses tool

  15. The training steps • Prepare data • Run GIZA++ • Align words • Get lexical translation table • Extract phrases • Score phrases • Build reordering model • Build generation models • Create configuration file

  16. Evaluation metrics • BLEU(BiLingual Evaluation Understudy) • Developed at IBM’s • The closer a MT is to a professional human translation, the better it is • NIST

  17. English-Persian MT challenges • The Persian language structure is very different in comparison to English • The structure of Persian language is very complex • There has been little previous work done for this language pair • Effective SMT systems rely on very large bilingual corpora but there are not readily available for the English/Persian language pair

  18. English-Persian SMT • There have been few English-Persian MT systems developed • Most of them are purely rule-based • There are two work on English-Persian SMT • Mohaghegh and Sarrafzadeh (Massey University) • Pilevar and Faili (Tehran University)

  19. System1 • Corpus: BBC news

  20. System1(Cont.) • Tools: SRILM, GIZA++, Moses

  21. System1: Improved Language Modeling

  22. System2 • Corpus: • Bidirectional(TEP): Subtitle of films, 3 books, KDE4

  23. System2(Cont.) • Corpus: • Monolingual: Hamshahri, subtitle of films

  24. System2(Cont.) • Tools: SRILM, GIZA++, Moses PersianSMT with 4-gram Sub-LM

  25. Comparison PersianSMT with Google Translator

  26. Problems in English-Persian SMT • compound verbs (aligning problem) • Use a phrase-based SMT system • But problem is inflectional morphology • Large number of inflected verb forms does not let the system learn to translate all the individual forms of a compound verb • Persian takes personal pronouns as an optional element in the sentence (aligning problem)

  27. Problems(Cont.) • failure of the system to place the elements of the sentence in the right order • Use a phrase-based SMT system • Re-rank the n-best output list and/or reorder the output sentences • Prior to translation, the input sentence is reordered using morpho-syntactic information, so that the word order resembles better that of the target language.

  28. References • [1] A. Ramanathan, "Statistical Machine Translation", Ph.D. Seminar Report, Department of Computer Science and Engineering Indian Institute of Technology, 2000. • [2] A. LOPEZ, "Statistical Machine Translation", ACM Computing Surveys, 2008. • [3] M. Mohaghegh, & A. Sarrafzadeh, “The first english-persian statistical machine translation”, New Zealand Postgraduate Conference, 2009 . • [4] M. Mohaghegh, & A. Sarrafzadeh, " An analysis of the effect of training data variation in English-Persian Statistical Machine Translation”, 2009 International Conference on Innovations in Information Technology (IIT 2009) • [5] M. Mohaghegh, & A. Sarrafzadeh, " Performance evaluation of various training data in English-Persian statistical machine translation “, Appear in Proceedings of the 10th International Conference on the Statistical Analysis of Textual Data (JADT 2010), Rome, Italy, June 9-11, 2010. • [6] M. Mohaghegh, & A. Sarrafzadeh, " Improved Language Modeling for English-Persian Statistical Machine Translation”, COLING 2010 / SIGMT Workshop 23rd International Conference on Computational Linguistics Beijing, China 28 August 2010

  29. References(Cont.) • [7] M.T. Pilevar and H. Faili, "PersianSMT: A First Attempt to English-Persian Statistical Machine Translation", to appear in Proc. of 10th International Conference on statistical analysis of textual data (JADT 2010)

More Related