320 likes | 731 Views
Naveed Khan [Under the guidance of] Prof. Pushpak Bhattacharyya CFILT, IIT-Bombay. Hindi To English Statistical Machine Translation. Presentation Outline. Overview of Statistical Approach Language Model Translation Model Components of SMT Moses and Giza++ Word based alignments
E N D
Naveed Khan [Under the guidance of]Prof. Pushpak BhattacharyyaCFILT, IIT-Bombay Hindi To English Statistical Machine Translation
Presentation Outline • Overview of Statistical Approach • Language Model • Translation Model • Components of SMT • Moses and Giza++ • Word based alignments • Parallel alignments • Phrase based SMT • Moses Steps • Evaluation of results • Conclusion and Future Work
Overview of Statistical Approach • “Find the English translation e corresponding to a given Foreign sentence f” • Thus, we seek ebest such that ebest = argmaxe P(e |f ) = argmaxe [P(e) * P(f |e)] Language Model – P(e) Translation Model – P(f |e) • Translations are produced on the basis of statistical model • Parameters are estimated using bilingual parallel corpora
Language Model • The goal is to find high fluency English sentence for a given sentence s1s2 …… sn Pr(s1s2 …… sn) = Pr(s1) * Pr(s2|s1) *. . . * Pr(sn|s1 s2 . . . sn-1) • Here Pr(sn|s1 s2 . . . sn-1) is the probability that word sn follows word string s1 s2 . . . sn-1 • N-gram model probability • Trigram model probability calculation
Translation Model [1/2] • It is a generative model, given a Hindi language sentence it tries to find highly fluent English language sentence • Whenever it faces an English sentence, it reasons backward and tries to identify which Hindi sentence is likely to produce this English sentence • Since sentences are infinite and it is not possible to find pr(f,e) for all pairs of sentences, the concept of allignment is introduced
Translation Model [2/2] • Allignment is the mapping of individual words in aligned sentence pairs • A= {a1, a2, a3, a4,....am} is termed as an allignment, where aj = set of positions in English sentence to which jth word of foreign language is aligned. • Without loss of generality we can say that Choose the length of foreign language string m given e Choose the identity of English word f given e, m, a Choose the alignment a given e, m
Moses and Giza++ • GIZA++ is a freely available implementation of the IBM Models. We need it as a initial step to establish word alignments. Our word alignments are taken from the intersection of bidirectional runs of GIZA++ plus some additional alignment points from the union of the two runs. • Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus). An efficient search algorithm finds quickly the highest probability translation among the exponential number of choices. • These tools can be obtained in their debain form from http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/pool/jaunty/nlp/
Word Based Alignment • For Each word in source language, align words from target language that this word possibly produces • Based on IBM models 1-5 • Model 1 is the simplest • As we go from models 1 to 5, models get more complex but more realistic
Parallel Alignments • Hindi to English Alignments # Sentence pair (1) source length 14 target length 19 alignment score : 8.99895e-36 इस क्षेत्र में दिगम्बर जैन मंदिर प्राप्त कर लिया है जो , बर्ड्स अस्पताल का संचय करता है . NULL ({ 18 }) the ({ 1 }) area ({ 2 }) has ({ 10 }) got ({ 7 }) the ({ 3 }) digamber ({ 4 8 9 13 14 16 17 }) jain ({ 5 }) temple ({ 6 }) which ({ 11 }) houses ({ }) the ({ 12 }) birds ({ 15 }) hospital ({ }) . ({ 19 }) • English to Hindi Alignments # Sentence pair (1) source length 19 target length 14 alignment score : 3.37018e-21 the area has got the digamber jain temple which houses the birds hospital . NULL ({ 5 }) इस ({ 1 }) क्षेत्र ({ 2 }) में ({ }) दिगम्बर ({ 3 4 6 }) जैन ({ 7 }) मंदिर ({ 8 }) प्राप्त ({ }) कर ({ }) लिया ({ }) है ({ }) जो ({ 9 }) , ({ }) बर्ड्स ({ 12 }) अस्पताल ({ 13 }) का ({ 11 }) संचय ({ 10 }) करता ({ }) है ({ }) . ({ 14 })
Phrase-Based SMT • Consider translation for sentence ”राम चम्मच से चावल खाता है” राम चम्मच से चावल खाता है Ram eats rice with a spoon
Moses Steps [1/4] • Training [../train-factored-phrase-model.perl -scripts-root-dir ../scripts -root-dir . -corpus filename.clean -e en -f hi -lm 0:3:../filename.lm:0] • Preparation of data • Run GIZA++ train.hi1 इस क्षेत्र में दिगम्बर जैन मंदिर प्राप्त कर लिया है जो , बर्ड्स अस्पताल का संचय करता है . 2 स्थान जहां एक पर्यटक उसकी चिन्ताओं को पीछे छोड सकता है , जम्मू और कश्मीर में गुलमर्ग , गढवाल में औली , हिमाचल प्रदेश में कुफ्री और नारूण्डा को सम्मिलित करते हुये . 3 छोटे बच्चों को मंदिरों में ले जाया जाता है और उनका परिचय बुध्दि एवं ज्ञान की देवी , सरस्वती के आगे वर्णमाला के अक्षरों से करवाया जाता है . train.en1 the area has got the digamber jain temple which houses the birds hospital . 2 places where a tourist can whiz past his worries include gulmarg in jammu and kashmir , auli in grawhal , kufri and narkanda in himachal pradesh . 3 young children are taken to the temples and are introduced to the letters of the alphabet in front of saraswati , the goddess of wisdom and learning . hi-en.A3.final # Sentence pair (1) source length 14 target length 19 alignment score : 8.99895e-36 इस क्षेत्र में दिगम्बर जैन मंदिर प्राप्त कर लिया है जो , बर्ड्स अस्पताल का संचय करता है . NULL ({ 18 }) the ({ 1 }) area ({ 2 }) has ({ 10 }) got ({ 7 }) the ({ 3 }) digamber ({ 4 8 9 13 14 16 17 }) jain ({ 5 }) temple ({ 6 }) which ({ 11 }) houses ({ }) the ({ 12 }) birds ({ 15 }) hospital ({ }) . ({ 19 }) # Sentence pair (2) source length 27 target length 33 alignment score : 2.45498e-47 स्थान जहां एक पर्यटक उसकी चिन्ताओं को पीछे छोड सकता है , जम्मू और कश्मीर में गुलमर्ग , गढवाल में औली , हिमाचल प्रदेश में कुफ्री और नारूण्डा को सम्मिलित करते हुये . NULL ({ 11 }) places ({ 1 }) where ({ 2 }) a ({ 3 }) tourist ({ 4 }) can ({ }) whiz ({ }) past ({ }) his ({ 5 }) worries ({ 6 7 }) include ({ }) gulmarg ({ 17 }) in ({ 16 }) jammu ({ 13 }) and ({ 14 }) kashmir ({ 15 }) , ({ 18 }) auli ({ 19 }) in ({ 20 }) grawhal ({ 8 9 10 21 }) , ({ 12 22 }) kufri ({ 26 }) and ({ 27 }) narkanda ({ 28 29 30 31 32 }) in ({ 25 }) himachal ({ 23 }) pradesh ({ 24 }) . ({ 33 }) en-hi.A3.final # Sentence pair (1) source length 19 target length 14 alignment score : 3.37018e-21 the area has got the digamber jain temple which houses the birds hospital . NULL ({ 5 }) इस ({ 1 }) क्षेत्र ({ 2 }) में ({ }) दिगम्बर ({ 3 4 6 }) जैन ({ 7 }) मंदिर ({ 8 }) प्राप्त ({ }) कर ({ }) लिया ({ }) है ({ }) जो ({ 9 }) , ({ }) बर्ड्स ({ 12 }) अस्पताल ({ 13 }) का ({ 11 }) संचय ({ 10 }) करता ({ }) है ({ }) . ({ 14 }) # Sentence pair (2) source length 33 target length 27 alignment score : 4.73882e-36 places where a tourist can whiz past his worries include gulmarg in jammu and kashmir , auli in grawhal , kufri and narkanda in himachal pradesh . NULL ({ }) स्थान ({ 1 }) जहां ({ 2 }) एक ({ 3 }) पर्यटक ({ 4 }) उसकी ({ 8 }) चिन्ताओं ({ 6 7 9 }) को ({ }) पीछे ({ }) छोड ({ }) सकता ({ 5 }) है ({ }) , ({ }) जम्मू ({ 13 }) और ({ 14 }) कश्मीर ({ 15 }) में ({ 12 }) गुलमर्ग ({ 10 11 }) , ({ 16 }) गढवाल ({ }) में ({ }) औली ({ 17 18 19 }) , ({ 20 }) हिमाचल ({ 25 }) प्रदेश ({ 26 }) में ({ 24 }) कुफ्री ({ 21 }) और ({ 22 }) नारूण्डा ({ 23 }) को ({ }) सम्मिलित ({ }) करते ({ }) हुये ({ }) . ({ 27 })
Moses Steps [2/4] • Align words To establish word alignments based on the two GIZA++ alignments, a number of heuristics may be applied. The default heuristic grow-diag-final starts with the intersection of the two alignments and then adds additional alignment points. • Get lexical translation table aligned.grow-diag-final1. इस क्षेत्र में दिगम्बर जैन मंदिर प्राप्त कर लिया है जो , बर्ड्स अस्पताल का संचय करता है . the area has got the digamber jain temple which houses the birds hospital . 0-0 1-1 3-2 9-2 3-3 6-3 2-4 3-5 7-5 8-5 16-5 4-6 5-7 10-8 15-9 11-10 14-10 12-11 13-12 18-13 2. स्थान जहां एक पर्यटक उसकी चिन्ताओं को पीछे छोड सकता है , जम्मू और कश्मीर में गुलमर्ग , गढवाल में औली , हिमाचल प्रदेश में कुफ्री और नारूण्डा को सम्मिलित करते हुये . places where a tourist can whiz past his worries include gulmarg in jammu and kashmir , auli in grawhal , kufri and narkanda in himachal pradesh . 0-0 1-1 2-2 3-3 9-4 5-5 5-6 4-7 5-8 6-8 16-9 16-10 15-11 12-12 13-13 14-14 17-15 18-16 19-17 7-18 8- 18 20-18 11-19 21-19 25-20 26-21 27-22 28-22 29-22 30-22 31-22 24-23 22-24 23-25 32-26 model/lex.h2e बैंक banking 0.0588235 बैंक bank 0.2571429 बैंक several 0.0116279 बैंक banks 0.1269841 बैंक sterling 0.0526316 बैंक paperwork 0.2857143 यूनियन union 0.1142857 अन्तिम success 0.0909091 अन्तिम final 0.1111111 अन्तिम eighties 0.1428571 अन्तिम last 0.0933333 अन्तिम terminus 0.0476190
Moses Steps [3/4] • Extract Phrases The for each line: Hindi phrase, English phrase and the allignment points. Alignment points are pairs (hindi, english). Also, an inverted alignment file extract.inv is generated. model/extract.0-0इस ||| the ||| 0-0 इस क्षेत्र ||| the area ||| 0-0 1-1 क्षेत्र ||| area ||| 0-0 में ||| the ||| 0-0 जैन ||| jain ||| 0-0 जैन मंदिर ||| jain temple ||| 0-0 1-1 मंदिर ||| temple ||| 0-0 जो ||| which ||| 0-0 जो , बर्ड्स अस्पताल का संचय ||| which houses the birds hospital ||| 0-0 5-1 1-2 4-2 2-3 3-4 संचय ||| houses ||| 0-0 , बर्ड्स अस्पताल का संचय ||| houses the birds hospital ||| 4-0 0-1 3-1 1-2 2-3 , बर्ड्स अस्पताल का ||| the birds hospital ||| 0-0 3-0 1-1 2-2 बर्ड्स ||| birds ||| 0-0 बर्ड्स अस्पताल ||| birds hospital ||| 0-0 1-1 अस्पताल ||| hospital ||| 0-0 . ||| . ||| 0-0 है . ||| . ||| 1-0
Moses Steps [4/4] • Score PhrasesA translation table is created from the stored phrase translation pairs. जैन ||| jain ||| (0) ||| (0) ||| 1 0.981818 0.857143 0.915254 2.718 क्षेत्र ||| area ||| (0) ||| (0) ||| 0.8375 0.671779 0.503759 0.376936 2.718 बर्ड्स ||| birds ||| (0) ||| (0) ||| 0.0175439 0.0147059 1 1 2.718 बर्ड्स अस्पताल ||| birds hospital ||| (0) (1) ||| (0) (1) ||| 1 0.0026738 1 0.5 2.718 अस्पताल ||| hospital ||| (0) ||| (0) ||| 0.4 0.181818 1 0.5 2.718 संचय ||| houses ||| (0) ||| (0) ||| 0.0327869 0.0134529 1 0.5 2.718 मंदिर ||| temple ||| (0) ||| (0) ||| 0.864903 0.768421 0.763838 0.760417 2.718 Phrase translation probability(f|e) Phrase Penalty Always exp(1)=2.718 Lexical Weighting lex(e|f) Lexical Weighting lex(f|e) Phrase translation probability(e|f)
Decoding Phrase table entry [रामRam] Hindi sentence: राम चम्मच से चावल खाता है Probability=p1 p1=p(राम|Ram)*pLM(Ram|<start>)*d(0) h= * चम्मच से चावल खाता है e= Ram Phrase table entry [खाता है eats] Probability=p1*p2 p2=p(खाता है|eats)*pLM(eats|Ram<start>)*d(2) h= * चम्मच से चावल * * e= Ram eats Phrase table entry [ चावल rice] Probability=p1*p2*p3 p3=p(चावल|rice)*pLM(rice|eats<start>)*d(2) h= * चम्मच से * * * e= Ram eats rice Phrase table entry [ चम्मच से with a spoon] Probability=p1*p2*p3*p4 p4=p(चम्मच से|with a spoon)*pLM(with a spoon|rice<start>)*d(2) h= * * * * * * e= Ram eats rice with a spoon
Some Positive Results • H: शब्दशह् धर्मशाला का मतलब पवित्र शरणस्थल् हैं . E: dharamshala literally means ' the holy refuge . • H: फतुहपुर सीकरी लाल बलुआ पत्थर में एक महाकाव्य है . E: fatehpur sikri is an epic in red sandstone . • H: कुल्लु घाटी भी वैली ओह्फ गोह्ड्स के नाम से प्रचलित है . E: the kullu valley also known as the valley of the gods . • H: वस्तुओं की गुणवत्ता परिवर्तनशील है , परंतु आपको अच्छा असली सौदा मिल सकता है . E: the quality of goods varies , but you may well find a genuine bargain . • H: हिमाचल प्रदेश की राजधानी शिमला को पहाडी स्टेशनों की रानी कहा जाता हैं . E: shimla the capital of himachal pradesh , called the queen of hill stations . • H: क्वीन विक्टोरिया ने ब्लैकफ्रयर्स ब्रिज का शुभारंभ नवम्बर 1869 में किया . E: queen victoria opened blackfriars bridge in november 1869 . • H: राजघाट यमुना के किनारे महात्मा गांधी का शांत स्मारक है . E: on the banks of yamuna raj ghat is the serene memorial of mahatma gandhi .
Error Analysis • H: बंकिघम पैलेस महारानी तथा युवराज फिलिप्स का लंदन निवास है .E: queen and prince philip buckingham palace is the london home of the .Error: The translated sentence followes a wrong word order. • H: वैज्ञानिक तरीके से एक दिव्य सैर के लिए तारामण्डल आएं .E: scientific a celestial trip to आएं planetarium .Error: Since 'आएं' is not present in the phrase table the word is left unalteredThe selection of the phrases from the phrase table is done in the decoding step [ आएं ; 9-9] is not been executed. • H: ऊंट सफारियां अपनी उत्पत्ति को भारत एवं चीन के बीच व्यापार के समय में चिन्हित करती हैं जब ऊंट कारवों मसालों , जडीबूटियों एवं रत्नों से लदे हुए स्थापित व्यापार मार्गों के साथ यात्रा करते थे .E: camel safaris india and china its origin to the time of trade between mark of when camel caravans spices and herbs , precious stones , from established trade routes laden with travel and Error: The translations of each and every word in the sentence is properly done but the proper word order does not exist.
Evaluation Criteria • Automatic Evaluation • BLEU: measures n-gram precision of a translation with respect to given reference translations • Higher score indicates better translation • Subjective Evaluation • Translations are judged by human evaluators on fluency and adequacy on the scale of 1 to 5
Subjective Evaluation Fluency Adequacy
Results • The Hindi to English translated test (400) sentences were manually sorted into four categories
Conclusion and Future Work • Shorter sentences when translated give out better BLEU score • The pos-tagging, morphological analysis and chunking process of the Hindi sentences and the application of reordering rules is an experiment that is still in progress • Significant improvement in the word order is expected
References • Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh and Pushpak Bhattacharyya. Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT,ACL-IJCNLP2009,Singapore,August, 2009 • Ananthakrishnan Ramanathan, Pushpak Bhattacharyya, Jayprasad Hegde, Ritesh M.Shah and M. Sasikuma . Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation, Proceedings of IJCNLP, 2008 • P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263-311. (1993). • Daniel Jurafsky & James H. Martin. An introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall Publication. (2006) • Philipp Koehn, Franz Josef Och and Daniel Marcu . Statistical phrase based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL). (2003).
Tokenizing the input sentence POS tagging done to the sentence Morphological analysis performed on the sentence Chunking is done to the input sentence that is tokenized+ POS-tagged+ Morph analysed Determining the subject, object and verb chunks SOV to SVO Reordering Reordering the prepositions Modifier Reordering Reordering Rules