Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th Feb, 2011

CS460/626 : Natural Language Processing/Speech, NLP and the Web(Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak BhattacharyyaCSE Dept., IIT Bombay 15th Feb, 2011

Going forward from word alignment Word alignment Phrase Alignment Decoding (going to bigger units (best possible Of correspondence) translation)

Abstract Problem Given: eoe1e2e3….enen+1 (Entities) Goal: lol1l2l3….lnln+1 (Labels) The Goal is to find the best possible label sequence Generative Model

Simplification Using Markov Assumption, the Language Model can be represented using bigrams Similarly translation model can also be represented in the following way:

Statistical Machine Translation • Finding the best possible English sentence given the foreign sentence • P(E)= Language Model • P(F|E) = Translation Model • E: English, F: Foreign Language

Problems in the framework • Labels are words of the target language • Very large in number • Who do you want to_go with ? • With whom do you want to go ? • आप किस के_साथ जाना चाहते_हो • (Aapkiske_sathjaanachahate_ho) whowho do do and so on you you want want to_goto_go with with Each word have multiple translation options. Preposition Stranding

Column of words of target language on the source language words ^ Aapkiske_sathjaanachahate_ho . who who do do and so on you you ^ want want … . to_goto_go with with Find the best possible path from ‘^’ to ‘.’ using transition and Observation probabilities. Viterbi can be used

TUTORIAL ON Giza++ and Moses tools (delivered by KushalLadha)

Word-based alignment For each word in source language, align words from target language that this word possibly produces Based on IBM models 1-5 Model 1 – simplest As we go from models 1 to 5, models get more complex but more realistic This is all that Giza++ does

Alignment A function from target position to source position: The alignment sequence is: 2,3,4,5,6,6,6 Alignment function A: A(1) = 2, A(2) = 3 .. A different alignment function will give the sequence:1,2,1,2,3,4,3,4 for A(1), A(2).. To allow spurious insertion, allow alignment with word 0 (NULL) No. of possible alignments: (I+1)J

IBM Model 1: Generative Process

Training Alignment Models • Given a parallel corpora, for each (F,E) learn the best alignment A and the component probabilities: • t(f|e) for Model 1 • lexicon probability P(f|e) and alignment probability P(ai|ai-1,I) • How to compute these probabilities if all you have is a parallel corpora

Intuition : Interdependence of Probabilities If you knew which words are probable translation of each other then you can guess which alignment is probable and which one is improbable If you were given alignments with probabilities then you can compute translation probabilities Looks like a chicken and egg problem EM algorithm comes to the rescue

Limitation: Only 1->Many Alignments allowed

Phrase-based alignment More natural Many-to-one mappings allowed

Giza++ and Moses Package http://cl.naist.jp/~eric-n/ubuntu-nlp/ Select your Ubuntu version Browse the nlp folder Download debian package of giza++, moses, mkcls, srilm Resolve all the dependencies and they get installed For alternate installation, refer to http://www.statmt.org/moses_steps.html

Steps Input - sentence aligned parallel corpus Output- target side tagged data Training Tuning Generate output on test corpus (decoding)

Training Create a folder named corpus containing test, train and tuning file Giza++ is used to generate alignment Phrase table is generated after training Before training language model needs to be build on target side mkdir lm ; /usr/bin/ngram-count -order 3 -interpolate -kndiscount -text $PWD/corpus/train_surface.hi -lm lm/train.lm; /usr/share/moses/scripts/training/train-factored-phrase-model.perl -scripts-root-dir /usr/share/moses/scripts -root-dir . -corpus train.clean -e hi -f en -lm 0:3:$PWD/lm/train.lm:0;

Example • train.pr hh eh l ow hh ah l ow w er l d k aa m p aw n d w er d hh ay f ah n ey t ih d ow eh n iy b uw m k w iy z l ah b aa t ah r train.en h e l l o h e l l o w o r l d c o m p o u n d w o r d h y p h e n a t e d o n e b o o m k w e e z l e b o t t e r

Sample from Phrase-table l l o ||| l ow ||| (0) (0) (1) ||| (0,1) (2) ||| 0.5 1 1 0.227273 2.718 l l ||| l ||| (0) (0) ||| (0,1) ||| 0.25 1 1 0.833333 2.718 l o ||| l ow ||| (0) (1) ||| (0) (1) ||| 0.5 1 1 0.227273 2.718 l ||| l ||| (0) ||| (0) ||| 0.75 1 1 0.833333 2.718 m ||| m ||| (0) ||| (0) ||| 1 0.5 1 1 2.718 n d ||| n d ||| (0) (1) ||| (0) (1) ||| 1 1 1 1 2.718 n e ||| eh n iy ||| (1) (2) ||| () (0) (1) ||| 1 1 0.5 0.3 2.718 n e ||| n iy ||| (0) (1) ||| (0) (1) ||| 1 1 0.5 0.3 2.718 n ||| eh n ||| (1) ||| () (0) ||| 1 1 0.25 1 2.718 o o m ||| uw m ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.5 1 0.181818 2.718 o o ||| uw ||| (0) (0) ||| (0,1) ||| 1 1 1 0.181818 2.718 o ||| aa ||| (0) ||| (0) ||| 1 0.666667 0.2 0.181818 2.718 o ||| ow eh ||| (0) ||| (0) () ||| 1 1 0.2 0.272727 2.718 o ||| ow ||| (0) ||| (0) ||| 1 1 0.6 0.272727 2.718 w o r ||| w er ||| (0) (1) (1) ||| (0) (1,2) ||| 1 0.1875 1 0.424242 2.718 w ||| w ||| (0) ||| (0) ||| 1 0.75 1 1 2.718 b o ||| b aa ||| (0) (1) ||| (0) (1) ||| 1 0.666667 1 0.181818 2.718 b ||| b ||| (0) ||| (0) ||| 1 1 1 1 2.718 c o m p o ||| aa m p ||| (2) (0,1) (1) (0) (1) ||| (1,3) (1,2,4) (0) ||| 1 0.0486111 1 0.154959 2.718 c ||| p ||| (0) ||| (0) ||| 1 1 1 1 2.718 d w ||| d w ||| (0) (1) ||| (0) (1) ||| 1 0.75 1 1 2.718 d ||| d ||| (0) ||| (0) ||| 1 1 1 1 2.718 e b ||| ah b ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718 e l l ||| ah l ||| (0) (1) (1) ||| (0) (1,2) ||| 1 1 0.5 0.5 2.718 e l l ||| eh l ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.111111 0.5 0.111111 2.718 e l ||| eh ||| (0) (0) ||| (0,1) ||| 1 0.111111 1 0.133333 2.718 e ||| ah ||| (0) ||| (0) ||| 1 1 0.666667 0.6 2.718 h e ||| hh ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718 h ||| hh ||| (0) ||| (0) ||| 1 1 1 1 2.718 l e b ||| l ah b ||| (0) (1) (2) ||| (0) (1) (2) ||| 1 1 1 0.5 2.718 l e ||| l ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.5 2.718

Tuning Not a compulsory step but will improve the decoding by a small percentage mkdir tuning; cp $WDIR/corpus/tun.en tuning/input; cp $WDIR/corpus/tun.hi tuning/reference; /usr/share/moses/scripts/training/mert-moses.pl $PWD/tuning/input $PWD/tuning/reference /usr/bin/moses $PWD/model/moses.ini --working-dir $PWD/tuning --rootdir /usr/share/moses/scripts It will take around 1 hour on a server with 32GB RAM

Testing • mkdir evaluation; /usr/bin/moses -config $WDIR/tuning/moses.ini -input-file $WDIR/corpus/test.en >evaluation/test.output; • The output will be in evaluation/test.output file • Sample Output • h o t hhaa t • p h o n e p|UNKhhow eh n iy • b o o k  b uw k

Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th Feb, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th Feb, 2011

Presentation Transcript

Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th April, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 3 rd and 7 th Feb, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 21 st March, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 31 st March, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 11 th April, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 22 nd March, 2011

Pushpak Bhattacharyya CSE Dept . IIT Bombay 1 st Nov, 2012

Pushpak Bhattacharyya CSE Dept., IIT Bombay 12 th April, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th and 18 th Oct, 2012

Pushpak Bhattacharyya CSE Dept., IIT Bombay 7 th April, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan , 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 29 th March, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th , 10 th March , 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th March, 2011

Pushpak Bhattacharyya CSE Dept. IIT Bombay 19 May, 2014

Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th Feb , 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th March, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 20 th Jan , 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 5 th and 6 th Nov , 2012

Pushpak Bhattacharyya CSE Dept., IIT Bombay 10 th Jan , 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 11 th Nov, 2012

Pushpak Bhattacharyya CSE Dept., IIT Bombay