CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 17– Alignment in SMT)

CS460/626 : Natural Language Processing/Speech, NLP and the Web(Lecture 17– Alignment in SMT) Pushpak BhattacharyyaCSE Dept., IIT Bombay 14th Feb, 2011

Language Divergence Theory: Lexico-Semantic Divergences (ref: Dave, Parikh, Bhattacharyya, Journal of MT, 2002) • Conflational divergence • F: vomir; E: to be sick • E: stab; H: churaa se maaranaa (knife-with hit) • S: Utrymningsplan; E: escape plan • Structural divergence • E: SVO; H: SOV • Categorial divergence • Change is in POS category (many examples discussed) • Head swapping divergence • E: Prime Minister of India; H: bhaaratkepradhaanmantrii (India-of Prime Minister) • Lexical divergence • E: advise; H: paraamarshdenaa (advice give): Noun Incorporation- very common Indian Language Phenomenon

Language Divergence Theory: Syntactic Divergences • Constituent Order divergence • E: Singh, the PM of India, will address the nation today; H: bhaaratkepradhaanmantrii, singh, … (India-of PM, Singh…) • Adjunction Divergence • E: She will visit here in the summer; H: vahyahaagarmiimeMaayegii (she here summer-in will come) • Preposition-Stranding divergence • E: Who do you want to go with?; H: kisakesaathaapjaanaachaahate ho? (who with…) • Null Subject Divergence • E: I will go; H: jaauMgaa (subject dropped) • Pleonastic Divergence • E: It is raining; H: baarish ho rahiihaai (rain happening is: no translation of it)

Alignment • Completely aligned • Your answer is right • Votre response est just • Problematic alignment • We first met in Paris • Nous nous sommes rencontres pour la premiere fois a Paris

The Statistical MT model: notation • Source language: F • Target Language: E • Source language sentence: f • Target language sentence: e • Source language word: wf • Target language word: we

The Statistical MT model To translate f: • Assume that all sentences in E are translations of f with some probability! • Choose the translation with the highest probability

SMT Model • What is a good translation? • Faithful to source • Fluent in target faithfulness fluency

Language Modeling • Task to find P(e) (assigning probabilities to sentences)

Language Modeling: The N-gram approximation • Probability of the word given the previous N-1 words • N=2: bigram approximation • N=3: trigram approximation • Bigram approximation:

Translation Modeling • Task: to find P(f|e) • Cannot use the counts of f and e • Approximate: P(f|e) using the product of word translation probabilities (IBM model 1) Problem: How to calculate word translation probabilities? Note: We do not have counts – training corpus is sentence-aligned, not word-aligned

Word-alignment example (1) (2) (3) (4) Ram has an apple रामके पासएकसेबहै (1) (2)(3) (4) (5) (6)

Expectation Maximization for the translation model

Expectation-Maximization algorithm • Start with uniform word translation probabilities • Use these probabilities to find the counts (fractional) • Use these new counts to recompute the word translation probabilities • Repeat the above steps till values converge Works because of the co-occurrence of words that are actually translations It can be proven that EM converges

The counts in IBM Model 1 Works by maximizing P(f|e) over the entire corpus For IBM Model 1, we get the following relationship:

The translation probabilities in IBM Model 1

English-French example of alignment • Completely aligned • Your1 answer2 is3 right4 • Votre1 response2 est3 just4 • Alignment: 11, 22, 33, 44 • Problematic alignment • We1 first2 met3 in4 Paris5 • Nous1 nous2 sommes3 rencontres4 pour5 la6 premiere7 fois8 a9 Paris10 • Alignment: 1(1,2) , 2(5,6,7,8) , 34 , 49 , 510 • Fertilty?: yes

English three rabits a b rabbits of Grenoble bc d French troislapins x y (2) lapins de Grenoble xy z EM for word alignment from sentence alignment: example

Initial Probabilities: each cell denotes t(a w), t(a x) etc.

The counts in IBM Model 1 Works by maximizing P(f|e) over the entire corpus For IBM Model 1, we get the following relationship:

Example of expected count C[aw; (a b)(w x)] t(aw) = ------------------------- X #(a in ‘a b’) X #(w in ‘w x’) t(aw)+t(ax) 1/4 = ----------------- X 1 X 1= 1/2 1/4+1/4

“counts”

Revised probability: example trevised(a w) 1/2 = ---------------------------------------- (½+1/2 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Revised probabilities table

“revised counts”

Re-Revised probabilities table Continue until convergence; notice that (b,x) binding gets progressively stronger

Another Example A four-sentence corpus: a b ↔ x y (illustrated book ↔ livreillustrie) b c ↔ x z (book shop ↔ livremagasin) Assuming no null alignments. Possible alignments: a b a b b c b c x y x y x z x z

Iteration 1

Iteration 2

Normalized probabilities: after iteration 2

Normalized probabilities: after iteration 3

Translation Model: Exact expression • Five models for estimating parameters in the expression [2] • Model-1, Model-2, Model-3, Model-4, Model-5 Choose the length of foreign language string given e Choose alignment given e and m Choose the identity of foreign word given e, m, a

Proof of Translation Model: Exact expression ; marginalization ; marginalization m is fixed for a particular f, hence

Model-1 • Simplest model • Assumptions • Pr(m|e) is independent of m and e and is equal to ε • Alignment of foreign language words (FLWs) depends only on length of English sentence = (l+1)-1 • l is the length of English sentence • The likelihood function will be • Maximize the likelihood function constrained to

Model-1: Parameter estimation • Using Lagrange multiplier for constrained maximization, the solution for model-1 parameters • λe : normalization constant; c(f|e; f,e) expected count;δ(f,fj) is 1 if f & fj are same, zero otherwise. • Estimate t(f|e) using Expectation Maximization (EM) procedure

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 17– Alignment in SMT)