470 likes | 650 Views
Statistical Machine Translation. Mohammad Taher Pilevar University of Tehran Winter 2010. Machine Translation?.
E N D
Statistical Machine Translation Mohammad Taher Pilevar University of Tehran Winter 2010
Machine Translation? The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. جزیره گوام ایالات متحده، پس از اینکه فرودگاه گوام و دفاتردولتی آن ایمیلی تهدیدآمیز در مورد حمله زیستی/شیمیایی به اماکن عمومی از جمله فرودگاه بود را از طرف شخصی که خود را اسامه بن لادن می نامید در یافت کردند، به حالت آماده باش کامل در آمده است.
Translation? • (sometimes) impossible for a sentence in one language to be a translation of a sentence in other (strictly speaking) • Eg: the Lord is my shepherd • the Lord will look after me (cost of fidelity) • The Lord is for me like somebody who looks after animals with cotton-like hair (faithful to original sent.) compromise
Fidelity + Fluency • So, true translation, which is both • Faithful to the source language and • Fluency in the target language
…Goal of Translation • the production of an output that maximizes some value function that represents the importance of bothfaithfulness and fluency. StatisticalMT
best-translation = argmax faithfulness (T,S) fluency(T) • E = argmaxP(F|E) P(E) T E Translation model Language model
…so, what we need? • Language model P(E) • Translation model P(E|F) • Decoder given F, produces the most probable E
Language model: P(E) • Assigns a higher probability to fluent / grammatical sentences • Estimated using monolingual corpora این جمله از نظر دستور زبان فارسی، یک جمله صحیح محسوب می شود High P(e) این محسوب جمله از دستور زبان نظر می شود فارسی یک جمله صحیح Low P(e)
P(F|E): THE PHRASE-BASED TRANSLATION MODEL • The job of the translation model: given an English sentence E and a foreign sentence F, is to assign a probability that E generates F.
Translation model: P(e|f) • Assigns higher probability to sentences that have corresponding meaning • Estimated using bilingual corpora ? Former president had a speech رئیس جمهور سابق، سخنرانی کرد High P(e|f) Low P(e|f) در سخنرانی رئیس جمهور شرکت کردم
Raw data to Bilingual corpus Some books, websites, … In English Same books, websites, … In Persian
Breaking sentences into words • The Poor don’t have any money [The] [Poor] [don’t] [have] [any] [money] {انسان های} {فقیر} {هیچ} {پولی} {ندارند} • Align according to co-occurence
Some examples Spurious words
Alignments(eg. 1) The poor don’t have any money انسانهای فقیر هیچ پولی ندارند [The poor] [don’t have] [any money] [The poor] [don’t have any money]
Alignments (eg. 2) He forgot to turn off the stove او فراموش کرد که گاز خاموش کند [He forgot to] [turn off] [the stove]
P(F,A|E) Story null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux
P(F,A|E) Story null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux
null The quick fox jumps over the lazy dog null The quick fox jumps over the lazy dog null The quick fox jumps over the lazy dog null The quick fox jumps over the lazy dog null The quick fox jumps over the lazy dog null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux Le renard rapide saut par - dessus le chien parasseux Le renard rapide saut par - dessus le chien parasseux Le renard rapide saut par - dessus le chien parasseux Le renard rapide saut par - dessus le chien parasseux Le renard rapide saut par - dessus le chien parasseux Getting Pt(f|e) • We need numbers for Pt(f|e) • Example: Pt(le|the) • Count lines in a large collection of aligned text
Where’s “heaven” in Vietnamese? English: In the beginning God created the heavens and the earth. Vietnamese: Ban dâu Dúc Chúa Tròi dung nên tròi dât. English: God called the expanse heaven. Vietnamese: Dúc Chúa Tròi dat tên khoang không la tròi. English: … you are this day like the stars of heaven in number. Vietnamese: … các nguoi dông nhu sao trên tròi. Example borrowed from Jason Eisner
Where’s “heaven” in Vietnamese? English: In the beginning God created the heavens and the earth. Vietnamese: Ban dâu Dúc Chúa Tròi dung nên tròi dât. English: God called the expanse heaven. Vietnamese: Dúc Chúa Tròi dat tên khoang không la tròi. English: … you are this day like the stars of heaven in number. Vietnamese: … các nguoi dông nhu sao trên tròi. Example borrowed from Jason Eisner
EM: Estimation Maximization • Assume a probability distribution (weights) over hidden events • Take counts of events based on this distribution • Use counts to estimate new parameters • Use parameters to re-weight examples. • Rinse and repeat
Good grief! We forgot about P(F|E)! • No worries, a little more stats gets us what we need:
Big Example: Corpus 1 fast car voiture rapide 2 fast rapide
Possible Alignments 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide
Parameters 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide
Weight Calculations 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide
Count Lines 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide 1/2 1/2 1
Count Lines 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide 1/2 1/2 1
Count Lines 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide 1/2 1/2 1 Normalize
Parameters 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide 1/2 1/2 1
Weight Calculations 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide 1/2 1/2 1
Count Lines 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide 1/4 3/4 1
Count Lines 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide 1/4 3/4 1
Count Lines 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide 1/4 3/4 1 Normalize
After many iterations: 1a 1b 2 fast car fast fast car voiture rapide voiture rapide rapide ~0 ~1 1
Mary di´o una bofetada a la bruja verde lattice of possible English translations for words and phrases in a particular sentence F,
Generative story… • we group the English source words into phrases • Translate them • Optionally reorder Translation probability Distortion probability
translation probability: • Words having ‘distorted’ position in the Spanish sentence than it had in the English sentence: • where is the start position of the foreign phrase generated by the English phrase , and is the end position of the foreign phrase generated by the English phrase .
Distortion probability 1 1 2 This distortion model penalizes large distortions by giving lower and lower probability the larger the distortion
Alignment in MT • The Poor don’t have any money [The] [Poor] [don’t] [have] [any] [money] {انسان های} {فقیر} {هیچ} {پولی} {ندارند} • Align according to co-occurence