Evaluating Machine Translation Using BLEU Scores

LING 138/238 SYMBSYS 138Intro to Computer Speech and Language Processing Lecture 12: Machine Translation (II) November 4, 2004 Dan Jurafsky Thanks to Kevin Knight for much of this material!! LING 138/238 Autumn 2004

Outline for MT Week • Intro and a little history • Language Similarities and Divergences • Four main MT Approaches • Transfer • Interlingua • Direct • Statistical • Evaluation LING 138/238 Autumn 2004

Thanks to Bonnie Dorr! • Next ten slides draw from her slides on BLEU LING 138/238 Autumn 2004

How do we evaluate MT? Human • Fluency • Overall fluency • Human rating of sentences read out loud • Cohesion (Lexical chains, anaphora, ellipsis) • Hand-checking for cohesion. • Well-formedness • 5-point scale of syntactic correctness • Fidelity (same information as source?) • Hand rating of target text on 100pt scale • Clarity • Comprehensibility • Noise test • Multiple choice questionnaire • Readability • cloze LING 138/238 Autumn 2004

Evaluating MT: Problems • Asking humans to judge sentences on a 5-point scale for 10 factors takes time and $$$ (weeks or months!) • We can’t build language engineering systems if we can only evaluate them once every quarter!!!! • We need a metric that we can run every time we change our algorithm. • It would be OK if it wasn’t perfect, but just tended to correlate with the expensive human metrics, which we could still run in quarterly. LING 138/238 Autumn 2004

BiLingual Evaluation Understudy (BLEU —Papineni, 2001) • Automatic Technique, but …. • Requires the pre-existence of Human (Reference) Translations • Approach: • Produce corpus of high-quality human translations • Judge “closeness” numerically (word-error rate) • Compare n-gram matches between candidate translation and 1 or more reference translations http://www.research.ibm.com/people/k/kishore/RC22176.pdf LING 138/238 Autumn 2004

Bleu Comparison Chinese-English Translation Example: Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. LING 138/238 Autumn 2004

How Do We Compute Bleu Scores? • Intuition: “What percentage of words in candidate occurred in some human translation?” • Proposal: count up # of candidate translation words (unigrams) # in any reference translation, divide by the total # of words in # candidate translation • But can’t just count total # of overlapping N-grams! • Candidate: the the the the the the • Reference 1: The cat is on the mat • Solution: A reference word should be considered exhausted after a matching candidate word is identified. LING 138/238 Autumn 2004

“Modified n-gram precision” • For each word compute: (1) total number of times it occurs in any single reference translation (2) number of times it occurs in the candidate translation • Instead of using count #2, use the minimum of #2 and #2, I.e. clip the counts at the max for the reference transcription • Now use that modified count. • And divide by number of candidate words. LING 138/238 Autumn 2004

Modified Unigram Precision: Candidate #1 It(1) is(1) a(1) guide(1) to(1) action(1) which(1) ensures(1) that(2) the(4) military(1) always(1) obeys(0) the commands(1) of(1) the party(1) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. What’s the answer??? 17/18 LING 138/238 Autumn 2004

Modified Unigram Precision: Candidate #2 It(1) is(1) to(1) insure(0) the(4) troops(0) forever(1) hearing(0) the activity(0) guidebook(0) that(2) party(1) direct(0) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. What’s the answer???? 8/14 LING 138/238 Autumn 2004

Modified Bigram Precision: Candidate #1 It is(1) is a(1) a guide(1) guide to(1) to action(1) action which(0) which ensures(0) ensures that(1) that the(1) the military(1) military always(0) always obeys(0) obeys the(0) the commands(0) commands of(0) of the(1) the party(1) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. 10/17 What’s the answer???? LING 138/238 Autumn 2004

Modified Bigram Precision: Candidate #2 It is(1) is to(0) to insure(0) insure the(0) the troops(0) troops forever(0) forever hearing(0) hearing the(0) the activity(0) activity guidebook(0) guidebook that(0) that party(0) party direct(0) Reference 1: It is a guide to action that ensures that themilitary will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. What’s the answer???? 1/13 LING 138/238 Autumn 2004

Catching Cheaters the(2) the the the(0) the(0) the(0) the(0) Reference 1: The cat is on the mat Reference 2: There is a cat on the mat What’s the unigram answer? 2/7 What’s the bigram answer? 0/7 LING 138/238 Autumn 2004

Bleu distinguishes human from machine translations LING 138/238 Autumn 2004

Bleu problems with sentence length • Candidate: of the • Solution: brevity penalty; prefers candidates translations which are same length as one of the references Reference 1: It is a guide to action that ensures that themilitary will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Problem: modified unigram precision is 2/2, bigram 1/1! LING 138/238 Autumn 2004

Statistical MT • Fidelity and fluency • Best-translation: • Developed by researchers who were originally in speech recognition at IBM • Called the IBM model LING 138/238 Autumn 2004

The IBM model • Hmm, those two factors might look familiar… • Yup, it’s Bayes rule: LING 138/238 Autumn 2004

Fluency: P(T) • How to measure that this sentence • That car was almost crash onto me • is less fluent than this one: • That car almost hit me. • Answer: language models (N-grams!) • For example P(hit|almost) > P(was|almost) • But can use any other more sophisticated model of grammar • Advantage: this is monolingual knowledge! LING 138/238 Autumn 2004

Faithfulness: P(S|T) • French: ça me plait [that me pleases] • English: • that pleases me - most fluent • I like it • I’ll take that one • How to quantify this? • Intuition: degree to which words in one sentence are plausible translations of words in other sentence • Product of probabilities that each word in target sentence would generate each word in source sentence. LING 138/238 Autumn 2004

Faithfulness P(S|T) • Need to know, for every target language word, probability of it mapping to every source language word. • How do we learn these probabilities? • Parallel texts! • Lots of times we have two texts that are translations of each other • If we knew which word in Source Text mapped to each word in Target Text, we could just count! LING 138/238 Autumn 2004

Faithfulness P(S|T) • Sentence alignment: • Figuring out which source language sentence maps to which target language sentence • Word alignment • Figuring out which source language word maps to which target language word LING 138/238 Autumn 2004

Big Point about Faithfulness and Fluency • Job of the faithfulness model P(S|T) is just to model “bag of words”; which words come from say English to Spanish. • P(S|T) doesn’t have to worry about internal facts about Spanish word order: that’s the job of P(T) • P(T) can do Bag generation: put the following words in order • Have programming a seen never I language better • Actual the hashing is since not collision-free usually the is less perfectly the of somewhat capacity table LING 138/238 Autumn 2004

P(T) and bag generation: the answer • “Usually the actual capacity of the table is somewhat less, since the hashing is not collision-free” LING 138/238 Autumn 2004

A motivating example • Japanese phrase 2000nen taio • 2000nen • 2000 - highest • Y2K • 2000 years • 2000 year • Taio • Correspondence -highest • Corresponding • Equivalent • Tackle • Dealing with • Deal with P(S|T) alone prefers: 2000 Correspondence Adding P(T) might produce correct Dealing with Y2K LING 138/238 Autumn 2004

More formally: The IBM Model • Let’s flesh out these intuitions about P(S|T) and P(T) a bit. • Many of the next slides are drawn from Kevin Knight’s fantastic “A Statistical MT Tutorial Workbook”! LING 138/238 Autumn 2004

IBM Model 3 as probabilistic version of Direct MT • We translate English into Spanish as follows: • Replace the words in the English sentence by Spanish words • Scramble around the words to look like Spanish order • But we can’t propose that English words are replaced by Spanish words one-for-one, because translations aren’t the same length. LING 138/238 Autumn 2004

IBM Model 3 (from Knight 1999) • For each word ei in English sentence, choose a fertilityi. The choice of i depends only on ei, not other words or ’s. • For each word ei, generate i Spanish words. Choice of French word depends only on English word ei, not English context or any Spanish words. • Permute all the Spanish words. Each Spanish word gets assign absolute target position slot (1,2,3, etc). Choice of Spanish word position dependent only on absolute position of English word generating it. LING 138/238 Autumn 2004

Translation as String rewriting (from Knight 1999) • Mary did not slap the green witch • Assign fertilities: 1 = copy over word, 2= copy twice, etc. 0 = delete • Mary not slap slap slap the the green witch • Replace English words with Spanish one-for-one: • Mary no daba una botefada a la verde bruja • Permute the words • Mary no daba una botefada a la bruja verde LING 138/238 Autumn 2004

Model 3: P(S|T) training parameters • What are the parameters for this model? Just look at dependencies: • Words: P(casa|house) • Fertilities: n(1|house): prob that “house” will produce 1 Spanish word whenever ‘house’ appears. • Distortions: d(5|2) prob that English word in position 2 of English sentence generates French word in position 5 of French translation • Actually, distortions are d(5,2,4,6) where 4 is length of English sentence, 6 is Spanish length • Remember, P(S|T) doesn’t have to model fluency LING 138/238 Autumn 2004

Model 3: last twist • Imagine some Spanish words are “spurious”; they appear in Spanish even though they weren’t in English original • Like function words; we generated “a la” from “the” by giving “the” fertility 2 • Instead, we could give “the” fertility 1, and generat “a” spuriously • Do this by pretending every English sentence contains invisible word NULL as word 0. • Then parameters like t(a|NULL) give probability of word “a” generating spuriously from NULL LING 138/238 Autumn 2004

Spurious words • We could imagine having n(3|NULL) (probability of being exactly 3 spurious words in a Spanish translation) • Instead, of n(0|NULL), n(1|NULL) … N(25|NULL), have a single parameter p1 • After assign fertilities to non-NULL English words we want to generate (say) z Spanish words. • As we genreate each of z words, we optionally toss in spurious Spanish word with probability p1 • Probability of not tossing in spurious word p0=1-p1 LING 138/238 Autumn 2004

Distortion probabilities for spurious words • Can’t just have d(5|0,4,6), I.e. chance that NULL word will end up in position 5. • Why? These are spurious words! Could occur anywhere!! To hard to predict • Instead, • Use normal-word distortion parameters to choose positions for normally-generated Spanish words • Put Null-generated words into empty slots left over • If three NULL-generated words, and three empty slots, then there are 3!, or six, ways for slotting them all in • We’ll assign a probability of 1/6 for each way LING 138/238 Autumn 2004

Real Model 3 • For each word ei in English sentence, choose fertility i with prob n(i| ei) • Choose number 0 of spurious Spanish words to be generated from e0=NULL using p1 and sum of fertilities from step 1 • Let m be sum of fertilities for all words including NULL • For each i=0,1,2,…L , k=1,2,… I : • choose Spanish word ikwith probability t(ik|ei) • For each i=1,2,…L , k=1,2,… I : • choose target Spanish position ikwith prob d(ik|I,L,m) • For each k=1,2,…, 0 choose position 0k from 0 -k+1 remaining vacant positions in 1,2,…m for total prob of 1/ 0! • Output Spanish sentence with words ik in positions ik (0<=I<=1,1<=k<= I) LING 138/238 Autumn 2004

String rewriting • Mary did not slap the green witch (input) • Mary not slap slap slap the green witch (choose fertilities) • Mary not slap slap slap NULL the green witch (choose number of spurious words) • Mary no daba una botefada a la verde bruja (choose translations) • Mary no daba una botefada a la bruja verde (choose target positions) LING 138/238 Autumn 2004

Model 3 parameters • N,t,p,d • If we had English strings and step-by-step rewritings into Spanish, we could: • Compute n(0|did) by locating every instance of “did”, see what happens to it during first rewriting step • If “did” appeared 15,000 times and was deleted during the first rewriting step 13,000 times, then n(0|did) = 13/15 LING 138/238 Autumn 2004

Alignments NULL And the program has been implemented | | | | | | /|\ Le programme a ete mis en application • If we had lots of alignments like this, • n(0|d): how many times “did” connects to no French words • T(maison|house) how many of all French words generated by “house” were “maison” • D(5|2,4,6) out of all times some word2 moved somewhere, how many times to word5? LING 138/238 Autumn 2004

Where to get alignments • It turns out we can bootstrap alignments • If we just have a bilingual corpus • We can bootstrap alignments • Assume some startup values for n,d,, etc • Use values for n,d, , etc to use model 3 to do “forced alignment”; I.e. to pick the best word alignments between sentences • Use these alignments to retrain n,d, , etc • Go to 2 • This is called the Expectation-Maximization or EM algorithm LING 138/238 Autumn 2004

Summary • Intro and a little history • Language Similarities and Divergences • Four main MT Approaches • Transfer • Interlingua • Direct • Statistical • Evaluation LING 138/238 Autumn 2004

Classes • LINGUIST 139M/239M. Human and Machine Translation. (Martin Kay) • CSCI 224N. Natural Language Processing (Chris Manning) LING 138/238 Autumn 2004

Evaluating Machine Translation Using BLEU Scores