Improving SMT with Phrase to Phrase Translations

Improving SMT withPhrase to Phrase Translations Joy Ying Zang, Ashish Venugopal, Stephan Vogel, Alex Waibel Carnegie Mellon University Project: Mega-RADD

CMU Mega RADD The Mega-RADD Team:SMT: Stephan Vogel, Alex Waibel, John Lafferty,EMBT: Ralf Brown, Bob Frederking,Chinese: Joy Ying Zang, Ashish Venugopal, Bing Zhao, Fei HuangArabic: Alicia Tribble, Ahmed Badran

Overview • Goals: • Develop Data-Driven General Purpose MT Systems • Train on Large and Small Corpora, Evaluate to test Portability • Approaches • Two Data-driven Approaches: Statistical, Example-Based • Also Grammar based Translation System • Multi-Engine Translation • Languages: Chinese and Arabic • Statistical Translation: • Exploit Structure in Language: Phrases • Determine Phrases from Mono- and Bi-Lingual Co-occurrences • Determine Phrases from Lexical and Alignment Information

Arabic: Initial System • 1 million words of UN data, 300 sentences for testing • Preprocessing: separation of punctuation marks, lower case for English, correction of corrupted numbers • Adding Human knowledge: cleaning statistical lexicon for 100 most frequent words building lists names, simple date expressions, numbers (total: 1000 entries, total effort: two part-timers * 4 weeks) • Alignment: IBM1 plus HMM training, lexicon plus phrase translations • Language Model: trained on 1m sub-corpus • Results (20 May 2002): UN test data (300 sentences): Bleu = 0.1176 NIST devtest (203 sentences): Bleu = 0.0242 NIST = 2.0608

Arabic: Portability to a New Language • Training on subset of UN corpus chosen to cover vocabulary of test data • Training English to Arabic for extraction of phrase translations • Minimalist Morphology: strip/add suffixes for ~200 unknown wordsNIST: 5.5368  5.6700 • Adapting LM: Select stories from 2 years of English Xinhua storiesaccording to 'Arabic' keyword list (280 entries); size 6.9m words.NIST: 5.5368  5.9183 • Results:- 20 Mai (devtest): 2.0608- 13 June (devtest): 6.5805- 14 June (evaltest): 5.4662 (final training not completed)- 17 June (evaltest): 6.4499 (after completed training)- 19 Juli (devtest): 7.0482

Two Approaches • Determine Phrases from Mono- and Bi-Lingual Co-occurrences • Joy • Determine Phrases from Lexical and Alignment Information • Ashish

Why phrases? • Mismatch between languages: word to word translation doesn’t work • Phrases encapsulate the context of words, e.g. verb tense

Why phrases? (Cont.) • Local reordering, e.g. Chinese relative clause • Using phrases to soothe word segmentation failure

Utilizing bilingual information • Given a sentence pair (S,T), S=<s1,s2,…,si,…sm> T=<t1,t2,…,tj,…,tn>, where si/tj are source/target words. • Given an m*n matrix B, where B(i,j)= co-occurrence(si,tj)= where, N=a+b+c+d;

Utilizing bilingual information (Cont.) • Goal: find a partition over matrix B, under the constraint that one src/tgt word can only align to one tgt/src word or phrase (adjacent word sequence) Legal segmentation, imperfect alignment Illegal segmentation, perfect alignment

Utilizing bilingual information (Cont.) For each sentence pair in the training data: While(still has row or column not aligned){ Find cell[i,j], where B(i,j) is the max among all available(not aligned) cells; Expand cell[i,j] with similarity sim_thresh to region[RowStart,RowEnd; ColStart,ColEnd] Mark all the cells in the region as aligned } Output the aligned regions as phrases ----------------------------------------------------- Sub expand cell[i,j] with sim_thresh { current aligned region: region[RowStart=i, RowEnd=i; ColStart=j, ColEnd=j] While(still ok to expand){ if all cells[m,n], where m=RowStart-1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart --; //expand to north if all cells[m,n], where m=RowEnd+1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart ++; //expand to south … //expand to east … //expand to west } Define similar(x,y)= true, if abs((x-y)/y) < 1-similarity_thresh

Utilizing bilingual information (Cont.) Expand to North Expand to South Expand to East Expand to West

Santa Clarita Union town Pittsburgh Los Angeles Corona Somerset Integrating monolingual information • Motivation: • Use more information in the alignment • Easier for aligning phrases • There is much more monolingual data than bilingual data Santa Monica

Integrating monolingual information (Cont.) • Given a sentence pair (S,T), S=<s1,s2,…,si,…sm> and T=<t1,t2,…,tj,…,tn>, where si/tj are source/target words. • Construct m*m matrix A,where A(i,j) = collocation(si, sj); Only A(i,i-1) and A(i,i+1) have values. • Construct n*n matrix C,where C(i,j) = collocation(ti, tj); Only C(j-1,j) and A(j+1,j) have values. • Construct m*n matrix B, where B(i,j)= co-occurrence(si, tj).

Integrating monolingual information (Cont.) • Normalize A so that: • Normalize C so that: • Normalize B so that: • Calculating new src-tgt matrix B’ B’ B

Discussion and Results • Simple • Efficient • Partitioning the matrix is linear O(min(m,n)). • The construction of A*B*C is O(m*n); • Effective • Improved the translation quality from baseline (NIST= 6.3775, Bleu=0.1417 ) to (NIST= 6.7405, Bleu=0.1681) on small data track dev-test

Utilizing alignment information: Motivation • Alignment model associates words and their translations on the sentence level. • Context and co-occurrence are represented when considering a set of sentence level alignments. • Extract phrase relations from the alignment information.

Processing Alignments • Identification – Selection of the source phrases target phrase candidates. • Scoring – Assigning a score to each candidate phrase pair to create a ranking. • Pruning – Reducing the set of candidate translations to a computationally tractable number.

Identification • Extraction from sentence level alignments. • For each source phrase identify the sentences in which they occur and load the sentence alignment • Form a sliding/expanding window in the alignment to identify candidate translations.

Identification Example - I

Identification Example - II • - is • is in step with the • is in step with the establishment • is in step with the establishment of • is in step with the establishment of its • is in step with the establishment of its legal • is in step with the establishment of its legal system • the • the establishment • the establishment of • …… • the establishment of its legal system • …… • establishment • establishment of • establishment of its • ….

Scoring - I • This candidate set H needs to be scored and ranked before pruning. • Alignment based scores. • Similarity clustering • Assume that the hypothesis set contains several similar phrases ( across several sentences ) and several noisy phrases. • SimScore(h) = Mean(EditDistance(h, h’)/AvgLen(h,h’)) for h,h’ in H

Scoring Example

Scoring - II • Lexicon augmentation • Weight each point in alignment scoring by their lexical probability. • P( si | tj ) where I, J represent the area of the translation hypothesis being considered. Only the pairs of words where there is an alignment is considered. • Calculate translation probability of hypothesis • ΣiΠj P( si | tj ) All words in the hypothesis are considered.

Combining Scores • Final Score(h) = Πj Scorej(h) for each scoring method. • Due to additional morphology present in English as compared to Chinese, a length model is used to adjust the final score to prefer longer phrases. • Diff Ratio = (I-J) / J if I>J • FinalScore(h)=FinalScore(h)*(1.0+c*e(-1.0*DiffRatio) ) • c is an experimentally determined constant

Pruning • This large candidate list is now sorted by score and is ready for pruning. • Difficult to pick a threshold that will work across different phrases. We need a split point that separates the useful and the noisy candidates. • Split point = argmax p {MeanScore(h<p) – MeanScore(h>=p)}where h represents each hypothesis in the ordered set H.

Experiments • Alignment model – experimented with one-way (EF) and two-way (EF-FE union/intersection) for IBM Models 1-4. • Best results found using union (high recall model) from model 4. • Both lexical augmentation (using model 1 lexicon) scores and length bonus were applied.

Results and Thoughts Small Track Large Track Baseline (IBM1+LDC-Dic) 6.3775 6.52 + Phrases 6.7405 7.316 -More effective pruning techniques will significantly reduced the experimentation cycle - Improved alignment models that better combine bi-directional alignment information

Combining Methods Small Data Track (Dec-01 data) Segmentation standard improved Baseline(IBM1+LDC-Dic) 6.2381 6.3775 + Phrases Joy 6.5624 6.7987 + Phrases Ashish 6.5295 6.7405 + Phrases Joy & Ashish 6.6427 6.8790

Improving SMT with Phrase to Phrase Translations