1 / 29

Improving SMT with Phrase to Phrase Translations

Improving SMT with Phrase to Phrase Translations. Joy Ying Zang, Ashish Venugopal, Stephan Vogel, Alex Waibel Carnegie Mellon University Project: Mega-RADD. CMU Mega RADD.

kalea
Download Presentation

Improving SMT with Phrase to Phrase Translations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving SMT withPhrase to Phrase Translations Joy Ying Zang, Ashish Venugopal, Stephan Vogel, Alex Waibel Carnegie Mellon University Project: Mega-RADD

  2. CMU Mega RADD The Mega-RADD Team:SMT: Stephan Vogel, Alex Waibel, John Lafferty,EMBT: Ralf Brown, Bob Frederking,Chinese: Joy Ying Zang, Ashish Venugopal, Bing Zhao, Fei HuangArabic: Alicia Tribble, Ahmed Badran

  3. Overview • Goals: • Develop Data-Driven General Purpose MT Systems • Train on Large and Small Corpora, Evaluate to test Portability • Approaches • Two Data-driven Approaches: Statistical, Example-Based • Also Grammar based Translation System • Multi-Engine Translation • Languages: Chinese and Arabic • Statistical Translation: • Exploit Structure in Language: Phrases • Determine Phrases from Mono- and Bi-Lingual Co-occurrences • Determine Phrases from Lexical and Alignment Information

  4. Arabic: Initial System • 1 million words of UN data, 300 sentences for testing • Preprocessing: separation of punctuation marks, lower case for English, correction of corrupted numbers • Adding Human knowledge: cleaning statistical lexicon for 100 most frequent words building lists names, simple date expressions, numbers (total: 1000 entries, total effort: two part-timers * 4 weeks) • Alignment: IBM1 plus HMM training, lexicon plus phrase translations • Language Model: trained on 1m sub-corpus • Results (20 May 2002): UN test data (300 sentences): Bleu = 0.1176 NIST devtest (203 sentences): Bleu = 0.0242 NIST = 2.0608

  5. Arabic: Portability to a New Language • Training on subset of UN corpus chosen to cover vocabulary of test data • Training English to Arabic for extraction of phrase translations • Minimalist Morphology: strip/add suffixes for ~200 unknown wordsNIST: 5.5368  5.6700 • Adapting LM: Select stories from 2 years of English Xinhua storiesaccording to 'Arabic' keyword list (280 entries); size 6.9m words.NIST: 5.5368  5.9183 • Results:- 20 Mai (devtest): 2.0608- 13 June (devtest): 6.5805- 14 June (evaltest): 5.4662 (final training not completed)- 17 June (evaltest): 6.4499 (after completed training)- 19 Juli (devtest): 7.0482

  6. Two Approaches • Determine Phrases from Mono- and Bi-Lingual Co-occurrences • Joy • Determine Phrases from Lexical and Alignment Information • Ashish

  7. Why phrases? • Mismatch between languages: word to word translation doesn’t work • Phrases encapsulate the context of words, e.g. verb tense

  8. Why phrases? (Cont.) • Local reordering, e.g. Chinese relative clause • Using phrases to soothe word segmentation failure

  9. Utilizing bilingual information • Given a sentence pair (S,T), S=<s1,s2,…,si,…sm> T=<t1,t2,…,tj,…,tn>, where si/tj are source/target words. • Given an m*n matrix B, where B(i,j)= co-occurrence(si,tj)= where, N=a+b+c+d;

  10. Utilizing bilingual information (Cont.) • Goal: find a partition over matrix B, under the constraint that one src/tgt word can only align to one tgt/src word or phrase (adjacent word sequence) Legal segmentation, imperfect alignment Illegal segmentation, perfect alignment

  11. Utilizing bilingual information (Cont.) For each sentence pair in the training data: While(still has row or column not aligned){ Find cell[i,j], where B(i,j) is the max among all available(not aligned) cells; Expand cell[i,j] with similarity sim_thresh to region[RowStart,RowEnd; ColStart,ColEnd] Mark all the cells in the region as aligned } Output the aligned regions as phrases ----------------------------------------------------- Sub expand cell[i,j] with sim_thresh { current aligned region: region[RowStart=i, RowEnd=i; ColStart=j, ColEnd=j] While(still ok to expand){ if all cells[m,n], where m=RowStart-1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart --; //expand to north if all cells[m,n], where m=RowEnd+1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart ++; //expand to south … //expand to east … //expand to west } Define similar(x,y)= true, if abs((x-y)/y) < 1-similarity_thresh

  12. Utilizing bilingual information (Cont.) Expand to North Expand to South Expand to East Expand to West

  13. Santa Clarita Union town Pittsburgh Los Angeles Corona Somerset Integrating monolingual information • Motivation: • Use more information in the alignment • Easier for aligning phrases • There is much more monolingual data than bilingual data Santa Monica

  14. Integrating monolingual information (Cont.) • Given a sentence pair (S,T), S=<s1,s2,…,si,…sm> and T=<t1,t2,…,tj,…,tn>, where si/tj are source/target words. • Construct m*m matrix A,where A(i,j) = collocation(si, sj); Only A(i,i-1) and A(i,i+1) have values. • Construct n*n matrix C,where C(i,j) = collocation(ti, tj); Only C(j-1,j) and A(j+1,j) have values. • Construct m*n matrix B, where B(i,j)= co-occurrence(si, tj).

  15. Integrating monolingual information (Cont.) • Normalize A so that: • Normalize C so that: • Normalize B so that: • Calculating new src-tgt matrix B’ B’ B

  16. Discussion and Results • Simple • Efficient • Partitioning the matrix is linear O(min(m,n)). • The construction of A*B*C is O(m*n); • Effective • Improved the translation quality from baseline (NIST= 6.3775, Bleu=0.1417 ) to (NIST= 6.7405, Bleu=0.1681) on small data track dev-test

  17. Utilizing alignment information: Motivation • Alignment model associates words and their translations on the sentence level. • Context and co-occurrence are represented when considering a set of sentence level alignments. • Extract phrase relations from the alignment information.

  18. Processing Alignments • Identification – Selection of the source phrases target phrase candidates. • Scoring – Assigning a score to each candidate phrase pair to create a ranking. • Pruning – Reducing the set of candidate translations to a computationally tractable number.

  19. Identification • Extraction from sentence level alignments. • For each source phrase identify the sentences in which they occur and load the sentence alignment • Form a sliding/expanding window in the alignment to identify candidate translations.

  20. Identification Example - I

  21. Identification Example - II • - is • is in step with the • is in step with the establishment • is in step with the establishment of • is in step with the establishment of its • is in step with the establishment of its legal • is in step with the establishment of its legal system • the • the establishment • the establishment of • …… • the establishment of its legal system • …… • establishment • establishment of • establishment of its • ….

  22. Scoring - I • This candidate set H needs to be scored and ranked before pruning. • Alignment based scores. • Similarity clustering • Assume that the hypothesis set contains several similar phrases ( across several sentences ) and several noisy phrases. • SimScore(h) = Mean(EditDistance(h, h’)/AvgLen(h,h’)) for h,h’ in H

  23. Scoring Example

  24. Scoring - II • Lexicon augmentation • Weight each point in alignment scoring by their lexical probability. • P( si | tj ) where I, J represent the area of the translation hypothesis being considered. Only the pairs of words where there is an alignment is considered. • Calculate translation probability of hypothesis • ΣiΠj P( si | tj ) All words in the hypothesis are considered.

  25. Combining Scores • Final Score(h) = Πj Scorej(h) for each scoring method. • Due to additional morphology present in English as compared to Chinese, a length model is used to adjust the final score to prefer longer phrases. • Diff Ratio = (I-J) / J if I>J • FinalScore(h)=FinalScore(h)*(1.0+c*e(-1.0*DiffRatio) ) • c is an experimentally determined constant

  26. Pruning • This large candidate list is now sorted by score and is ready for pruning. • Difficult to pick a threshold that will work across different phrases. We need a split point that separates the useful and the noisy candidates. • Split point = argmax p {MeanScore(h<p) – MeanScore(h>=p)}where h represents each hypothesis in the ordered set H.

  27. Experiments • Alignment model – experimented with one-way (EF) and two-way (EF-FE union/intersection) for IBM Models 1-4. • Best results found using union (high recall model) from model 4. • Both lexical augmentation (using model 1 lexicon) scores and length bonus were applied.

  28. Results and Thoughts Small Track Large Track Baseline (IBM1+LDC-Dic) 6.3775 6.52 + Phrases 6.7405 7.316 -More effective pruning techniques will significantly reduced the experimentation cycle - Improved alignment models that better combine bi-directional alignment information

  29. Combining Methods Small Data Track (Dec-01 data) Segmentation standard improved Baseline(IBM1+LDC-Dic) 6.2381 6.3775 + Phrases Joy 6.5624 6.7987 + Phrases Ashish 6.5295 6.7405 + Phrases Joy & Ashish 6.6427 6.8790

More Related