170 likes | 329 Views
A Novel Phrase-based System Combination Framework for MT. Necip Fazil Ayan 1 , Bing Xiang 2 , Bonnie J. Dorr 1 , Richard Schwartz 2 , Spyros Matsoukas 2 , Antti-Veikko I. Rosti 2 1 University of Maryland, College Park 2 BBN Technologies, Inc. Motivation and Goal. Motivation:
E N D
A Novel Phrase-based System Combination Framework for MT Necip Fazil Ayan1, Bing Xiang2, Bonnie J. Dorr1, Richard Schwartz2, Spyros Matsoukas2, Antti-Veikko I. Rosti2 1University of Maryland, College Park 2BBN Technologies, Inc.
Motivation and Goal • Motivation: • Different systems have their strengths/weaknesses • Can we create better hypotheses choosing the best phrases from different systems? • Goal: • Combine outputs of various machine translation (MT) systems by examining the source-to-target phrases used by individual MT systems • Choose the “best” phrases from each system and merge them in a way that will yield the closest translations to reference translations
Phrase-based Combination Framework N-best lists S1 S2 SN Phrase Collection Confidence Estimation Decoding Output Optimization
Phrase Collection • Reduce the search space for the decoding for combination • Collect source-to-target phrases for each sentence • Same source interval • Same target surface string • How to Collect Them? • Can be provided by the systems • Can be generated automatically using an automated word-alignment tool • Phrases can overlap (and subsume each other) • Hierarchical phrases must be flattened out
Enriching Initial Phrase Sets • Goal: Increase the number of common phrases among systems • Different systems might use different phrases to translate the same source interval to the same target string • Original phrase set for each system is extended by • Phrase Concatenation: Combine adjacent phrases (both at the source and target level) • Phrase Splitting: Generate one-word phrases by estimating word correspondences using word translation probabilities
Phrase Confidence Estimation • Weighted combination of phrase posteriors • Posteriors of the hypotheses that contain this phrase • System weights • Similarity of phrases to other phrases • 4 types of similarity levels • Same source interval, same target words, and same original distortion • Same source interval, same target words but different original distortion • Overlapping source intervals with the same target words • Overlapping target words • Each phrase in one hypothesis is similar to another hypothesis at only one similarity level
Example: Phrase Posteriors • 2 systems, 2 hypotheses for each system, 3 phrases • p1 ~ p2 at similarity level 2, p2 ~ p3 at similarity level 3 Sim 1 Sim 2 Sim 3 Sys 1 Posteriors related to p1 Sys 2 Sys 1 Posteriors related to p2 Sys 2 Sys 1 Posteriors related to p3 Sys 2
Example: Combining Phrase Posteriors For One System SimW1 = 0.7 SimW2 = 0.2 SimW3 = 0.1 Combined Sys 1 Posteriors related to p1 Sys 2 Sys 1 Posteriors related to p2 Sys 2 Sys 1 Posteriors related to p3 Sys 2
Combining Phrase Posteriors Among Different Systems • sysWi: Confidence weight for system i • Post(m, i): Posterior of phrase m in system i • Conf(m): Overall confidence of phrase m
Decoding • Phrasal decoding based on standard beam search (Koehn, 2004) • Features • Language Model • Phrase penalty • Word penalty • Distortion penalty • Original distortion penalty (computed over the set of phrases generated by each system) • Combined phrase confidence • The total score for a hypothesis is a log-linear combination of these features
System Optimization • Optimization by BBN’s generic optimizer • Based on Powell’s method • Operates on n-best lists with various feature scores • Can optimize for an arbitrary scoring function • Number of weights to be optimized: N+M+6 • N: Number of systems • M: Number of similarity levels • One feature weight for each of the first 5 features • (N-1) + M + (3-1) = N+M+1 weights for phrase confidence • System weights sum up to 1 • Interpolation weights sum up to 1
Experimental Settings (1) • 6 input systems • Three phrase-based • Two hierarchical phrase-based • One syntax-based • Arabic-to-English training data: ~140M words • Chinese-to-English training data: ~225M words • All systems optimized on NIST MTEval'02 test set • 2 systems were tuned to minimize TER (Snover et al., 2006) • 4 systems were tuned to maximize BLEU (Papineni et al., 2002)
Experimental Settings (2) • Input: 100-best lists generated by each system with phrase-to-phrase alignments • Combination weights optimized on NIST MTEval'03 test sets • Optimization Functions: • TER for Arabic-English • BLEU for Chinese-English • Tested on NIST MTEval'04 and NIST MTEval'05 test sets • Evaluated using mixed case TER and BLEU scores
Conclusions • Presented a novel phrase-based system combination framework • Build a new translation option table for each sentence separately and re-decode for a consensus translation • Arabic-English: Up to 1.7 BLEU point improvement over the best system • Chinese-English: The improvements on tuning sets are not reflected in the results on the test sets. Why? • The success of combination method depends on setting several weights appropriately • Hard to optimize (too many weights!) • Distortion features override the effects of other features
Proposal for System Combination Evaluation • A lot of effort on system combination within GALE • Hard to evaluate different system combination approaches against each other • Different input systems • Different development and evaluation sets • Software not publicly available • Proposal: Create a common corpus to evaluate system combination methods • A standard set of system outputs on standard MTEval test sets • A common format that satisfies the requirements of each approach • Periodic updates on system outputs