270 likes | 465 Views
A Syntax-Driven Bracketing Model for Phrase-Based Translation. Deyi Xiong, et al. ACL 2009. Introduction. Machine Translation Chinese to English Chinese 把 7 月 11 日 設立 為 航海 節 An ideal case:. 把 7 月 11 日 設立 為 航海 節.
E N D
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009
Introduction • Machine Translation • Chinese to English • Chinese • 把 7月 11日 設立 為 航海 節 • An ideal case: 把 7月 11日 設立 為 航海 節 to establish July 11 as Sailing Festival day
Wrong Linguistic Structure • 航海 節 is a syntactic constituent 把 7月 11日 設立 為 航海 節 to set up for navigation on July 11 knots
A Naive Solution • Employ syntactic constraints • Fully respect linguistic structures
A Naive Solution (2) • Unfortunately, it damages the performance • Non-syntactic translations are sometimes useful 把 今天 設立 為 航海 節 establish today as Sailing Festival day
Syntax-Driven Bracketing Model • SDB model • Translation unit is more important • Whether it is syntactic or non-syntactic • Include but not limited to constituent matching/violation • Protect the strength of the phrase-based system
Translation Unit • Bracketable source phrase and its corresponding translation • Bracketable • A source phrase is bracketable • Its translation is contiguous • A pair of neighboring phrases is bracketable • Their translations are contiguous after combined
Translation Unit Examples • Bracketable 把 今天 設立 為 把 今天 設立 為 establish today as establish today as • 把 今天 設立 and 為 are bracketable • 把 今天 設立 為 is bracketable
Translation Unit Examples • Unbracketable 把 今天 設立 為 establish today as • 設立 and 為 are unbracketable • 設立 為 is unbracketable
Bracketing Instances Extraction • Extract bracketable and unbracketable instances from training data • Aligned sentence pair + parsed source sentence • Estimate whether a source phrase is bracketable at run time
Rule Features • Rule Features (RF) • CFG rule • Horizontal context
Rule Features (2) S1: ADVP AD S2: VP VV AS NP S: VP ADVP VP
Path Features • Path features (PF) • Path to roots • S1 to the root of S • S2 to the root of S • S to the root of this tree • Vertical context
Path Features (2) S1: ADVP VP S2: VP VP S: VP IP
Constituent Boundary Matching Features • Constituent Boundary Matching Features (CBMF) • Exact match • Source phrase covers the boundaries of its tree • Inside match • Source phrase covers a sequence of its tree • Crossing match • Source phrase crosses the subtree of its tree
Constituent Boundary Matching Features (3) Exact match Inside match Crossing match
Integration into Phrase-based MT • SDB model estimate the probability that a source phrase is bracketable. • Whether it can be translated as a unit • Integrated into BTG MT system • Bracketing Transduction Grammar (Wu, 1997) Straight Inverted 把 今天 設立 為 把 今天 設立 為 establish today as as establish today
Experiment • Comparing models • Baseline: BTG system • XP+ (Marton and Resnik, 2008) • NP, VP, PP, ADVP…. • Penalize each time when violating the syntactic boundaries. (soft constraint) • UniSDB • Only S features • BiSDB • S1, S2 and S features
Experiment (2) • Chinese parser • Lexicalized PCFG parser (Xiong et al., 2005) • Parallel corpus • FBIS corpus • Word alignment • GIZA++ • Four-gram language model • Built with SRILM • Xinhua section of the the English Gigaword corpus • Maximum Entropy (ME) Trainer • Zhang 2004
Result • SDB receives the largest feature weight • Imply its impact on decoder. XP+ and SDB Baseline features (Common for phrase-based systems)
Result (2) • NIST MT-05 test set • Improvement of 1.67 BLEU over baseline • Improvement of 0.59 BLEU over XP+
Result (3) • Based on CBMF, adding rule and path feature achieves further improvement • BiSDB is constantly better than UniSDB • Inner contexts (S1 and S2) are useful
XP+ and SDB • Same • Consider syntactic constituent • Different • XP+ only punishes non-syntactic source phrase • SDB is able to encourage non-syntactic if the phrase is bracketable
Conclusion • SDM model predict whether a source phrase can be translated as a unit. • Appropriate constituent violations are helpful • Because it better inherit the strength of phrase-based approach