微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

微软亚洲研究院汉英翻译系统CWMT2008评测技术报告微软亚洲研究院汉英翻译系统CWMT2008评测技术报告张冬冬李志灏李沐周明微软亚洲研究院

Outline • Overview • MSRA Submissions • System Description • Experiments • Training Data & Toolkits • Chinese-English Machine Translation • Chinese-English System Combination • Conclusion

Evaluation Task Participation

MSRA Submission • Machine translation task • Primary submission • Unlimited training corpus • Combining: SysA + SysB + SysC + SysD • Contrast submission • Limited training corpus • Combining: SysA + SysB + SysC • System combination task • Limited training corpus • Combining: 10 systems

SysA • Phrase-based model • CYK decoding algorithm • BTG grammar • Features: • Similar with(Koehn, 2004) • Maximum Entropy reordering model • (Zhang et. al 2007, Xiong et. Al, 2006)

SysB • Syntactic pre-reordering model • (Li et. al, 2007) • Motivations • Isolating reordering model from decoder • Making use of syntactic information

SysC • Hierarchical phase-based model • (David Chiang, 2005) • Hiero re-implementation • Weighted synchronous CFG

SysD • String-to-dependency MT • (Shen et. al, 2008) • Integrating target dependent language model • Motivations • Target dependent structures integrate linguistic knowledge • Directly targeted on lexical items, simpler than CFG • Capture long distance relations by local dependency trees

System Combination • Analogous with BBN’s work (Rosti et. al 2007)

System Combination(Cont.) • Adaptations in MSRA system • Single confusion network • Candidate skeletons come from top-1 translations of each system • The best skeleton has the most similarity with others based on BLEU • Word alignment between skeleton and other candidate translations performed by GIZA++ • Parameters are tuned to maximize BLEU on Dev. data

Outline • Overview • MSRA Submissions • System Description • Experiments • Training Data & Toolkits • Chinese-English Machine Translation • Chinese-English System Combination • Conclusion

Training Data Primary MT Submission Contrast MT Submission

Pre-/Post-processing • Pre-processing • Tokenization for Chinese and English sentences • Before word alignment and language model training • Special tokens recognized and normalized (date, time and number) for training data • Special tokens are pre-translated with rules for test data before decoding • Post-processing • English caserestoration after translation • OOVs are removed from final translation

Tools • MSR-SEG • MSRA word segmentation tool used to segment Chinese sentences in parallel data • Berkeley parser • Parse sentences for both training and test data for syntactic pre-reordering model based system • GIZA++ • Used for bilingual word alignment • MaxEnt Toolkit • Reordering Model (Le Zhang, 2004) • MSRA internal tools • Language modeling • Decoders • Case-restoration for English words • System combination

Experiments for MT Task

Experiments for System Comb. 非受限LM

Conclusions • Syntax information improves SMT • Syntactic pre-reordering model • Target dependency model • Limited LM affects the system combination • Perform worse over unlimited output when using limited LM

Thanks!

MSRASystems • SysA: • 基于连续短语翻译模型 • SysB: • SysA + 多个预调序的源语言输入 • SysC: • 基于层次短语翻译模型 • SysD: • 基于串到目标语言依存树的翻译模型

SysB • Syntactic pre-reordering model • (Li et. al, 2007) • Motivations • Isolating reordering model from decoder • Making use of syntactic parse information

微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Presentation Transcript