220 likes | 454 Views
微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告. 张冬冬 李志灏 李沐 周明 微软亚洲研究院. Outline. Overview MSRA Submissions System Description Experiments Training Data & Toolkits Chinese-English Machine Translation Chinese-English System Combination Conclusion. Evaluation Task Participation. MSRA Submission.
E N D
微软亚洲研究院汉英翻译系统CWMT2008评测技术报告微软亚洲研究院汉英翻译系统CWMT2008评测技术报告 张冬冬 李志灏 李沐 周明 微软亚洲研究院
Outline • Overview • MSRA Submissions • System Description • Experiments • Training Data & Toolkits • Chinese-English Machine Translation • Chinese-English System Combination • Conclusion
MSRA Submission • Machine translation task • Primary submission • Unlimited training corpus • Combining: SysA + SysB + SysC + SysD • Contrast submission • Limited training corpus • Combining: SysA + SysB + SysC • System combination task • Limited training corpus • Combining: 10 systems
SysA • Phrase-based model • CYK decoding algorithm • BTG grammar • Features: • Similar with(Koehn, 2004) • Maximum Entropy reordering model • (Zhang et. al 2007, Xiong et. Al, 2006)
SysB • Syntactic pre-reordering model • (Li et. al, 2007) • Motivations • Isolating reordering model from decoder • Making use of syntactic information
SysC • Hierarchical phase-based model • (David Chiang, 2005) • Hiero re-implementation • Weighted synchronous CFG
SysD • String-to-dependency MT • (Shen et. al, 2008) • Integrating target dependent language model • Motivations • Target dependent structures integrate linguistic knowledge • Directly targeted on lexical items, simpler than CFG • Capture long distance relations by local dependency trees
System Combination • Analogous with BBN’s work (Rosti et. al 2007)
System Combination(Cont.) • Adaptations in MSRA system • Single confusion network • Candidate skeletons come from top-1 translations of each system • The best skeleton has the most similarity with others based on BLEU • Word alignment between skeleton and other candidate translations performed by GIZA++ • Parameters are tuned to maximize BLEU on Dev. data
Outline • Overview • MSRA Submissions • System Description • Experiments • Training Data & Toolkits • Chinese-English Machine Translation • Chinese-English System Combination • Conclusion
Training Data Primary MT Submission Contrast MT Submission
Pre-/Post-processing • Pre-processing • Tokenization for Chinese and English sentences • Before word alignment and language model training • Special tokens recognized and normalized (date, time and number) for training data • Special tokens are pre-translated with rules for test data before decoding • Post-processing • English caserestoration after translation • OOVs are removed from final translation
Tools • MSR-SEG • MSRA word segmentation tool used to segment Chinese sentences in parallel data • Berkeley parser • Parse sentences for both training and test data for syntactic pre-reordering model based system • GIZA++ • Used for bilingual word alignment • MaxEnt Toolkit • Reordering Model (Le Zhang, 2004) • MSRA internal tools • Language modeling • Decoders • Case-restoration for English words • System combination
Conclusions • Syntax information improves SMT • Syntactic pre-reordering model • Target dependency model • Limited LM affects the system combination • Perform worse over unlimited output when using limited LM
MSRASystems • SysA: • 基于连续短语翻译模型 • SysB: • SysA + 多个预调序的源语言输入 • SysC: • 基于层次短语翻译模型 • SysD: • 基于串到目标语言依存树的翻译模型
SysB • Syntactic pre-reordering model • (Li et. al, 2007) • Motivations • Isolating reordering model from decoder • Making use of syntactic parse information