120 likes | 180 Views
TIDES MT Workshop Review. Using Syntax?. ISI-small: Cross-lingual parsing/decoding Input: Chinese sentence + English lattice built with all possible phrase substitutions Output English parse tree Algorithm: lattice parsing (?) JHU: Failed to incorporate “linguistic knowledge”
E N D
Using Syntax? • ISI-small: • Cross-lingual parsing/decoding • Input: Chinese sentence + English lattice built with all possible phrase substitutions • Output English parse tree • Algorithm: lattice parsing (?) • JHU: • Failed to incorporate “linguistic knowledge” • Morphology, NE, Syntax • Modeling phrase movement did not help
ITC-irst • ITC-irst: very similar to Franz’s system • Log-linear model / minimum error training • Phrase based model • Preprocessing • Chinese numberical translation, segmentation, split of long sentence (testing) • LM Adaptation: mixture of LM from different copora
JHU • Alignment template + WFST • Multiple phrase segmentation of source sentence is essential in translation • BiText chunking(DP, and Divisive Clustering): similar idea as our sentence splitting • Phrase-level movement • Document-specific LM (LM adaptation) • Gains from Doc-specific LMs and BMR-Bleu are not additive
ATR • ATR: • unsupervised Chinese word segmenter • Truecasing by Conditional Random Field
IBM • Word reordering • Pre-ordering: reorder the source sentence 10% improvement • Word-level and block level reordering
ISI • Franz system: • Log linear model • Alignment template • Discriminative training • DP search • New feature functions • Lexicalized reordering +1% Bleu • Penalize word deletions +2% Bleu • Tight integration of rule-based translations +2% • Translation Components: numbers, NE, dates • Train classifier to identify where TC works where
ISI • Franz: important things in system developing • Good engineering is important • Scalability • Efficiency • No bugs in software • Good overall system architecture • Error analysis should drive research • Step1: what is the major error in current system • Step2: fix it! • Step3: goto step 1
Comparable Corpora • ISI: • Arabic: 99M->106M; Bleu: 43.8->42.99 • Chinese: 168M->176M; Bleu: 32.05->32.85
Confidence Intervals • Bootstrapping • IBM’s method • Chop the test data into 50 pieces • NIST’s method • Sign test
New Players • BYU: simple transfer system • Linear B: human post edit MT hypothesis (HAMT) • MTM linguaSoft: based on CIMOS rule based system • NTT: WFST based decoder