Cross-lingual Parsing with Syntax: Workshop Review and Algorithm Comparison

TIDES MT Workshop Review

Using Syntax? • ISI-small: • Cross-lingual parsing/decoding • Input: Chinese sentence + English lattice built with all possible phrase substitutions • Output English parse tree • Algorithm: lattice parsing (?) • JHU: • Failed to incorporate “linguistic knowledge” • Morphology, NE, Syntax • Modeling phrase movement did not help

Phrase Alignment

ITC-irst • ITC-irst: very similar to Franz’s system • Log-linear model / minimum error training • Phrase based model • Preprocessing • Chinese numberical translation, segmentation, split of long sentence (testing) • LM Adaptation: mixture of LM from different copora

JHU • Alignment template + WFST • Multiple phrase segmentation of source sentence is essential in translation • BiText chunking(DP, and Divisive Clustering): similar idea as our sentence splitting • Phrase-level movement • Document-specific LM (LM adaptation) • Gains from Doc-specific LMs and BMR-Bleu are not additive

ATR • ATR: • unsupervised Chinese word segmenter • Truecasing by Conditional Random Field

IBM • Word reordering • Pre-ordering: reorder the source sentence 10% improvement • Word-level and block level reordering

ISI • Franz system: • Log linear model • Alignment template • Discriminative training • DP search • New feature functions • Lexicalized reordering +1% Bleu • Penalize word deletions +2% Bleu • Tight integration of rule-based translations +2% • Translation Components: numbers, NE, dates • Train classifier to identify where TC works where

ISI • Franz: important things in system developing • Good engineering is important • Scalability • Efficiency • No bugs in software • Good overall system architecture • Error analysis should drive research • Step1: what is the major error in current system • Step2: fix it! • Step3: goto step 1

Comparable Corpora • ISI: • Arabic: 99M->106M; Bleu: 43.8->42.99 • Chinese: 168M->176M; Bleu: 32.05->32.85

Confidence Intervals • Bootstrapping • IBM’s method • Chop the test data into 50 pieces • NIST’s method • Sign test

New Players • BYU: simple transfer system • Linear B: human post edit MT hypothesis (HAMT) • MTM linguaSoft: based on CIMOS rule based system • NTT: WFST based decoder

Cross-lingual Parsing with Syntax: Workshop Review and Algorithm Comparison

Cross-lingual Parsing with Syntax: Workshop Review and Algorithm Comparison

Presentation Transcript

Tides

Tides

TIDES

Tides

Tides

Tides!!!!!!!!!!!

TIDEs

Tides

TIDES !!!

TIDES

Tides

Tides

Tides

DARPA TIDES MT Group Meeting

Tides

TIDES

Tides

C2 REVIEW MT EX

Tides

Tides

TIDES

Tides