230 likes | 411 Views
A Discriminative Syntactic Word Order Model for MT. Pi-Chuan Chang 1 , Kristina Toutanova 2 1 Stanford University 2 Microsoft Research. Assuming the existence of a Word Choice Model. We focus on a Word Order Model which predicts the order of the words. Introduction (1/2).
E N D
A Discriminative Syntactic Word Order Model for MT Pi-Chuan Chang1, Kristina Toutanova2 1 Stanford University 2 Microsoft Research
Assuming the existence of a Word Choice Model We focus on a Word Order Model which predicts the order of the words Introduction (1/2) • Machine Translation can be viewed as 2 subtasks: • Word Choice • Choosing which words to translate into • Word Order • Deciding the ordering of those words • Task Decomposition
Introduction (2/2) • Useful to model word order in terms of: • Movement of syntactic constituents • Yamada and Knight 01; Galley et al. 06 • Movement of surface strings • Och and Ney 2004; Al-Onaizan and Papineni 2006; Kuhn et al. 2006 • Our goal: • Combine the information from syntactic and surface movement • Train a discriminative, globally normalized model for the ordering task
Source Dependency tree (depTree(s)) Source Sentence (s) Word alignment (A) [れました] [満たさ] [制約] [条件] [は] [すべて] Words in target sentence (bag(t)) “restriction” “condition” TOPIC “all” “satisfy” Target Dependency Tree (depTree(t)) Information given to our Word Order Model • Target Dependency Tree is used to: • constrain the space of possible orders • provide features for ordering all constraints are satisfied PASSIVE-PRES Input for the Word Order Model
Two Evaluation Settings for Word Ordering Task • Input from Reference Sentences • To evaluate the Order Model when word choice is correct • Input from MT Output Sentences • To evaluate the Order Model when the word choice comes from an MT system • For training, we are using input from Reference Sentences • In both settings, we need to obtain necessary input for the Word Order Model
Two Evaluation Settings for Word Ordering Task • Input from Reference Sentences • To evaluate the Order Model when word choice is correct • Input from MT Output Sentences • To evaluate the Order Model when the word choice comes from an MT system • For training, we are using input from Reference Sentence • In both settings, we need to obtain necessary input for the Word Order Model • Setting 1: Input from Reference Sentences • Parse the source sentence to get depTree(s). • Use GIZA++ to obtain the alignment A. • depTree(t) is obtained by the Dependency Tree Projection algorithm
password your mot votre de passe Two Evaluation Settings for Word Ordering Task • Input from Reference Sentences • To evaluate the Order Model when word choice is correct • Input from MT Output Sentences • To evaluate the Order Model when the word choice comes from an MT system • For training, we are using input from Reference Sentence • In both settings, we need to obtain necessary input for the Word Order Model Setting 2: Input from MT Output SentencesBaseline MT system: the tree-to-string MSR system (Quirk et al., 2005) Dependency Treelet Translation system: We use the 1-best output of the Baseline MT system, which provides the necessary input for our Word Order Model
Obtaining target dependency trees for Reference Sentences Source Dependency tree (depTree(s)) Source Sentence (s) Word alignment (A) [れました] [満たさ] [制約] [条件] [は] [すべて] Words in target sentence (bag(t)) “restriction” “condition” TOPIC “all” “satisfy” all constraints are satisfied PASSIVE-PRES Dependency Tree Projection Algorithm (Quirk et al. 2005) • Project source tree through word alignment • Fix non-projectivitiy and null-aligned words using heuristics • Note: the heuristics use correct word order (PASSIVE-PRES) (condition) (TOPIC) (all) (satisfy) (restriction) target dependency tree
Constraining orders with Target Dependency Tree • Orders that are not projective w.r.t. depTree(t) will have zero probabilities. • How helpful is the constraint? Non-Projective (Crossing edges!)
Projected target tree provides a huge amount of information In this case, the best order is not necessarily in the TargetProjective space. TargetProjective is still better by 5.4 BLEU Compare two search spaces 1. All permutations of a bag of words (AllPermutations) 2. All orders projective w.r.t. target dependency tree (TargetProjective) We use a trigram LM as the order model Correct Word Choices. Upper bound BLEU: 100
Space of possible orders is still factorially large. We use a re-ranking approach (Collins 2000) … Discriminative Word Order Model • We have shown that a target dependency tree provides useful information about order • We propose a discriminative model to select the best order (projective w.r.t a target dependency tree) target dependency tree Features defined on complete ordering
Discriminative Word Order Model N-best orders Best order target dependency tree Re-ranking model Base Model …
Base Model • Combination of two models: • Local Tree Order Model (LTOM) • Trigram Language Model (LM) to generate N-best orders
agenda/ NN programa -2 -1 Rio Rio/NN el the/DT de Local Tree Order Model (Quirk et al. 2005) • Models the position of each word with respect to its parent in the target dependency tree • A log-linear model that makes individual decisions for head-relative positions (non-sequence model) • a set of local features (lexical, source POS, etc.) are used • Captures syntactic movement • Example above: Reverse order of NNs in noun compound • Provides an alternative view to trigram LM -1 +1
Base Model Combination of two models: Local Tree Order Model (LTOM) Trigram Language Model (LM) to generate N-best orders Quality of the N-best orders generated by the Base Model? (before showing results, we will describe the data we used)
Experimental Setup English-Japanese parallel text (computer domain) Total data: 445K sentence pairs Parse trees on English (source) side Alignment: GIZA++ Baseline MT system: Trained on 500K sentences (superset of the 445K) Uses the same LM and a similar LTOM State-of-the-art system; outperforms Pharaoh on this dataset N-best orders Best order target dependency tree Re-ranking model Base Model Training size: 400K sentence pairs Training: 44K sentence pairs Evaluation: 1K sentence pairs …
Performance of the Base Model • Ordering Reference Sentences: • 30-best list achieves 98 BLEU. • Ordering MT Output Sentences: • 30-best list with oracle BLEU 36. • With re-ranking, we can get up to 3 BLEU point improvement
Discriminative Word Order Model N-best orders Best order target dependency tree Re-ranking model Base Model We’re here!
Re-ranking Model • Log-linear probability distribution over possible orders with parameters and feature functions f o0 t1 t3 t2 o1 t1 t2 t3 • Loss function: negative log-likelihood of the best orders according to BLEU • (o0 has the highest BLEU) o2 t3 t1 t2 . . . .
(2) crossing Ek Ek+n Ji Ji+1 (3) widening Ek Ek+n Ji Ji+1 (1) parallel Ek Ek+1 Ji Ji+1 Re-ranking Features • 2 features from Base Model • Log probabilities for LM and LTOM • Different conjunctions of the following features: • Part-of-Speech tags • on both source and target sentences • Displacement Feature • A real-valued feature (the distortion feature in Pharaoh) • A categorical DISP feature • For every two adjacent target words (1) Parallel: also adjacent (in the same order) on source side (2) Crossing: reverse order (3) Widening: in the same order but not adjacent in the source • Word Bigram features • discriminatively trained
Re-ranking results with different features Overall, our Word Order Model improves 2.36 BLEU over the Baseline MT system. For ordering Reference Sentences, our word order model achieves 94.50 BLEU Our DISP feature improves more than Pharaoh DISP Discriminatively trained bigram features improved about 0.55 BLEU
Conclusion • Having good projective target dependency trees helps word ordering task • Reduce the search space; helps modeling syntactic movement • Discriminative Word Order Model: • Combined information from syntactic and surface movement • Achieved 94.50 BLEU on ordering Reference Sentences • Improved an English-Japanese MT system by 2.36 BLEU points. • Future direction: Better integration with the MT system • Instead of fixing PWordChoice and maximizing PWordOrder, we can directly maximize P(t|s)