350 likes | 366 Views
Explore a permutation search for machine translation with very large-scale neighborhoods. The workshop presentation by Jason Eisner and Roy Tromble focuses on finding optimal permutations using local search techniques in various NLP applications. Discover the motivation and challenges in rearranging information for improved translation, discourse generation, and rule ordering. Learn about cost models, traditional beam search, and alternative local search methods utilizing swap and jump neighborhoods. Dive into dynamic programming techniques to consider exponentially many neighbors efficiently. Take a deep dive into optimizing permutations for better translation accuracy.
E N D
with Very Large-Scale Neighborhoods Local Searchfor Optimal Permutations in Machine Translation Jason Eisner and Roy Tromble
Motivation • MT is really easy! • Just use a finite-state transducer! • Phrases, morphology, the works! Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
1 1 4 5 5 6 6 3 2 4 3 2 Mary hasn’t seen me Permutation search in MT NNP Marie NNP Marie NEG ne NEG ne PRP m’ PRP m’ AUX a AUX a NEG pas NEG pas VBN vu VBN vu initial order(French) best order(French’) easy transduction Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Motivation • MT is really easy! • Just use a finite-state transducer! • Phrases, morphology, the works! • Have just to fix that pesky word order. Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Often want to find an optimal permutation … • Machine translation: Reorder French to French-prime (Brown et al. 1992) So it’s easier to align or translate • MT eval: How much do you need to rearrange MT output so it scores well under an LM derived from ref translations? • Discourse generation, e.g., multi-doc summarization: Order the output sentences (Lapata 2003) So they flow nicely • Reconstruct temporal order of eventsafter info extraction • Learn rule ordering or constraint ranking for phonology? • Multi-word anagrams that score well under a LM Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
1 1 4 5 5 6 6 3 4 2 3 2 How can we find this needlein the haystack of N!possible permutations? Permutation search: The problem initial order best orderaccording tosome costfunction Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
1 1 4 5 5 6 6 3 4 2 3 2 4 before 3 …? 1…2…3? Cost models These costs are enough to encodeTraveling Salesperson Many other NP-complete problems IBM Model 4 and more … initial order cost of this order: • Does my favorite WFSA like it as a string? • Non-local pair order ok? • Non-local triple order ok? • Add these all up … Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
state remembers what we’ve generated so far(but not in what order) arc weight = cost of picking 5 next if we’ve seen {1,2,4} so far Traditional approach: Beam search Approx. best path through a really big FSA N! paths: one for each permutation only 2N states Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
An alternative: Local search (Germann et al. 2001) The “swap” neighborhood 1 3 2 4 5 6 cost=20 2 1 3 4 5 6 cost=26 1 2 3 4 5 6 cost=22 1 2 4 3 5 6 cost=19 1 2 3 4 5 6 cost=22 1 2 3 5 4 6 cost=25 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
An alternative: Local search (Germann et al. 2001) The “swap” neighborhood 1 2 3 4 5 6 cost=22 1 2 4 3 5 6 cost=19 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
cost=19 cost=17 cost=16. . . An alternative: Local search The “swap” neighborhood 1 4 5 6 2 3 cost=22 Why are the costs always going down? How long does it take to pick your swap? How many swaps might you need to reach answer? What if you get stuck in a local min? we pick bestswap O(N)*O(1)? O(N2) random restarts Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Hill-climbing vs. random walks 1 3 2 4 5 6 cost=20 2 1 3 4 5 6 cost=26 1 2 3 4 5 6 cost=22 1 2 4 3 5 6 cost=19 1 2 3 4 5 6 cost=22 1 2 3 5 4 6 cost=25 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
4 5 cost=17 Larger neighborhoods – fewer local mins? (Germann et al. 2001, Germann 2003) “Jump” neighborhood • Now we can get to our destination in O(N) steps instead of O(N2) • But each step has to consider O(N2) neighbors instead of O(N) • Push the runtime down here, it pops up there … • Can we do better? 1 6 2 3 cost=22 Yes! Consider exponentially manyneighbors by dynamic programming Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
= swap children Let’s define each neighbor by a tree 1 4 5 6 2 3 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
= swap children 1 4 5 6 2 3 Let’s define each neighbor by a tree Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
= swap children 5 6 2 3 Let’s define each neighbor by a tree 1 4 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
If that was the optimal neighbor … … now look for its optimal neighbor new tree! 1 5 6 4 2 3 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
1 5 6 4 2 If that was the optimal neighbor … … now look for its optimal neighbor new tree! 3 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
If that was the optimal neighbor … … now look for its optimal neighbor … repeat till reach local optimum At each step, consider all possible treesby dynamic programming (CKY parsing) 1 4 5 6 2 3 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
1 1 4 5 5 6 6 3 4 2 3 2 Dynamic program must pick the tree that leads to the lowest-cost permutation initial order cost of this order: • Does my favorite WFSA like it as a string? Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
A bigram model as a WFSA After you read 1, you’re in state 1 After you read 2, you’re in state 2 After you read 3, you’re in state 3 … and this state determines the cost of the next symbol you read Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
2 4 2 61 42 23 14 I5 56 Including WFSA costs via nonterminals A possible preterminal for word 2is an arc in A that’s labeled with 2. The preterminal 42 rewrites as word 2 with a cost equal to the arc’s cost. 4 5 6 1 2 3 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
This constituent’s total cost is the total cost of the best 63 path . I3 I3 I3 cost of the new permutation . 4 6 5 1 2 3 4 1 2 3 6 I 5 1 4 2 3 6 1 4 2 3 63 63 63 13 43 I6 I6 I6 61 61 42 42 23 23 14 14 I5 I5 56 56 4 4 5 5 6 6 1 1 2 2 3 3 Including WFSA costs via nonterminals Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
1 1 4 5 6 5 6 3 4 2 3 2 4 before 3 …? Dynamic program must pick the tree that leads to the lowest-cost permutation initial order cost of this order: • Does my favorite WFSA like it as a string? • Non-local pair order ok? Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
1 4 2 3 Incorporating the pairwise ordering costs This puts {5,6,7} before {1,2,3,4}. So this hypothesis must add costs 5 < 1, 5 < 2, 5 < 3, 5 < 4, 6 < 1, 6 < 2, 6 < 3, 6 < 4, 7 < 1, 7 < 2, 7 < 3, 7 < 4 Uh-oh! So now it takesO(N2) time to combine twosubtrees, instead of O(1) time? Nope – dynamic programmingto the rescue again! 5 6 7 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
1 4 2 3 already computed at earlier steps of parsing Incorporating the pairwise ordering costs This puts {5,6,7} before {1,2,3,4}. So this hypothesis must add costs 5 6 7 = + - + Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Incorporating 3-way ordering costs • See the paper … • A little tricky, but • comes “for free” if you’re willing to accept a certain restriction on these costs • more expensive without that restriction, but possible Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
How many steps to get from here to there? initial order 6 2 8 4 7 5 3 1 One twisted-tree step? Not always … (Dekai Wu) 1 2 4 5 7 3 6 8 best order Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Can you get to the answer in one step? German-English, Giza++ alignment not always(yay, local search) often(yay, big neighborhood) Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
How many steps to the answer in the worst case? (what is diameter of the search space?) 6 2 8 4 7 5 3 1 claim: only log2N steps at worst (if you know where to step) Let’s sketch the proof! 1 2 4 5 7 3 6 8 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8 right-branchingtree 6 2 8 4 7 5 3 1 4 5 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
sequence of right-branchingtrees 2 4 1 7 5 3 8 6 Quicksort anything into, e.g., 1 2 3 4 5 6 7 8 Only log2 N steps to get to 1 2 3 4 5 6 7 8 …… or to anywhere! 4 5 2 3 6 7 Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Speedups (read the paper!) • We’re just parsing the current permutation as a string – and we know how to speed up parsers! • pruning • A* • best-first • coarse-to-fine • Can restrict to a subset of parse trees • Gives us smaller neighborhoods, quicker to search,but still exponentially large • Right-branching trees, asymmetric trees … • Note: Even w/o any of this, super-fast and effective on the LOP (no WFSA no grammar const). Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
More on modeling (read the paper!) • Encoding classical NP-complete problems • Encoding translation decoding in general • Encoding IBM Model 4 • Encoding soft phrasal constraints via hidden bracket symbols • Costs that depend on features of source sentence • Training the feature weights Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Summary • Local search is fun and easy • Popular elsewhere in AI • Closely related to MCMC sampling • Probably useful for translation • Can efficiently use huge local neighborhoods • Algorithms are closely related to parsing and FSMs • We know that stuff better than anyone! Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006