870 likes | 1.01k Views
syntactically-flavored reordering model. Shuffling Non-Constituents. Jason Eisner. with David A. Smith and Roy Tromble. syntactically-flavored reordering search methods. ACL SSST Workshop (invited talk), June 2008. Starting point: Synchronous alignment.
E N D
syntactically-flavored reordering model Shuffling Non-Constituents Jason Eisner with David A. Smith and Roy Tromble syntactically-flavored reordering search methods ACL SSST Workshop (invited talk), June 2008
Starting point: Synchronous alignment • Synchronous grammars are very pretty. • But does parallel text actually have parallel structure? • Depends on what kind of parallel text • Free translations? Noisy translations? • Were the parsers trained on parallel annotation schemes? • Depends on what kind of parallel structure • What kinds of divergences can your synchronous grammar formalism capture? • E.g., wh-movement versus wh in situ Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
donnent (“give”) kiss à (“to”) Sam baiser (“kiss”) Sam often kids un (“a”) beaucoup(“lots”) quite d’ (“of”) enfants (“kids”) Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Start NP NP Adv Adv NP NP Adv Adv null null null null Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) kiss à (“to”) Sam baiser (“kiss”) Sam often kids un (“a”) beaucoup(“lots”) quite NP d’ (“of”) NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Start NP Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. A much worse alignment ... donnent (“give”) kiss à (“to”) Sam baiser (“kiss”) Sam often kids un (“a”) NP beaucoup(“lots”) quite d’ (“of”) NP Adv enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Start NP NP Adv Adv NP NP Adv Adv null null null null Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) kiss à (“to”) Sam baiser (“kiss”) Sam often kids un (“a”) beaucoup(“lots”) quite NP d’ (“of”) NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
donnent (“give”) Start kiss à (“to”) baiser (“kiss”) NP Adv NP Adv un (“a”) Sam NP NP beaucoup(“lots”) Sam enfants (“kids”) d’ (“of”) kids often quite NP NP Adv Adv null null null null Synchronous Grammar = Set of Elementary Trees
But many examples are harder Auf diese Frage habe ich leider keine Antwort bekommen To this question have I alas no answer received NULL I did not unfortunately receive an answer to this question Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
But many examples are harder Auf diese Frage habe ich leider keine Antwort bekommen To this question have I alas no answer received NULL I did not unfortunately receive an answer to this question Displaced modifier (negation) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
But many examples are harder Auf diese Frage habe ich leider keine Antwort bekommen To this question have I alas no answer received NULL I did not unfortunately receive an answer to this question Displaced modifier (negation) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
But many examples are harder Auf diese Frage habe ich leider keine Antwort bekommen To this question have I alas no answer received NULL I did not unfortunately receive an answer to this question Displaced argument (here, because projective parser) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
But many examples are harder Auf diese Frage habe ich leider keine Antwort bekommen To this question have I alas no answer received NULL I did not unfortunately receive an answer to this question Head-swapping (here, just different annotation conventions) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Free Translation Tschernobyl Chernobyl könntecould dannthen etwas something später later anon diethe Reihequeue kommencome NULL Then we could deal with Chernobyl some time later Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Free Translation Tschernobyl Chernobyl könntecould dannthen etwas something später later anon diethe Reihequeue kommencome NULL Then we could deal with Chernobyl some time later Probably not systematic (but words are correctly aligned) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Free Translation Tschernobyl Chernobyl könntecould dannthen etwas something später later anon diethe Reihequeue kommencome NULL Then we could deal with Chernobyl some time later Erroneous parse Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
What to do? • Current practice: • Don’t try to model all systematic phenomena! • Just use non-syntactic alignments (Giza++). • Only care about the fragments that recur often • Phrases or gappy phrases • Sometimes even syntactic constituents (can favor these, e.g., Marton & Resnik 2008) • Use these (gappy) phrases in a decoder • Phrase based or hierarchical Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
What to do? • Current practice: • Use non-syntactic alignments (Giza++) • Keep frequent phrases for a decoder • But could syntax give us better alignments? • Would have to be “loose” syntax … • Why do we want better alignments? • Throw away less of the parallel training data • Help learn a smarter, syntactic, reordering model • Could help decoding: less reliance on LM • Some applications care about full alignments Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Quasi-synchronous grammar • How do we handle “loose” syntax? • Translation story: • Generate target English by a monolingualgrammar • Any grammar formalism is okay • Pick a dependency grammar formalism for now P(I | did, PRP) I did not unfortunately receive an answer to this question P(PRP | no previous left children of “did”) parsing: O(n3) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Quasi-synchronous grammar • How do we handle “loose” syntax? • Translation story: • Generate target English by a monolingualgrammar • But probabilities are influenced by source sentence • Each English node is aligned to some source node • Prefers to generate children aligned to nearby source nodes I did not unfortunately receive an answer to this question parsing: O(n3) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
QCFG Generative Story observed ? Auf diese Frage habe ich leider keine Antwort bekommen NULL P(parent-child) P(breakage) P(I | did, PRP, ich) I did not unfortunately receive an answer to this question P(PRP | no previous left children of “did”, habe) aligned parsing: O(m2n3) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
What’s a “nearby node”? • Given parent’s alignment, where might child be aligned? synchronous grammar case + “none of the above” Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Source Target Quasi-synchronous grammar • How do we handle “loose” syntax? • Translation story: • Generate target English by a monolingualgrammar • But probabilities are influenced by source sentence • Useful analogies: • Generative grammar with latent word senses • MEMM • Generate n-gramtag sequence, but probabilities are influenced by word sequence Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Quasi-synchronous grammar • How do we handle “loose” syntax? • Translation story: • Generate target English by a monolingualgrammar • But probabilities are influenced by source sentence • Useful analogies: • Generative grammar with latent word senses • MEMM • IBM Model 1 • Source nodes can be freely reused or unused • Future work: Enforce 1-to-1 to allow good decoding(NP-hard to do exactly) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Some results: Quasi-synch. Dep. Grammar • Alignment (D. Smith & Eisner 2006) • Quasi-synchronous syntax much better than synchronous • Maybe also better than IBM Model 4 • Question answering(Wang et al. 2007) • Align question w/ potential answer • Mean average precision 43% 48% 60% • previous state of the art + QG + lexical features • Bootstrapping a parser for a new language (D. Smith & Eisner 2007 & ongoing) • Learn how parsed parallel text influences target dependencies • Along with many other features! (cf. co-training) • Unsupervised: German 30% 69%, Spanish 26% 65% Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Summary of part I • Current practice: • Use non-syntactic alignments (Giza++) • Some bits align nicely • Use the frequent bits in a decoder • Suggestion: Let syntax influence alignments. • So far, loose syntax methods are like IBM Model I. • NP-hard to enforce 1-to-1 in any interesting model. • Rest of talk: • How to enforce 1-to-1 in interesting models? • Can we do something smarter than beam search? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
syntactically-flavored reordering search methods Shuffling Non-Constituents Jason Eisner with David A. Smith and Roy Tromble syntactically-flavored reordering model ACL SSST Workshop, June 2008
Motivation • MT is really easy! • Just use a finite-state transducer! • Phrases, morphology, the works! Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
1 1 4 5 5 6 6 3 2 4 3 2 Mary hasn’t seen me Permutation search in MT NNP Marie NNP Marie NEG ne NEG ne PRP m’ PRP m’ AUX a AUX a NEG pas NEG pas VBN vu VBN vu initial order(French) best order(French’) easy transduction Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Motivation • MT is really easy! • Just use a finite-state transducer! • Phrases, morphology, the works! • Have just to fix that pesky word order. Framing it this way lets us enforce 1-to-1 exactly at the permutation step. Deletion and fertility > 1 are still allowed in the subsequent transduction. Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Often want to find an optimal permutation … • Machine translation: Reorder French to French-prime (Brown et al. 1992) So it’s easier to align or translate • MT eval: How much do you need to rearrange MT output so it scores well under an LM derived from ref translations? • Discourse generation, e.g., multi-doc summarization: Order the output sentences (Lapata 2003) So they flow nicely • Reconstruct temporal order of eventsafter info extraction • Learn rule ordering or constraint ranking for phonology? • Multi-word anagrams that score well under a LM Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
1 1 4 5 5 6 6 3 4 2 3 2 How can we find this needlein the haystack of N!possible permutations? Permutation search: The problem initial order best orderaccording tosome costfunction Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
state remembers what we’ve generated so far(but not in what order) arc weight = cost of picking 5 next if we’ve seen {1,2,4} so far Traditional approach: Beam search Approx. best path through a really big FSA N! paths: one for each permutation only 2N states Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
An alternative: Local search (“hill climbing”) The SWAP neighborhood 1 3 2 4 5 6 cost=20 2 1 3 4 5 6 cost=26 1 2 3 4 5 6 cost=22 1 2 4 3 5 6 cost=19 1 2 3 4 5 6 cost=22 1 2 3 5 4 6 cost=25 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
An alternative: Local search (“hill-climbing”) The SWAP neighborhood 1 2 3 4 5 6 cost=22 1 2 4 3 5 6 cost=19 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
cost=19 cost=17 cost=16. . . An alternative: Local search (“hill-climbing”)Like “greedy decoder” of Germann et al. 2001 The SWAP neighborhood 1 4 5 6 2 3 cost=22 Why are the costs always going down? How long does it take to pick best swap? How many swaps might you need to reach answer? What if you get stuck in a local min? we pick bestswap O(N) if you’re careful O(N2) random restarts Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
Larger neighborhood 1 3 2 4 5 6 cost=20 2 1 3 4 5 6 cost=26 1 2 3 4 5 6 cost=22 1 2 4 3 5 6 cost=19 1 2 3 4 5 6 cost=22 1 2 3 5 4 6 cost=25 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
4 5 cost=17 Larger neighborhood(well-known in the literature; works well) INSERT neighborhood 1 6 2 3 cost=22 Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? yes – 3 can move past 4 to get past 5 O(N) rather than O(N2) O(N2) rather than O(N) O(N2) rather than O(N) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
4 5 cost=14 Even larger neighborhood BLOCK neighborhood 1 6 3 cost=22 2 yes – 2 can get past 45 without having to cross 3 or move 3 first still O(N) O(N3) rather than O(N), O(N2) O(N3) rather than O(N), O(N2) Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
4 5 Larger yet: Via dynamic programming?? 1 6 3 cost=22 2 logarithmic exponential polynomial Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
4 5 i j k Move is defined by an (i,j,k) triple O(N) O(N2) O(N3) Unifying/generalizing neighborhoods so far 1 6 7 8 3 2 Exchange two adjacent blocks, of max widths w ≤ w’ SWAP: w=1, w’=1 INSERT: w=1, w’=N BLOCK: w=N, w’=N runtime = # neighbors = O(ww’N) everything in this talk can be generalized to other values of w,w’ Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
2 4 6 5 1 3 2 1 1 2 3 4 5 6 3 6 2 4 3 5 4 Cost of this arc is Δcostof swapping (4,5), here < 0 5 Very large-scale neighborhoods • What if we consider multiple simultaneous exchanges that are “independent”? • The DYNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000) 1 3 5 6 2 4 Lowest-cost neighboris lowest-cost path Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
6 2 4 1 3 5 1 2 3 4 5 6 2 4 3 5 2 4 1 3 DYNASEARCH (-20+-20) 1 2 3 4 B = SWAP (-30) 2 3 Very large-scale neighborhoods Lowest-cost neighboris lowest-cost path • Why would this be a good idea? Help get out of bad local minima? Help avoid getting into bad local minima? no; they’re still local minima yes – less greedy Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
2 6 4 5 3 1 1 2 3 4 5 6 2 4 5 3 Very large-scale neighborhoods Lowest-cost neighboris lowest-cost path • Why would this be a good idea? Help get out of bad local minima? Help avoid getting into bad local minima? More efficient? no; they’re still local minima yes – less greedy yes! – shortest-path algorithm finds the best setof swaps in O(N) time, as fast as best single swap. Up to N moves as fast as 1 move:no penalty for “parallelism”! Globally optimizes over exponentially many neighbors (paths). Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
4 5 i j k Move is defined by an (i,j,k) triple O(N) Yes. Asymptotic runtime is always unchanged. O(N2) O(N3) Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond SWAP? 1 6 7 8 3 2 Exchange two adjacent blocks, of max widths w ≤ w’ SWAP: w=1, w’=1 INSERT: w=1, w’=N BLOCK: w=N, w’=N runtime = # neighbors = O(ww’N) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
= swap children Let’s define each neighbor by a “colored tree”Just like ITG! 1 4 5 6 2 3 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
= swap children 1 4 5 6 2 3 Let’s define each neighbor by a “colored tree”Just like ITG! Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
= swap children 5 6 2 3 Let’s define each neighbor by a “colored tree”Just like ITG! 1 4 This is like the BLOCK neighborhood, but with multiple block exchanges, which may be nested. Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
If that was the optimal neighbor … … now look for its optimal neighbor new tree! 1 5 6 4 2 3 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
1 5 6 4 2 If that was the optimal neighbor … … now look for its optimal neighbor new tree! 3 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
If that was the optimal neighbor … … now look for its optimal neighbor … repeat till reach local optimum Each tree defines a neighbor. At each step, optimize over all possible treesby dynamic programming (CKY parsing). 1 4 5 6 2 3 Use your favorite parsing speedups (pruning, best-first, …) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008