1 / 87

Shuffling Non-Constituents

syntactically-flavored reordering model. Shuffling Non-Constituents. Jason Eisner. with David A. Smith and Roy Tromble. syntactically-flavored reordering search methods. ACL SSST Workshop, June 2008. Starting point: Synchronous alignment.

maina
Download Presentation

Shuffling Non-Constituents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. syntactically-flavored reordering model Shuffling Non-Constituents Jason Eisner with David A. Smith and Roy Tromble syntactically-flavored reordering search methods ACL SSST Workshop, June 2008

  2. Starting point: Synchronous alignment • Synchronous grammars are very pretty. • But does parallel text actually have parallel structure? • Depends on what kind of parallel text • Free translations? Noisy translations? • Were the parsers trained on parallel annotation schemes? • Depends on what kind of parallel structure • What kinds of divergences can your synchronous grammar formalism capture? • E.g., wh-movement versus wh in situ Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  3. donnent (“give”) kiss à (“to”) Sam baiser (“kiss”) Sam often kids un (“a”) beaucoup(“lots”) quite d’ (“of”) enfants (“kids”) Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”

  4. Start NP NP Adv Adv NP NP Adv Adv null null null null Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) kiss à (“to”) Sam baiser (“kiss”) Sam often kids un (“a”) beaucoup(“lots”) quite NP d’ (“of”) NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”

  5. Start NP Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. A much worse alignment ... donnent (“give”) kiss à (“to”) Sam baiser (“kiss”) Sam often kids un (“a”) NP beaucoup(“lots”) quite d’ (“of”) NP Adv enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”

  6. Start NP NP Adv Adv NP NP Adv Adv null null null null Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) kiss à (“to”) Sam baiser (“kiss”) Sam often kids un (“a”) beaucoup(“lots”) quite NP d’ (“of”) NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”

  7. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) NP Adv NP Adv un (“a”) Sam NP NP beaucoup(“lots”) Sam enfants (“kids”) d’ (“of”) kids often quite NP NP Adv Adv null null null null Grammar = Set of Elementary Trees

  8. But many examples are harder Auf diese Frage habe ich leider keine Antwort bekommen To this question have I alas no answer received NULL I did not unfortunately receive an answer to this question Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  9. But many examples are harder Auf diese Frage habe ich leider keine Antwort bekommen To this question have I alas no answer received NULL I did not unfortunately receive an answer to this question Displaced modifier (negation) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  10. But many examples are harder Auf diese Frage habe ich leider keine Antwort bekommen To this question have I alas no answer received NULL I did not unfortunately receive an answer to this question Displaced modifier (negation) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  11. But many examples are harder Auf diese Frage habe ich leider keine Antwort bekommen To this question have I alas no answer received NULL I did not unfortunately receive an answer to this question Displaced argument (here, because projective parser) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  12. But many examples are harder Auf diese Frage habe ich leider keine Antwort bekommen To this question have I alas no answer received NULL I did not unfortunately receive an answer to this question Head-swapping (here, different annotation conventions) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  13. Free Translation Tschernobyl Chernobyl könntecould dannthen etwas something später later anon diethe Reihequeue kommencome NULL Then we could deal with Chernobyl some time later Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  14. Free Translation Tschernobyl Chernobyl könntecould dannthen etwas something später later anon diethe Reihequeue kommencome NULL Then we could deal with Chernobyl some time later Probably not systematic (but words are correctly aligned) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  15. Free Translation Tschernobyl Chernobyl könntecould dannthen etwas something später later anon diethe Reihequeue kommencome NULL Then we could deal with Chernobyl some time later Erroneous parse Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  16. What to do? • Current practice: • Don’t try to model all systematic phenomena! • Just use non-syntactic alignments (Giza++). • Only care about the fragments that recur often • Phrases or gappy phrases • Sometimes even syntactic constituents (can favor these, e.g., Marton & Resnik 2008) • Use these (gappy) phrases in a decoder • Phrase based or hierarchical Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  17. What to do? • Current practice: • Use non-syntactic alignments (Giza++) • Keep frequent phrases for a decoder • But could syntax give us better alignments? • Would have to be “loose” syntax … • Why do we want better alignments? • Throw away less of the parallel training data • Help learn a smarter, syntactic, reordering model • Could help decoding: less reliance on LM • Some applications care about full alignments Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  18. Quasi-synchronous grammar • How do we handle “loose” syntax? • Translation story: • Generate target English by a monolingualgrammar • Any grammar formalism is okay • Pick a dependency grammar formalism for now P(I | did, PRP) I did not unfortunately receive an answer to this question P(PRP | no previous left children of “did”) parsing: O(n3) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  19. Quasi-synchronous grammar • How do we handle “loose” syntax? • Translation story: • Generate target English by a monolingualgrammar • But probabilities are influenced by source sentence • Each English node is aligned to some source node • Prefers to generate children aligned to nearby source nodes I did not unfortunately receive an answer to this question parsing: O(n3) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  20. QCFG Generative Story observed ? Auf diese Frage habe ich leider keine Antwort bekommen NULL P(parent-child) P(breakage) P(I | did, PRP, ich) I did not unfortunately receive an answer to this question P(PRP | no previous left children of “did”, habe) aligned parsing: O(m2n3) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  21. What’s a “nearby node”? • Given parent’s alignment, where might child be aligned? synchronous grammar case + “none of the above” Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  22. Source Target Quasi-synchronous grammar • How do we handle “loose” syntax? • Translation story: • Generate target English by a monolingualgrammar • But probabilities are influenced by source sentence • Useful analogies: • Generative grammar with latent word senses • MEMM • Generate n-gramtag sequence, but probabilities are influenced by word sequence Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  23. Quasi-synchronous grammar • How do we handle “loose” syntax? • Translation story: • Generate target English by a monolingualgrammar • But probabilities are influenced by source sentence • Useful analogies: • Generative grammar with latent word senses • MEMM • IBM Model 1 • Source nodes can be freely reused or unused  • Future work: Enforce 1-to-1 to allow good decoding(NP-hard to do exactly) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  24. Some results: Quasi-synch. Dep. Grammar • Alignment (D. Smith & Eisner 2006) • Quasi-synchronous much better than synchronous • Maybe also better than IBM Model 4 • Question answering(Wang et al. 2007) • Align question w/ potential answer • Mean average precision 43%  48%  60% • previous state of the art  + QG  + lexical features • Bootstrapping a parser for a new language (D. Smith & Eisner 2007 & ongoing) • Learn how parsed parallel text influences target dependencies • Along with many other features! (cf. co-training) • Unsupervised: German 30%  69%, Spanish 26%  65% Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  25. Summary of part I • Current practice: • Use non-syntactic alignments (Giza++) • Some bits align nicely • Use the frequent bits in a decoder • Suggestion: Let syntax influence alignments. • So far, loose syntax methods are like IBM Model I. • NP-hard to enforce 1-to-1 in any interesting model. • Rest of talk: • How to enforce 1-to-1 in interesting models? • Can we do something smarter than beam search? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  26. syntactically-flavored reordering search methods Shuffling Non-Constituents Jason Eisner with David A. Smith and Roy Tromble syntactically-flavored reordering model ACL SSST Workshop, June 2008

  27. Motivation • MT is really easy! • Just use a finite-state transducer! • Phrases, morphology, the works! Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  28. 1 1 4 5 5 6 6 3 2 4 3 2 Mary hasn’t seen me Permutation search in MT NNP Marie NNP Marie NEG ne NEG ne PRP m’ PRP m’ AUX a AUX a NEG pas NEG pas VBN vu VBN vu initial order(French) best order(French’) easy transduction Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  29. Motivation • MT is really easy! • Just use a finite-state transducer! • Phrases, morphology, the works! • Have just to fix that pesky word order. Framing it this way lets us enforce 1-to-1 exactly at the permutation step. Deletion and fertility > 1 are still allowed in the subsequent transduction. Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  30. Often want to find an optimal permutation … • Machine translation: Reorder French to French-prime (Brown et al. 1992) So it’s easier to align or translate • MT eval: How much do you need to rearrange MT output so it scores well under an LM derived from ref translations? • Discourse generation, e.g., multi-doc summarization: Order the output sentences (Lapata 2003) So they flow nicely • Reconstruct temporal order of eventsafter info extraction • Learn rule ordering or constraint ranking for phonology? • Multi-word anagrams that score well under a LM Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  31. 1 1 4 5 5 6 6 3 4 2 3 2 How can we find this needlein the haystack of N!possible permutations? Permutation search: The problem initial order best orderaccording tosome costfunction Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  32. state remembers what we’ve generated so far(but not in what order) arc weight = cost of picking 5 next if we’ve seen {1,2,4} so far Traditional approach: Beam search Approx. best path through a really big FSA N! paths: one for each permutation only 2N states Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  33. An alternative: Local search (“hill climbing”) The SWAP neighborhood 1 3 2 4 5 6 cost=20 2 1 3 4 5 6 cost=26 1 2 3 4 5 6 cost=22 1 2 4 3 5 6 cost=19 1 2 3 4 5 6 cost=22 1 2 3 5 4 6 cost=25 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  34. An alternative: Local search (“hill-climbing”) The SWAP neighborhood 1 2 3 4 5 6 cost=22 1 2 4 3 5 6 cost=19 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  35. cost=19 cost=17 cost=16. . . An alternative: Local search (“hill-climbing”)Like “greedy decoder” of Germann et al. 2001 The SWAP neighborhood 1 4 5 6 2 3 cost=22 Why are the costs always going down? How long does it take to pick best swap? How many swaps might you need to reach answer? What if you get stuck in a local min? we pick bestswap O(N) if you’re careful O(N2) random restarts Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  36. Larger neighborhood 1 3 2 4 5 6 cost=20 2 1 3 4 5 6 cost=26 1 2 3 4 5 6 cost=22 1 2 4 3 5 6 cost=19 1 2 3 4 5 6 cost=22 1 2 3 5 4 6 cost=25 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  37. 4 5 cost=17 Larger neighborhood(well-known in the literature; reportedly works well) INSERT neighborhood 1 6 2 3 cost=22 Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? yes – 3 can move past 4  to get past 5  O(N) rather than O(N2) O(N2) rather than O(N) O(N2) rather than O(N) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  38. 4 5 cost=14 Even larger neighborhood BLOCK neighborhood 1 6 3 cost=22 2 yes – 2 can get past 45  without having to cross 3  or move 3 first  still O(N) O(N3) rather than O(N), O(N2) O(N3) rather than O(N), O(N2) Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  39. 4 5 Larger yet: Via dynamic programming?? 1 6 3 cost=22 2 logarithmic exponential polynomial Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  40. 4 5 i j k Move is defined by an (i,j,k) triple O(N) O(N2) O(N3) Unifying/generalizing neighborhoods so far 1 6 7 8 3 2 Exchange two adjacent blocks, of max widths w ≤ w’ SWAP: w=1, w’=1 INSERT: w=1, w’=N BLOCK: w=N, w’=N runtime = # neighbors = O(ww’N) everything in this talk can be generalized to other values of w,w’ Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  41. 2 4 6 5 1 3 2 1 1 2 3 4 5 6 3 6 2 4 3 5 4 Cost of this arc is Δcostof swapping (4,5), here < 0 5 Very large-scale neighborhoods • What if we consider multiple simultaneous exchanges that are “independent”? • The DYNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000) 1 3 5 6 2 4 Lowest-cost neighboris lowest-cost path Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  42. 6 2 4 1 3 5 1 2 3 4 5 6 2 4 3 5 2 4 1 3 DYNASEARCH (-20+-20) 1 2 3 4 B = SWAP (-30) 2 3 Very large-scale neighborhoods Lowest-cost neighboris lowest-cost path • Why would this be a good idea? Help get out of bad local minima? Help avoid getting into bad local minima? no; they’re still local minima yes – less greedy Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  43. 2 6 4 5 3 1 1 2 3 4 5 6 2 4 5 3 Very large-scale neighborhoods Lowest-cost neighboris lowest-cost path • Why would this be a good idea? Help get out of bad local minima? Help avoid getting into bad local minima? More efficient? no; they’re still local minima yes – less greedy yes! – shortest-path algorithm finds the best setof swaps in O(N) time, as fast as best single swap. Up to N moves as fast as 1 move:no penalty for “parallelism”! Globally optimizes over exponentially many neighbors (paths). Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  44. 4 5 i j k Move is defined by an (i,j,k) triple O(N) Yes. Asymptotic runtime is always unchanged. O(N2) O(N3) Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond SWAP? 1 6 7 8 3 2 Exchange two adjacent blocks, of max widths w ≤ w’ SWAP: w=1, w’=1 INSERT: w=1, w’=N BLOCK: w=N, w’=N runtime = # neighbors = O(ww’N) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  45. = swap children Let’s define each neighbor by a “colored tree”Just like ITG! 1 4 5 6 2 3 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  46. = swap children 1 4 5 6 2 3 Let’s define each neighbor by a “colored tree”Just like ITG! Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  47. = swap children 5 6 2 3 Let’s define each neighbor by a “colored tree”Just like ITG! 1 4 This is like the BLOCK neighborhood, but with multiple block exchanges, which may be nested. Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  48. If that was the optimal neighbor … … now look for its optimal neighbor new tree! 1 5 6 4 2 3 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  49. 1 5 6 4 2 If that was the optimal neighbor … … now look for its optimal neighbor new tree! 3 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

  50. If that was the optimal neighbor … … now look for its optimal neighbor … repeat till reach local optimum Each tree defines a neighbor. At each step, optimize over all possible treesby dynamic programming (CKY parsing). 1 4 5 6 2 3 Use your favorite parsing speedups (pruning, best-first, …) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008

More Related