1 / 8

Evidence Reinforcement for Phrase-Based Paraphrase Generation in NLP

Explore complexity, challenges, and future advancements in implementing phrase pattern matching for paraphrase generation using linguistic constraints in NLP.

cicada
Download Presentation

Evidence Reinforcement for Phrase-Based Paraphrase Generation in NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evidence Reinforcementfor “a disponer de los” “to leave the” A disponer de los .95 Yuval Marton, Dissertation Defense

  2. Complexity • Suffix array (Manber and Myers 1993) • All suffixes of the text in lex order • Phrase pattern matching implementation (Lopez 2007) • Find all occurrences of phrase phr in text T: O(|phr| + log|T|) • Can be done in optimal O(m) (Abouelhoda et al. 2004) • Find all occurrences of gapped phrase LXR in text T: O(|L|+|R|+log|T| + c(L)c(R)) • Can be improved with stratified tree: O(|L|+|R|+log|T| + (|L|+|R|)loglog|T|) (Emde Boas et al 1977) • Or hash table: O(|L|+|R|+log|T| + c(L)+c(R)) (Lopez 2007)

  3. Complexity II • Find all paraphrases of phr: • Find all occurrences of phr: O(|phr| + log|T|) • Find all adequate contexts L_R: O(c(phr)) • Sample phr if too frequent: c(phr) > st=10,000 • Find all occur. of each LXR: O(|L|+|R|+log|T| + c(L)+c(R)) • Choose L,R long enough so that max(c(L),c(R)) < mcc=2000. in practice, it will be rare that max(|L|,|R|) > 5 • There are up to c(phr) such L_R contexts, each occurring c(LXR) times < max(c(L),c(R)). Total occur. of contexts: c(phr)max(c(L),c(R)) • Compare all X,phr • Total: O(|phr| + log|T| + c(phr) + c(phr)[|L|+|R|+log|T| + c(L)+c(R)] + c(phr)max(c(L),c(R)) ) • With limits: O(|phr| + st[log|T|+mcc] )

  4. Comparison with Pivoting Method • Resource availability: • Pivoting uses parallel text, limited. • Distributional paraphrases use monolingual text, abundant. • Top candidate challenges: • Pivoting: function words (due to “promiscuity” + frequency?); translational shift (due to double translation step) • Distributinal: antonyms, co-hypernyms • Semantic Similarity measure: • Pivoting: probability p(f1|e) p(e|f2) • Distributional: vector sim. Can plug in any similarity measure Yuval Marton, Dissertation Defense

  5. Unified Model (Soft Semantic Constraints)semantic distance of word e in sense s from word e’ in sense s’: cos(es,e’s’) = where: fSense(e,wi) fWord(e,wi) ≈cos sense-proportional cross-terms cross-terms ≈cos pure corpus-based Yuval Marton, Dissertation Defense

  6. Future Challenges • Quality: • Use antonym detection to penalize antonymous candidates. E.g., Mohammad, Dorr and Dunne (2009) • How to benefit from POS and syntactic info e.g, Callison-Burch (2008) • How to benefit from semantic info / WSDe.g., Erk & Pado (2008) • Scaling: • How? (Hadoop + Map/Reduce?) • Find out why larger monolingual training set doesn’t always improve paraphrase contribution to SMT • Beat pivoting method (Callison-Burch et al. (2006)) • Get gains on larger SMT sets (for resource-rich languages!) Yuval Marton, Dissertation Defense

  7. Main Contributions • Showing the advantage of fine-grained linguistic soft constraints in SMT, relative to “pure” corpus-based baseline and coarse-grained soft constraints. • Showing the advantage of fine-grained linguistic soft constraints in lexical semantics and paraphrase generation, relative to “pure” corpus-based baseline, hard constraints and coarse-grained soft constraints. • Showing statistically significant gains in state-of-the-art end-to-end phrase-based SMT systems of both syntactic and semantic (paraphrastic) contributions. • Introducing a novel paraphrase generation engine, using a monolingual corpus-based distributional approach, independent of parallel texts (a limited resource). • Introducing a novel evidence reinforcement component for scoring translation rules in paraphrase-augmented translation models. • Tunable (task-specific optimization) unified linear statistical NLP model. Yuval Marton, Dissertation Defense

More Related