1 / 48

Features, Formalized

Features, Formalized. Stephen Mayhew Hyung Sul Kim. Outline. What are features? How are they defined in NLP tasks in general? How they are defined specifically for relation extraction ? (Kernel methods). What are features?. Feature Extraction Pipeline.

jennis
Download Presentation

Features, Formalized

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Features, Formalized Stephen Mayhew Hyung Sul Kim

  2. Outline • What are features? • How are they defined in NLP tasks in general? • How they are defined specifically for relation extraction? (Kernel methods)

  3. What are features?

  4. Feature Extraction Pipeline • Define Feature Generation Functions (FGF) • Apply FGFs to Data to make a lexicon • Translate examples into feature space • Learning with vectors

  5. Feature Generation Functions • When we say ‘features’, we actually are often talking about FGFs. • Define a relation over instance space • For example, let instance • The relation containsWord(w) is active ( = 1) three times: containsWord(little), containsWord(brown), containsWord(cow)

  6. Feature Generation Functions Let be an enumerable collection of relations on . A Feature Generation Function is a mapping: that maps each to a set of all elements in that satisfy . Common notation for FGF:

  7. Feature Generation Functions Example: “GregorSamsa woke from troubled dreams.” Let { isCap(…), hasLen4(…), endsWithS(…) } Define an FGF over and apply it to the instance: isCap(Gregor), isCap(Samsa), hasLen4(woke), hasLen4(from), endsWithS(dreams)

  8. Feature Extraction Pipeline • Define Feature Generation Functions (FGF) • Apply FGFs to Data to make a lexicon • Translate examples into feature space • Learning with vectors

  9. Lexicon Apply our FGF to all input data. Creates grounded features and indexes them … 3534: hasWord(stark) 3535: hasWord(stamp) 3536: hasWord(stampede) 3537: hasWord(starlight) …

  10. Feature Extraction Pipeline • Define Feature Generation Functions (FGF) • Apply FGFs to Data to make a lexicon • Translate examples into feature space • Learning with vectors

  11. Translate examples to feature space “In the stark starlight” <98, 241, 3534, 3537> From Lexicon: … 98: hasWord(In) … 241: hasWord(the) … 3534: hasWord(stark) 3535: hasWord(stamp) 3536: hasWord(stampede) 3537: hasWord(starlight) …

  12. Feature Extraction Pipeline • Define Feature Generation Functions (FGF) • Apply FGFs to Data to make a lexicon • Translate examples into feature space • Learning with vectors Easy.

  13. Feature Extraction Pipeline Testing • FGFs are already defined • Lexicon is already defined • Translate examples into feature space • Learning with vectors No surprises here.

  14. Structured Pipeline - Training • Define Feature Generation Functions (FGF) (Note: in this case: ) • Apply FGFs to data to make a lexicon • Translate examples into feature space • Learning with vectors Exactly the same as before!

  15. Structured Pipeline - Testing Remember, the FGF is Now we don’t have to use, but the idea is very similar: for every possible we create features.

  16. Automatic Feature Generation Two ways to look at this: • Creating an FGFThis is a black art, not even intuitive for humans to do • Choosing the best subset of a closed setThis is possible, algorithms exist

  17. Exploiting Syntactico-Semantic Structures for Relation Extraction Chan and Roth, ACL 2011 Before doing the hard task of relation classification, apply some easy heuristics to recognize: • Premodifiers: [the [Seattle] Zoo] • Possessives: [[California’s] Governor] • Prepositions: [officials] in [California] • Formulaics: [Medford] , [Massachusetts] These 4 structures cover 80% of the mention pairs (in ACE 2004)

  18. Kernels for Relation Extraction Hyung Sul Kim

  19. Kernel Tricks • Borrowed a few slides from ACL2012 Tutorial for Kernels in NLP by Moschitti

  20. All We Need is K(x1, x2) = ϕ(x1) · ϕ(x2) Computing K(x1, x2) can be possible without mapping x to ϕ(x)

  21. Linear Kernels with Features(Zhou et al., 2005) • Pairwise binary-SVM training • Features • Words • Entity Types • Mention Level • Overlap • Base Phrase Chunking • Dependency Tree • Parse Tree • Semantic Resources

  22. Word Features

  23. Entity Types, Mention Level, Overlap

  24. Base Phrase Chunking

  25. Performance of Features (F1 Measure)

  26. Performance Comparison

  27. Syntactic Kernels(Zhao and Grishman, 2005) • Syntactic Kernels (Composite of 5 Kernels) • Argument Kernel • Bigram Kernel • Link Sequence Kernel • Dependency Path Kernel • Local Dependency Kernel

  28. Bigram Kernel • All unigrams and bigrams in the text from M1 to M2

  29. Dependency Path Kernel That's because Israel was expected to retaliate against Hezbollah forces in areas controlled by Syrian troops.

  30. Performance Comparison

  31. Composite Kernel(Zhang et al., 2006) • Composite of Two Kernels • Entity Kernel (Linear Kernel with entity related features given by ACE datasets) • Convolution Tree Kernel (Collins and Duffy, 2001) • Two ways to composite two kernels • Linear Combination • Polynomial Expansion

  32. Convolution Tree Kernel(Collins and Duffy, 2001) Efficiently Compute K(x1, x2) by O(|x1|·|x2|) An example tree

  33. Relation Instance Spaces 61.9 51.3 60.4 59.2

  34. Performance Comparison

  35. Context-Sensitive Tree Kernel(Zhou et al., 2007) • Motivational Example: John and Mary got married called predicate-linked category (10%) Context-Sensitive Tree Kernel: 73.2 PT: 63.6

  36. Performance Comparison

  37. Best Kernel(Nguyen et al., 2009) • Use Multiple Kernels on • Constituent Trees • Dependency Trees • Sequential Structures • Design 5 different Kernel Composites with 4 Tree Kernels and 6 Sequential Kernels

  38. Convolution Tree Kernels on 4 SpecialTrees PET 68.9 PET + GR = 70.5 DW + GR = 61.8 GRW DW GR 56.3 58.5 60.2

  39. Word Sequence Kernels on 6 Special Sequences 61.0 SK1. Sequence of terminals (lexical words) in the PET e.g. T2-LOC washington, U.S. T1-PER officials SK2. Sequence of part-of-speech (POS) tags in the PETe.g. T2-LOC NN , NNP T1-PER NNS SK3. Sequence of grammatical relations in the PETe.g. T2-LOC pobj , nn T1-PER nsubj SK4. Sequence of words in the DWe.g. Washington T2-LOC In working T1-PER officials GPE U.S. SK5. Sequence of grammatical relations in the GRe.g.pobj T2-LOC prep ROOT T1-PER nsubj GPE nn SK6. Sequence of POS tags in the DWe.g. NN T2-LOC IN VBP T1-PER NNS GPE NNP 60.8 61.6 59.7 59.8 59.7 SK1 + SK2 + SK3 + SK4 + SK5 + SK6 = 69.8

  40. Word Sequence Kernels(Canceddaet al., 2003) • Extended Sequence Kernels • Map to high-dimensional spaces using every subsequence • Penalties to • common subsequences (using IDF) • longer subsequences • non-contiguous subsequences

  41. Performance Comparison (Zhang et al., 2006) F-measure 68.9 in our settings (Zhou et al., 2007) “Such heuristics expand the tree and remove unnecessary information allowing a higher improvement on RE. They are tuned on the target RE task so although the result is impressive, we cannot use it to compare with pure automatic learning approaches, such us our models. “

  42. Topic Kernel(Wang et al., 2011) • Use Wikipedia InfoBox to learn topics of relations (like topics of words) based on co-occurrences

  43. Overview

  44. Performance Comparison

More Related