480 likes | 635 Views
Features, Formalized. Stephen Mayhew Hyung Sul Kim. Outline. What are features? How are they defined in NLP tasks in general? How they are defined specifically for relation extraction ? (Kernel methods). What are features?. Feature Extraction Pipeline.
E N D
Features, Formalized Stephen Mayhew Hyung Sul Kim
Outline • What are features? • How are they defined in NLP tasks in general? • How they are defined specifically for relation extraction? (Kernel methods)
Feature Extraction Pipeline • Define Feature Generation Functions (FGF) • Apply FGFs to Data to make a lexicon • Translate examples into feature space • Learning with vectors
Feature Generation Functions • When we say ‘features’, we actually are often talking about FGFs. • Define a relation over instance space • For example, let instance • The relation containsWord(w) is active ( = 1) three times: containsWord(little), containsWord(brown), containsWord(cow)
Feature Generation Functions Let be an enumerable collection of relations on . A Feature Generation Function is a mapping: that maps each to a set of all elements in that satisfy . Common notation for FGF:
Feature Generation Functions Example: “GregorSamsa woke from troubled dreams.” Let { isCap(…), hasLen4(…), endsWithS(…) } Define an FGF over and apply it to the instance: isCap(Gregor), isCap(Samsa), hasLen4(woke), hasLen4(from), endsWithS(dreams)
Feature Extraction Pipeline • Define Feature Generation Functions (FGF) • Apply FGFs to Data to make a lexicon • Translate examples into feature space • Learning with vectors
Lexicon Apply our FGF to all input data. Creates grounded features and indexes them … 3534: hasWord(stark) 3535: hasWord(stamp) 3536: hasWord(stampede) 3537: hasWord(starlight) …
Feature Extraction Pipeline • Define Feature Generation Functions (FGF) • Apply FGFs to Data to make a lexicon • Translate examples into feature space • Learning with vectors
Translate examples to feature space “In the stark starlight” <98, 241, 3534, 3537> From Lexicon: … 98: hasWord(In) … 241: hasWord(the) … 3534: hasWord(stark) 3535: hasWord(stamp) 3536: hasWord(stampede) 3537: hasWord(starlight) …
Feature Extraction Pipeline • Define Feature Generation Functions (FGF) • Apply FGFs to Data to make a lexicon • Translate examples into feature space • Learning with vectors Easy.
Feature Extraction Pipeline Testing • FGFs are already defined • Lexicon is already defined • Translate examples into feature space • Learning with vectors No surprises here.
Structured Pipeline - Training • Define Feature Generation Functions (FGF) (Note: in this case: ) • Apply FGFs to data to make a lexicon • Translate examples into feature space • Learning with vectors Exactly the same as before!
Structured Pipeline - Testing Remember, the FGF is Now we don’t have to use, but the idea is very similar: for every possible we create features.
Automatic Feature Generation Two ways to look at this: • Creating an FGFThis is a black art, not even intuitive for humans to do • Choosing the best subset of a closed setThis is possible, algorithms exist
Exploiting Syntactico-Semantic Structures for Relation Extraction Chan and Roth, ACL 2011 Before doing the hard task of relation classification, apply some easy heuristics to recognize: • Premodifiers: [the [Seattle] Zoo] • Possessives: [[California’s] Governor] • Prepositions: [officials] in [California] • Formulaics: [Medford] , [Massachusetts] These 4 structures cover 80% of the mention pairs (in ACE 2004)
Kernels for Relation Extraction Hyung Sul Kim
Kernel Tricks • Borrowed a few slides from ACL2012 Tutorial for Kernels in NLP by Moschitti
All We Need is K(x1, x2) = ϕ(x1) · ϕ(x2) Computing K(x1, x2) can be possible without mapping x to ϕ(x)
Linear Kernels with Features(Zhou et al., 2005) • Pairwise binary-SVM training • Features • Words • Entity Types • Mention Level • Overlap • Base Phrase Chunking • Dependency Tree • Parse Tree • Semantic Resources
Syntactic Kernels(Zhao and Grishman, 2005) • Syntactic Kernels (Composite of 5 Kernels) • Argument Kernel • Bigram Kernel • Link Sequence Kernel • Dependency Path Kernel • Local Dependency Kernel
Bigram Kernel • All unigrams and bigrams in the text from M1 to M2
Dependency Path Kernel That's because Israel was expected to retaliate against Hezbollah forces in areas controlled by Syrian troops.
Composite Kernel(Zhang et al., 2006) • Composite of Two Kernels • Entity Kernel (Linear Kernel with entity related features given by ACE datasets) • Convolution Tree Kernel (Collins and Duffy, 2001) • Two ways to composite two kernels • Linear Combination • Polynomial Expansion
Convolution Tree Kernel(Collins and Duffy, 2001) Efficiently Compute K(x1, x2) by O(|x1|·|x2|) An example tree
Relation Instance Spaces 61.9 51.3 60.4 59.2
Context-Sensitive Tree Kernel(Zhou et al., 2007) • Motivational Example: John and Mary got married called predicate-linked category (10%) Context-Sensitive Tree Kernel: 73.2 PT: 63.6
Best Kernel(Nguyen et al., 2009) • Use Multiple Kernels on • Constituent Trees • Dependency Trees • Sequential Structures • Design 5 different Kernel Composites with 4 Tree Kernels and 6 Sequential Kernels
Convolution Tree Kernels on 4 SpecialTrees PET 68.9 PET + GR = 70.5 DW + GR = 61.8 GRW DW GR 56.3 58.5 60.2
Word Sequence Kernels on 6 Special Sequences 61.0 SK1. Sequence of terminals (lexical words) in the PET e.g. T2-LOC washington, U.S. T1-PER officials SK2. Sequence of part-of-speech (POS) tags in the PETe.g. T2-LOC NN , NNP T1-PER NNS SK3. Sequence of grammatical relations in the PETe.g. T2-LOC pobj , nn T1-PER nsubj SK4. Sequence of words in the DWe.g. Washington T2-LOC In working T1-PER officials GPE U.S. SK5. Sequence of grammatical relations in the GRe.g.pobj T2-LOC prep ROOT T1-PER nsubj GPE nn SK6. Sequence of POS tags in the DWe.g. NN T2-LOC IN VBP T1-PER NNS GPE NNP 60.8 61.6 59.7 59.8 59.7 SK1 + SK2 + SK3 + SK4 + SK5 + SK6 = 69.8
Word Sequence Kernels(Canceddaet al., 2003) • Extended Sequence Kernels • Map to high-dimensional spaces using every subsequence • Penalties to • common subsequences (using IDF) • longer subsequences • non-contiguous subsequences
Performance Comparison (Zhang et al., 2006) F-measure 68.9 in our settings (Zhou et al., 2007) “Such heuristics expand the tree and remove unnecessary information allowing a higher improvement on RE. They are tuned on the target RE task so although the result is impressive, we cannot use it to compare with pure automatic learning approaches, such us our models. “
Topic Kernel(Wang et al., 2011) • Use Wikipedia InfoBox to learn topics of relations (like topics of words) based on co-occurrences