170 likes | 197 Views
Learn about ACE relations, benchmarks, feature-based models, kernel methods, and neural networks for relation extraction tasks. Explore various kernels, lexical generalization techniques, and comparisons of different models. Understand the benefits of distant supervision in training models for relation extraction.
E N D
NYU Relation ExtractionCSCI-GA.2591 Ralph Grishman
ACE Relations An ACE relation mention connects two entity mentions in the same sentence: • the CEO of Microsoft OrgAff:employment(the CEO of MIcrosoft, Microsoft) • in the West Bank, a passenger was wounded Phys:Located(a passenger, the WestBank) ACE 2005 had 6 types of relations and 18 subtypes • most papers report on types only Most relations are local … • in roughly 70% of relations, arguments are adjacent or separated by one word • so chunking is important but full parsing is not critical
Benchmarks • ACE 2003 / 2003 / 2005 corpora • generally assuming perfect entity mentions on input • some work assumes only position (and not semantic type) is given • Semeval-2010 task 8 • carefully selected examples of 10 relations • a classification task
Using MaxEnt • First description of an ACE relation extractor • IBM system [Kambhatla ACL 2004] • Used features: • words • entity type • mention level • overlap • dependency tree • parse tree • used 2003 ACE data • F = 55 (perfect mentions) 23 (system mentions) • good system mentions are important
Lots of features • Singapore system [Zhou et al. ACL 2005] used a very rich feature set, including • 11 chunk-based features • family-relative feature • 2 country-name features • 7 dependency-based features • . . . • highly tuned to ACE task • F = 68 (relation type) F = 55 (subtype) • reports several % gain over IBM • used perfect mentions • further extended at NYU, on ACE 2004: F=70.1
Kernel methods and SVMs • As an alternative to a feature-based model, one can provide a kernel function: a similarity function between pairs of the objects being classified • kernel can be used directly by a kNN nearest neighbor classifier • or can be used in training an SVM [Support Vector Machine]
SVM • The SVM, when trained, creates a separating hyperplane • if data is fully separable, all data on one side of the hyperplane are classified +, on the other side – • inherently binary classifier
Benefit of kernel methods • provides a natural way of handling structured input of variable size: sequences and trees • feature-based system may require a large number of features for the same effect
Shortest-path kernel • [Bunescu & Mooney EMNLP 2005] • Sept 2002 corpus • Based on dependency path between arguments • Kernel function between two paths x and y of lengths m and n • c = degree of match (lexical / POS) • Train SVM • F = 52.5
Tree kernel • To take account of more of the tree than the dependency path, use PET (path-enclosed tree) • PET = Portion of tree enclosed by shortest path • Using entire sentence tree introduces too much irrelevant data • Use a tree kernel which recursively compares the two trees • For example, counts number of shared subtrees • Best kernel is a composite kernel: • tree kernel + entity kernel
Lexical Generalization • Test data will include words not seen in training • Remedies • Use lemmas • Use Brown clusters • Use word embedings • Can be used with feature-based or kernel-based methods
FCM Feature-Rich Compositional Embedding Models • Combines word embedding and hand-made discrete features: • where • e is the word embedding vector • f is a vector of hand-coded features • T is a matrix of weights • If e is fixed during training, this is a feature-rich log linear model
Neural Network • neural networks • provide a richer model than logLinear • reduce the need for feature engineering • although it may help to add features to embeddings • but are slow to train and hard to inspect • several types of networks have been used • convolutionalNNs • recurrent NNs • an ensemble of different NN types appears most effective • may even include log linear model in ensemble
Some comparisons • ACE 2005, train nw+bn, test bc, • perfect mentions, including entity types • LogLinearsystem 57.8 • FCM 61.9 • hybrid FCM 63.5 • CNN 63.0 • NN ensemble 67.0 • The richer model of even a simple NN beats a log linear (maxent system) • [Nguyen and Grishman, IJCAI Workshop 2016]
Comparing scores Using subset of ACE 2005 (news) Feature-based system Perfect mention position but no type info • Baseline 51.4 • Single Brown Cluster 52.3 • Multiple clusters 53.7 • Word Embedding (WE) 54.1 • Multiple clusters + WE 55.5 • Mult. clusters + WE + regularization 59.4 Moral: lexical generalization & regularization are worthwhile (probably for all ACE tasks) [Nguyen & Grishman ACL 2014]
Distant Supervision • We have focused on supervised methods, which produce the best performance • If we have a large data base with instances of the relations of interest, we can use distant supervision • Use data base to tag corpus • If DB has relation R(x,y),tag all sentences in corpus containing x and y as examples of R • Train model from tagged corpus
Distant Supervision • By itself, distant supervision is too noisy • If the same pair <x, y> is connected by several relations, which one to we label? • But it can be combined with selective manual annotation to produce a satisfactory result