Improved Relation Extraction with Feature-Rich Compositional Embedding Models

Improved Relation Extraction with Feature-Rich Compositional Embedding Models Mo Yu* Matt Gormley* September 21, 2015EMNLP Mark Dredze *Co-first authors

FCM or: How I Learned to Stop Worrying (about Deep Learning) and Love Features Mo Yu* Matt Gormley* September 21, 2015EMNLP Mark Dredze *Co-first authors

Handcrafted Features born-in p(y|x) ∝exp(Θyf ( ) LOC PER S NP VP ) NP VP ADJP

Where do features come from? hand-crafted features Sunetal.,2011 FeatureEngineering Zhouetal.,2005 FeatureLearning

Where do features come from? Classifier Look-uptable input(context words) embedding missing word hand-crafted features unsupervised learning Sunetal.,2011 FeatureEngineering similar words, similar embeddings cat: CBOW model in Mikolov et al. (2013) dog: Zhouetal.,2005 word embeddings Mikolovet al.,2013 FeatureLearning

Where do features come from? hand-crafted features pooling Sunetal.,2011 The [movie]showed [wars] FeatureEngineering RecursiveAutoEncoder (Socher 2011) ConvolutionalNeuralNetworks (Collobert and Weston 2008) CNN RAE Zhouetal.,2005 string embeddings word embeddings The [movie]showed [wars] Socher, 2011 Mikolovet al.,2013 Collobert & Weston, 2008 FeatureLearning

Where do features come from? hand-crafted features Sunetal.,2011 S tree embeddings FeatureEngineering WNP,VP Socheretal.,2013 NP VP Hermann& Blunsom, 2013 WDT,NN WV,NN Zhouetal.,2005 string embeddings word embeddings Socher, 2011 Mikolovet al.,2013 The [movie]showed [wars] Collobert & Weston, 2008 FeatureLearning

Where do features come from? Refineembeddingfeatureswithsemantic/syntacticinfo word embedding features hand-crafted features Turian et al. 2010 Hermannetal.2014 Kooet al. 2008 Sunetal.,2011 tree embeddings FeatureEngineering Socheretal.,2013 Hermann& Blunsom, 2013 Zhouetal.,2005 string embeddings word embeddings Socher, 2011 Mikolovet al.,2013 Collobert & Weston, 2008 FeatureLearning

Where do features come from? Our model (FCM) word embedding features hand-crafted features Turian et al. 2010 Hermannetal.2014 Kooet al. 2008 Sunetal.,2011 tree embeddings FeatureEngineering Socheretal.,2013 Hermann& Blunsom, 2013 Zhouetal.,2005 string embeddings word embeddings Socher, 2011 Mikolovet al.,2013 Collobert & Weston, 2008 FeatureLearning

Feature-rich Compositional Embedding Model (FCM) Goals for our Model: • Incorporate semantic/syntactic structural information • Incorporate word meaning • Bridge the gap between featureengineering and featurelearning – but remain as simple as possible

Feature-rich Compositional Embedding Model (FCM) Per-word Features: f4 f6 f1 f3 f5 f2 noun-other verb-percep. noun-other nil noun-person verb-comm. The [movie]M1 I watched depicted [hope]M2

Feature-rich Compositional Embedding Model (FCM) Per-word Features: f5 noun-other verb-percep. noun-other nil noun-person verb-comm. The [movie]M1 I watched depicted [hope]M2

Feature-rich Compositional Embedding Model (FCM) Per-word Features: (with conjunction) f5 noun-other verb-percep. noun-other nil noun-person verb-comm. The [movie]M1 I watched depicted [hope]M2

Feature-rich Compositional Embedding Model (FCM) Per-word Features: (with soft conjunction) f5 Outer-product edepicted noun-other verb-percep. noun-other nil noun-person verb-comm. The [movie]M1 I watched depicted [hope]M2

Feature-rich Compositional Embedding Model (FCM) Per-word Features: (with soft conjunction) f5 edepicted noun-other verb-percep. noun-other nil noun-person verb-comm. The [movie]M1 I watched depicted [hope]M2

Feature-rich Compositional Embedding Model (FCM) Ty fi Σ n p(y|x) ∝ exp i=1 And finally, exponentiates and renormalizes e wi Then takes the dot-product with a parameter tensor Our full model sums over each word in the sentence

Features for FCM • Let M1 and M2 denote the left and right entity mentions • Our per-word Binary Features: • head of M1 • head of M2 • in-between M1 and M2 • -2, -1, +1, or +2 of M1 • -2, -1, +1, or +2 of M2 • on dependency path between M1 and M2 • Optionally: Add the entity type of M1, M2, or both

FCM as a Neural Network • Embeddings are (optionally) treated as model parameters • A log-bilinear model • We initialize, then fine-tune the embeddings Parameter tensor Binary features Embeddings

Baseline Model Yi,j • Multinomial logistic regression (standard approach) • Bring in all the usual binary NLP features (Sun et al., 2011) • type of the left entity mention • dependency path between mentions • bag of words in right mention • …

Hybrid Model: Baseline + FCM Yi,j Product of Experts: p(y|x) = pBaseline(y|x) pFCM(y|x) 1 Z(x)

Experimental Setup ACE 2005 SemEval-2010 Task 8 Data: Web text Newswire (nw) Broadcast Conversation (bc) Broadcast News (bn) Telephone Speech (cts) Usenet Newsgroups (un) Weblogs (wl) Train:Dev: Test: Metric:Macro F1 (given entity boundaries) • Data: 6 domains • Newswire (nw) • Broadcast Conversation (bc) • Broadcast News (bn) • Telephone Speech (cts) • Usenet Newsgroups (un) • Weblogs (wl) • Train:bn+nw (~3600 relations)Dev: ½ of bcTest: ½ of bc, cts, wl` • Metric: Micro F1 (given entity mention) Standard split from shared task

ACE 2005 Results

SemEval-2010 Results Best in SemEval-2010 Shared Task

SemEval-2010 Results

Takeaways FCM bridges the gap between feature engineering and feature learning If you are allergic to deep learning: • Try the FCM for your task: it is simple, easy-to-implement, and was shown to be effective for two relation benchmarks If you are a deep learning expert: • Inject the FCM (i.e. outer product of features and embeddings) into your fancy deep network

Questions? Two open source implementations: • Java: (Within the Pacaya framework)https://github.com/mgormley/pacaya • C++: (From our NAACL 2015 paper on LRFCM)https://github.com/Gorov/ERE_RE

Improved Relation Extraction with Feature-Rich Compositional Embedding Models