400 likes | 549 Views
Event Extraction Using Distant Supervision. Kevin Reschke, Mihai Surdeanu , Martin Jankowiak , David McClosky , Christopher Manning Nov 15, 2012. Event Extraction. Slot filling for event templates. Plane Crashes <Crash Site> <Operator/Airline> <Aircraft Type> <#Fatalities> <#Injuries>
E N D
Event Extraction Using Distant Supervision Kevin Reschke, MihaiSurdeanu, Martin Jankowiak, David McClosky, Christopher Manning Nov 15, 2012
Event Extraction • Slot filling for event templates. Plane Crashes <Crash Site> <Operator/Airline> <Aircraft Type> <#Fatalities> <#Injuries> <#Survivors> <#Crew> <#Passengers> Terrorist Attacks <Location> <Perpetrator> <Instrument> <Target> <Victim> Corporate Joint Ventures <Entity> <Ownership> <Activity> <Name> <Aliases> <Location> <Nationality> <Type>
Event Extraction: High Level • Candidate Generation • Mention Classification • Label Merging
Mention Classification US Airways Flight 133 crashed in Toronto. • “crashed in,” NEType=Location, syntactic dependency (crashed, Toronto). • Training a classifier: • Supervised (See ACE and MUC tasks). • Unsupervised (See Chambers and Jurafsky ACL2011) • Distantly Supervised
Distant Supervision (High Level) • Begin with set of known events. • Use this set to automatically label mentions. • Train and classify (handle noise)
Distant Supervision for Relation Extraction • KBP-TAC 2011: Slot filling for named entity relations. • E.g.: • Company: <ceo of>, <founder of>, <founding date>, <city of headquarters>, etc. • Known relations: founder_of(Steve Jobs, Apple) • Noisy Labeling Rule: Slot value and entity name must be in same sentence. • Apple co-founder Steve Jobs passed away yesterday. • Steve Jobs delivered the Stanford commencement address. • Steve Jobs was fired from Apple in 1985.
Distant Supervision for Event Extraction • E.g., Known events: <CrashSite= Bermuda> • Noisy Labeling Rule: slot value and event name must be in same sentence. • Doesn’t work! • What is the name of an event? (“The crash of USAir Flight 11”) • Slots values occur separate from names. • The plane went down in central Texas. • 10 died and 30 were injured in yesterday’s tragic incident. • Solution: Document Level Noisy Labeling Rule.
Task: Plane Crash Event Extraction • 80 plane crashes from Wikipedia. • 32 train; 8 dev; 40 test
Infoboxes to Event Templates <Flight Number = Flight 3272> <Site = Monroe, Michigan, USA> <Passengers = 26> <Crew = 3> <Fatalities = 29> <Survivors = 0> <Aircraft Type = Embraer 120 RT Brasilia> <Operator = Comair, Delta Connection> Labels for mention classification: <NIL>, Site, Passenger, Crew, Fatalities, Survivors, Aircraft Type, Operator, Injuries
Newswire Corpus • Corpus: newswire docs from 1988 to present. • Preprocessing: POS, Dependencies and NER with Stanford NLP Pipeline.
Automatic Labeling • Use Flight Number as proxy for event name. • Rule: if doc contains Flight Number, it is relevant to event. • Noise: • Hand check 50 per label • 39% are bad. • E.g., Good: At least 52 people survived the crash of the Boeing 767. Bad: First envisioned in 1964, the Boeing 737 entered service in 1968. • Bad labels are concentrated on common words: two, Brazil, etc. • Future work: reduce noise by avoiding common words in training set.
Label Merging • Exhaustive Aggregation Four <NIL> Four <Crew>Four <Crew> Four <Fatalities> Four <NIL> Four <NIL> • Shanghai <NIL> • Shanghai <NIL> • Shanghai <NIL> • Shanghai <NIL> • Shanghai <NIL> <Crew> <Fatalities> <NIL>
Test Time • Task: Given a Flight Number, fill remaining slots. • Metric: Multiclass Precision and Recall • Precision: # correct (non-NIL) guesses / total (non-NIL) guesses • Recall: # slots correctly filled / # slots possibly filled
Experiment 1: Simple Local Classifier • Multiclass Logistic Regression • Features: unigrams, POS, NETypes, part of doc, dependencies • E.g.: US Airways Flight 133 crashed in Toronto LexIncEdge-prep_in-crash-VBD UnLexIncEdge-prep_in-VBD PREV_WORD-in 2ndPREV_WORD-crash NEType-LOCATION Sent-NEType-ORGANIZATION etc.
Experiment 1: Simple Local Classifier • Baseline: Majority Class Classifier (Maj) • Distribution of test set labels: • <NIL> 19196 (with subsampling) • Site 10365 LOCATION • Operator 4869 ORGANIZATION • Fatalities 2241 NUMBER • Aircraft_Type 1028 ORGANIZATION • Crew 470 NUMBER • Survivors 143 NUMBER • Passengers 121 NUMBER • Injuries 0 NUMBER
Experiment 1: Simple Local Classifier • Results • vary <NIL> class priors to maximize F1-score on dev-set. Dev Set prec/recall (8 events; 23 findable slots) Test Set prec/recall (40 events; 135 findable slots)
Experiment 1: Simple Local Classifier • Accuracy of Local, by slot type.
Experiment 2: Training Set Bias • Number of relevant documents per event Pan Am Flight 103 (a.k.a. The Lockerbie Bombing)
Experiment 2: Training Set Bias • Does a more balanced training set improve performance? • Train new model with only 200 documents for Flight 103. • Results On Dev (prec/rec): LocalBalanced(0.30 / 0.13) Local (0.26 / 0.30) • Is difference due to balance, or less data? • Train new model with all 3456 docs for Flight 103, but use 24 events instead of 32. • Results On Dev (prec/rec): (0.33 / 0.13)
Experiment 3: Sentence Relevance • Problem: a lot of errors come from sentences that are irrelevant to event. • E.g.: Clay Foushee, vice president for flying operations for Northwest Airlines, which also once suffered from a coalition of different pilot cultures, said the process is “long and involved and acrimonious.” • Solution: Only extract values from “relevant” sentences. • Train binary relevance classifier (unigram/bigram features). • Noisly label training data: a training sentence is relevant iff it contains at least one slot value from some event.
Experiment 3: Sentence Relevance • New Mention Classifiers: • LocalWithHardSent: All mention from non-relevant sentences are <NIL> • LocalWithSoftSent: Use sentence relevance as a feature. • Initial results; dev set; no tuning (prec/rec) Local (0.18/0.30) HardSent (0.21/0.17) SoftSent(0.22/0.17) • Final results: with tuned <NIL> class priors: Dev Set Local (0.26/0.30) HardSent (0.25/0.17) SoftSent(0.25/0.17) Test Set Local (0.16/0.40) HardSent (0.12/0.24) SoftSent(0.11/0.23)
Experiment 3: Sentence Relevance • Why did sentence relevance hurt performance? • Possibility: sentence relevance causes overfitting. • …But, Local outperforms others on training set. • Other possibilities?
Experiment 4: Pipeline Model • Intuition: There are dependencies between labels. E.g., Crew and Passenger go together: 4 crew and 200 passengers were on board. Site often follows Site: The plane crash landed in Beijing, China. Fatalities never follows Fatalities * 20 died and 30 were killed in last Wednesday’s crash. • Solution: A Pipeline Model where previous label is a feature.
Experiment 4: Pipeline Model • Results. • Dev Set (prec/rec) • Test Set (prec/rec)
Experiment 5: Joint Model • Problem: Pipeline Models propagate error. 20 dead, 15 injured in a USAirwaysBoeing 747 crash. Gold: Fat. Inj. Oper. A.Type. Pred: Fat. Surv. ?? ??
Experiment 5: Joint Model • Problem: Pipeline Models propagate error. 20 dead, 15 injured in a USAirwaysBoeing 747 crash. Gold: Fat. Inj. Oper. A.Type. Pred: Fat. Surv. ?? ?? Gold: Fat. Fat. Oper. A.Type. Pred: Fat. Inj. ?? ??
Experiment 5: Joint Model • Solution: The Searn Algorithm (Daumé III et al., 2009) • Searn: Search-based Structured Prediction 20 15 USAir Boe. 747
Experiment 5: Joint Model A Searn Iteration: For a mention m in sentence s: Step 1: Classify all mentions left of m using current hypothesis. Step 2: Use these decision as features for m. Step 3: Pick a label y for m. Step 4: Given m = y, classify all mentions right of m with current hypothesis Step 5: Compare whole sentence to gold labels. Step 6: Set cost for m = y to #errors Repeat 2-5 for all possible labels. Repeat 1-6 for all mentions and sentence. You now have a cost vector for each mention. Train a Cost Sensitive Classifier then interpolate with current hypothesis. 20 15 USAir Boe. 747
Experiment 5: Joint Model • Cost Sensitive Classifier. • We use a Passive-Aggressive Multiclass Perceptron (Cramer et al., 2006) • Results • Dev Set (prec/rec) Local (0.26/0.30) Searn(0.35/0.35) SearnSent(0.30/0.30) • TestSet (prec/rec) Local (0.159/0.407) Searn (0.213/0.422)SearnSent (0.182/0.407)
Experiment 5: Joint Model • Problem: Pipeline Models propagate error. 20 dead, 15 injured in a USAirwaysBoeing 747 crash. Gold: Fat. Inj. Oper. A.Type. Pred: Fat. Surv. ?? ?? Let’s try a CRF model…
What kind of CRF do we want? • to begin with we can start with a linear-chain CRF label factor label-label factor mention-label factor
What kind of CRF do we want? • we can add a sentence-level relevance variable • (e.g.)
What kind of CRF do we want? • we can imagine joining neighboring sentences
What kind of CRF do we want? • we can add “skip-chains” connecting identical mentions • (possibly in different sentences)
Implementation • for implementation, we’re lucky to have FACTORIE: • a probabilistic modeling toolkit written in Scala • allows for general graphical structures, which can be defined imperatively • supports various inference methods • ... and much much more ... see http://factorie.cs.umass.edu/ and Andrew McCallum, Karl Schultz, Sameer Singh. “FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs.”
Experiment 5: Joint Model • Preliminary Results (Just in this morning!) • Linear Chain CRF: • Large Positive Label-Label weights: • <Crew,Passengers> <Passengers,Crew> <Operator,Aircraft_Type> • Large Negative Label-Label weights: • <Fatalities,Fatalities> <Survivors,Survivors> <Passengers, Passengers > • Results on Dev Set (prec/recall) • CRF: (0.25/0.39) • Searn: (0.35/0.35) • Local: (0.26/0.30)
Experiment 6: Label Merging Exhaustive Aggregation Max Aggregation Four <NIL> Four <Crew>Four <Crew> Four <Fatalities> Four <NIL> Four <NIL> Four <NIL> Four <Crew>Four <Crew> Four <Fatalities> Four <NIL> Four <NIL> <Crew> <Fatalities> <Crew> • Results: • Max Aggregation gives small (0.01) improvement to test set precision.
Experiment 6: Label Merging • Noisy-OR Aggregation • Key idea: Classifier gives us distribution over labels: • Stockholm <NIL:0.8; Site: 0.1, Crew:0.01, etc.> • Stockholm <Site: 0.5; NIL: 0.3, Crew:0.1, etc.> • Compute Noisy-OR for each label. • If Noisy-Or > threshold, use label. (tune threshold on dev set). • Results for Local model (prec/recall): Dev (threshold = 0.9): ExhaustAgg(0.26/0.30) NoisyORAgg(0.35/0.30) Test: ExhaustAgg (0.16/0.41) NoisyORAgg(0.19/0.39)
Summary • Tried 3 things to cope with noisy training data for distantly supervised event extraction. • Sentence Relevance FAIL • Searn Joint Model SUCCESS • Noisy-OR label aggregation SUCCESS • Future work: • Better doc selection. • Make Sentence Relevance work. • Apply technique to ACE or MUC tasks.