220 likes | 350 Views
Information Extraction for New Event Detection. Xiaoqiang Luo. Acknowledgments. Martin Franz Abe Ittycheriah Scott McCarley Salim Roukos Todd Ward. Outline. NED systems tf-idf baseline MaxEnt model: tf-idf + ACE annotation Errors and Observations Conclusions. tf-idf Baseline.
E N D
Information Extraction for New Event Detection Xiaoqiang Luo
Acknowledgments • Martin Franz • Abe Ittycheriah • Scott McCarley • Salim Roukos • Todd Ward
Outline • NED systems • tf-idf baseline • MaxEnt model: tf-idf + ACE annotation • Errors and Observations • Conclusions
tf-idf Baseline Similarity score: Confidence: d0: current doc d-: previous doc W: docs in the past window Decision: New Event if DET Curve: Varying Threshold Value
Baseline Performance Window size: *Newswire stories only
Why Just Words? • The tf-idf score: structure is ignored • Example “structures”: • PERSON, LOCATION, ORGANIZATION etc • Coreference information • Relation between entities
AMA ManagentRole Reardon ACE: mention, entity, relation The American Medical Association voted yesterday to install Thomas R. Reardonas itspresident-elect, rejecting a strong, upstart challenge by a Districtdoctor who argued that the nation’s largest physicians’ group needs stronger ethics and new leadership. In electing Thomas R. Reardon, an Oregon general practitioner who had been the chairman of its board, members signified they did not hold him responsible for a costly gaffe last year, when the group agreed to endorse a line of Sunbeam Corp. health care products. Reardon had become chairmanof …
ACE Entity and Relation Types • RelType Subtype • AT based-In • located • residence • NEAR relative-location • PART other • part-Of • subsidiary • ROLE affiliate-partner • citizen-Of • client • founder • general-staff • management • member • other • owner • SOCIAL associate • grandparent • other-personal • other-professional • other-relative • parent • sibling • spouse Entity Type: PERSON ORGANIZATION FACILITY LOCATION GPE Mention Level: NAME NOMINAL PRONOUN
ME Model for NED • Probability of “new”: • MaxEnt Model • Used to rescore a top-N set of candidate documents {d_}
In electing Thomas R. Reardon, an Oregon general practitioner who had been the chairman of its board, members signified they did not hold him responsible for a costly gaffe last year, when the group agreed to endorse a line of Sunbeam Corp. health care products. Reardon had become chairmanof … N1 =4 The American Medical Association voted yesterday to install Thomas R. Reardon as itspresident-elect, rejecting a strong, upstart challenge by a Districtdoctor who argued that the nation’s largest physicians’ group needs stronger ethics and new leadership. N2 =5 Past Story: Current Story: Counting Common Entities Comm: N(ent) =2 Ratio: R1=N/N1 R2=N/N2 Rc=N/(N1+N2)
Features in MaxEnt Model • Example Features: • tfidf: if • R1: if • R1&Rc: if • Relation: similar
ME Learning Curve Training Data: TDT3 #events: 2963 (180+) #Features: 294
ME Results Summary * Training on TDT3 Test=TDT3*,TDT2,TDT4
ME Model: Easier to Pick A Good Operating Point tfidf system ME system
Similar on TDT4 tfidf system ME system
submission ME Model on TDT5
Analysis: Extra Information from ACE Entity? From nl312::ws2/h/hh
Some Other Findings • Entity not covered by ACE • “Hurricane George” • First Story or First Event? • Feature or metrics computed at story-level • Example follows
Analysis: Want Event, Not Story TDT3 30033: Introduction of euro 1st Story: 19981001_0635_0719_APW_ENG.tkn_RECID=4226.sent.htm (FA) Doc: 19981001_0931_1012_APW_ENG.tkn_RECID=5863.sent.htm
Conclusions • ACE used in NED • ME model useful for picking a good operating point • Benefit of ACE features should be enhanced by: • Training set with consistent annotation rules! • More entity/relation types and Events • Need for sub-document level analysis: • Document-level features not good for detecting events!