170 likes | 328 Views
Text Classification and Named Entities for New Event Detection. Giridhar Kumaran and James Allan University of Massachusetts Amherst SIGIR 2004. Introduction. New Event Detection (NED) is one of the task in TDT program. (http://www.nist.gov/speech/tests/tdt/index.htm)
E N D
Text Classification and Named Entities for New Event Detection Giridhar Kumaran and James Allan University of Massachusetts Amherst SIGIR 2004
Introduction • New Event Detection (NED) is one of the task in TDT program. (http://www.nist.gov/speech/tests/tdt/index.htm) • Vector space model has achieved the best results to date. • Better similarity metrics and document representations.
Previous Research • Increasing the number of features. • Weight event-level features more heavily than more general topic-level features. • Lexical chains (using WordNet) • NED and tracking system. • Named entities re-weighted and stop list created for each topic. • Incremental TF-IDF
NED Evaluation • Assign a confidence score between 0-1 by the NED algorithm, immediately or look-ahead. • 0 new, 1 old • Define threshold results in the least cost. • Detection Error Tradeoff (DET) curve is used to represent miss and false alarm.
Basic Model • Cosine similarity
Modified Model • Cosine is good, but make mistakes. • The level of a hierarchy of events is of interest. • Looking into other parameters like the category, the overlap of named entities, and the overlap of non-named entities. • Develop a simple rules reflect the questions that a human being would ask before deciding if a story is new or old.
Modification to document model • Terms: health care – drugs, cost, coverage, plan, prescription..vs. locations and individuals. • Solve: First placing stories into broad categories, and then computing term weights. • Using topic types specified by the LDC. • Classification according to LDC topics. • Train in TDT2, test in TDT3.
Modification to Similarity Metric • Isolate the named entities and treat them preferentially (nothing new). • Named entities are a double-edged sword, deciding when to use them can be tricky.
Multiple document representations • Alpha : all terms • Beta : only named entities • Gamma : non-named entity terms • Event, GPE(Geographical and Political Entities ), Language, Location, Nationality, Organization, Person, Cardinal, Ordinal, Date, and Time.
Election News • Gamma is less than 0.2, while beta spreads out. (2 Graphs) : using alpha + gamma
Legal/Criminal Cases • Gamma below 0.4, beta above 0.4 : use beta + alpha
Financial News • Cannot decide using beta or gamma: use alpha only.
Term scores and categories • (Table 4)
Experimental Results • The result seems to be worse in TDT4. • TDT4 may contain topics not conductive to named entity-based modification.
DET Curve of TDT3 • Focus on the high accuracy area.
Conclusion and Future Work • Present a new multi-stage system for NED. • Show a way to harness the named entities in documents, and illustrate their utility in different situations. • Improve named entity rules • Different ways to develop stop lists for different categories • Temporal information