Exploiting Topic Pragmatics for New Event Detection in TDT-2004

Exploiting Topic Pragmatics for New Event Detection in TDT-2004 TDT-2004 Evaluation Workshop December 2-3, 2004 Ronald K. Braun 1107 NE 45th St., Suite 310, Seattle, WA 98105 206-545-2941 FAX: 206-545-7227 rbraun@stottlerhenke.com http://www.stottlerhenke.com

Who We Are • Stottler Henke is a small business specializing in AI consulting and R&D. • Seattle focus on information retrieval and text mining. • Work constitutes part of a DARPA-sponsored Small Business Innovation Research (SBIR) contract (#DAAH01-03-C-R108).

Project Overview Leverage topic pragmatics and committee methods to increase accuracy in the new event detection task. • Pragmatics: non-semantic structure arising from how a topic is reported through time. • Committee methods: combining evidence from multiple perspectives (e.g., ensemble learning).

An Informal Experiment • Considered case by case errors made on the TDT3 corpus (topic-weighted CFSD = 0.4912, pMiss = 0.4000, pFA = 0.0186) . • Examined 30 misses and 20 false alarms, asked what % of these are computationally tractable? • 28% of misses and 35% of the false alarms in our sample had computationally visible features. • With copious caveats, we estimate a CFSD limit of 0.35 exists for the TDT3 corpus under current NED evaluation conditions. • Limit might be greater due to one-topic bias.

Error Classes • Annotation effects – limited annotation time, possible keyword biases. • Lack of a priori topic definitions – topic structure not computationally accessible. • Lack of semantic knowledge – causality, abstraction relationships not modeled. • Multiple topics within a story – at event level, single topic per story may be exceptional.

Error Classes (continued) • High overlap of entities due to subject marginality or class membership – “Podunk country syndrome”, topics in same topic category. • Topics joined in later stages of activity – earliest event activities are ossified into shorthand tags. • Sparseness of topical allusions – “a season of crashing banks, plunging rubles, bouncing paychecks, failing crops and rotating governments” == Russian economic crisis. • Outlier / peripheral events – human interest stuff.

TDT5 Corpus Effects • 280000 stories  an order of magnitude larger than TDT3 or TDT4. • Reduced evaluation in part to an exercise in scalability (only 9 seconds per story for all processing). • Lots of optimization. • Threw out several techniques that relied on POS tagging as tagger was not sufficiently efficient.

TDT5 Corpus Effects (continued) • Performed worse on TDT5 relative to TDT4 for topic-weighted CFSD metric, suggesting TDT5 topic set has some different attribute.

TDT5 Corpus Effects (continued) • An increase in p(miss) rate was expected. • Less annotation time per topic implies an increased likelihood of missed annotations. • Possible conflation of stories due to ubiquitous Iraq verbiage.

NED Classifiers • Made use of three classifiers in our official submission. • Vector Cosine (Baseline) • Sentence Linkage • Location Association

Vector Cosine (Baseline) • Traditional full-text similarity. • Stemmed, stopped bag-of-words feature vector. • TF/IDF weighting, vector cosine distance. • Non-incremental raw DF statistics, generated from all manual stories of TDT3.

Sentence Linkage • Detect linking sentences in text that refer to events described or also referenced by previous or future stories. • For TDT-2003, we used a temporal reference heuristic to identify event candidates. • Sentence Linkage generalizes this technique by treating every sentence (>= 15 unique features, >= one capitalized feature) as a potential event reference candidate. • Candidates of new story compared to all previous stories and all future stories.

Sentence Linkage (continued) • If all capitalized features in candidate occur in story and >= threshold of all unique features also overlap, the stories are linked. • Targets error classes: multiple topics within a story, shared event enforcement in high entity overlapping stories, linking across topic activities, and outlier / peripheral stories. • Problems: contextual events, ambient events.

Location Association • Looks for pairs of strongly associated location entities and non-location words in a story. • Co-occurrence frequencies are maintained for all BBN locations and non-location words in moving window (deferment window + twice that past). A + B > 5 A + C > 5 assoc > 0.7

Location Association (continued) • For all interesting pairs in a story, pair is added to feature vector and all location words and the non-location word are removed. • Feature weight is non-location word’s TF/IDF weight + max TF/IDF weight of words in location. • Uses Baseline TF/IDF methodology otherwise. • Addresses high entity overlap stories error class.

Evidence Combination • Authority voting – a single classifier is primary; other classifiers may override with a non-novel judgment based on their expertise. • Non-primary members of the committee are trained to low miss error rates. • Confidence is claimant’s normalized confidence for non-FS and least normalized confidence of all classifiers for FS. • Evaluation run of Baseline + Sentence Linkage.

Evidence Combination (continued) • Majority voting – members of the committee are each polled for a NED judgment and the majority decision is the system decision. • Trained all classifiers to minimize topic-weighted CFSD over TDT3 and TDT4. • Confidence is the average normalized distance between each majority classifier’s confidence value and decision threshold. • Ties: maximal average normalized difference between the novel versus the non-novel voters decides the system. • Used for our official submission SHAI1.

Evaluation Results • 5 runs, three singletons to gauge individual classifier performance and two committees.

Evaluation Results (continued) • Authority committee was non-useful. • Explained by poor threshold on Baseline, making Baseline non-FS promiscuous. • Majority committee did surprisingly well given non-optimized thresholds of classifiers. • Topic-weighted performance worse than last year but story-weighted performance improved. • Committee outperformed all constituent classifiers again this year. • Suggests less sensitivity to initial thresholding than was expected.

Exploiting Topic Pragmatics for New Event Detection in TDT-2004