First Story Detection: Combining Similarity and Novelty Based Approaches

First Story Detection: Combining Similarity and Novelty Based Approaches Martin Franz, Abraham Ittycheriah, J. Scott McCarley, Todd Ward, IBM T. J. Watson Research Center

What is First Story Detection • Have we seen this before? • If it is not old it must be new. • Novelty measured at three levels: • word: “Bloomberg” • story: Bloomberg wins! … (NYT 11/7/2001, page1) • story cluster: NYC mayoral elections in 2001 past future

Outline • our first participation in FSD • combined approach: • story similarity (unsupervised clustering) • term novelty

FSD with Unsupervised Clustering for each story for each cluster compute story/cluster similarity score yes best score > threshold no start new cluster merge story into cluster FSD confidence = 1 / best similarity score

Story/Cluster Similarity cluster representation: “mean story” symmetrized Okapi formula Ok(s,c) = S cnts(t)*cntc(t)*idf(t) cntis warped, length scaled term count

Text Pre-Processing • tokenizing • part-of-speech tagging • morphing • word_tag -> morph • computers_NNS -> computer • computed_VBD -> compute • unigrams and noun bigrams

Refinement: Cluster Recency Distance from the first story (TDT2, January-March) correct reject: flat FA: decreasing with the distance from the seed story

Clusters more “attractive” shortly after they are created. score’ = score *(1 + 2-age/half-time) half-time ~ 2 days ~ 860 stories

After Incorporating Cluster Recency

Effect of Cluster Recency before (baseline) after (cluster recency) half-time TDT2, first 10 000 stories

Baseline vs. Cluster Recency TDT3, ASR, reference boundaries

Effect of Cluster Recency TDT3, ASR, reference boundaries

Processing Very Short Stories, Automatic Boundaries Problem: numerous segmentation false alarms, resulting in short “stories”, causing FSD false alarms. Solution: finding and connecting similar neighboring stories “catch all” cluster

Processing Very Short Stories • Problem: • short “stories”, causing FSD false alarms. • Solution: • if • best similarity score = 0 • or • story vocabulary size < 20 • then • story -> “catch all” cluster

Term Novelty Feature new story ~ new words and phrases score(t) = (1 - 2-distance / half-time) * tf * idf half-time = (dev_corpus_size / df) * c c TDT2, Jan-March, clean

Combining Similarity and Novelty Scores scoreFSD = 0.8 * scoreSim + 0.2 * scoreNov

Combining Similarity and Novelty Scores TDT3, manual TDT3, ASR TDT3, ASR, auto boundaries

FSD on Mandarin (Systran) Data reference boundaries automatic boundaries det_SR=nwt+bnasr_TE=mul,eng.ndx October-December Mandarin only 99 topics

FSD on Mandarin (Systran) and English Data reference boundaries automatic boundaries det_SR=nwt+bnasr_TE=mul,eng.ndx October-December Mandarin (Systran) + English 240 topics, 39 have Mandarin first story

Conclusion • Cluster recency feature brings moderate performance gain. • Term novelty approach shows acceptable performance, more robust to noise. • Combining the two algorithms improves performance under most conditions. • As the noise level grows, the performance gain obtained by combining novelty and similarity systems increases.

Lessons Learned • Automatic FSD is a hard problem • Solution: deeper story understanding?

First Story Detection: Combining Similarity and Novelty Based Approaches