230 likes | 486 Views
First Story Detection: Combining Similarity and Novelty Based Approaches. Martin Franz, Abraham Ittycheriah, J. Scott McCarley, Todd Ward, IBM T. J. Watson Research Center. What is First Story Detection. Have we seen this before? If it is not old it must be new.
E N D
First Story Detection: Combining Similarity and Novelty Based Approaches Martin Franz, Abraham Ittycheriah, J. Scott McCarley, Todd Ward, IBM T. J. Watson Research Center
What is First Story Detection • Have we seen this before? • If it is not old it must be new. • Novelty measured at three levels: • word: “Bloomberg” • story: Bloomberg wins! … (NYT 11/7/2001, page1) • story cluster: NYC mayoral elections in 2001 past future
Outline • our first participation in FSD • combined approach: • story similarity (unsupervised clustering) • term novelty
FSD with Unsupervised Clustering for each story for each cluster compute story/cluster similarity score yes best score > threshold no start new cluster merge story into cluster FSD confidence = 1 / best similarity score
Story/Cluster Similarity cluster representation: “mean story” symmetrized Okapi formula Ok(s,c) = S cnts(t)*cntc(t)*idf(t) cntis warped, length scaled term count
Text Pre-Processing • tokenizing • part-of-speech tagging • morphing • word_tag -> morph • computers_NNS -> computer • computed_VBD -> compute • unigrams and noun bigrams
Refinement: Cluster Recency Distance from the first story (TDT2, January-March) correct reject: flat FA: decreasing with the distance from the seed story
Clusters more “attractive” shortly after they are created. score’ = score *(1 + 2-age/half-time) half-time ~ 2 days ~ 860 stories
Effect of Cluster Recency before (baseline) after (cluster recency) half-time TDT2, first 10 000 stories
Baseline vs. Cluster Recency TDT3, ASR, reference boundaries
Effect of Cluster Recency TDT3, ASR, reference boundaries
Processing Very Short Stories, Automatic Boundaries Problem: numerous segmentation false alarms, resulting in short “stories”, causing FSD false alarms. Solution: finding and connecting similar neighboring stories “catch all” cluster
Processing Very Short Stories • Problem: • short “stories”, causing FSD false alarms. • Solution: • if • best similarity score = 0 • or • story vocabulary size < 20 • then • story -> “catch all” cluster
Term Novelty Feature new story ~ new words and phrases score(t) = (1 - 2-distance / half-time) * tf * idf half-time = (dev_corpus_size / df) * c c TDT2, Jan-March, clean
Combining Similarity and Novelty Scores scoreFSD = 0.8 * scoreSim + 0.2 * scoreNov
Combining Similarity and Novelty Scores TDT3, manual TDT3, ASR TDT3, ASR, auto boundaries
FSD on Mandarin (Systran) Data reference boundaries automatic boundaries det_SR=nwt+bnasr_TE=mul,eng.ndx October-December Mandarin only 99 topics
FSD on Mandarin (Systran) and English Data reference boundaries automatic boundaries det_SR=nwt+bnasr_TE=mul,eng.ndx October-December Mandarin (Systran) + English 240 topics, 39 have Mandarin first story
Conclusion • Cluster recency feature brings moderate performance gain. • Term novelty approach shows acceptable performance, more robust to noise. • Combining the two algorithms improves performance under most conditions. • As the noise level grows, the performance gain obtained by combining novelty and similarity systems increases.
Lessons Learned • Automatic FSD is a hard problem • Solution: deeper story understanding?