Novelty and Redundancy Detection in Adaptive Filtering

Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang , Jamie Callan , Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu

Outline • Introduction : task definition and related work • Building an filtering system • Filtering system structure • Redundancy measures • Experimental methodology • Creating testing datasets • Evaluation measures • Experimental result • Conclusion and future work

Task Definition • What user want in adaptive filtering: relevant & novel information as soon as the document arrives • Current filtering systems are relevant oriented. • Optimization: deliver as much relevant information as possible • Evaluation: relevant recall/precision. System gets credit for relevant but redundant information

Relates to First Story Detection in TDT • No work on novelty detection in adaptive filtering • Current research on FSD in TDT: • Goal : identify the first story of an event • Current performance: far from solved • FSD in TDT != Novelty Detection while filtering • Assumption on redundancy definition • Unsupervised learning vs. supervised learning. • Novelty Detection in filtering is about user specified domain, and user information is available

Outline • Introduction : task definition and related work • Building an filtering system • Filtering system Structure • Redundancy measures • Experimental methodology • Creating testing datasets • Evaluation measures • Experimental result • Conclusion and future work

Relevancy vs. Novelty • User wants: relevant and novel information • Contradiction? • Relevant: deliver document similar to previously delivered relevant documents to user • Novel: deliver document not similar to previously delivered relevant documents to user • Solution: two stages system • Use different similarity measure to model relevancy and novelty

Stream of Documents Relevance Filtering . . . . . . . OR OR Redundancy Filtering OR OR Novel Redundant Novel Redundant Two Stages Filtering System

Two Problems for Novelty Detection • Input: • A sequence of document user read • User feedback • Redundancy measure (our current focus): • Measures redundancy of current document with previous documents • Profile specific any time updating of redundancy/novelty measure • Thresholding • only document with a redundancy score below threshold is considered novel

Redundancy Measures • Use similarity/distance/difference between two documents to measure redundancy • 3 types of document representation • Set difference • Geometric distance (cosine similarity) • Distributional Similarity (language model)

Set Difference • Main idea: • Boolean bag-of-words representation • Use smoothing to add frequent words to the doc representation • Algorithm: • wj Set(d) iff Count (wj, d) > k Count (wj, d) = 1 * tf wj,d + 3 *rdf w + 2 * df wj • Using the number of new words in dt to measure the novelty R(dt | di)= -|Set(dt) Set(di)|

Geometric Distance • Main idea: • Basic vector space approach • Algorithm: • Represent a document as a vector, and the weight of each dimension is the tf*idf score of corresponding word • Using cosine distance to measure the redundancy R(dt | di) = Cosine(dt, di)

Distributional Similarity (1) • Main idea: • Unigram language models • Algorithm: • Represent a document d as a words distribution d • Measure the redundancy/novelty between two documents using Kullback-Leibler (KL) distance of the corresponding two distributions R(dt | di) = - KL (dt , di,)

Distributional Similarity (2):Smoothing • Why smoothing: • maximum likelihood estimation of d will makeKL (dt , di,) infinite because of unseen words • make the estimate of language model more accurate • Smoothing algorithms for d: • Bayesian smoothing using dirichlet priors (Zhai&Lafferty SIGIR 01) • Smoothing using shrinkage (McCallum ICML98) • A mixture model based smoothing

ME: E General English MT: T Topic MI: d_core New Information  E  T d_core A Mixture Model: Relevancy vs. Novelty • Relevancy detection: focus on learning T • Redundancy detection: focus on learning d_core

Outline • Introduction : task definition and related work • Building an filtering system • Filtering system structure • Redundancy measures • Experimental methodology • Creating testing datasets • Evaluation measures • Experimental result • Conclusion and future work

A New Evaluation dataset:APWSJ • Combine 1988-1990 AP+WSJ to get a corpus which is likely to contain redundant documents • Hired undergraduates to read all relevant documents chronologically sorted and let them to judge: • Whether a document is redundant • If yes, identify document set that make this document redundant • Two degree of redundancy: absolutely redundant vs. somewhat redundant • Adjudicated by two assessors

Another Evaluation Dataset: TREC Interactive Data • Combine TREC-6, TREC-7 and TREC-8 interactive dataset (20 TREC topics) • Each topic contains several aspects • NIST assessors identify aspects for each document • Assume dt is redundant if all aspectsrelated to dt have already been covered by previous documents user seen. • Strong assumption on what’s novel/redundant • Can still provide useful information

Evaluation Methodology (1) • Four components of an adaptive filtering system • relevancy measure • relevance threshold • redundancy measure • redundancy threshold • Goal: focus on redundancy measures, and avoid the influence of other part of the filtering system • Assume we have a perfect relevancy detection stage to avoid influence of that stage • Use 11-pt average recall and precision graph to avoid the influence of thresholding module

Evaluation Methodology (2)

Comparing Different Redundancy Measures on Two Datasets • Cosine measure is consistently good (ONE SLIDE TO EXPLAIN) • Mixture language model works much better than other LM approach

Mistakes After Thresholding • A simple thresholding algorithm that makes the system complete • Learning user’s preference is important • Similar results for interactive track data on paper

Conclusion: Our Contributions • Novelty/redundancy detection in an adaptive filtering system • Two stages approach • Reasonably good at identifying redundant documents • Cosine similarity • Mixture language model • Factors affecting accuracy • Accuracy at finding relevant documents • Redundancy measure • Redundancy threshold

Future work • Cosine similarity is far from the optimal (symmetric vs. asymmetric) • Feature engineering: time, source, author, name entity… • Better novelty measure • Doc.-doc. distance vs. doc-cluster distance (?) • Depend on user: what is novel/redundant for the user? • Learning user redundancy preferences • Thresholding: sparse training data problem

Appendix: Threshold Algorithm • Initialize Rthreshold to let only near duplicates as redundant • For each dt delivered: • If user said it is redundant and R(dt)> argmax R(di) for all di (delivered relevant document) • Rthreshold=R(dt ) • Else • Rthreshold=Rthreshold-(Rthreshold-R(dt ))/10

Novelty and Redundancy Detection in Adaptive Filtering