Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004

Main Contents • Identify novelty of news stories given preceding news a user has read • Newsjunkie: a set of algorithms for different (but related) tasks • Technique: text collection comparison • Tasks: • Ranking news by novelty • Personalized news updates • Characterization of relevance types of articles • Evaluation or Examples

Review: Text Comparison • Syntactic differences b/w Web pages • e.g :AT&T Internet Difference Engine • Characteristic words • e.g: genre classification • Language models for entire collections • e.g: corpus linguistics • Comparing one set of documents to another • e.g: MMR (Maximum Marginal Relevance) • Newsjunkie

Research Problems • Focus on temporal aspects of content difference • automatically assess the novelty over time of news articles coming from live newsfeeds. • Look for documents most dissimilar from documents reviewed earlier • limitation: output entire documents rather than novel parts of multiple documents => much harder : + IE + summarization

Difference of Text Content • KL divergence • Density of new named entities • assumption: novelty is often conveyed by introducing new named entities ? Is normalization reasonable? What we need is new info. regardless how long the document is.

Task 1: news ranking

Evaluation 1 • User evaluate on 3 distance metrics, 12 topics • KL divergence; density of NE; chronological order • Each metric produced a set of 3 novel documents • Users judge which set is the most novel • Statistical significance tests on mean ranks • KL & NE are superior than chronological order • No significant difference b/w KL & NE ? Not consider the order of the 3 articles, while the question is ranking! ? Statistical tests only on mean, how about variance?

Task 2: personalized news update • Task 2.1 single daily update • articles on the preceding day as background • user specify a novelty threshold Future work: consider more previous articles with weights decaying with age • No evaluation in this part

Task 2.2: breaking news report • detect new information about a story • preceding articles within a sliding window as background • empirically, size of 40 articles • Filtering out delayed reports and recaps • those are narrow spikes in a distance graph • based on the nature of news reports • median filter filters out narrow spikes • empirically, width of filter : 5 ? parameters setting

Task 2.2: example

Task 3: relevance type of articles • Four types of relevance to background • Recap: repeat old stuff, • Elaboration: add new info. • Offshoot: mainly about another topic • Irrelevant: totally different topic • Identify them using intra-document dynamics

Task 3: intra-document dynamics • Estimate relevance of different parts within a document • Sliding window with a fixed size • Compare content within the window to background • Plot the distance scores • Identify different patterns

What will the graph of a irrelevant article look like? -- Higher absolute scores, but small dynamic range

Contributions • Novel novelty metric • density of named entities • Evaluation by users • Breaking news detection • novel adoption of median filter • Characterization of article types • intra-story pattern novelty

Limitations • Generalization of the metric on named entities: • works well on news domain, but others? • User evaluation: too coarse • without considering order of articles • used old news which users had seen before the tests • Claimed “personalized”, but only provided flexibility in threshold and, possibly, article relevance type selection • Better if it can identify novel parts • or maybe not, keep integrity of a piece of news

Thank you!

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty