1 / 16

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty. Gabrilovich et.al WWW2004. Main Contents. Identify novelty of news stories given preceding news a user has read Newsjunkie: a set of algorithms for different (but related) tasks

burson
Download Presentation

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004

  2. Main Contents • Identify novelty of news stories given preceding news a user has read • Newsjunkie: a set of algorithms for different (but related) tasks • Technique: text collection comparison • Tasks: • Ranking news by novelty • Personalized news updates • Characterization of relevance types of articles • Evaluation or Examples

  3. Review: Text Comparison • Syntactic differences b/w Web pages • e.g :AT&T Internet Difference Engine • Characteristic words • e.g: genre classification • Language models for entire collections • e.g: corpus linguistics • Comparing one set of documents to another • e.g: MMR (Maximum Marginal Relevance) • Newsjunkie

  4. Research Problems • Focus on temporal aspects of content difference • automatically assess the novelty over time of news articles coming from live newsfeeds. • Look for documents most dissimilar from documents reviewed earlier • limitation: output entire documents rather than novel parts of multiple documents => much harder : + IE + summarization

  5. Difference of Text Content • KL divergence • Density of new named entities • assumption: novelty is often conveyed by introducing new named entities ? Is normalization reasonable? What we need is new info. regardless how long the document is.

  6. Task 1: news ranking

  7. Evaluation 1 • User evaluate on 3 distance metrics, 12 topics • KL divergence; density of NE; chronological order • Each metric produced a set of 3 novel documents • Users judge which set is the most novel • Statistical significance tests on mean ranks • KL & NE are superior than chronological order • No significant difference b/w KL & NE ? Not consider the order of the 3 articles, while the question is ranking! ? Statistical tests only on mean, how about variance?

  8. Task 2: personalized news update • Task 2.1 single daily update • articles on the preceding day as background • user specify a novelty threshold Future work: consider more previous articles with weights decaying with age • No evaluation in this part

  9. Task 2.2: breaking news report • detect new information about a story • preceding articles within a sliding window as background • empirically, size of 40 articles • Filtering out delayed reports and recaps • those are narrow spikes in a distance graph • based on the nature of news reports • median filter filters out narrow spikes • empirically, width of filter : 5 ? parameters setting

  10. Task 2.2: example

  11. Task 3: relevance type of articles • Four types of relevance to background • Recap: repeat old stuff, • Elaboration: add new info. • Offshoot: mainly about another topic • Irrelevant: totally different topic • Identify them using intra-document dynamics

  12. Task 3: intra-document dynamics • Estimate relevance of different parts within a document • Sliding window with a fixed size • Compare content within the window to background • Plot the distance scores • Identify different patterns

  13. What will the graph of a irrelevant article look like? -- Higher absolute scores, but small dynamic range

  14. Contributions • Novel novelty metric • density of named entities • Evaluation by users • Breaking news detection • novel adoption of median filter • Characterization of article types • intra-story pattern novelty

  15. Limitations • Generalization of the metric on named entities: • works well on news domain, but others? • User evaluation: too coarse • without considering order of articles • used old news which users had seen before the tests • Claimed “personalized”, but only provided flexibility in threshold and, possibly, article relevance type selection • Better if it can identify novel parts • or maybe not, keep integrity of a piece of news

  16. Thank you!

More Related