1 / 33

Turning Down the Noise in the Blogosphere

Turning Down the Noise in the Blogosphere. Khalid El- Arini , Gaurav Veda, Dafna Shahaf , Carlos Guestrin. Millions of blog posts published every day Some stories become disproportionately popular Hard to find information you care about. Our Goal: Coverage.

kellan
Download Presentation

Turning Down the Noise in the Blogosphere

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Turning Down the Noise in the Blogosphere Khalid El-Arini, Gaurav Veda, DafnaShahaf, Carlos Guestrin

  2. Millions of blog posts published every day • Some stories become disproportionately popular • Hard to find information you care about

  3. Our Goal: Coverage • Turn down the noise in the blogosphere • Select a small set of posts that covers the most important stories January 17, 2009

  4. Our Goal: Coverage • Turn down the noise in the blogosphere • Select a small set of posts that covers the most important stories

  5. Our Goal: Personalization • Tailor post selection to user tastes Posts selected without personalization But, I like sports! I want articles like: After personalization based on Zidane’s feedback

  6. Main Contributions • Formalize notion of covering the blogosphere • Near-optimal solution for post selection • Learn a personalized coverage function • No-regret algorithm for learning user preferences using limited feedback • Evaluate on real blog data • Conduct user studies and compare against:

  7. Approach Overview Blogosphere Post Selection Coverage Function Feature Extraction …

  8. Document Features • Low level • Words, noun phrases, named entities • e.g., Obama, China, peanut butter • High level • e.g., Topics from a topic model • Topic = probability distribution over words Inauguration Topic National Security Topic

  9. Coverage • cover ( ) = amount by which covers • cover ( ) = amount by which { , } covers … Features Posts … Document d Feature f coverd(f) Feature f Set A coverA(f)

  10. Simple Coverage: MAX-COVER Find k posts that cover the most features • cover ( ) = 1 if at least or contain • Problems with MAX-COVER: Feature Significance in Document Feature Significance in Corpus … at George Mason University in Fairfax, Va.

  11. Feature Significance in Document • Solution: Define a probabilistic coverage function • coverd(f) = P(feature f | post d) Feature Significance in Document Feature Significance in Corpus Not really about Washington cover (Washington) = 0.01 Feature Significance in Document Feature Significance in Corpus e.g., with topics as features ≡ P(post d is about topic f)

  12. Feature Significance in Corpus • Some features are more important • Want to cover the important features • Solution: • Associate a weight wf with each feature f • e.g., frequency of feature in corpus • Cover an important feature using multiple posts Feature Significance in Document Feature Significance in Corpus Carlos Guestrin Barack Obama

  13. Incremental Coverage cover ( ) = 1 – P(neither nor cover ) = 1 – (1 – 0.5) (1 – 0.4) = 0.7 probability at least one post in set A covers feature f cover( ) 0.5 • Obama: Tight noose on Bin Laden as good as capture • What Obama’s win means for China 0.4 cover ( ) < 0.7 < cover ( )+cover ( ) Gain due to covering using multiple posts Diminishing returns

  14. Post Selection Optimization • Want to select a set of posts A that maximizes • This function is submodular • Exact maximization is NP-hard • Greedy algorithm leads to a (1 – 1/e) ~ 63% approximation, i.e., a near-optimal solution • We use CELF (Leskovec et al 2007) weights on features probability that set A covers feature f feature set

  15. Approach Overview Blogosphere Post Selection Coverage Function Feature Extraction Submodular function optimization

  16. Evaluating Coverage • Evaluate on real blog data from Spinn3r • 2 week period in January • ~200K posts per day (after pre-processing) • Two variants of our algorithm • User study involving 27 subjects to evaluate: TDN+LDA: High level features Latent Dirichlet Allocation topics TDN+NE: Low level features Topicality & Redundancy

  17. Topicality User Study Downed jet lifted from ice-laden Hudson River NEW YORK (AP) - The airliner that was piloted to a safe emergency landing in the Hudson… Is this post topical? i.e., is it related to any of the major stories of the day? … Reference Stories Post for evaluation

  18. Results: Topicality Higher is better TDN+NE TDN +LDA Named entities and common noun phrases as features LDA topics as features We do as well as Google & Yahoo!

  19. Evaluation: Redundancy • Israel unilaterally halts fire as rockets persist • Downed jet lifted from ice-laden Hudson River • Israeli-trained Gaza doctor loses three daughters and niece to IDF tank shell • ... Is this post redundant with respect to any of the previous posts?

  20. Results: Redundancy Lower is better TDN +LDA TDN+NE Google performs poorly We do as well as Yahoo!

  21. Results: Coverage • Google: good topicality, high redundancy • Yahoo!: performs well on both, but uses rich features • CTR, search trends, user voting, etc. Lower is better Higher is better TDN +LDA TDN +NE Topicality Redundancy TDN +LDA TDN +NE We do as well as Yahoo! Using only text based features

  22. Results: January 22, 2009

  23. Personalization • People have varied interests • Our Goal: Learn a personalized coverage function using limited user feedback Barack Obama Britney Spears

  24. Approach Overview Blogosphere Post Selection Pers. Post Selection Coverage Function Feature Extraction Personalized coverage Fn. Personalization

  25. Modeling User Preferences • ¼f represents user preference for feature f • Want to learn preference ¼ over the features Importance of feature in corpus User preference ¼1 ¼2 ¼3 ¼4 ¼5 ¼1 ¼2 ¼3 ¼4 ¼5 ¼ for a politico ¼ for a sports fan

  26. Learning User Preferences Multiplicative Weights Update Multiplicative Weights Update Before any feedback After 1 day of personalization After 2 days of personalization

  27. No-Regret Learning Theorem:For TDN, avg( ) – avg( ) 0 learned ¼ learned using TDN optimal fixed • i.e., we achieve no-regret Given the user ratings in advance, compare with the optimal fixed ¼ optimal fixed ¼ (in hindsight)

  28. Approach Overview Blogosphere Submodular function optimization Pers. Post Selection Personalized coverage fn. Feature Extraction User feedback Personalization Online learning

  29. Simulating a Sports Fan • likes all posts from Fan House (a sports blog) Personalized Objective Personalization Ratio ═ ═ Unpersonalized Objective Dead Spin (Sports Blog) Personalization ratio Fan House (Sports Blog) Unpersonalized Huffington Post (Politics Blog) Days of sports personalization

  30. Personalizing for India • Like all posts about India • Dislike everything else • After 5 epochs: • 1. India keeps up pressure on Pakistan over Mumbai • After 10 epochs: • 1. Pakistan’s shift alarms the U.S. • 3. India among 20 most dangerous places in world • After 15 epochs: • 1. 26/11 effect: Pak delegation gets cold vibes • 3. Pakistan flaunts its all weather ties with China • 4. Benjamin Button gets 13 Oscar nominations [mentions Slumdog Millionaire] • 8. Miliband was not off-message, he toed the UK line on Kashmir

  31. Personalization User Study • Generate personalized posts • Obtain user ratings • Generate posts without using feedback • Obtain user ratings Blogosphere Blogosphere … …

  32. Personalization Evaluation Personalized Higher is better Unpersonalized Users like personalized posts more than unpersonalized posts

  33. Summary • Formalized coveringthe blogosphere • Near-optimal optimization algorithm • Learned a personalized coveragefunction • No-regret learning algorithm • Evaluated on real blog data • Coverage: using only post content, we perform as well as other techniques that use richer features • Successfully tailor post selection to user preferences www.TurnDownTheNoise.com

More Related