330 likes | 441 Views
Turning Down the Noise in the Blogosphere. Khalid El- Arini , Gaurav Veda, Dafna Shahaf , Carlos Guestrin. Millions of blog posts published every day Some stories become disproportionately popular Hard to find information you care about. Our Goal: Coverage.
E N D
Turning Down the Noise in the Blogosphere Khalid El-Arini, Gaurav Veda, DafnaShahaf, Carlos Guestrin
Millions of blog posts published every day • Some stories become disproportionately popular • Hard to find information you care about
Our Goal: Coverage • Turn down the noise in the blogosphere • Select a small set of posts that covers the most important stories January 17, 2009
Our Goal: Coverage • Turn down the noise in the blogosphere • Select a small set of posts that covers the most important stories
Our Goal: Personalization • Tailor post selection to user tastes Posts selected without personalization But, I like sports! I want articles like: After personalization based on Zidane’s feedback
Main Contributions • Formalize notion of covering the blogosphere • Near-optimal solution for post selection • Learn a personalized coverage function • No-regret algorithm for learning user preferences using limited feedback • Evaluate on real blog data • Conduct user studies and compare against:
Approach Overview Blogosphere Post Selection Coverage Function Feature Extraction …
Document Features • Low level • Words, noun phrases, named entities • e.g., Obama, China, peanut butter • High level • e.g., Topics from a topic model • Topic = probability distribution over words Inauguration Topic National Security Topic
Coverage • cover ( ) = amount by which covers • cover ( ) = amount by which { , } covers … Features Posts … Document d Feature f coverd(f) Feature f Set A coverA(f)
Simple Coverage: MAX-COVER Find k posts that cover the most features • cover ( ) = 1 if at least or contain • Problems with MAX-COVER: Feature Significance in Document Feature Significance in Corpus … at George Mason University in Fairfax, Va.
Feature Significance in Document • Solution: Define a probabilistic coverage function • coverd(f) = P(feature f | post d) Feature Significance in Document Feature Significance in Corpus Not really about Washington cover (Washington) = 0.01 Feature Significance in Document Feature Significance in Corpus e.g., with topics as features ≡ P(post d is about topic f)
Feature Significance in Corpus • Some features are more important • Want to cover the important features • Solution: • Associate a weight wf with each feature f • e.g., frequency of feature in corpus • Cover an important feature using multiple posts Feature Significance in Document Feature Significance in Corpus Carlos Guestrin Barack Obama
Incremental Coverage cover ( ) = 1 – P(neither nor cover ) = 1 – (1 – 0.5) (1 – 0.4) = 0.7 probability at least one post in set A covers feature f cover( ) 0.5 • Obama: Tight noose on Bin Laden as good as capture • What Obama’s win means for China 0.4 cover ( ) < 0.7 < cover ( )+cover ( ) Gain due to covering using multiple posts Diminishing returns
Post Selection Optimization • Want to select a set of posts A that maximizes • This function is submodular • Exact maximization is NP-hard • Greedy algorithm leads to a (1 – 1/e) ~ 63% approximation, i.e., a near-optimal solution • We use CELF (Leskovec et al 2007) weights on features probability that set A covers feature f feature set
Approach Overview Blogosphere Post Selection Coverage Function Feature Extraction Submodular function optimization
Evaluating Coverage • Evaluate on real blog data from Spinn3r • 2 week period in January • ~200K posts per day (after pre-processing) • Two variants of our algorithm • User study involving 27 subjects to evaluate: TDN+LDA: High level features Latent Dirichlet Allocation topics TDN+NE: Low level features Topicality & Redundancy
Topicality User Study Downed jet lifted from ice-laden Hudson River NEW YORK (AP) - The airliner that was piloted to a safe emergency landing in the Hudson… Is this post topical? i.e., is it related to any of the major stories of the day? … Reference Stories Post for evaluation
Results: Topicality Higher is better TDN+NE TDN +LDA Named entities and common noun phrases as features LDA topics as features We do as well as Google & Yahoo!
Evaluation: Redundancy • Israel unilaterally halts fire as rockets persist • Downed jet lifted from ice-laden Hudson River • Israeli-trained Gaza doctor loses three daughters and niece to IDF tank shell • ... Is this post redundant with respect to any of the previous posts?
Results: Redundancy Lower is better TDN +LDA TDN+NE Google performs poorly We do as well as Yahoo!
Results: Coverage • Google: good topicality, high redundancy • Yahoo!: performs well on both, but uses rich features • CTR, search trends, user voting, etc. Lower is better Higher is better TDN +LDA TDN +NE Topicality Redundancy TDN +LDA TDN +NE We do as well as Yahoo! Using only text based features
Personalization • People have varied interests • Our Goal: Learn a personalized coverage function using limited user feedback Barack Obama Britney Spears
Approach Overview Blogosphere Post Selection Pers. Post Selection Coverage Function Feature Extraction Personalized coverage Fn. Personalization
Modeling User Preferences • ¼f represents user preference for feature f • Want to learn preference ¼ over the features Importance of feature in corpus User preference ¼1 ¼2 ¼3 ¼4 ¼5 ¼1 ¼2 ¼3 ¼4 ¼5 ¼ for a politico ¼ for a sports fan
Learning User Preferences Multiplicative Weights Update Multiplicative Weights Update Before any feedback After 1 day of personalization After 2 days of personalization
No-Regret Learning Theorem:For TDN, avg( ) – avg( ) 0 learned ¼ learned using TDN optimal fixed • i.e., we achieve no-regret Given the user ratings in advance, compare with the optimal fixed ¼ optimal fixed ¼ (in hindsight)
Approach Overview Blogosphere Submodular function optimization Pers. Post Selection Personalized coverage fn. Feature Extraction User feedback Personalization Online learning
Simulating a Sports Fan • likes all posts from Fan House (a sports blog) Personalized Objective Personalization Ratio ═ ═ Unpersonalized Objective Dead Spin (Sports Blog) Personalization ratio Fan House (Sports Blog) Unpersonalized Huffington Post (Politics Blog) Days of sports personalization
Personalizing for India • Like all posts about India • Dislike everything else • After 5 epochs: • 1. India keeps up pressure on Pakistan over Mumbai • After 10 epochs: • 1. Pakistan’s shift alarms the U.S. • 3. India among 20 most dangerous places in world • After 15 epochs: • 1. 26/11 effect: Pak delegation gets cold vibes • 3. Pakistan flaunts its all weather ties with China • 4. Benjamin Button gets 13 Oscar nominations [mentions Slumdog Millionaire] • 8. Miliband was not off-message, he toed the UK line on Kashmir
Personalization User Study • Generate personalized posts • Obtain user ratings • Generate posts without using feedback • Obtain user ratings Blogosphere Blogosphere … …
Personalization Evaluation Personalized Higher is better Unpersonalized Users like personalized posts more than unpersonalized posts
Summary • Formalized coveringthe blogosphere • Near-optimal optimization algorithm • Learned a personalized coveragefunction • No-regret learning algorithm • Evaluated on real blog data • Coverage: using only post content, we perform as well as other techniques that use richer features • Successfully tailor post selection to user preferences www.TurnDownTheNoise.com