Turning Down the Noise in the Blogosphere

Turning Down the Noise in the Blogosphere Khalid El-Arini, Gaurav Veda, DafnaShahaf, Carlos Guestrin

Millions of blog posts published every day • Some stories become disproportionately popular • Hard to find information you care about

Our Goal: Coverage • Turn down the noise in the blogosphere • Select a small set of posts that covers the most important stories January 17, 2009

Our Goal: Coverage • Turn down the noise in the blogosphere • Select a small set of posts that covers the most important stories

Our Goal: Personalization • Tailor post selection to user tastes Posts selected without personalization But, I like sports! I want articles like: After personalization based on Zidane’s feedback

Main Contributions • Formalize notion of covering the blogosphere • Near-optimal solution for post selection • Learn a personalized coverage function • No-regret algorithm for learning user preferences using limited feedback • Evaluate on real blog data • Conduct user studies and compare against:

Approach Overview Blogosphere Post Selection Coverage Function Feature Extraction …

Document Features • Low level • Words, noun phrases, named entities • e.g., Obama, China, peanut butter • High level • e.g., Topics from a topic model • Topic = probability distribution over words Inauguration Topic National Security Topic

Coverage • cover ( ) = amount by which covers • cover ( ) = amount by which { , } covers … Features Posts … Document d Feature f coverd(f) Feature f Set A coverA(f)

Simple Coverage: MAX-COVER Find k posts that cover the most features • cover ( ) = 1 if at least or contain • Problems with MAX-COVER: Feature Significance in Document Feature Significance in Corpus … at George Mason University in Fairfax, Va.

Feature Significance in Document • Solution: Define a probabilistic coverage function • coverd(f) = P(feature f | post d) Feature Significance in Document Feature Significance in Corpus Not really about Washington cover (Washington) = 0.01 Feature Significance in Document Feature Significance in Corpus e.g., with topics as features ≡ P(post d is about topic f)

Feature Significance in Corpus • Some features are more important • Want to cover the important features • Solution: • Associate a weight wf with each feature f • e.g., frequency of feature in corpus • Cover an important feature using multiple posts Feature Significance in Document Feature Significance in Corpus Carlos Guestrin Barack Obama

Incremental Coverage cover ( ) = 1 – P(neither nor cover ) = 1 – (1 – 0.5) (1 – 0.4) = 0.7 probability at least one post in set A covers feature f cover( ) 0.5 • Obama: Tight noose on Bin Laden as good as capture • What Obama’s win means for China 0.4 cover ( ) < 0.7 < cover ( )+cover ( ) Gain due to covering using multiple posts Diminishing returns

Post Selection Optimization • Want to select a set of posts A that maximizes • This function is submodular • Exact maximization is NP-hard • Greedy algorithm leads to a (1 – 1/e) ~ 63% approximation, i.e., a near-optimal solution • We use CELF (Leskovec et al 2007) weights on features probability that set A covers feature f feature set

Approach Overview Blogosphere Post Selection Coverage Function Feature Extraction Submodular function optimization

Evaluating Coverage • Evaluate on real blog data from Spinn3r • 2 week period in January • ~200K posts per day (after pre-processing) • Two variants of our algorithm • User study involving 27 subjects to evaluate: TDN+LDA: High level features Latent Dirichlet Allocation topics TDN+NE: Low level features Topicality & Redundancy

Topicality User Study Downed jet lifted from ice-laden Hudson River NEW YORK (AP) - The airliner that was piloted to a safe emergency landing in the Hudson… Is this post topical? i.e., is it related to any of the major stories of the day? … Reference Stories Post for evaluation

Results: Topicality Higher is better TDN+NE TDN +LDA Named entities and common noun phrases as features LDA topics as features We do as well as Yahoo! and Google

Evaluation: Redundancy • Israel unilaterally halts fire as rockets persist • Downed jet lifted from ice-laden Hudson River • Israeli-trained Gaza doctor loses three daughters and niece to IDF tank shell • ... Is this post redundant with respect to any of the previous posts?

Results: Redundancy Lower is better TDN +LDA TDN+NE Google performs poorly We do as well as Yahoo!

Results: Coverage • Google: good topicality, high redundancy • Yahoo!: performs well on both, but uses rich features • CTR, search trends, user voting, etc. Lower is better Higher is better TDN +LDA TDN +NE Topicality Redundancy TDN +LDA TDN +NE We do as well as Yahoo! using only post content

Results: January 22, 2009

Personalization • People have varied interests • Our Goal: Learn a personalized coverage function using limited user feedback Barack Obama Britney Spears

Approach Overview Blogosphere Post Selection Pers. Post Selection Coverage Function Feature Extraction Personalized coverage Fn. Personalization

Modeling User Preferences • ¼f represents user preference for feature f • Want to learn preference ¼ over the features Importance of feature in corpus User preference ¼1 ¼2 ¼3 ¼4 ¼5 ¼1 ¼2 ¼3 ¼4 ¼5 ¼ for a politico ¼ for a sports fan

Learning User Preferences Multiplicative Weights Update Multiplicative Weights Update Before any feedback After 1 day of personalization After 2 days of personalization

No-Regret Learning Theorem:For TDN, avg( ) – avg( ) 0 learned ¼ learned using TDN optimal fixed • i.e., we achieve no-regret Given the user ratings in advance, compare with the optimal fixed ¼ optimal fixed ¼

Approach Overview Blogosphere Submodular function optimization Pers. Post Selection Personalized coverage fn. Feature Extraction User feedback Personalization Online learning

Simulating a Sports Fan • likes all posts from Fan House (a sports blog) Personalized Objective Personalization Ratio ═ ═ Unpersonalized Objective Dead Spin (Sports Blog) Personalization ratio Fan House (Sports Blog) Unpersonalized Huffington Post (Politics Blog) Days of sports personalization

Personalizing for India • Like all posts about India • Dislike everything else • After 5 epochs: • 1. India keeps up pressure on Pakistan over Mumbai • After 10 epochs: • 1. Pakistan’s shift alarms the U.S. • 3. India among 20 most dangerous places in world • After 15 epochs: • 1. 26/11 effect: Pak delegation gets cold vibes • 3. Pakistan flaunts its all weather ties with China • 4. Benjamin Button gets 13 Oscar nominations [mentions Slumdog Millionaire] • 8. Miliband was not off-message, he toed the UK line on Kashmir

Personalization User Study • Generate personalized posts • Obtain user ratings • Generate posts without using feedback • Obtain user ratings Blogosphere Blogosphere … …

Personalization Evaluation Personalized Higher is better Unpersonalized Users like personalized posts more than unpersonalized posts

Summary • Formalized coveringthe blogosphere • Near-optimal optimization algorithm • Learned a personalized coveragefunction • No-regret learning algorithm • Evaluated on real blog data • Coverage: using only post content, we perform as well as other techniques that use richer features • Successfully tailor post selection to user preferences www.TurnDownTheNoise.com

Turning Down the Noise in the Blogosphere

Turning Down the Noise in the Blogosphere

Presentation Transcript

The Blogosphere

SEARCHING THE BLOGOSPHERE

The Policysphere and the Blogosphere

The Turning

Turning the world upside down

Turn Down the Noise!

Turning down the noise in blogosphere

Would you mind turning down the music?

Would you mind turning down the music?

Turning the World Upside Down The search for Global Health in the 21st Century

Conversation and Connectivity in the Blogosphere

BLOGS and the BLOGOSPHERE

Blogs and the blogosphere

Would you mind turning down the music?

Classroom Blogging: Journalist-Learning in the Blogosphere

Mapping the Blogosphere in America

Would you mind turning down the music?

Unit7 Would you mind turning down the music?

Would you mind turning down the music?

The Blogosphere

Would you mind turning down the music?

Would you mind turning down the music?