70 likes | 89 Views
Explore how Twitter data can predict public opinion trends by analyzing sentiment and volume correlations with polling results. The study showcases approaches using sentiment analysis and statistical modeling to gauge public sentiment on political and economic matters.
E N D
Who Needs Polls?Gauging Public Opinion from Twitter Data David Cummings Haruki Oh Ningxuan (Jason) Wang
From Tweets to Poll Numbers • Motivation: People spend millions of dollars on polling every year: politics, economy, entertainment • Millions of posts on Twitter every day • Can we model public opinion using tweets? • Data: 476 million tweets from June to December 2009, courtesy of Jure Lescovec • Public polls from The Gallup Organization (presidential approval, economic confidence) and Rasmussen Reports (generic Congressional ballot) • Goal: high correlation with public opinion polls • All correlation figures for 6-day smoothing window
Approach 1: Volume • The simplest metric: percentage of tweets that mention a given topic in a certain time window • Moderate negative correlation (-36.3%, -35.7%) for economy and Congressional ballot: mention things you want to complain about more often • Higher correlation (52.4%) for Obama
Approach 2: Generic Sentiment • Can we distinguish between positive and negative sentiment of tweets? • University of Pennsylvania OpinionFinder subjective polarity lexicon • “conceited” strong negative -10 • “ironic” weak negative -5 • “trendy” weak positive +5 • “illuminating” strong positive +10 • Sum word scores for a tweet to classify it as positive, negative, or neutral; then subtract negative counts from positive counts and normalize over window
Approach 2: Generic Sentiment • Good results on economic confidence: 60.4% correlation, 70.1% correlation on 15-day window • Poor performance on presidential approval and Congressional ballot: -24.5% and 21.5% correlation respectively • Sentiment about politics expressed differently?
Approach 3: LM-based Classification • Train three language models (positive, negative, and neutral) on hand-classified data • Classify each tweet according to the language model that affords it the highest probability • Applied for the case of Obama: manually classified 3,633 tweets • “can we all talk about how awesome Obama is?” • “that Obama sticker on your car might as well say ‘Yes I’m stupid’ #tcot #iamthemob #teaparty #glennbeck” • Then we tested the language models: best performer was a linearly interpolated bigram model
Approach 3: LM-based Classification • Much-improved results on presidential approval: 49.4% correlation • Throwing out retweets and duplicate tweets helps a little more: 55.9% correlation • Finally, combining both volume and LM-based sentiment gives best results: 63.3% correlation, or 69.6% correlation on a 15-day window