210 likes | 384 Views
Towards Detecting Influenza Epidemics by Analyzing Twitter Massages. Aron Culotta. Jedsada Chartree. Introduction. Growing interest in monitoring disease outbreaks. Growing of twitter users - February, 2010 50 million tweets/day - June, 2010 65 million tweets/day (750 tweets/ s
E N D
Towards Detecting Influenza Epidemics by Analyzing Twitter Massages AronCulotta Jedsada Chartree
Introduction • Growing interest in monitoring disease outbreaks. • Growing of twitter users - February, 2010 50 million tweets/day - June, 2010 65 million tweets/day (750 tweets/s - 190 million users Source: http://en.wikipedia.org/wiki/Twitter
Introduction • Twitter is a website, which offers a social networking and micro-blogging service. - Users send and read messages called “tweets” (140 characters)
Introduction • Advantages of Twitter for this research - Full messages provide more information than query. - Twitter profiles contain more detail to analyze. (city, state, gender, age) - Diversity of twitter users.
Methodology • Data - Collect 574,643 messages for 10 weeks (February 12, 2010 to April 24, 2010) - The US Centers for Disease Control and Prevention (CDC) publishes the US Outpatient Influenza-like Illness Surveillance Network (ILINet)
Methodology The Ground truth ILI rates obtained from the CDC statistics
Methodology • Regression Models 1. Simple linear regression P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match W = D = a document collection Dw = a document frequency for word W logit(x) =
Methodology • Regression Models 2. Multiple linear regression P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match Wi = D = a document collection Dwi = a document frequency for word Wi logit(x) =
Methodology • Keyword Selection • Correlation Coefficient - Simple linear regression model evaluation 2. Residual Sum of Squares (RSS) - It measures a discrepancy between the data and an estimation model
Methodology • Keyword Generation • Hand-chosen keywords (flu, cough, sore throat, headache) • Most frequent keywords - Search all documents containing any of hand-chosen keywords. - Find the top 5,000 most frequently occurring words.
Methodology • Document Filtering - Applying logistic regression to predict whether a Twitter message is reporting an ILI symptom. yi = a binary random variable (1 if document Di is positive, 0 otherwise) xi = {xij} = number of times word j appears in document i
Methodology • Classification evaluation -Accuracy - Precision - Recall - F-measure
Results • Document Filtering Evaluation of messages classification with standard error in parentheses
Results • Regression The 10 different systems evaluated
Results • Regression The regression coefficient (r), residual sum of square (RSS), and standard error of each system
Results Results for multi-hand-rss(2) Results for classification-hand
Results Results for multi-freq-rss(3) Results for simple-hand-rss(1)
Results Correlation results for simple –hand-rss and multi-hand-rss Correlation results for simple –hand-corr and multi-hand-corr
Results Correlation results for simple –freq-rss and multi-freq-rss Correlation results for simple –freq-corr and multi-freq-corr
Conclusion • Several methods to identify influenza-related messages. • Compare a number of regression models to correlate the messages with CDC statistics. • The best model achieves correlation of .78 .