810 likes | 821 Views
Explore the impact of propagation in data mining models across various domains, such as epidemiology, online diffusion, and public health. Understand how dynamical processes over networks influence social collaboration, information diffusion, viral marketing, and more. Discover insights into disease surveillance, flu forecasting, and hidden flu-state topic modeling using propagation patterns. Analyze the challenges and potential improvements in leveraging propagation techniques for better disease control and information dissemination.
E N D
Leveraging Propagation for Data MiningModels, Algorithms, Applications B. Aditya Prakash Department of Computer Science Social Computing Workshop, ARL, Sept 28, 2016
Dynamical Processes over networks are also everywhere! Prakash 2016
Why do we care? • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology • ........ Prakash 2016
Why do we care? (1: Epidemiology) • Dynamical Processes over networks [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Prakash 2016
Why do we care? (1: Epidemiology) • Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients • transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2016
Why do we care? (1: Epidemiology) ~6x fewer! [US-MEDICARE NETWORK 2005] CURRENT PRACTICE OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash 2016
Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2016
Why do we care? (2: Online Diffusion) • Dynamical Processes over networks Buy Versace™! Followers Celebrity Social Media Marketing Prakash 2016
Why do we care? (3: To change the world?) • Dynamical Processes over networks Social networks and Collaborative Action Prakash 2016
High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? Q. How do opinions spread? Q. How to market better? products/viruses transmit s/w patches Prakash 2016
Research Theme ANALYSIS Understanding POLICY/ ACTION Managing DATA Large real-world networks & processes Prakash 2016
Research Theme – Social Media ANALYSIS # cascades in future? POLICY/ ACTION How to market better? DATA Modeling Tweets spreading Prakash 2016
Research Theme – Public Health ANALYSIS Will an epidemic happen? POLICY/ ACTION How to control out-breaks? DATA Modeling # patient transfers Prakash 2016
In this talk Using propagation for _________ Q1: SyndromicSurveillance Q2: Memes, Tweets, Blogs Q3: Summarization & Communities. Applications Large real-world networks & processes Prakash 2016
Applications Using propagation for _________ • Q1: Syndromic Surveillance • Q2: Memes, Tweets, Blogs • Q3: General Graph Mining Prakash 2016
Surveillance [Chen et. al. ICDM 2014] • How to estimate and predict flu trends? Surveillance Report Hospital record Lab survey Population survey Prakash 2016
GFT& Twitter • Estimate flu trends using online electronic sources Prakash 2016
Flu forecasting • Twitter – a surrogate for flu forecasting? • Google Flu Trends: using keywords to track the flu season • Can we get more specific? • Consider: Prakash 2016
“Propagation” ideas • Can we develop better disease surveillance tools by leveraging • How flu-related information propagates on Twitter • Epidemiological models Prakash 2016
Observation 1: States • There are different states in an infection cycle. • SEIR model: 1. Susceptible 2.Exposed 3. Infected 4.Recovered Prakash 2016
Observation 2: Ep. & So. Gap • Infection cases drop exponentially in epidemiology (Hethcote 2000) • Keyword mentions drop in a power-law pattern in social media (Matsubara 2012) Prakash 2016
Flu Forecasting • Using combination of propagation patterns, develop a hidden flu-state topic model • Learn “flu” vocabulary and transition probabilities Prakash 2016
Details HFSTM Model • Hidden Flu-State from Tweet Model (HFSTM) • Each word (w) in a tweet (Oi) can be generated by: • A background topic • Non-flu related topics • State related topics Latent state Initial prob. Transit. switch Binary non-flu related switch Transit. prob. Binary background switch Word distribution Prakash 2016
Details HFSTM Model Generate the state for a tweet Generate the topic for a word • Generating tweets State: [S,E,I] Topic: [Background, Non-flu, State] good S: This restaurant is really E: The movie was good but was it freezing I: I think I have flu Prakash 2016
Details Inference • EM-based algorithm: HFSTM-FIT • E-step: • At(i)=P(O1,O2,…,Ot,St=i) • Bt(i)=P(Ot+1,…,OTu|St=i) • γt(i)=P(St=i|Ou) • M-step: • Other parameters such as state transition probabilities, topic distributions, etc. • Parameters learned: Prakash 2016
A possible issue with HFSTM • Suffersfrom large, noisy vocabulary. • Semi-supervision for improvement • Introduce weak supervision into HFSTM. Prakash 2016
Details HFSTM-A [Chen et. al. DAMI 2015] • HFSTM-A(spect) • Introduce an aspect variable y, expressing our belief on whether a word is flu-related or not. • The value of y biases the switch variables s.t. flu-related words are more likely to be explained by state topics. When the aspect value (y) is introduced, the switching probability are updated accordingly. Prakash 2016
Vocabulary & Dataset • Vocabulary (230 words): • Flu-related keyword list by Chakraborty SDM 2014 • Extra state-related keyword list • Dataset (34,000 tweets): • Identify infected users and collect their tweets • Train on data from Jun 20, 2013-Aug 06, 2013 • Test on two time period: • Dec 01, 2012- July 08, 2013 • Nov 10, 2013-Jan 26, 2014 Prakash 2016
Learned word distributions • The most probable words learned in each state Probably healthy: S Having symptons: E Definitely sick: I Prakash 2016
Learned state transition Transition probabilities Transition in real tweets Learned by HFSTM: Not directly flu-related, yet correctly identified Prakash 2016
Flu trend fitting • Ground-truth: • The Pan American Health Organization (PAHO) • Algorithms: • Baseline: • Count the number of keywords weekly as features, and regress to the ground-truth curve. • Google flu trend: • Take the google flu trend data as input, regress to the PAHO curve. • HFSTM: • Distinguish different states of keyword, and only use the number of keywords in I state. Again regress to PAHO. Prakash 2016
Flu trend fitting • Linear regression to the case count reported by PAHO (the ground-truth) Prakash 2016
HFSTM-A • Results are qualitatively similar with HFSTM, when the vocabulary is 10 times larger. Prakash 2016
Applications Using propagation for _________ • Q1: Syndromic Surveillance • Q2: Memes, Tweets, Blogs • Q3: General Graph Mining Prakash 2016
Memetracking • Memes – a virally transmitted cultural symbol or social idea (first coined by Richard Dawkins in 1976) • Usually text (a phrase) and/or an image A viral meme from 2012 Olympics All the way to the White House Prakash 2016
Patterns Anomaly Imputation Compression Extrapolation Prakash 2016
Google Search Volume e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date (1) First spike (2) Release date (3) Two weeks before release ? ? Prakash 2016
Rise and fall patterns in social media • Meme (# of mentions in blogs) • short phrases Sourced from U.S. politics in 2008 “you can put lipstick on a pig” “yes we can” Prakash 2016
Rise and fall patterns in social media • Can we find a unifying model, which includes these patterns? • four classes on YouTube [Crane et al. ’08] • six classes on Meme [Yang et al. ’11] Prakash 2016
Rise and fall patterns in social media • Answer: YES! • We can represent all patterns by single model In Matsubara+ SIGKDD 2012 Prakash 2016
Main idea - SpikeM • 1. Un-informed bloggers (uninformed about rumor) • 2. External shock at time nb(e.g, breaking news) • 3. Infection(word-of-mouth) β Time n=0 Time n=nb Time n=nb+1 • Infectiveness of a blog-post at age n: • Strength of infection (quality of news) • Decay function (how infective a blog posting is) Power Law Prakash 2016
J. G. Oliveira et. al. Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature437, 1251 (2005) . [PDF] (also in Leskovec, McGlohon+, SDM 2007) -1.5 slope Prakash 2016
Details SpikeM - with periodicity • Full equation of SpikeM Periodicity 12pm Peak activity Bloggers change their activity over time (e.g., daily, weekly, yearly) 3am Low activity activity Time n Prakash 2016
Tail-part forecasts • SpikeMcan capture tail part Prakash 2016
“What-if” forecasting e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date (1) First spike (2) Release date (3) Two weeks before release ? ? Prakash 2016
“What-if” forecasting • SpikeM can forecast not only tail-part, but also rise-part! • SpikeMcan forecast upcoming spikes (1) First spike (2) Release date (3) Two weeks before release Prakash 2016
Bonus: Protest Predictions Violent Protest (VP) [Sundereisan et al. ASONAM 2014] [Jin et al. SIGKDD 2014] • Can Twitter provide a lead time? • South American twitter dataset • Language: Spanish/Portuguese • Idea • Look for trending keywords. • Predict event type for protest using SpikeM parameters! VP A political tweet Non Violent Protest (P) P Prakash 2016
[Papalexakakis et al. ASONAM 2013] Propagation and Cyber-Security: Temporal Patterns Looks familiar? Prakash 2016
[Chan et. Al. WSDM 2016] Propagation and Cyber-Security: Ensemble Models Prakash 2016
Applications Using propagation for _________ • Q1: Syndromic Surveillance • Q2: Memes, Tweets, Blogs • Q3: General Graph Mining Prakash 2016