770 likes | 778 Views
Learn how to understand and predict human behavior through the study of propagation. Explore topics such as flu trends, cyber security, viral marketing, and more.
E N D
Understanding and Predicting Human Behavior using Propagation: From Flu-trends to Cyber-Security B. Aditya Prakash Computer Science Virginia Tech. Keynote Talk, BEAMS Workshop, ICDM, Nov 14, 2015
Thanks! • Reza Zafarani • Huan Liu Prakash 2015
Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] Prakash 2015
Dynamical Processes over networks are also everywhere! Prakash 2015
Why do we care? • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology ........ Prakash 2015
Why do we care? (1: Epidemiology) • Dynamical Processes over networks [AJPH 2007] SI Model CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Prakash 2015
Why do we care? (1: Epidemiology) • Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2015
Why do we care? (1: Epidemiology) ~6x fewer! [US-MEDICARE NETWORK 2005] CURRENT PRACTICE OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash 2015
Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2015
Why do we care? (2: Online Diffusion) • Dynamical Processes over networks Buy Versace™! Followers Celebrity Social Media Marketing Prakash 2015
Why do we care? (3: To change the world?) • Dynamical Processes over networks Social networks and Collaborative Action Prakash 2015
High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? Q. How do opinions spread? Q. How to market better? products/viruses transmit s/w patches Prakash 2015
Research Theme ANALYSIS Understanding POLICY/ ACTION Managing/Utilizing DATA Large real-world networks & processes Prakash 2015
Research Theme – Public Health ANALYSIS Will an epidemic happen? POLICY/ ACTION How to control out-breaks? DATA Modeling # patient transfers Prakash 2015
Research Theme – Social Media ANALYSIS # cascades in future? POLICY/ ACTION How to market better? DATA Modeling Tweets spreading Prakash 2015
In this talk Q1: How to predict Flu- trends better? Q2: How does information evolve over time? DATA Large real-world networks & processes Prakash 2015
In this talk Q3: How do malware attacks evolve over time? DATA Large real-world networks & processes Prakash 2015
Outline • Motivation • Part 1: Learning Models (Empirical Studies) • Q1: How to predict Flu-trends better? • Q2: How does information evolve over time? • Q3: How does malware attacks evolve over time? • Conclusion Prakash 2015
[Chen et. al. ICDM 2014] Surveillance • How to estimate and predict flu trends? Surveillance Report Hospital record Lab survey Population survey Prakash 2015
GFT& Twitter • Estimate flu trends using online electronic sources So cold today, I’m catching cold. I have headache, sore throat, I can’t go to school today. My nose is totally congested, I have a hard time understanding what I’m saying. Prakash 2015
Observation 1: States • There are different states in an infection cycle. • SEIR model: 1. Susceptible 2.Exposed 3. Infected 4.Recovered Prakash 2015
Observation 2: Ep. & So. Gap • Infection cases drop exponentially in epidemiology (Hethcote 2000) • Keyword mentions drop in a power-law pattern in social media (Matsubara 2012) Prakash 2015
HFSTM Model • Hidden Flu-State from Tweet Model (HFSTM) • Each word (w) in a tweet (Oi) can be generated by: • A background topic • Non-flu related topics • State related topics Latent state Initial prob. Transit. switch Binary non-flu related switch Transit. prob. Binary background switch Word distribution Prakash 2015
HFSTM Model Generate the state for a tweet Generate the topic for a word • Generating tweets State: [S,E,I] Topic: [Background, Non-flu, State] good S: This restaurant is really E: The movie was good but was it freezing I: I think I have flu Prakash 2015
Inference • EM-based algorithm: HFSTM-FIT • E-step: • At(i)=P(O1,O2,…,Ot,St=i) • Bt(i)=P(Ot+1,…,OTu|St=i) • γt(i)=P(St=i|Ou) • M-step: • Other parameters such as state transition probabilities, topic distributions, etc. • Parameters learned: Prakash 2015
A possible issue with HFSTM • Suffersfrom large, noisy vocabulary. • Semi-supervision for improvement • Introduce weak supervision into HFSTM. Prakash 2015
HFSTM-A • HFSTM-A(spect) • Introduce an aspect variable y, expressing our belief on whether a word is flu-related or not. • The value of y biases the switch variables s.t. flu-related words are more likely to be explained by state topics. When the aspect value (y) is introduced, the switching probability are updated accordingly. Prakash 2015
Vocabulary & Dataset • Vocabulary (230 words): • Flu-related keyword list by Chakraborty SDM 2014 • Extra state-related keyword list • Dataset (34,000 tweets): • Identify infected users and collect their tweets • Train on data from Jun 20, 2013-Aug 06, 2013 • Test on two time period: • Dec 01, 2012- July 08, 2013 • Nov 10, 2013-Jan 26, 2014 Prakash 2015
Learned word distributions • The most probable words learned in each state Probably healthy: S Having symptons: E Definitely sick: I Prakash 2015
Learned state transition Transition probabilities Transition in real tweets Learned by HFSTM: Not directly flu-related, yet correctly identified Prakash 2015
Flu trend fitting • Ground-truth: • The Pan American Health Organization (PAHO) • Algorithms: • Baseline: • Count the number of keywords weekly as features, and regress to the ground-truth curve. • Google flu trend: • Take the google flu trend data as input, regress to the PAHO curve. • HFSTM: • Distinguish different states of keyword, and only use the number of keywords in I state. Again regress to PAHO. Prakash 2015
Flu trend fitting • Linear regression to the case count reported by PAHO (the ground-truth) Prakash 2015
HFSTM-A • Results are qualitatively similar with HFSTM, when the vocabulary is 10 times larger. Prakash 2015
Outline • Motivation • Part 1: Learning Models (Empirical Studies) • Q1: How to predict Flu-trends better? • Q2: How does information evolve over time? • Q3: How does malware attacks evolve over time? • Conclusion Prakash 2015
Google Search Volume e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date (1) First spike (2) Release date (3) Two weeks before release ? ? Prakash 2015
Patterns Y X Prakash 2015
Patterns Y More Data X Prakash 2015
Patterns Y Anomaly ? X Prakash 2015
Patterns Y Anomaly ? Extrapolation X Prakash 2015
Patterns Y Anomaly Imputation Extrapolation X Prakash 2015
Patterns Anomaly Imputation Compression Extrapolation Prakash 2015
Rise and fall patterns in social media • Meme (# of mentions in blogs) • short phrases Sourced from U.S. politics in 2008 “you can put lipstick on a pig” “yes we can” Prakash 2015
Rise and fall patterns in social media • Can we find a unifying model, which includes these patterns? • four classes on YouTube [Crane et al. ’08] • six classes on Meme [Yang et al. ’11] Prakash 2015
Rise and fall patterns in social media • Answer: YES! • We can represent all patterns by single model In Matsubara, Sakurai, Prakash+ SIGKDD 2012 Prakash 2015
Main idea - SpikeM • 1. Un-informed bloggers (uninformed about rumor) • 2. External shock at time nb(e.g, breaking news) • 3. Infection(word-of-mouth) β Time n=0 Time n=nb Time n=nb+1 • Infectiveness of a blog-post at age n: • Strength of infection (quality of news) • Decay function (how infective a blog posting is) Power Law Prakash 2015
J. G. Oliveira et. al. Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature437, 1251 (2005) . [PDF] (also in Leskovec, McGlohon+, SDM 2007) -1.5 slope Prakash 2015
Details SpikeM - with periodicity • Full equation of SpikeM Periodicity 12pm Peak activity Bloggers change their activity over time (e.g., daily, weekly, yearly) 3am Low activity activity Time n Prakash 2015
Tail-part forecasts • SpikeMcan capture tail part Prakash 2015
“What-if” forecasting e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date (1) First spike (2) Release date (3) Two weeks before release ? ? Prakash 2015
“What-if” forecasting • SpikeM can forecast not only tail-part, but also rise-part! • SpikeMcan forecast upcoming spikes (1) First spike (2) Release date (3) Two weeks before release Prakash 2015