1.13k likes | 1.15k Views
Explore the importance of leveraging propagation in data mining, including applications in social collaboration, viral marketing, epidemiology, cyber security, human mobility, and more. Learn about algorithms to control outbreaks and improve online diffusion.
E N D
Leveraging Propagation for Data MiningModels, Algorithms & Applications B. Aditya Prakash Dept. of Computer Science November 16, 2017. UTRC, Hartford, CT.
Thanks! • HalaMostafa Prakash 2017
Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] Prakash 2017
Dynamical Processes over networks are also everywhere! Prakash 2017
Why do we care? • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology • ........ Prakash 2017
Why do we care? (1: Epidemiology) • Dynamical Processes over networks [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Prakash 2017
Why do we care? (1: Epidemiology) • Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients • transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2017
Why do we care? (1: Epidemiology) ~6x fewer! [US-MEDICARE NETWORK 2005] CURRENT PRACTICE OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash 2017
Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2017
Why do we care? (2: Online Diffusion) • Dynamical Processes over networks Buy Versace™! Followers Celebrity Social Media Marketing Prakash 2017
Why do we care? (3: To change the world?) • Dynamical Processes over networks Social networks and Collaborative Action Prakash 2017
High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? Q. How do opinions spread? Q. How to market better? products/viruses transmit s/w patches Prakash 2017
Research Theme ANALYSIS Understanding POLICY/ ACTION Managing DATA Large real-world networks & processes Prakash 2017
Research Theme – Public Health ANALYSIS Will an epidemic happen? POLICY/ ACTION How to control out-breaks? DATA Modeling # patient transfers Prakash 2017
Research Theme – Social Media ANALYSIS # cascades in future? POLICY/ ACTION How to market better? DATA Modeling Tweets spreading Prakash 2017
In this talk Q1: How to immunize and control out-breaks better? Q2: How to reverse-engineer epidemics? Algorithms Managing/Manipulating Prakash 2017
In this talk How to use propagation for _________ Q3: Memes, and Malware Q4: Disease Surveillance Q5: General Graph Mining Applications Large real-world networks & processes Prakash 2017
Outline • Motivation • Part 2: Policy and Action (Algorithms) • Part 3: Applications (Data-Driven) • Conclusion Prakash 2017
Part 2: Algorithms • Q1: Whom to immunize? • Q2: How to reverse-engineer epidemics? Prakash 2017
Immunization • Centers for Disease Control (CDC) cares about containing epidemic diseases • E.g: ~400 million dollars used for vaccines for children in 2013 • Twitter tries to stop rumor spread • E.g.: rumors of victims after the Boston Marathon bombs in 2013 How to choose best nodes/edges etc. to vaccinate (remove)? Prakash 2017
Immunization Given: a graph A, virus prop. model and budget k; Find: k ‘best’ nodes for immunization (removal). ? ? k = 2 ? ? Prakash 2017
Background “SIR” model: life immunity (mumps) • Each node in the graph is in one of three states • Susceptible (i.e. healthy) • Infected • Removed (i.e. can’t get infected again) Prob. β Prob. δ t = 1 t = 2 t = 3 Prakash 2017
Immunization (= Interventions) • Different Flavors: • Pre-emptive • immunization (choose nodes before the epidemic starts) • Data-aware • Immunization after epidemic has started • Group-based • Allocation based on groups • Data-based • Allocation directly using data Prakash 2017
Pre-emptive: Vulnerability • First eigenvalue λ1(of adjacency matrix) is sufficient for most diffusion models. [Prakash+ ICDM 2011; Selected for Best Papers] λ1 is the epidemic threshold “Safe” “Vulnerable” “Deadly” Increasing λ1 , Increasing vulnerability Prakash 2017
“Eigen-Drop” Eigen-Drop(S) Δ λ = λ - λs 9 Δ 9 9 11 10 10 2 1 1 4 4 8 8 6 2 7 3 7 3 5 5 6 Without {2, 6} Original Graph Prakash 2017
Pre-emptive: Goal • Decrease λ1as much as possible • Node based [Tong, Prakash+ ICDM 2010] • Edge-based [Tong, Prakash+ CIKM 2012, Best Paper Award] • Edge-Manipulation [Prakash, Adamic+ SDM 2013] Prakash 2017
Node based: Direct Algorithm too expensive! [Tong, Prakash+ ICDM 2010 Prakash, Adamic+ SDM 2013] • Select k nodes which maximize Δλ S = argmaxΔλ • Combinatorial! • Complexity: • Example: • 1,000 nodes, with 10,000 edges • It takes 0.01 seconds to compute λ • It takes2,615 yearsto find 5-best nodes! Prakash 2017
Our Solution • Part 1: • Carefully approximate Eigen-drop (Δλ) • Matrix perturbation theory • Part 2: Algorithm • Greedily pick best node at each step • Eigen-drop approximation submodular • NetShield(linear complexity) • O(nk2+m) n = # nodes; m = # edges Prakash 2017
Experiment: Immunization quality Log(fraction of infected nodes) PageRank Betweeness (shortest path) Degree Lower is better Acquaintance Eigs (=HITS) NetShield Time Prakash 2017
Latest results • First (provable) approximation algorithms for edge-based problem [Saha, Adiga, Prakash, Vullikanti SDM 2015] • O(log^2 n)--factor (can be improved to O(log n)) • Based on the idea of removing closed walks • Semi-Definite Programming Rounding-based O(1) factor • Running time more expensive than NetShield Prakash 2017
Data-aware Immunization [Zhang and Prakash, SDM 2014 Zhang and Prakash, TKDD 2015] Given: Graph and Infected nodes Find: ‘best’ nodes for immunization • Complexity • NP-hard • Hard to approximate within an absolute error • DAVA-tree • Optimal solution on the tree • DAVA and DAVA-fast • Merging infected nodes • Build a “dominator tree”, and run DAVA-tree • Running time: subquadratic • DAVA: O(k(|E|+ |V|log|V|)) • DAVA-fast: O(|E|+|V|log|V|) Graph with infected nodes Dominator tree Prakash 2017
Extensions • Can be extended to Uncertain and noisy initial data as well [Zhang and Prakash, CIKM 2014] Twitter Firehose API 1% sample Prakash 2017
Group-based immunization vaccination [Zhang+, ICDM 2015] • Sometimes individual immunization cannot be easily turned into implementable policies • E.g., Hard to ensure specific individuals take the adequate vaccine Prakash 2017
Group-based immunization • Observation: Groups naturally exist in underlying networks • ages, demographics, occupations, … • interests, geolocations, … Occupation Groups How to select groups to control propagation over networks? Geolocation Groups Prakash 2017
Summary of methods m: number of vaccines (budget); n: number of groups L: simulation time for greedy algorithm; V: node set Prakash 2017
Data-driven Immunization [Zhang+, ICDM 2017, Best-paper candidate] Data Explosion Network Data Twitter following network Population contact network …… Propagation Data Tweets in social media Flu reports in public health …… However, can we build algorithms directly using data?
Case-Study Allocation from contact networks Allocation from propagation data Our approach Houston Miami Observation 1: Our approachconsiders both networks and propagation data
Case-Study Allocation from contact networks Allocation from propagation data Our approach Houston Miami • Observation 2: • Our approachdistributes vaccines to areas with high risk of influenza outbreak • E.g., the Texas Medical Center (large medical center) • E.g., Miami Beach (with large transient population).
Part 2: Algorithms • Q1: Whom to immunize? • Q2: How to reverse-engineer epidemics? Prakash 2017
Problem definition 2-d grid ‘+’ -> infected Who started it? In Prakash+, ICDM 2012 (Selected for best papers) Prakash 2017
Problem definition 2-d grid ‘+’ -> infected Who started it? Prior work: [Lappas et al. 2010, Shah et al. 2011] Prakash 2017
Who are the culprits • Two-part solution • use MDL for number of seeds • for a given number: • exoneration = centrality + penalty • Novel laplacian sub-matrix method • Running time = • linear! (in edges and nodes) NetSleuth Prakash 2017
Case-Study 35 TB patients + 1039 contacts CDC [AJPH 2007] Patient-zero by NetSleuth === by CDC Prakash 2017
Many extensions • Temporalnetworks [Rozenshtein+ SIGKDD 2016] • Noisyinput[Sundareisan+ SDM 2015] Prakash 2017
Outline • Motivation • Part 1: Understanding Epidemics (Theory) • Part 2: Policy and Action (Algorithms) • Part 3: Applications (Data-Driven) • Conclusion Prakash 2017
Part 3: Applications How to use propagation for _________ • Q3: Disease Surveillance • Q4: Memes, and Malware • Q5: General Graph Mining Prakash 2017
GFT& Twitter • Estimate flu trends using online electronic sources So cold today, I’m catching cold. I have headache, sore throat, I can’t go to school today. My nose is totally congested, I have a hard time understanding what I’m saying. Prakash 2017
Nowcasting the Flu • Propagation on Twitter to “nowcast” the H1N1 pandemic • Track the spread of flu-related keywords • Support vector regression to CDC ILI dictionary Prakash 2017
Flu forecasting • Twitter – a surrogate for flu forecasting? • Google Flu Trends: using keywords to track the flu season • Can we get more specific? • Consider: Prakash 2017
“Propagation” ideas • Can we develop better disease surveillance tools by leveraging • How flu-related information propagates on Twitter • Epidemiological models Prakash 2017