1 / 113

Leveraging Propagation for Data Mining Models, Algorithms & Applications

Explore the importance of leveraging propagation in data mining, including applications in social collaboration, viral marketing, epidemiology, cyber security, human mobility, and more. Learn about algorithms to control outbreaks and improve online diffusion.

ronnieking
Download Presentation

Leveraging Propagation for Data Mining Models, Algorithms & Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Leveraging Propagation for Data MiningModels, Algorithms & Applications B. Aditya Prakash Dept. of Computer Science November 16, 2017. UTRC, Hartford, CT.

  2. Thanks! • HalaMostafa Prakash 2017

  3. Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] Prakash 2017

  4. Dynamical Processes over networks are also everywhere! Prakash 2017

  5. Why do we care? • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology • ........ Prakash 2017

  6. Why do we care? (1: Epidemiology) • Dynamical Processes over networks [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Prakash 2017

  7. Why do we care? (1: Epidemiology) • Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients • transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2017

  8. Why do we care? (1: Epidemiology) ~6x fewer! [US-MEDICARE NETWORK 2005] CURRENT PRACTICE OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash 2017

  9. Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2017

  10. Why do we care? (2: Online Diffusion) • Dynamical Processes over networks Buy Versace™! Followers Celebrity Social Media Marketing Prakash 2017

  11. Why do we care? (3: To change the world?) • Dynamical Processes over networks Social networks and Collaborative Action Prakash 2017

  12. High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? Q. How do opinions spread? Q. How to market better? products/viruses transmit s/w patches Prakash 2017

  13. Research Theme ANALYSIS Understanding POLICY/ ACTION Managing DATA Large real-world networks & processes Prakash 2017

  14. Research Theme – Public Health ANALYSIS Will an epidemic happen? POLICY/ ACTION How to control out-breaks? DATA Modeling # patient transfers Prakash 2017

  15. Research Theme – Social Media ANALYSIS # cascades in future? POLICY/ ACTION How to market better? DATA Modeling Tweets spreading Prakash 2017

  16. In this talk Q1: How to immunize and control out-breaks better? Q2: How to reverse-engineer epidemics? Algorithms Managing/Manipulating Prakash 2017

  17. In this talk How to use propagation for _________ Q3: Memes, and Malware Q4: Disease Surveillance Q5: General Graph Mining Applications Large real-world networks & processes Prakash 2017

  18. Outline • Motivation • Part 2: Policy and Action (Algorithms) • Part 3: Applications (Data-Driven) • Conclusion Prakash 2017

  19. Part 2: Algorithms • Q1: Whom to immunize? • Q2: How to reverse-engineer epidemics? Prakash 2017

  20. Immunization • Centers for Disease Control (CDC) cares about containing epidemic diseases • E.g: ~400 million dollars used for vaccines for children in 2013 • Twitter tries to stop rumor spread • E.g.: rumors of victims after the Boston Marathon bombs in 2013 How to choose best nodes/edges etc. to vaccinate (remove)? Prakash 2017

  21. Immunization Given: a graph A, virus prop. model and budget k; Find: k ‘best’ nodes for immunization (removal). ? ? k = 2 ? ? Prakash 2017

  22. Background “SIR” model: life immunity (mumps) • Each node in the graph is in one of three states • Susceptible (i.e. healthy) • Infected • Removed (i.e. can’t get infected again) Prob. β Prob. δ t = 1 t = 2 t = 3 Prakash 2017

  23. Immunization (= Interventions) • Different Flavors: • Pre-emptive • immunization (choose nodes before the epidemic starts) • Data-aware • Immunization after epidemic has started • Group-based • Allocation based on groups • Data-based • Allocation directly using data Prakash 2017

  24. Pre-emptive: Vulnerability • First eigenvalue λ1(of adjacency matrix) is sufficient for most diffusion models. [Prakash+ ICDM 2011; Selected for Best Papers] λ1 is the epidemic threshold “Safe” “Vulnerable” “Deadly” Increasing λ1 , Increasing vulnerability Prakash 2017

  25. “Eigen-Drop” Eigen-Drop(S) Δ λ = λ - λs 9 Δ 9 9 11 10 10 2 1 1 4 4 8 8 6 2 7 3 7 3 5 5 6 Without {2, 6} Original Graph Prakash 2017

  26. Pre-emptive: Goal • Decrease λ1as much as possible • Node based [Tong, Prakash+ ICDM 2010] • Edge-based [Tong, Prakash+ CIKM 2012, Best Paper Award] • Edge-Manipulation [Prakash, Adamic+ SDM 2013] Prakash 2017

  27. Node based: Direct Algorithm too expensive! [Tong, Prakash+ ICDM 2010 Prakash, Adamic+ SDM 2013] • Select k nodes which maximize Δλ S = argmaxΔλ • Combinatorial! • Complexity: • Example: • 1,000 nodes, with 10,000 edges • It takes 0.01 seconds to compute λ • It takes2,615 yearsto find 5-best nodes! Prakash 2017

  28. Our Solution • Part 1: • Carefully approximate Eigen-drop (Δλ) • Matrix perturbation theory • Part 2: Algorithm • Greedily pick best node at each step • Eigen-drop approximation submodular • NetShield(linear complexity) • O(nk2+m) n = # nodes; m = # edges Prakash 2017

  29. Experiment: Immunization quality Log(fraction of infected nodes) PageRank Betweeness (shortest path) Degree Lower is better Acquaintance Eigs (=HITS) NetShield Time Prakash 2017

  30. Latest results • First (provable) approximation algorithms for edge-based problem [Saha, Adiga, Prakash, Vullikanti SDM 2015] • O(log^2 n)--factor (can be improved to O(log n)) • Based on the idea of removing closed walks • Semi-Definite Programming Rounding-based O(1) factor • Running time more expensive than NetShield Prakash 2017

  31. Data-aware Immunization [Zhang and Prakash, SDM 2014 Zhang and Prakash, TKDD 2015] Given: Graph and Infected nodes Find: ‘best’ nodes for immunization • Complexity • NP-hard • Hard to approximate within an absolute error • DAVA-tree • Optimal solution on the tree • DAVA and DAVA-fast • Merging infected nodes • Build a “dominator tree”, and run DAVA-tree • Running time: subquadratic • DAVA: O(k(|E|+ |V|log|V|)) • DAVA-fast: O(|E|+|V|log|V|) Graph with infected nodes Dominator tree Prakash 2017

  32. Extensions • Can be extended to Uncertain and noisy initial data as well [Zhang and Prakash, CIKM 2014] Twitter Firehose API 1% sample Prakash 2017

  33. Group-based immunization vaccination [Zhang+, ICDM 2015] • Sometimes individual immunization cannot be easily turned into implementable policies • E.g., Hard to ensure specific individuals take the adequate vaccine Prakash 2017

  34. Group-based immunization • Observation: Groups naturally exist in underlying networks • ages, demographics, occupations, … • interests, geolocations, … Occupation Groups How to select groups to control propagation over networks? Geolocation Groups Prakash 2017

  35. Summary of methods m: number of vaccines (budget); n: number of groups L: simulation time for greedy algorithm; V: node set Prakash 2017

  36. Data-driven Immunization [Zhang+, ICDM 2017, Best-paper candidate] Data Explosion Network Data Twitter following network Population contact network …… Propagation Data Tweets in social media Flu reports in public health …… However, can we build algorithms directly using data?

  37. Case-Study Allocation from contact networks Allocation from propagation data Our approach Houston Miami Observation 1: Our approachconsiders both networks and propagation data

  38. Case-Study Allocation from contact networks Allocation from propagation data Our approach Houston Miami • Observation 2: • Our approachdistributes vaccines to areas with high risk of influenza outbreak • E.g., the Texas Medical Center (large medical center) • E.g., Miami Beach (with large transient population).

  39. Part 2: Algorithms • Q1: Whom to immunize? • Q2: How to reverse-engineer epidemics? Prakash 2017

  40. Problem definition 2-d grid ‘+’ -> infected Who started it? In Prakash+, ICDM 2012 (Selected for best papers) Prakash 2017

  41. Problem definition 2-d grid ‘+’ -> infected Who started it? Prior work: [Lappas et al. 2010, Shah et al. 2011] Prakash 2017

  42. Who are the culprits • Two-part solution • use MDL for number of seeds • for a given number: • exoneration = centrality + penalty • Novel laplacian sub-matrix method • Running time = • linear! (in edges and nodes) NetSleuth Prakash 2017

  43. Case-Study 35 TB patients + 1039 contacts CDC [AJPH 2007] Patient-zero by NetSleuth === by CDC Prakash 2017

  44. Many extensions • Temporalnetworks [Rozenshtein+ SIGKDD 2016] • Noisyinput[Sundareisan+ SDM 2015] Prakash 2017

  45. Outline • Motivation • Part 1: Understanding Epidemics (Theory) • Part 2: Policy and Action (Algorithms) • Part 3: Applications (Data-Driven) • Conclusion Prakash 2017

  46. Part 3: Applications How to use propagation for _________ • Q3: Disease Surveillance • Q4: Memes, and Malware • Q5: General Graph Mining Prakash 2017

  47. GFT& Twitter • Estimate flu trends using online electronic sources So cold today, I’m catching cold. I have headache, sore throat, I can’t go to school today. My nose is totally congested, I have a hard time understanding what I’m saying. Prakash 2017

  48. Nowcasting the Flu • Propagation on Twitter to “nowcast” the H1N1 pandemic • Track the spread of flu-related keywords • Support vector regression to CDC ILI dictionary Prakash 2017

  49. Flu forecasting • Twitter – a surrogate for flu forecasting? • Google Flu Trends: using keywords to track the flu season • Can we get more specific? • Consider: Prakash 2017

  50. “Propagation” ideas • Can we develop better disease surveillance tools by leveraging • How flu-related information propagates on Twitter • Epidemiological models Prakash 2017

More Related