810 likes | 992 Views
Leveraging Propagation for Data Mining Models, Algorithms & Applications. B. Aditya Prakash Naren Ramakrishnan. August 10, Tutorial, SIGKDD 2016, San Francisco. About us. B. Aditya Prakash Asst. Professor CS, Virginia Tech. PhD. CMU, 2012. Data Mining, Applied ML
E N D
Leveraging Propagation for Data MiningModels, Algorithms & Applications B. Aditya Prakash NarenRamakrishnan August 10, Tutorial, SIGKDD 2016, San Francisco
About us • B. Aditya Prakash • Asst. Professor • CS, Virginia Tech. • PhD. CMU, 2012. • Data Mining, Applied ML • Graph and Time-series mining • Applications to Social Media, Epidemiology/Public Health, Cyber Security • Homepage: http://www.cs.vt.edu/~badityap/ Prakash and Ramakrishnan 2016
About us • NarenRamakrishnan • Thomas L. Phillips Prof. • CS, Virginia Tech. • PhD. Purdue, 1997. • Data mining • for intelligence analysis, forecasting, sustainability, and health informatics • Homepage: http://people.cs.vt.edu/naren/ Prakash and Ramakrishnan 2016
Tutorial webpage • http://people.cs.vt.edu/~badityap/TALKS/16-kdd-tutorial/ • All Slides will be posted there. • Talk video as well (later). Prakash and Ramakrishnan 2016
Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] Prakash and Ramakrishnan 2016
Dynamical Processes over networks are also everywhere! Prakash and Ramakrishnan 2016
Why do we care? • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology • ........ Prakash and Ramakrishnan 2016
Why do we care? (1: Epidemiology) • Dynamical Processes over networks [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Prakash and Ramakrishnan 2016
Why do we care? (1: Epidemiology) • Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients • transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash and Ramakrishnan 2016
Why do we care? (1: Epidemiology) ~6x fewer! [US-MEDICARE NETWORK 2005] CURRENT PRACTICE OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash and Ramakrishnan 2016
Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash and Ramakrishnan 2016
Why do we care? (2: Online Diffusion) • Dynamical Processes over networks Buy Versace™! Followers Celebrity Social Media Marketing Prakash and Ramakrishnan 2016
Why do we care? (3: To change the world?) • Dynamical Processes over networks Social networks and Collaborative Action Prakash and Ramakrishnan 2016
High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? Q. How do opinions spread? Q. How to market better? products/viruses transmit s/w patches Prakash and Ramakrishnan 2016
Research Theme ANALYSIS Understanding POLICY/ ACTION Managing DATA Large real-world networks & processes Prakash and Ramakrishnan 2016
Research Theme – Public Health ANALYSIS Will an epidemic happen? POLICY/ ACTION How to control out-breaks? DATA Modeling # patient transfers Prakash and Ramakrishnan 2016
Research Theme – Social Media ANALYSIS # cascades in future? POLICY/ ACTION How to market better? DATA Modeling Tweets spreading Prakash and Ramakrishnan 2016
In this tutorial Given propagation models, on arbitrary networks: Q1: What is the epidemic threshold? Q2: How do viruses compete? With extensions to dynamic networks, multi-profile networks etc. Fundamental Models Understanding Prakash and Ramakrishnan 2016
In this tutorial Q3: How to estimate and learn influence and networks? Q4: How to immunize and control out-breaks better? Q5: How to reverse-engineer epidemics? Q6: How to leverage viral marketing? Q7: How to pick sensors for graphs? Algorithms Managing/Manipulating Prakash and Ramakrishnan 2016
In this tutorial How to use propagation for _________ Q8: Memes, Tweets, Blogs Q9: Disease Surveillance Q10: Protest Trends Q11: Malware Attacks Q12: General Graph Mining Applications Large real-world networks & processes Prakash and Ramakrishnan 2016
Plan • Three breaks! • 2-2:05pm • 3-3:30pm (conference coffee break) • 4:15-4:20pm • Part 2: Algorithms starts at roughly 1:50pm • Part 3: Applications at 3:30pm (after the coffee break) • Please interrupt anytime for questions Prakash and Ramakrishnan 2016
Outline • Motivation • Part 1: Understanding Epidemics (Theory) • Part 2: Policy and Action (Algorithms) • Part 3: Applications (Data-Driven) • Conclusion Prakash and Ramakrishnan 2016
Part 1: Theory • Q1: What is the epidemic threshold? • Q2: How do viruses compete? Prakash and Ramakrishnan 2016
A fundamental question Strong Virus Epidemic? Prakash and Ramakrishnan 2016
example (static graph) Weak Virus Epidemic? Prakash and Ramakrishnan 2016
Problem Statement # Infected above (epidemic) below (extinction) time Separate the regimes? Find, a condition under which • virus will die out exponentially quickly • regardless of initial infection condition Prakash and Ramakrishnan 2016
Threshold (static version) Problem Statement • Given: • Graph G, and • Virus specs (attack prob. etc.) • Find: • A condition for virus extinction/invasion Prakash and Ramakrishnan 2016
Threshold: Why important? • Accelerating simulations • Forecasting (‘What-if’ scenarios • Design of contagion and/or topology • A great handle to manipulate the spreading • Immunization • Maximize collaboration ….. Prakash and Ramakrishnan 2016
Part 1: Theory • Q1: What is the epidemic threshold? • Background • Result and Intuition (Static Graphs) • Proof Ideas (Static Graphs) • Bonus: Dynamic Graphs • Q2: How do viruses compete? Prakash and Ramakrishnan 2016
Background “SIR” model: life immunity (mumps) • Each node in the graph is in one of three states • Susceptible (i.e. healthy) • Infected • Removed (i.e. can’t get infected again) Prob. β Prob. δ t = 1 t = 2 t = 3 Prakash and Ramakrishnan 2016
Background Terminology: continued • Other virus propagation models (“VPM”) • SIS : susceptible-infected-susceptible, flu-like • SIRS : temporary immunity, like pertussis • SEIR : mumps-like, with virus incubation (E = Exposed) ….…………. • Underlying contact-network – ‘who-can-infect-whom’ Prakash and Ramakrishnan 2016
Background Related Work • All are about either: • Structured topologies (cliques, block-diagonals, hierarchies, random) • Specific virus propagation models • Static graphs • R. M. Anderson and R. M. May. Infectious Diseases of Humans. Oxford University Press, 1991. • A. Barrat, M. Barthélemy, and A. Vespignani. Dynamical Processes on Complex Networks. Cambridge University Press, 2010. • F. M. Bass. A new product growth for model consumer durables. Management Science, 15(5):215–227, 1969. • D. Chakrabarti, Y. Wang, C. Wang, J. Leskovec, and C. Faloutsos. Epidemic thresholds in real networks. ACM TISSEC, 10(4), 2008. • D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, 2010. • A. Ganesh, L. Massoulie, and D. Towsley. The effect of network topology in spread of epidemics. IEEE INFOCOM, 2005. • Y. Hayashi, M. Minoura, and J. Matsukubo. Recoverable prevalence in growing scale-free networks and the effective immunization. arXiv:cond-at/0305549 v2, Aug. 6 2003. • H. W. Hethcote. The mathematics of infectious diseases. SIAM Review, 42, 2000. • H. W. Hethcote and J. A. Yorke. Gonorrhea transmission dynamics and control. Springer Lecture Notes in Biomathematics, 46, 1984. • J. O. Kephart and S. R. White. Directed-graph epidemiological models of computer viruses. IEEE Computer Society Symposium on Research in Security and Privacy, 1991. • J. O. Kephart and S. R. White. Measuring and modeling computer virus prevalence. IEEE Computer Society Symposium on Research in Security and Privacy, 1993. • R. Pastor-Santorras and A. Vespignani. Epidemic spreading in scale-free networks. Physical Review Letters 86, 14, 2001. • ……… • ……… • ……… Prakash and Ramakrishnan 2016
Part 1: Theory • Q1: What is the epidemic threshold? • Background • Result and Intuition (Static Graphs) • Proof Ideas (Static Graphs) • Bonus: Dynamic Graphs • Q2: How do viruses compete? Prakash and Ramakrishnan 2016
How should the answer look like? ….. • Answer should depend on: • Graph • Virus Propagation Model (VPM) • But how?? • Graph – average degree? max. degree? diameter? • VPM – which parameters? • How to combine – linear? quadratic? exponential? Prakash and Ramakrishnan 2016
Static Graphs: Our Main Result • Informally, • For, • any arbitrary topology (adjacency • matrix A) • any virus propagation model (VPM) in • standard literature • the epidemic threshold depends only • on the λ,firsteigenvalueof A,and • some constant , determined by the virus propagation model λ • No epidemic if λ * < 1 In Prakash+ ICDM 2011
Our thresholds for some models s = effective strength s < 1 : below threshold
Our result: Intuition for λ “Official” definition: “Un-official” Intuition λ ~ # paths in the graph • Let A be the adjacency matrix. Then λ is the root with the largest magnitude of the characteristic polynomial of A [det(A – xI)]. • Doesn’t give much intuition! u u ≈ . (i, j) = # of paths i j of length k Prakash and Ramakrishnan 2016
Largest Eigenvalue (λ) better connectivity higher λ λ ≈ 2 λ = N λ = N-1 λ ≈ 2 λ= 31.67 λ= 999 N = 1000 N nodes Prakash and Ramakrishnan 2016
Examples: Simulations – SIR (mumps) Fraction of Infections Footprint (a) Infection profile (b) “Take-off” plot PORTLAND graph 31 million links, 6 million nodes Effective Strength Time ticks
Examples: Simulations – SIRS (pertusis) Fraction of Infections Footprint (a) Infection profile (b) “Take-off” plot PORTLAND graph 31 million links, 6 million nodes Time ticks Effective Strength
Part 1: Theory • Q1: What is the epidemic threshold? • Background • Result and Intuition (Static Graphs) • Proof Ideas (Static Graphs) • Bonus: Dynamic Graphs • Q2: How do viruses compete? Prakash and Ramakrishnan 2016
Proof Sketch General VPM structure Model-based λ * < 1 Graph-based Topology and stability Prakash and Ramakrishnan 2016
Models and more models Prakash and Ramakrishnan 2016
Ingredient 1: Our generalized model Endogenous Transitions Endogenous Transitions Susceptible Susceptible Infected Infected Exogenous Transitions Vigilant Vigilant Endogenous Transitions Prakash and Ramakrishnan 2016
Special case: SIR Susceptible Infected Vigilant Prakash and Ramakrishnan 2016
Special case: H.I.V. “Non-terminal” “Terminal” Multiple Infectious, Vigilant states Prakash and Ramakrishnan 2016
Details Ingredient 2: NLDS + Stability size N (number of nodes in the graph) S • Probability vector Specifies the state of the system at time t . . . size mNx 1 I V . . . . . • View as a NLDS • discrete time • non-linear dynamical system (NLDS) Prakash and Ramakrishnan 2016
Details Ingredient 2: NLDS + Stability Non-linear function Explicitly gives the evolution of system . . . size mNx 1 . . . . . • View as a NLDS • discrete time • non-linear dynamical system (NLDS) Prakash and Ramakrishnan 2016
Ingredient 2: NLDS + Stability • View as a NLDS • discrete time • non-linear dynamical system (NLDS) • Threshold Stability of NLDS Prakash and Ramakrishnan 2016
Details Special case: SIR S S size 3Nx1 I I R R = probability that node iis not attacked by any of its infectious neighbors NLDS Prakash and Ramakrishnan 2016