Scalable Vaccine Distribution in Large Graphs given Uncertain Data

Scalable Vaccine Distribution in Large Graphs given Uncertain Data Yao Zhang,B. Aditya Prakash Department of Computer Science Virginia Tech CIKM, Shanghai, November 6, 2014

Outline • Motivation • Problem Definition • Our Proposed Methods • Experiments • Conclusion Zhang and Prakash, CIKM2014

Propagation on networks • Virus outbreaks over population network • E.g., WHO estimates 5,000 to 10,000 new Ebola cases weekly in West Africa by the first week of December [from leverage.com] • Information spreads over social networks • E.g., Millions of photos/messages sharing [from the Economist] Zhang and Prakash, CIKM2014

Motivation I: Diffusion models – Social Media • In social media, information spreads over friendship networks • E.g., Rumor spreads over Facebook fridendship network • Independent cascade model (IC) [Kempe+, KDD03] • Weights βij: propagation prob. from i to j • Each node has only one chance to infect its neighbors Rumor spreading β12 β13 Zhang and Prakash, CIKM2014

Motivation I: Diffusion models – Epidemiology • In epidemiology, virus spreads over population contact networks • E.g., ebola, chickenpox, etc. may spread if people are coming to contact • SIR model [Anderson+ 1991] • Susceptible-Infectious-Recovered • Weights βij: propagation prob. from i to j • Recovered prob. δ for each infected node Ebola spreading β12 β13 δ Zhang and Prakash, CIKM2014

Motivation II: Immunization • Epidemiology • Centers for Disease Control (CDC) • Which people to vaccinate to control spread of Ebola? • Social Media • Twitter • Which people to warn to stop rumors like “wall street crashing” Common abstract goal: “find best nodes to remove” Zhang and Prakash, CIKM2014

Immunization Strategies • Pre-emptive Strategy • choose nodes before the epidemic starts • Netshield [Tong+ 2010] • Minimize the epidemic threshold (which is focusing on the largest eigenvalue[Prakash+ 2011]), above which a lot of people get infected Which nodes to vaccinate Zhang and Prakash, CIKM2014

Immunization Strategies • Pre-emptive Strategy • choose nodes before the epidemic starts • Netshield [Tong+ 2010] • Data-aware Strategy • choose nodes knowing current infections (which nodes are infected) • DAVA-fast algorithm [Zhang and Prakash 2014] Which nodes to vaccinate However… Zhang and Prakash, CIKM2014

Motivation III: Real Data is Uncertain We don’t know who exactly are infected • Epidemiology • Public-health surveillance CDC Lab Hospital Not sure ? CNN headlines Surveillance Pyramid [Nishiura+, PLoS ONE 2011] ? Not sure Each level have a certain probability to miss some truly infected people Zhang and Prakash, CIKM2014

Motivation III: Real Data is Uncertain We don’t know who exactly are infected • Social Media • Twitter: due to the uniform samples [Morstatter+, ICWSM 2013], the relevant ‘infected’ tweets may be missed Tweets Missing ? Sampled Tweets ? Missing Sampling Zhang and Prakash, CIKM2014

Motivation III: Real Data is Uncertain ? How to design immunization strategy in the presence of uncertainty? • Not sure if some nodes are infected • More realistic intervention Challenge • Cannot vaccinate/warn people who are already infected ? Which nodes to vaccinate ? We call it Uncertain Data-Aware Vaccination Problem ? this paper Zhang and Prakash, CIKM2014

Outline • Motivation • Problem Definition • Uncertainty Models • Problem Formulation • Our Proposed Methods • Experiments • Conclusion Zhang and Prakash, CIKM2014

Uncertainty Models Tweets • Uniform • Identical prob. to be infected • E.g., Twitter API • Surveillance • Each node takes a prob. from a set P • E.g., Surveillance pyramid • Prop-Deg • The prob. to be infected is proportional to a node’s degree • E.g., people with larger connections have higher prob. to be infected • General • Each node has its own infected prob. Sampled Tweets Sampling We assume factorizable distributions: Zhang and Prakash, CIKM2014

Problem Formulation Uncertain Data-Aware Vaccination Problem (UDAV) Given: graph G(V,E), uncertain model U, infected node set I Find: the best set S of k nodes to vaccinate Such that: the final expected epidemic size is minimized the expected number of infected nodes after vaccination in Gi Expected epidemic size Formally: a “possible” world ?0.5 ?0.5 ?0.8 ?0.8 Which two nodes to vaccinate Zhang and Prakash, CIKM2014

Complexity of UDAV • NP-hard, and cannot be approximated within an absolute error • A special case of UDAV (equal to the deterministic case) is NP-hard [Zhang+ 2014] Zhang and Prakash, CIKM2014

Overview of proposed methods • UDAV is a stochastic optimization problem • Sampling based method • the Sample Average Approximation (SAA) framework • Expectation based method • the expected “situation” Hedging Uncertainty ?0.5 ?0.5 ?0.8 ?0.8 Which two nodes to vaccinate Zhang and Prakash, CIKM2014

Outline • Motivation • Problem Definition • Our Proposed Methods • Sample-Cascade • Expect-Max • Experiments • Conclusion Zhang and Prakash, CIKM2014

Sample-Cascade: Idea The benefit of vaccinating the healthy node set Si in deterministic graph Gi UDAV can also be formulated as: Idea: sample deterministic cases, and take the average Expected benefit 4 “possible” worlds Sample 1 ?0.5 ... ... ?0.8 Working on the sampled graphs Sample L Zhang and Prakash, CIKM2014

Sample-Cascade See paper for details Issue 1: how to approximate Solution: use its lower bound (Lemma 1) Dominator tree: every path from the root to v contains u (see [Lengauer and Tarjan, 1979]). Here, the root is the set of infected nodes. Expected benefit on the dominator tree of Gi Dominator tree of sampled graphs Samples Working on trees ... ... Zhang and Prakash, CIKM2014

Sample-Cascade • Algorithm: 1. Sample Gi from G and U, and Build dominator trees of Gi 2. Select a* such that 3. Remove a from G 4. Goto Step 2 until |S|=k Dominator tree of sampled graphs Working on trees ... ... Zhang and Prakash, CIKM2014

Sample-Cascade Issue 2: number of samples l Solution: (Hoeffding'sInequality) Worse case l=O(|V|2) Running time: O(l*(k|E|+k|V|+ |V|log|V|)) Accurate, but too slow for large networks! Dominator tree of sampled graphs Working on trees ... ... Zhang and Prakash, CIKM2014

Outline • Motivation • Problem Definition • Our Proposed Methods • Sample-Cascade • Expect-Max • Experiments • Conclusion Zhang and Prakash, CIKM2014

Expect-Max: Idea Idea: construct the expected “situation” (graph) ?0.5 Super node 0.5 ?0.8 Create a “super node” 0.8 1.0 : edge from super node Expected Graph GE Original Graph Lemma: when the budget=1, UDAV can be exactly solved on the expected graph How to calculate it? See more details in the paper Zhang and Prakash, CIKM2014

Calculating Benefit on the Expected Graph • We propose two methods to calculate • Using dominator tree • Expect-Dom • Using the drop of the first eigenvalue • Expect-Eig Zhang and Prakash, CIKM2014

Expect-Dom Idea: use to approximate , the benefit on the expected graph GE Step: 1. GE=Construct the expected graph 2. T = Build a dominator tree of GE 3. Select v with max. benefit on T 4. Remove v from G 5. Goto Step 3 until |S|=k 0.5 0.8 1.0 Construct Dominator tree Expected Graph GE Dominator tree of GE Zhang and Prakash, CIKM2014

Expect-Eig Idea: use to approximate , the benefit on the expected graph GE : the drop of the first eigenvalue (Measuring the threshold of the epidemic). (Can be computed fast [Tong+, ICDM 2010]) 0.5 0.5 0.8 0.8 1.0 1.0 Calculate Expected Graph GE Expected Graph GE Lemma : The number of newly infected nodes is bounded by the first eigenvalue (details in the paper) Zhang and Prakash, CIKM2014

Expect-Eig Idea: use to approximate , the benefit on the expected graph GE Step: 1. GE=Construct the expected graph 2. Select v with max. 3. Remove v from G 4. Goto Step 2 until |S|=k 0.5 0.8 1.0 Calculate Expected Graph GE 0.5 0.8 1.0 Zhang and Prakash, CIKM2014

Expect-Dom vs. Expect-Eig • Let α be the support of U • the percentage of nodes that may be initially infected ?0.5 ?0.8 α=0.5 Zhang and Prakash, CIKM2014

Expect-Dom vs. Expect-Eig • Let α be the support of U • the percentage of nodes that may be initially infected • As α increases, • Observation I: Expect-Dom becomes worse • Intuition: αis equal to 0: the deterministic case of UDAV (can be solved by DAVA-fast [Zhang and Prakash 2014]) • Observation II: Expect-Eig becomes better • α increases, we have more and more uncertainty, which is close to the pre-emptive case (can be solved by Netshield[Tong+ 2010]) More formal justification in the paper Zhang and Prakash, CIKM2014

Expect-Max: a hybrid algorithm Idea: put Expect-Dom and Expect-Eig together As they are complementary for different distributions and different networks (we don’t know where the crosspoint is) • pick the better one between Expect-Dom and Expect-Eig Running time (subquadratic): O(k(|V|+|E|)+|V|log|V|+T) Zhang and Prakash, CIKM2014

Extending to SIR • Our methods can be extended to SIR model • Idea: using an equivalent IC model with the propagation probability See paper for details Zhang and Prakash, CIKM2014

Experiments: datasets • Social Media • AS router graph: OREGON • Hyperlink network: STANFORD • Peer-to-peer network: GNUTELLA • Friendship network: BRIGHTKITE • Epidemiology • PORTLAND and MIAMI • large urban social-contact graph used in national smallpox modeling studies [Eubank+, 2004] Zhang and Prakash, CIKM2014

Experiments: setup Tweets Sampled Tweets • Uncertainty models • Uniform: p=0.6 • Surveillance: p is chosen from {0.1, 0.5} • Prop-Deg: pi=di/dmax • Settings • Uniformly randomly pick 5% of nodes as infected • Number of samples: 500 Sampling See more details in the paper Zhang and Prakash, CIKM2014

Experiments: baselines • OPTIMAL: brute-force algorithm which tries all possible cases (optimal, and only run it on KARATE) • RANDOM: randomly uniformly choose nodes from W • DEGREE: choose top-k nodes from W according to weighted degrees • PAGERANK: choose top-k nodes from W with top pageranks • PER-PRANK: choose top-k nodes from W with top personalized pageranks with respect to infected nodes • DAVA-fast • A fast data-aware immunization method in presence of already infected nodes [Zhang+, SDM 14] W: a set of nodes that are not definitely infected (0<=p<1) Zhang and Prakash, CIKM2014

Results: Sample-Cas Sample-Case Saves at least 90% of nodes compared to OPTIMAL Close to optimal Higher is better Zhang and Prakash, CIKM2014

Results: Expect-Max: α matters BRIGHTKITE STANFORD R>1: Expect-Dom is better R<1: Expect-Eig is better R=1: cross point (different for different networks and different distributions) This is why we use Expect-Max Zhang and Prakash, CIKM2014

Results: Effectiveness GNUTELLA (IC) MIAMI (SIR) 10K nodes Higher is better Sample-Cas and Expect-Max consistently outperform the baseline algorithms. (See more results in the paper) Zhang and Prakash, CIKM2014

Results: Scalability did not finish within 24 hours Running time(sec.) Lower is better Zhang and Prakash, CIKM2014

Conclusion Uncertain Data-Aware Vaccination Given: Graph and Uncertain model Find: ‘best’ k nodes for vaccination • Uncertainty models • Uniform, Surveillance, Prop-Deg, General • Proposed Methods • Sample-Cas: sampling graphs (slow, accurate) • Expect-Max: constructing expected graph (fast, subquadratic) ... Sampling ?0.5 ?0.8 0.5 0.8 Expected Graph 1.0 Zhang and Prakash, CIKM2014

Any questions? Code at: http://people.cs.vt.edu/~yaozhang Funding: Yao Zhang B. Aditya Prakash Zhang and Prakash, CIKM2014

Scalable Vaccine Distribution in Large Graphs given Uncertain Data