910 likes | 928 Views
Explore the application of information theory in mining graphs and sequences for various domains such as epidemiology, online diffusion, and social media. Analyze real-world networks for epidemic outbreaks, viral marketing, and social collaboration.
E N D
Leveraging Information Theory for Mining Graphs and Sequences: From Propagation to Segmentation B. Aditya Prakash Computer Science Virginia Tech. ITA Workshop, San Diego, Feb 5, 2016
Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] Prakash 2016
Dynamical Processes over networks are also everywhere! Prakash 2016
Why do we care? • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology ........ Prakash 2016
Why do we care? (1: Epidemiology) • Dynamical Processes over networks [AJPH 2007] SI Model CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Prakash 2016
Why do we care? (1: Epidemiology) • Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2016
Why do we care? (1: Epidemiology) ~6x fewer! [US-MEDICARE NETWORK 2005] CURRENT PRACTICE OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash 2016
Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2016
Why do we care? (2: Online Diffusion) • Dynamical Processes over networks Buy Versace™! Followers Celebrity Social Media Marketing Prakash 2016
Why do we care? (3: To change the world?) • Dynamical Processes over networks Social networks and Collaborative Action Prakash 2016
High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? Q. How do opinions spread? Q. How to market better? products/viruses transmit s/w patches Prakash 2016
Research Theme ANALYSIS Understanding POLICY/ ACTION Managing/Utilizing DATA Large real-world networks & processes Prakash 2016
Research Theme – Public Health ANALYSIS Will an epidemic happen? POLICY/ ACTION How to control out-breaks? DATA Modeling # patient transfers Prakash 2016
Research Theme – Social Media ANALYSIS # cascades in future? POLICY/ ACTION How to market better? DATA Modeling Tweets spreading Prakash 2016
In this talk Q1: How to find hidden culprits? Q2: How to segment multi-dimensional sequences? DATA Large real-world networks & processes Prakash 2016
Outline • Motivation • Part 1: Learning Models (Empirical Studies) • Q1: How to find hidden culprits? • Q2: How to segment data sequences? • Conclusion Prakash 2016
Culprits Motivation • Patient zeroes • Who started the epidemic? • Rumors • Who started the rumor? Prakash 2016
But: Real data is noisy! We don’t know who exactly are infected • Epidemiology • Public-health surveillance CDC Lab Hospital Not sure ? CNN headlines Surveillance Pyramid [Nishiura+, PLoS ONE 2011] ? Not sure Each level has a certain probability to miss some truly infected people Prakash 2016
Real data is noisy! Correcting missing data is by itself very important • Social Media • Twitter: due to the uniform samples [Morstatter+, ICWSM 2013], the relevant ‘infected’ tweets may be missed Tweets Missing ? Sampled Tweets ? Missing Sampling Prakash 2016
Outline • Motivation---Introduction • Problem Definition • Our Approach • Experiments • Conclusion Prakash 2016
The Problem [Sundareisan, Vreeken, Prakash 2015] • GIVEN: • Graph G(V, E) from historical data • Infected set D V, sampled (p%) and incomplete • Infectivity β of the virus (assumed to follow the SI model) • FIND: • Seed set i.e. patient zeros/culprits • Set C- (the missing infected nodes) • Ripple R (the order of infections) Prakash 2016
Related Work – Culprits (Partial) • Shah and Zaman, IEEE TIT, 2011 • One seed. • Provably finds MLE seed for d-regular trees • SI process • Lappas et. al., KDD, 2010. • k seeds (takes in Input k) • Infected graph assumed to be in steady-state • IC model • Prakash et. al., ICDM, 2012. (NetSleuth) • Finds number of seeds automatically. • Assumes no mistakes in infected set D. Prakash 2016
Related Work – Missing Nodes (Partial) • Costenbader and Valente 2003; Kossinets 2006, Borgatti et al. 2006 • Study the effect of sampling on macro levelnetworkstatistics • Adiga et. al. 2013 • Sensitivity of total infections to noise in network structure • Sadikov et al., WSDM, 2011 • correct for sampling for macro level cascade statistics Prakash 2016
Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Prakash 2016
MDL-Minimum Description Length Principle • Occam’s Razor • Simplest model is the best model • “Induction by Compression” • Related to Bayesian approaches • MDL cost in bits = Model cost + Data cost • Best model least cost in bits Data + Model Channel Prakash 2016 Sender Receiver
MDL Encoding For Our Problem The Model Seeds (S), Ripple (R) Missing nodes (C-) Sender Receiver Graph G(V, E) Infectivity (β) Sampling (p) Seeds (S) Infected set (D C-) Ripple (R) Missing nodes (C-) Graph G(V, E) Infectivity (β) Sampling (p) Data given model Prakash 2016
Model (S, R) Cost • Scoring the seed set (S) • Scoring the ripple? Number of possible |S|-sized sets En-coding integer |S| Prakash 2016
Model (S, R) Cost • Scoring a ripple (R) Infected Snapshot Original Graph Ripple R1 Ripple R2 Prakash 2016
Model (S, R) Cost • Ripple cost Ripple R How the ‘frontier’ advances How long is the ripple Prakash 2016
Cost of the data (C-) • We have to transmit the missed nodes C- (green nodes) • So that receiver can recover D Detail:γ = 1 – p i.e. the probability of a node to be truly missing Prakash 2016
Total MDL Cost • Finally • And our problem is now • Find S, R, C- to minimize it Prakash 2016
Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Prakash 2016
Our Approach: Decoupling • The two problems are • Finding seeds/ripple (S, R) • Finding Missing nodes (C-) • Can we decouple them? Prakash 2016
Decoupling the problems (contd.) • Finding seeds depends on missing nodes. Legend Missing nodes Seed Infected node NetSleuth: correct missing nodes filled in as input NetSleuth: No missing nodes as Input Prakash 2016
Decoupling the problems (cont.) • Finding missing nodes also depends on seeds. Not Infected Infected Most probably A was missed B Seed S A Prakash 2016
Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Prakash 2016
Finding missing nodes (S) and culprits (C-) • Suppose an oracle gives us the missing nodes (C-) • We have complete infected set (D U C-) • Apply NetSleuth directly • NO SAMPLING INVOLVED • Will give us the seed set Legend Missing nodes Seed Infected node * Prakash et. al., ICDM 2012 Applying NetSleuth* on Oracle’s Answer Prakash 2016
Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Prakash 2016
Missing Nodes (C-) given (S) • Oracle gives us S, find C- • Naïve Approach? • Find all possible C- • Pick the best one according to MDL • Infeasible! ( sets) Prakash 2016
Our Approach • Sub-problem 1: |Seeds| = 1 • |Missing nodes| = 1 • Sub-problem 2: Finding the right number of missing nodes. • Sub-problem 3: |Seeds| > 1 Prakash 2016
Sub Problem 1: Best hidden culprit given one seed • Best node is one which makes the Seed s more likely • We use empirical risk as the measure • Sanity Check: ideally risk should be 0 • So best hidden culprit, Prakash 2016
Sub-Problem 1: Best Hidden Culprit • Using some results in Prakash et. al. 2012 (see details in paper), we can rewrite it as u1 is the eigenvector corresponding to the smallest eigenvalue of the Laplaciansubmatrixof D Prakash 2016
Detour: LaplacianSubmatrix • Laplacian = Deg(G) – A(G) • LD = take only rows for nodes in D (Laplaciansubmatrix) • u1 (smallest eigenvalue’s eigenvector) Laplacian Degree Adjacency Laplacian LaplacianSubmatrix D ƛ Eigenvector Prakash 2016
Okay • How to solve this quickly? Proof Omitted: see paper Prakash 2016
Best hidden hazard • Choose n* such Measures • how connected a node n is to centrally located infected nodes w.r.t. s in D • Depends on the seed as well as the structure Prakash 2016
Sub-Problem 2: How many missing nodes? • MDL? • Add nodes based on Z-scores till MDL increases. • MDL is not convex! • But it has convex like behavior….. Prakash 2016
Sub-Problem - 3: What if |Seeds| > 1 SKIP! Using z-scores: Missing nodes are near one seed Ideal: Missing nodes near both seeds Prakash 2016
Sub problem 3: What if |Seeds| > 1 SKIP! • Exonerate previous seeds • Make previous seeds uninfected and calculate u1 • The blame is transferred to the locality of the older seed • Complete Z-score = maxover all seeds Z-score (n) • Maximum as we need high quality missing nodes • Take nodes with top-k complete Z-scores Prakash 2016
Finding missing nodes given seeds Phew! Prakash 2016
The complete algorithm – NetFill (Outline) Running time: sub-quadratic in practice Prakash 2016