Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

1 Stanford University2 MPI for Biological Cybernetics3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez Rodriguez1,2Jure Leskovec1Andreas Krause3

Networks and Processes Many times hard to directly observe a social or information network: Hidden/hard-to-reach populations: Drug injection users Implicit connections: Network of information sharing in online media Often easier to observe results of the processes taking place on such (invisible) networks: Virus propagation: People get sick, they see the doctor Information networks: Blogs mention information

Information Diffusion Network Information diffuses through the network • We only see who mentions but not where they got the information from • Can we reconstruct the (hidden) diffusion network?

More Examples Virus propagation Word of mouth & Viral marketing Viruses propagate through the network We only observe when people get sick But NOT who infected them • Recommendations and influence propagate • We only observe when people buy products • But NOT who influenced them Process We observe It’s hidden Can we infer the underlying social or information network?

Inferring the Network • There is ahiddendirected network: • We only see times when nodes get infected: • Contagion c1: (a, 1), (c, 2), (b, 3), (e, 4) • Contagion c2: (c, 1), (a, 4), (b, 5), (d, 6) • Want to infer the who-infects-whom network a a a b b b d d c c c e e

Our Problem Formulation Plan for the talk: Define a continuous time model of diffusion Define the likelihood of the observed propagation data given a graph Show how to efficiently compute the likelihood Show how to efficiently optimize the likelihood Find a graph G that maximizes the likelihood There is a super-exponential number of graphs G Our method finds a near-optimal solution in O(N2)!

Cascade generation model: Cascade reaches u at tu, and spreads to u’s neighbors v: with probability β cascade propagates along (u, v) and tv = tu + Δ, with Δ~ f() Cascade Generation Model ta tb tc te tf Δ1 Δ2 Δ3 Δ4 a a a b b b c c c d We assume each node v has only one parent! e e f f

Likelihood of a Cascade • If u infected v in a cascade c, its transmission probability is: • Pc(u, v)  f(tv - tu)with tv > tuand (u, v) are neighbors a a b b • To model that in reality any node v in a cascade can have been infected by an external influence m:Pc(m, j) =ε d c c m • Prob. that cascade c propagates in a tree T: ε e e ε ε Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)

Finding the Diffusion Network There are many possible propagation trees: c: (a, 1), (c, 2), (b, 3), (e, 4) Need to consider all possible propagation trees T supported by G: Good news Computing P(c|G) is tractable: • Bad news • We actually want to find: • We have a super-exponential number of graphs! a a a a a a b b b b b b For each c, consider all O(nn) possible transmission trees of G. Matrix Tree Theorem can compute this sum in O(n3)! d d d c c c c c c e e e e e e • Likelihood of a set of cascades C on G: • Want to find:

An Alternative Formulation We consider only the most likely tree Maximum log-likelihood for a cascade c under a graph G: Log-likelihood of G given a set of cascades C: The problem is NP-hard: MAX-k-COVER Our algorithm can do it near-optimally in O(N2)

Max Directed Spanning Tree Given a cascade c, What is the most likely propagation tree? where • A maximum directed spanning tree (MDST): • Just need to compute the MDST of a the sub-graph of G induced by c (i.e., a DAG) • For each node, just picks an in-edge ofmax-weight: Subgraph of G induced by c doesn’t have loops (DAG) a a 3 5 b b 2 c c 4 6 1 d d Local greedy selection gives optimal tree!

Great News: Submodularity Theorem: Log-likelihood Fc(G) is monotonic, and submodular in the edges of the graph G Fc(A  {e}) – Fc (A) ≥ Fc (B  {e}) – Fc (B) Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph A B  VxV • Proof: • Single cascade c, edge e with weight x A r i i w x • Let w be max weight in-edge of s in A s k k • Let w’ be max weight in-edge of s in B B • We know: w ≤ w’ j w’ • Now: Fc(A  {e}) – Fc(A) = max (w, x) – w ≥ max (w’, x) – w’ = Fc(B {e}) – Fc(B) o a Then, log-likelihood FC(G) is monotonic, and submodular too

Finding the Diffusion Graph Use the greedy hill-climbing to maximize FC(G): At every step, pick the edge that maximizes the marginal improvement Benefits • 1. Approximation guarantee (≈ 0.63 of OPT) • 2. Tight on-line bounds on the solution quality • 3. Speed-ups: • Lazy evaluation (by submodularity) • Localized update (by the structure of the problem) Marginal gains b a : 12 a c a : 3 b d a : 6 a b : 20 : 17 c b : 18 Localized update d Localized update : 2 : 1 d b : 4 c : 1 : 3 e b : 5 a c : 15 Localized update b c : 6 : 8 e b d : 16 Localized update c d : 7 : 8 e d : 8 : 10 b e : 7 d e : 13

Experimental Setup We validate our method on: Synthetic data Generate a graph G on k edges Generate cascades Record node infection times Reconstruct G Real data MemeTracker: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases (quotes) • How well do we optimize Fc(G)? • How many edges of G can we find? • Precision-Recall • Break-even point • How fast are we? • How many cascades do we need?

Small Synthetic Example True network Baseline network Our method • Pick kstrongest edges: • Small synthetic network: 15

Synthetic Networks 1024 node hierarchical Kronecker exponential transmission model 1000 node Forest Fire (α = 1.1) power law transmission model • Our performance does not depend on the network structure: • Synthetic Networks: Forest Fire, Kronecker, etc. • Prob. of transmission: Exponential, Power Law • Break-even points of > 90% when the baseline gets 30-50%!

How good is our graph? We achieve ≈ 90 % of the best possible network!

How many cascades do we need? With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!

Running Time Lazy evaluation and localized updates speed up 2 orders of magnitude!

Real Data MemeTracker dataset: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases (quotes) Times tc(w) when site w mentions phrase (quote) c http://memetracker.org • Given times when sites mention phrases • We infer the network of information diffusion: • Who tends to copy (repeat after) whom

Real Network • We use the hyperlinks in the MemeTracker dataset to generate the edges of a ground truth G • From the MemeTracker dataset, we have the timestamps of: • 1. cascades of hyperlinks: • sites link other sites • 2.cascades of (MemeTracker) quotes • sites copy quotes from other sites e a c f Are they correlated? e a c f Can we infer the hyperlinks network from… …cascades of hyperlinks? …cascades of MemeTracker quotes?

Real Network 500 node hyperlink network using hyperlinks cascades 500 node hyperlink network using MemeTracker cascades • Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!

Diffusion Network Blogs Mainstream media • 5,000 news sites:

Diffusion Network (small part) Blogs Mainstream media

Networks and Processes We infer hidden networks based on diffusion data (timestamps) Problem formulation in a maximum likelihood framework NP-hard problem to solve exactly We develop an approximation algorithm that: It is efficient -> It runs in O(N2) It is invariant to the structure of the underlying network It gives a sub-optimal network with tight bound Future work: Learn both the network and the diffusion model Extensions to other processes taking place on networks Applications to other domains: biology, neuroscience, etc.

Thanks! For more (Code & Data):http://snap.stanford.edu/netinf

Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3