560 likes | 787 Views
Information Diffusion. Mary McGlohon CMU 10-802 3/23/10. Outline. Intro: Models for diffusion Epidemiological: SIS/SIR/SIRS Threshold models Case studies SIR: Info diffusion in blogs SIS: Cascades in blogs Timing: Cascades in chain letters A closer look: Network-based Marketing.
E N D
Information Diffusion Mary McGlohon CMU 10-802 3/23/10
Outline • Intro: Models for diffusion • Epidemiological: SIS/SIR/SIRS • Threshold models • Case studies • SIR: Info diffusion in blogs • SIS: Cascades in blogs • Timing: Cascades in chain letters • A closer look: Network-based Marketing
Epidemiological: SIS • Susceptible, Infected, Susceptible • Infected for tItimesteps • While infected, transmits with probability b • After tI steps, returns to susceptible
Epidemiological: SIR • Susceptible, Infected, Removed • Infected for tItimesteps • While infected, transmits with probability b • After tI steps, goes to removed/recovered
Epidemiological: SIRS • Susceptible, Infected, Removed, Susceptible • Combination of SIS+SIR • After tI steps, goes to removed/recovered • After tR steps, returns to susceptible
Epidemiological: Networks • Historically, SIS/SIR assumed a person could infect anybody else, full clique. There is an epidemic threshold in SIS. • For random power-law networks, threshold=0 [Pastor-Satorras+Vespignani] • (But not for PL networks with high clustering coefficients [Egu´ıluz and Klemm])
Threshold Models Each node in network has weighted threshold If adopted neighbors reaches threshold, the node adopts.
Outline • Intro: Models for diffusion • Epidemiological: SIS/SIR/SIRS • Threshold models • Case studies • SIR: Info diffusion in blogs • SIS: Cascades in blogs • Timing: Cascades in chain letters • A closer look: Network-based Marketing
Info Diffusion in Blogs D. Gruhl, R. Guha, Liben D. Nowell, A. Tomkins. Information Diffusion Through Blogspace. In WWW '04 (2004). Goal: How do topics trend in blogs, and how can we model diffusion of topics?
Info Diffusion in Blogs • Data: Crawled 11K blogs, 400K posts. • Found 34o topics: • apple ariannaashcroft astronaut blairboykinbustamantechibi china davisdianafarfarelloguantanamoharvardkazaa longhorn schwarzeneggerudellsiegfriedwildfireszidanegizmodomicrosoftsaddam
Info Diffusion in Blogs • Topics = Chatter + Spikes • Chatter: Alzheimer • Spike: Chibi • Spiky Chatter: Microsoft
Info Diffusion in Blogs • Modeled as SIR • Some set of authors is infected to write about a topic • Then propagate, as others write new posts on that topic • Measure the topic over time and other properties • Fit using EM • Compute probability of propagation along each edge
Info Diffusion in Blogs • Validation: • Synthetic • Used modified Erdos-Renyi graph, created propagation • Found that EM was able to identify transmission of most edges • Real • Found “internet-only” topics • Looked at most highly ranked expected transmission links, identified a real link in 90% of cases
Info Diffusion in Blogs • Limitations of SIR • No multiple postings • No “stickiness”, which topics resonate with whom • No time limiting factor in topics • “Closed world assumption” • No outside influences after initial infection
Outline • Intro: Models for diffusion • Epidemiological: SIS/SIR/SIRS • Threshold models • Case studies • SIR: Info diffusion in blogs • SIS: Cascades in blogs • Timing: Cascades in chain letters • A closer look: Network-based Marketing
Cascades in Blogs Jure Leskovec, Mary Mcglohon, Christos Faloutsos, Natalie Glance, Matthew Hurst. Cascading Behavior in Large Blog Graphs: Patterns and a Model. In Society of Applied and Industrial Mathematics: Data Mining (SDM07) (2007) Goal: What do cascades (conversation trees) in blogs look like, and how can we model them?
Cascades in Blogs • Data: • Gathered from August-September 2005 • Used set of 44,362 blogs, 2.4 million posts • 245,404 blog-to-blog links Sep 29 Aug 1 Number of posts Jul 4 Time [1 day]
Cascades in Blogs What is the timing of links? What are cascade sizes? What are cascade shapes? B1 B2 a b c d B3 B4 e Blogosphere “Star”“Chain” a c b d e e Cascades
Cascades in Blogs • What is the timing of links? • Does popularity decay at a constant rate? • With an exponential (“half life”)? Linear-linear scale Log-linear scale Log-log scale
Cascades in Blogs Observation: The probability that a post written at time tp acquires a link at time tp + Δ is: p(tp+Δ) ∝Δ-1.5 slope=-1.5 log( # in-links) log(days after post) (Linear-linear scale)
Cascades in Blogs • How are cascade sizes distributed? • Geometric distribution? Linear-linear scale Log-linear scale Log-log scale a c b d e e
Cascades in Blogs Q: What size distribution do cascades follow? Are large cascades frequent? Observation: The probability of observing a cascade of n blog posts follows a Zipf distribution: p(n) ∝ n-2 log(Count) slope=-2 a c b d e e log(Cascade size) (# of nodes)
Cascades in Blogs • How are cascade shapes distributed? • More stars? More chains? a c b d e e
Cascades in Blogs Q: What is the distribution of particular cascade shapes? Observation: Stars and chains in blog cascades also follow a power law, with different exponents (star -3.1, chain -8.5). log(Count) a=-8.5 a=-3.1 log(Count) log(Size) of star (# nodes) log(Size) of chain (# nodes)
Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 B3 B4
Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 B3 B4
Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 B3 B4
Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 p4,1 B3 B4 p2,1
Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 p4,1 p2,1 B3 B4
Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 p4,1 p2,1 B3 B4
Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 p4,1 p4,1 p2,1 B3 B4
Cascades in Blogs Data Model Most frequent cascades log(Count) model log(Cascade size) (# nodes) data log(Count) log(Count) log(Star size) log(Chain size)
Cascades in Blogs • Limitations of SIS • Closed world assumption • Forced to set infection probability low to avoid large epidemics– possibly limits stars. • No time limit, possibly overestimates chains.
Outline • Intro: Models for diffusion • Epidemiological: SIS/SIR/SIRS • Threshold models • Case studies • SIR: Info diffusion in blogs • SIS: Cascades in blogs • Timing: Cascades in chain letters • A closer look: Network-based Marketing
Chain Letter Cascades David Liben-Nowell, Jon Kleinberg. Tracing the Flow of Information on a Global Scale Using Internet Chain-Letter Data. Proceedings of the National Academy of Sciences, Vol. 105, No. 12.(March 2008), pp. 4633-4638. Goal: How can we trace the path of a meme, and explain these paths?
Chain Letter Cascades • Data: NPR chain letter records. • People directed to sign and send back to admin • Had several copies of lists, overlaps • Reconstructed the trees using edit distance
Chain Letter Cascades A reconstruction:
Chain Letter Cascades The tree:
Chain Letter Cascades • How to model? • These trees have much longer paths • 2 considerations • Spatial distance (geographic) • Timing
Chain Letter Cascades Model: based on a delay distribution Nodes reply-to-all, so latecomers just append.
Chain Letter Cascades • Validation: Simulated on a real social network (Livejournal), produced similar trees. • Limitations: • The chain letter mechanism is somewhat nontraditional diffusion • Closed-world assumption is perhaps OK
Outline • Intro: Models for diffusion • Epidemiological: SIS/SIR/SIRS • Threshold models • Case studies • SIR: Info diffusion in blogs • SIS: Cascades in blogs • Timing: Cascades in chain letters • A closer look: Network-Based Marketing
Network-Based Marketing Shawndra Hill, Foster Provost, Chris Volinsky. Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science, Vol. 22, No. 2. (2006), pp. 256-275. Question: Is there statistical evidence that network linkage directly affects product adoption?
Network-Based Marketing • Data: Direct-mail marketing campaign for adopting a new communications service. • 21 target segments, millions of customers • Divided based on: • Loyalty • Previous adoptions • Predictive scores based on other demographics • Different marketing campaigns (postcards, calls)
Network-Based Marketing • Hypothesis: A customer who has had direct communication with a subscriber is more likely to adopt. • Data: (incomplete) network information • ID of users, Timestamp, Duration • To test, added a “NN” (network neighbor) flag to features if a customer had communicated with a subscriber. (0.3% overall)
Network-Based Marketing • Created baseline statistical model based on node attributes. • “Loyalty”- how consumer used services in past • Geographic - city, state, etc. • Demographic- census-type data, credit score • Added a variable for NN, performed logistic regression on each segment, with response variable being “take rate”.
Network-Based Marketing Log-odds ratio for NN variable
Network-Based Marketing Take rates Lift ratios
Network-Based Marketing Added a “segment 22” consisting of only NN, but made up of less promising customers.