1 / 55

Information Diffusion

Information Diffusion. Mary McGlohon CMU 10-802 3/23/10. Outline. Intro: Models for diffusion Epidemiological: SIS/SIR/SIRS Threshold models Case studies SIR: Info diffusion in blogs SIS: Cascades in blogs Timing: Cascades in chain letters A closer look: Network-based Marketing.

harlan
Download Presentation

Information Diffusion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Diffusion Mary McGlohon CMU 10-802 3/23/10

  2. Outline • Intro: Models for diffusion • Epidemiological: SIS/SIR/SIRS • Threshold models • Case studies • SIR: Info diffusion in blogs • SIS: Cascades in blogs • Timing: Cascades in chain letters • A closer look: Network-based Marketing

  3. Epidemiological: SIS • Susceptible, Infected, Susceptible • Infected for tItimesteps • While infected, transmits with probability b • After tI steps, returns to susceptible

  4. Epidemiological: SIR • Susceptible, Infected, Removed • Infected for tItimesteps • While infected, transmits with probability b • After tI steps, goes to removed/recovered

  5. Epidemiological: SIRS • Susceptible, Infected, Removed, Susceptible • Combination of SIS+SIR • After tI steps, goes to removed/recovered • After tR steps, returns to susceptible

  6. Epidemiological: Networks • Historically, SIS/SIR assumed a person could infect anybody else, full clique. There is an epidemic threshold in SIS. • For random power-law networks, threshold=0 [Pastor-Satorras+Vespignani] • (But not for PL networks with high clustering coefficients [Egu´ıluz and Klemm])

  7. Threshold Models Each node in network has weighted threshold If adopted neighbors reaches threshold, the node adopts.

  8. Outline • Intro: Models for diffusion • Epidemiological: SIS/SIR/SIRS • Threshold models • Case studies • SIR: Info diffusion in blogs • SIS: Cascades in blogs • Timing: Cascades in chain letters • A closer look: Network-based Marketing

  9. Info Diffusion in Blogs D. Gruhl, R. Guha, Liben D. Nowell, A. Tomkins. Information Diffusion Through Blogspace. In WWW '04 (2004). Goal: How do topics trend in blogs, and how can we model diffusion of topics?

  10. Info Diffusion in Blogs • Data: Crawled 11K blogs, 400K posts. • Found 34o topics: • apple ariannaashcroft astronaut blairboykinbustamantechibi china davisdianafarfarelloguantanamoharvardkazaa longhorn schwarzeneggerudellsiegfriedwildfireszidanegizmodomicrosoftsaddam

  11. Info Diffusion in Blogs • Topics = Chatter + Spikes • Chatter: Alzheimer • Spike: Chibi • Spiky Chatter: Microsoft

  12. Info Diffusion in Blogs • Modeled as SIR • Some set of authors is infected to write about a topic • Then propagate, as others write new posts on that topic • Measure the topic over time and other properties • Fit using EM • Compute probability of propagation along each edge

  13. Info Diffusion in Blogs • Validation: • Synthetic • Used modified Erdos-Renyi graph, created propagation • Found that EM was able to identify transmission of most edges • Real • Found “internet-only” topics • Looked at most highly ranked expected transmission links, identified a real link in 90% of cases

  14. Info Diffusion in Blogs • Limitations of SIR • No multiple postings • No “stickiness”, which topics resonate with whom • No time limiting factor in topics • “Closed world assumption” • No outside influences after initial infection

  15. Outline • Intro: Models for diffusion • Epidemiological: SIS/SIR/SIRS • Threshold models • Case studies • SIR: Info diffusion in blogs • SIS: Cascades in blogs • Timing: Cascades in chain letters • A closer look: Network-based Marketing

  16. Cascades in Blogs Jure Leskovec, Mary Mcglohon, Christos Faloutsos, Natalie Glance, Matthew Hurst. Cascading Behavior in Large Blog Graphs: Patterns and a Model. In Society of Applied and Industrial Mathematics: Data Mining (SDM07) (2007) Goal: What do cascades (conversation trees) in blogs look like, and how can we model them?

  17. Cascades in Blogs • Data: • Gathered from August-September 2005 • Used set of 44,362 blogs, 2.4 million posts • 245,404 blog-to-blog links Sep 29 Aug 1 Number of posts Jul 4 Time [1 day]

  18. Cascades in Blogs What is the timing of links? What are cascade sizes? What are cascade shapes? B1 B2 a b c d B3 B4 e Blogosphere “Star”“Chain” a c b d e e Cascades

  19. Cascades in Blogs • What is the timing of links? • Does popularity decay at a constant rate? • With an exponential (“half life”)? Linear-linear scale Log-linear scale Log-log scale

  20. Cascades in Blogs Observation: The probability that a post written at time tp acquires a link at time tp + Δ is: p(tp+Δ) ∝Δ-1.5 slope=-1.5 log( # in-links) log(days after post) (Linear-linear scale)

  21. Cascades in Blogs • How are cascade sizes distributed? • Geometric distribution? Linear-linear scale Log-linear scale Log-log scale a c b d e e

  22. Cascades in Blogs Q: What size distribution do cascades follow? Are large cascades frequent? Observation: The probability of observing a cascade of n blog posts follows a Zipf distribution: p(n) ∝ n-2 log(Count) slope=-2 a c b d e e log(Cascade size) (# of nodes)

  23. Cascades in Blogs • How are cascade shapes distributed? • More stars? More chains? a c b d e e

  24. Cascades in Blogs Q: What is the distribution of particular cascade shapes? Observation: Stars and chains in blog cascades also follow a power law, with different exponents (star -3.1, chain -8.5). log(Count) a=-8.5 a=-3.1 log(Count) log(Size) of star (# nodes) log(Size) of chain (# nodes)

  25. Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability  • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 B3 B4

  26. Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability  • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 B3 B4

  27. Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability  • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 B3 B4

  28. Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability  • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 p4,1 B3 B4 p2,1

  29. Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability  • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 p4,1 p2,1 B3 B4

  30. Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability  • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 p4,1 p2,1 B3 B4

  31. Cascades in Blogs • Based on SIS model in epidemiology • Randomly pick blog to infect, add post to cascade • Infect each in-linked neighbor with probability  • Add infected neighbors’ posts to cascade. • Set old infected node to uninfected. B1 B2 p1,1 p4,1 p4,1 p2,1 B3 B4

  32. Cascades in Blogs Data Model Most frequent cascades log(Count) model log(Cascade size) (# nodes) data log(Count) log(Count) log(Star size) log(Chain size)

  33. Cascades in Blogs • Limitations of SIS • Closed world assumption • Forced to set infection probability low to avoid large epidemics– possibly limits stars. • No time limit, possibly overestimates chains.

  34. Outline • Intro: Models for diffusion • Epidemiological: SIS/SIR/SIRS • Threshold models • Case studies • SIR: Info diffusion in blogs • SIS: Cascades in blogs • Timing: Cascades in chain letters • A closer look: Network-based Marketing

  35. Chain Letter Cascades David Liben-Nowell, Jon Kleinberg. Tracing the Flow of Information on a Global Scale Using Internet Chain-Letter Data. Proceedings of the National Academy of Sciences, Vol. 105, No. 12.(March 2008), pp. 4633-4638. Goal: How can we trace the path of a meme, and explain these paths?

  36. Chain Letter Cascades • Data: NPR chain letter records. • People directed to sign and send back to admin • Had several copies of lists, overlaps • Reconstructed the trees using edit distance

  37. Chain Letter Cascades A reconstruction:

  38. Chain Letter Cascades The tree:

  39. Chain Letter Cascades • How to model? • These trees have much longer paths • 2 considerations • Spatial distance (geographic) • Timing

  40. Chain Letter Cascades Model: based on a delay distribution Nodes reply-to-all, so latecomers just append.

  41. Chain Letter Cascades • Validation: Simulated on a real social network (Livejournal), produced similar trees. • Limitations: • The chain letter mechanism is somewhat nontraditional diffusion • Closed-world assumption is perhaps OK

  42. Outline • Intro: Models for diffusion • Epidemiological: SIS/SIR/SIRS • Threshold models • Case studies • SIR: Info diffusion in blogs • SIS: Cascades in blogs • Timing: Cascades in chain letters • A closer look: Network-Based Marketing

  43. Network-Based Marketing Shawndra Hill, Foster Provost, Chris Volinsky. Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science, Vol. 22, No. 2. (2006), pp. 256-275. Question: Is there statistical evidence that network linkage directly affects product adoption?

  44. Network-Based Marketing • Data: Direct-mail marketing campaign for adopting a new communications service. • 21 target segments, millions of customers • Divided based on: • Loyalty • Previous adoptions • Predictive scores based on other demographics • Different marketing campaigns (postcards, calls)

  45. Network-Based Marketing

  46. Network-Based Marketing • Hypothesis: A customer who has had direct communication with a subscriber is more likely to adopt. • Data: (incomplete) network information • ID of users, Timestamp, Duration • To test, added a “NN” (network neighbor) flag to features if a customer had communicated with a subscriber. (0.3% overall)

  47. Network-Based Marketing • Created baseline statistical model based on node attributes. • “Loyalty”- how consumer used services in past • Geographic - city, state, etc. • Demographic- census-type data, credit score • Added a variable for NN, performed logistic regression on each segment, with response variable being “take rate”.

  48. Network-Based Marketing Log-odds ratio for NN variable

  49. Network-Based Marketing Take rates Lift ratios

  50. Network-Based Marketing Added a “segment 22” consisting of only NN, but made up of less promising customers.

More Related