1 / 31

Blogs (web logs) contain online stamped entries

Implicit Structure and Dynamics of BlogSpace Eytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose HP Labs, Palo Alto, CA. list of read blogs. date and time stamps. URL that is being commented on. via link. Blogs (web logs) contain online stamped entries.

nhi
Download Presentation

Blogs (web logs) contain online stamped entries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implicit Structure and Dynamics of BlogSpaceEytan Adar, Li Zhang, Lada Adamic, & Rajan LukoseHP Labs, Palo Alto, CA

  2. list of read blogs date and time stamps URL that is being commented on via link Blogs (web logs) contain online stamped entries

  3. Blogs: structure and transmission • Blog use: • Record real-world and virtual experiences • Note and discuss things “seen” on the net • Blog structure: blog-to-blog linking • Use + Structure • Great to track “memes” (catchy ideas) • Patterns of information flow • How does the popularity of a topic evolve over time? • Who is getting information from whom? • Ranking algorithms that take advantage of transmission patterns

  4. Related Work Link prediction in social networks: Butts, C. Network Inference, Error, and Information (In)Accuracy: A Bayesian Approach, Social Networks, 25(2):103-140. Dombroski, M., P. Fischbeck, and K. Carley, An Empirically-Based Model for Network Estimation and Prediction, NAACSOS conference proceeding, Pittsburgh, PA, 2003. O’Madadhain J., Smyth P., Adamic L., Learning Predictive Models for Link Formation, Sunbelt 2005 (hope you were there!) Getoor, L., N. Friedman, D. Koller, and B. Taskar, Learning Probabilistic Models of Link Structure, Journal of Machine Learning Research, vol. 3(2002), pp. 690-707. Adamic L., Adar E., Friends and neighbors on the Web, Social Networks, 2003. Kleinberg, J., and .D. Liben-Nowell, The Link Prediction Problem for Social Networks’, in Proceedings of CIKM ’03 (New Orleans, LA, November 2003), ACM Press. Blog ranking: Technorati, BlogPulse, Daypop… Blog epidemic tracking: Blogdex at MIT media lab, Cameron Marlow, Sunbelt 2003 BlogPulse

  5. Intelliseek’s BlogPulse Service for tracking trends in the blogosphere: popular URLs, phrases, people

  6. BlogPulse Data analyzed 37,153 blogs Differential daily crawls (to find new posts) for May 2003 Full page crawl for May 18, 2003 to capture blogrolls 175,712 URLs occurring on > 2 blogs

  7. Slashdot Effect BoingBoing Effect Tracking popularity over time Popularity Time Blogdex, BlogPulse, etc. track the most popular links/phrases of the day

  8. Election Map Cartograms Michael Gastner, Cosma Shalizi, and Mark NewmanUniversity of Michigan http://www-personal.umich.edu/~mejn/election/

  9. Tracking popularity over time Popularity Time

  10. Total # of mentions substantial (40) URL mentioned for the first time in May Clustering information popularity profiles May 2003

  11. K-means clustering 259 URLs in the sample satisfy criteria Take normalized cumulative profiles all mentions day K-means minimizes the sum of the differences within each cluster 4 clusters captured most of the differences

  12. Different kinds of information have differentpopularity profiles 1 2 3 4 1 Major-news site (editorial content) – back of the paper Products, etc. Slashdotpostings Front-pagenews 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 10 15 5 10 15 5 10 15 5 10 15

  13. cluster 4 cluster 1 cluster 3 cluster 2 1 0.9 0.8 0.7 0.6 Popularity profiles 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 10 12 14 16 18 20 22

  14. Micro example: Giant Microbes

  15. What do we need track specific info ‘epidemics’? Timings Underlying network b2 b3 Microscale Dynamics b1 t0 Time of infection t1

  16. Challenges Root may be unknown Multiple possible paths Uncrawled space, alternate media (email, voice) No links b2 b3 Microscale Dynamics bn b1 ? ? t0 Time of infection t1

  17. Via Links (< 2 % of links, 50% within sample) unambiguous Multiple explicit links: which link is more likely No explicit links (70%) which implicit path is more likely Microscale Dynamics who is getting info from whom

  18. Link Inference • Use machine learning algorithms: • A) Support Vector Machine (SVM) • B) Logistic Regression • What we can use • Full text • Blogs in common • Links in common • History of infection BoingBoing WIRED

  19. Percentage of blog pairs sharing at least one link

  20. Similarity in links between reciprocated, unreciprocated, and non-linked blog pairs

  21. Training on positive and negative examples of ‘infection’ Positive Example Negative Example Blog A Blog A - + Blog B Blog B Tinfection(Blog B) > Tinfection(Blog A) Infected Uninfected

  22. Prediction results Link Inference: SVM 91% accuracy regression 92% accuracy (blog-blog links most predictive) Infection inference: SVM 71.5% accuracy: using blog and non-blog link similarity + timing features (AbeforeB)/nA, (BbeforeA)/nA, (A same day B)/nA,, … Regression: 75% accuracy using only timing features

  23. Sources of error Incomplete crawls uncrawled blog or media source C inferred time A actual B Coarseness and sparseness of timing data (1 day resolution) Mirror URLS (actually helps)

  24. GUESS tool (build your own, see demo @ 5:30!) Using GraphViz (by AT&T) layouts Simple algorithm If single, explicit link exists, draw it (add node if needed) Otherwise use ML algorithm Pick the most likely explicit link Pick the most likely possible link Tool lets you zoom around space, control threshold, link types, etc. Visualizationby Eytan Adar http://www-idl.hpl.hp.com/blogstuff

  25. Giant Microbes epidemic visualization via link inferred link blog explicit link

  26. Find early sources of good information using inferred information paths or timing iRank b1 True source b2 Popular site b3 b4 … b5 bn

  27. iRank Algorithm • Draw a weighted edge for all pairs of blogs that cite the same URL • higher weight for mentions closer together • run PageRank • control for ‘spam’ t0 Time of infection t1

  28. 02:00 AM Friday Mar. 05, 2004 PSTWired publishes: "Warning: Blogs Can Be Infectious.” 7:25 AM Friday Mar. 05, 2004 PSTSlashdot posts: "Bloggers' Plagiarism Scientifically Proven" 9:55 AM Friday Mar. 05, 2004 PSTMetafilter announces "A good amount of bloggers are outright thieves." Do Bloggers Kill Kittens?

  29. For more info Information Dynamics Lab @ HP http://www.hpl.hp.com/research/idl Blog Epidemic Analyzer http://www-idl.hpl.hp.com/blogstuff Eytan, Li, Lada & Rajan http://www.hpl.hp.com/research/idl/people/eytan/ http://www.hpl.hp.com/personal/Li_Zhang/ http://www.hpl.hp.com/personal/Lada_Adamic http://www.hpl.hp.com/research/idl/people/lukose/

  30. CNN: Wal-Mart banishes bawdy mags

More Related