1 / 32

Analyzing Social Bookmarking Systems: A del.icio Cookbook

Analyzing Social Bookmarking Systems: A del.icio.us Cookbook Robert Wetzker, Carsten Zimmermann, Christian Bauckhage Workshop on Mining Social Data, ECAI 2008. 13 September, 2014 Dipl.-Ing. Robert Wetzker I robert.wetzker@dai-labor.de.

brick
Download Presentation

Analyzing Social Bookmarking Systems: A del.icio Cookbook

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing Social Bookmarking Systems: A del.icio.us Cookbook Robert Wetzker, Carsten Zimmermann, Christian Bauckhage Workshop on Mining Social Data, ECAI 2008 13 September, 2014 Dipl.-Ing. Robert Wetzker I robert.wetzker@dai-labor.de

  2. Why this paper? • Why social bookmarking? • Provides a vast amount of user-generated annotations for web content. • Reflects the interests of millions of users. • Wisdom-of-crowds.

  3. Why this paper? Why social bookmarking? • Provides a vast amount of user-generated annotations for web content. • Reflects the interests of millions of users. • Wisdom-of-crowds. Research areas: • (Web-) Search • (Web-) Content classification • Ontology building • Trend detection • Recommendation • …

  4. Outline • The del.icio.us bookmarking service • Bookmarking patterns • Tagging patterns • Social bookmarking and spam • Conclusions and future work

  5. The del.icio.us bookmarking service

  6. The del.icio.us bookmarking service

  7. The growth of del.icio.us

  8. The dataset • We recursively crawled del.icio.us tag wise starting with the tag “web2.0” (Oct.-Dez. 2007). • From the retrieved corpus of 45 million bookmarks we extracted the 1 million most frequent users and downloaded the bookmarks of these users. (Dez. 2007 – Apr. 2008) • For the analysis presented here, we only considered the 142 million bookmarks obtained from the user wise crawling.

  9. The dataset • We recursively crawled del.icio.us tag wise starting with the tag “web2.0” (Oct.-Dez. 2007). • From the retrieved corpus of 45 million bookmarks we extracted the 1 million most frequent users and downloaded the bookmarks of these users. (Dez. 2007 – Apr. 2008) • For the analysis presented here, we only considered the 142 million bookmarks obtained from the user wise crawling. Corpus details

  10. The dataset • We recursively crawled del.icio.us tag wise starting with the tag “web2.0” (Oct.-Dez. 2007). • From the retrieved corpus of 45 million bookmarks we extracted the 1 million most frequent users and downloaded the bookmarks of these users. (Dez. 2007 – Apr. 2008) • For the analysis presented here, we only considered the 142 million bookmarks obtained from the user wise crawling. Corpus details > 80% of del.icio.us

  11. Bookmarking patterns

  12. Bookmarking patterns • The del.icio.us community is biased toward web community and web technology related content. Top 10 most frequent URLs in the corpus

  13. Bookmarking patterns • The del.icio.us community is biased toward web community and web technology related content. Top 10 most frequent domains in the corpus

  14. Bookmarking patterns • The Top 1% of users proliferates 22% of all bookmarks. • 39% of all bookmarks link to 1% of all URLs.

  15. Bookmarking patterns • The del.icio.us community pays attention to new content only for a very short period of time.

  16. Tagging patterns

  17. Tagging patterns • Each bookmark is labeled with 3.16 tags on average. • About 7% of all bookmarks are not tagged at all. Top 20 most frequent tags in the corpus

  18. Tagging patterns • 700 of 7.000.000 tags account for 50% of all labels. • 55% of all tags appear only once.

  19. Tagging patterns • Tendencies in the del.icio.us tag distribution strongly correlate with upcoming and periodic external events. Occurrence of 5 sample tags in 2007.

  20. Social bookmarking and spam

  21. Social bookmarking and spam • Del.icio.us is highly vulnerable to spam. • 19 of the Top 20 users are of apparently non human origin accounting for 1.3 million bookmarks, around 1% of the corpus.

  22. Social bookmarking and spam • Del.icio.us is highly vulnerable to spam. • 19 of the Top 20 users are of apparently non human origin accounting for 1.3 million bookmarks, around 1% of the corpus. • We find spammers to exhibit one or more of the following characteristics: • very high activity • bookmarking only few domains • high tagging rate • very low tagging rate • bulk posts • a combination of the above

  23. Social bookmarking and spam The number of bookmarks and the number of users linking to a domain.

  24. Social bookmarking and spam The number of user bookmarks and the average number of tags per bookmark.

  25. The diffusion of attention

  26. The diffusion of attention • In some cases spam detection may prove computational expensive or ambiguous. • The diffusion of attention concept reduces the effect of spam on the tag distribution without the actual need of spam detection.

  27. The diffusion of attention • In some cases spam detection may prove computational expensive or ambiguous. • The diffusion of attention concept reduces the effect of spam on the tag distribution without the actual need of spam detection. • We define the attention given to a tag as the number of users using the tag. • The diffusion of attention for a tag is then given by the number of users that assign a tag for the first time in a given period.

  28. The diffusion of attention Tagging trends by tag occurrence.

  29. The diffusion of attention Tagging trends by tag occurrence. Tagging trends by diffusion of attention.

  30. Future work • Provide automatic and scalable spam detection methods. • Topic aware detection of trends. Follow up paper: Detecting Trends in Social Bookmarking Systems using a Probabilistic Generative Model and Smoothing, R. Wetzker, T. Plumbaum, A.Korth, C. Bauckhage, T. Alpcan, F. Metze, International Conference on Pattern Recognition (ICPR), 2008, Tampa, USA (to appear)

  31. Thank you. Questions?

  32. Social bookmarking and spam The number of bookmarks and the number of users linking to a domain. http://d.hatena.ne.jp

More Related