320 likes | 479 Views
Analyzing Social Bookmarking Systems: A del.icio.us Cookbook Robert Wetzker, Carsten Zimmermann, Christian Bauckhage Workshop on Mining Social Data, ECAI 2008. 13 September, 2014 Dipl.-Ing. Robert Wetzker I robert.wetzker@dai-labor.de.
E N D
Analyzing Social Bookmarking Systems: A del.icio.us Cookbook Robert Wetzker, Carsten Zimmermann, Christian Bauckhage Workshop on Mining Social Data, ECAI 2008 13 September, 2014 Dipl.-Ing. Robert Wetzker I robert.wetzker@dai-labor.de
Why this paper? • Why social bookmarking? • Provides a vast amount of user-generated annotations for web content. • Reflects the interests of millions of users. • Wisdom-of-crowds.
Why this paper? Why social bookmarking? • Provides a vast amount of user-generated annotations for web content. • Reflects the interests of millions of users. • Wisdom-of-crowds. Research areas: • (Web-) Search • (Web-) Content classification • Ontology building • Trend detection • Recommendation • …
Outline • The del.icio.us bookmarking service • Bookmarking patterns • Tagging patterns • Social bookmarking and spam • Conclusions and future work
The dataset • We recursively crawled del.icio.us tag wise starting with the tag “web2.0” (Oct.-Dez. 2007). • From the retrieved corpus of 45 million bookmarks we extracted the 1 million most frequent users and downloaded the bookmarks of these users. (Dez. 2007 – Apr. 2008) • For the analysis presented here, we only considered the 142 million bookmarks obtained from the user wise crawling.
The dataset • We recursively crawled del.icio.us tag wise starting with the tag “web2.0” (Oct.-Dez. 2007). • From the retrieved corpus of 45 million bookmarks we extracted the 1 million most frequent users and downloaded the bookmarks of these users. (Dez. 2007 – Apr. 2008) • For the analysis presented here, we only considered the 142 million bookmarks obtained from the user wise crawling. Corpus details
The dataset • We recursively crawled del.icio.us tag wise starting with the tag “web2.0” (Oct.-Dez. 2007). • From the retrieved corpus of 45 million bookmarks we extracted the 1 million most frequent users and downloaded the bookmarks of these users. (Dez. 2007 – Apr. 2008) • For the analysis presented here, we only considered the 142 million bookmarks obtained from the user wise crawling. Corpus details > 80% of del.icio.us
Bookmarking patterns • The del.icio.us community is biased toward web community and web technology related content. Top 10 most frequent URLs in the corpus
Bookmarking patterns • The del.icio.us community is biased toward web community and web technology related content. Top 10 most frequent domains in the corpus
Bookmarking patterns • The Top 1% of users proliferates 22% of all bookmarks. • 39% of all bookmarks link to 1% of all URLs.
Bookmarking patterns • The del.icio.us community pays attention to new content only for a very short period of time.
Tagging patterns • Each bookmark is labeled with 3.16 tags on average. • About 7% of all bookmarks are not tagged at all. Top 20 most frequent tags in the corpus
Tagging patterns • 700 of 7.000.000 tags account for 50% of all labels. • 55% of all tags appear only once.
Tagging patterns • Tendencies in the del.icio.us tag distribution strongly correlate with upcoming and periodic external events. Occurrence of 5 sample tags in 2007.
Social bookmarking and spam • Del.icio.us is highly vulnerable to spam. • 19 of the Top 20 users are of apparently non human origin accounting for 1.3 million bookmarks, around 1% of the corpus.
Social bookmarking and spam • Del.icio.us is highly vulnerable to spam. • 19 of the Top 20 users are of apparently non human origin accounting for 1.3 million bookmarks, around 1% of the corpus. • We find spammers to exhibit one or more of the following characteristics: • very high activity • bookmarking only few domains • high tagging rate • very low tagging rate • bulk posts • a combination of the above
Social bookmarking and spam The number of bookmarks and the number of users linking to a domain.
Social bookmarking and spam The number of user bookmarks and the average number of tags per bookmark.
The diffusion of attention • In some cases spam detection may prove computational expensive or ambiguous. • The diffusion of attention concept reduces the effect of spam on the tag distribution without the actual need of spam detection.
The diffusion of attention • In some cases spam detection may prove computational expensive or ambiguous. • The diffusion of attention concept reduces the effect of spam on the tag distribution without the actual need of spam detection. • We define the attention given to a tag as the number of users using the tag. • The diffusion of attention for a tag is then given by the number of users that assign a tag for the first time in a given period.
The diffusion of attention Tagging trends by tag occurrence.
The diffusion of attention Tagging trends by tag occurrence. Tagging trends by diffusion of attention.
Future work • Provide automatic and scalable spam detection methods. • Topic aware detection of trends. Follow up paper: Detecting Trends in Social Bookmarking Systems using a Probabilistic Generative Model and Smoothing, R. Wetzker, T. Plumbaum, A.Korth, C. Bauckhage, T. Alpcan, F. Metze, International Conference on Pattern Recognition (ICPR), 2008, Tampa, USA (to appear)
Thank you. Questions?
Social bookmarking and spam The number of bookmarks and the number of users linking to a domain. http://d.hatena.ne.jp