Kwan Yi School of Library and Information Science

Mining a Web 2.0 service for the discovery of semantically similar terms: A case study with Del.icio.us Kwan Yi School of Library and Information Science College of Communications and Information Studies University of Kentucky

Social bookmarking: Del.icio.us • Del.icio.us is one of most popular social bookmarking systems: • 3 million registered users and • 100 million unique URLs bookmarked, as of September 2007

Folksonomy • We define folksonomy as a collective set of tags (keywords or terms) assigned by participants in a social tagging system. • User-created vocabulary • Uncontrolled vocabulary • Built in a collaborative manner

Example: A folksonomy in Delicious.com Resource title Resource URL Resource taggers Popular tags Tagging history

Objective of the Study • To examine an effective way of mining semantically similar terms from folksonomy for the purpose of investigating the feasibility of folksonomy as a potential data source of semantically similar terms

Proposed algorithms for mining similar terms from Folksonomy • Co-occurrence-based similarity algorithm • Correlation-based similarity algorithm

Experiment (I) • To identify similar terms of each of the 121 most popular tags on Del.icio.us posted on the fifteenth of May 2008

Result: How many similar terms for the 121 popular tags? • Co-occurrence-based algorithm • 2.6 similar terms (Level of similarity = 0.9) • 5.1 similar terms (Level of similarity = 0.7) • 10.1 similar terms (Level of similarity = 0.5) • Correlation-based algorithm • 0.9 similar terms (Level of similarity = 0.9) • 1.6 similar terms (Level of similarity = 0.7) • 2.6 similar terms (Level of similarity = 0.5)

Experiment (II) • To identify similar terms of each of the 32 tags (out of the 121) that are not listed on the online version of Merriam-Webster Dictionary

Result: How many similar terms for the 32 not-in-the-dictionary tags? • Co-occurrence-based algorithm • 3.3 similar terms (Level of similarity = 0.9) • 5.9 similar terms (Level of similarity = 0.7) • 10.1 similar terms (Level of similarity = 0.5) • Correlation-based algorithm • 1 similar terms (Level of similarity = 0.9) • 1.7 similar terms (Level of similarity = 0.7) • 2.4 similar terms (Level of similarity = 0.5)

Webdesign(similarity level: 0.9) • Co-occurrence [12]: resources css web design reference html tutorial tutorials inspiration gallery development webdev • Correlation [4]: css design html inspiration

Findings • The correlation-based is more selective than the co-occurrence-based. • The co-occurrence-based appears to be most attractive with the similarity level of 0.7.

Conclusion • As social bookmarking systems are more popularly utilized, the potential of their folksonomies for the mining task will be more increased.

Thanks!

Co-occurrence-based similarity algorithm (Identifying similar terms of the term W) 1 W (100) A (50) B (20) C (10) W (87) B (57) C (40) A (30) W (1032) A (250) F (120) D (78) W (37) A (29) B (16) F (9) 3 CoSA(s=1: A  W) CoSA(s=0.75: B  W) A (4) B (3) C (2) F (2) D (1) CoSA(s=0.5: C  W) CoSA(s=0.5: F  W) CoSA(s=0.25: D  W) 2

Correlation-based similarity algorithm • Term X is said to be similar to term W on the basis of the correlation-based algorithm: CrSA(s: XW) • CrSA(s: XW) can be defined only if both CoSA(s: XW) and CoSA(s: WX) are satisfied.

Kwan Yi School of Library and Information Science