1 / 28

Clustering the Tagged Web

Clustering the Tagged Web. D. Ramage , P. Heymann , C. Manning, & H. Garcia-Molina from Stanford InfoLab. ACM Conference on Web Search and Data Mining (WSDM 2009). IDS Lab. Seminar Spring 2009. Mar. 20 th , 2009. 강 민 석. minsuk@europa.snu.ac.kr. Contents. Introduction Problem Statement

dillan
Download Presentation

Clustering the Tagged Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering the Tagged Web D. Ramage, P. Heymann, C. Manning, & H. Garcia-Molina from Stanford InfoLab ACM Conference on Web Search and Data Mining (WSDM 2009) IDS Lab. Seminar Spring 2009 Mar. 20th, 2009 강 민 석 minsuk@europa.snu.ac.kr

  2. Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Conclusion

  3. Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Conclusion

  4. Introduction: Clustering the Web • One of the most promising approaches to handle the inherent ambiguity of the user query is through automatic clustering of web pages. • Cluster hypothesis: “the associations between documents convey information about the relevance of documents to requests” Clustering set of documents

  5. Introduction: Social Bookmarking & Tag • Tags promise a uniquely well suited source of information on the similarity between web documents. • This paper is the first to systematically evaluate how best to use tags for clustering web documents.

  6. Problem Statement • How cantagging data best be used to improve web document clustering?

  7. Contents • Introduction • Problem Statement • Main Topics • Clustering Algorithms • Combine Words & Tags • Evaluation Metric • Experiments • Further Studies • Conclusion

  8. Main Topics “Clustering the Tagged Web” Title Goal Document Clustering for Search Topics Clustering Algorithm Modeling a Document Evaluation Metric K-means MM-LDA How to CombineWords & Tagsin the VSM use ODP use F-score

  9. Clustering Algorithm • partitions a set of web documents into groups of similar documents • similar to a standard task, except each has tags as well as words • look at two algorithms • K-means based on VSM • LDA-derived based on a probabilistic model Clustering # of clusters K set of documents 1,2,…,D a bag of words a bag of tags

  10. K-means clustering • simple and highly scalable clustering algorithm • based on Vector Space Model • clusters documents into one of K groups by iteratively re-assigning • All documents are vectors, and dimensionality is the size of the vocabulary. • Then, How to model the documents? Images from http://www.cs.cmu.edu/~dpelleg/kmeans.html

  11. MM-LDA (Multi-Multinomial Latent Dirichlet Allocation) • A variation of LDA, a generative probabilistic topic model • LDA models each document as a mixture of hidden topic variables,each topic is associated with a distribution over words. • LDA adds fully generative probabilistic semantics to pLSI,which is itself a probabilistic version of LSI, Latent Semantic Indexing. • We extend LDA to jointly account for words and tags as distinct sets of observations. pLSI LDA MM-LDA

  12. MM-LDA (Multi-Multinomial Latent Dirichlet Allocation) • A variation of LDA, a generative probabilistic topic model • LDA models each document as a mixture of hidden topic variables,each topic is associated with a distribution over words. Process generating a collection of tagged documents • For each topic k,draw a multinomial distribution beta_k of size |W|from Dirichlet distribution with parameter etha_w • For each topic k,draw a multinomial distribution gamma_k of size |T|from Dirichlet distribution with parameter etha_t • For each document i,draw a multinomial distribution theta_i of size |K|from Dirichlet distribution with parameter alpha • For each word j in document i,- Draw a topic z_j from theta_i- Draw a word w_j from beta_z_j • For each word j in document i,- Draw a topic z_j from theta_i- Draw a word w_j from beta_z_j Graphical representation of MM-LDA • Step 1,3,4 are equivalent to standard LDA. • In step 2, we construct distributions of tags per topic. • In step 5, we sample a topic for each tag • After the steps (right side), • Learn MM-LDA parameters using Gibbs sampling

  13. Combining Words & Tags • Key Question is “How to model the documents in the VSM?”. • Five ways to model a document with a bag of words and a bag of tags as a vector V. Words Only Tags Only Words + Tags Tags as Words Times n Tags as New Words

  14. Combining Words & Tags • Example • word vocabulary has 8 words, tag vocabulary has 6 tags Words Only Tags Only Words + Tags Tags as Words Times 2 Tags as New Words

  15. Evaluation of Cluster Quality • It is difficult to evaluate clustering algorithms. • Several studies compared their output with a hierarchical web directory. • We derive gold standard clusters from the ODP.

  16. Evaluation of Cluster Quality • compare the generated clusters with the clustering derived from ODPby using the F1 measure • F1 cluster evaluation measure is the harmonic mean of precision and recall

  17. Contents • Introduction • Problem Statement • Main Topics • Experiments • Term Weighing in VSM • How to Combine Words and Tags • Compare MM-LDA and K-means • Further Studies • Conclusion

  18. Term Weighting in the VSM • A document vector V is defined as • Then, how should the weights be assigned? • consider two common functions: tf & tf-idf tf vs. tf-idf weighting on K-means F1-score for 2,000 documents • Conclusion • Words+Tags outperforms words alone under both. • tf on Words+tags outperforms tf-idf on Words+Tags. • tf-idf performs poorly because it over-emphasized the rarest terms

  19. How to Combine Words and Tags with VSM • Which of the five ways to model a document work best in the VSM? • Ten runs of tf weighting on 13,230 documents F1-score for K-means (tf) with several means of combining words and tags • Conclusion • Words+Tags model outperforms any other model. • Tags are a qualitatively different type of content that “just more words”. • K-means can incorporate tagging data as an independent information channel.

  20. How to Combine Words and Tags with MM-LDA • Then… How about MM-LDA? F1-score for MM-LDA with several means of combining words and tags • Conclusion • Also, Words+Tags model outperforms all other configurations. • Interestingly, performance decrease when the addition of tags to the worddue in part to the very different distributional statistics observed for words vs. tags.

  21. Compare MM-LDA and K-means • Which model is better? F-scores for K-means and MM-LDA on 13,320 documents • Conclusion • The inclusion of tagging data improves the performance. • MM-LDA’s Words+Tags model is significantly better than all other models.

  22. Experiments • Highest scoring tags & words from clusters generated by K-means & MM-LDA

  23. Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Tags vs. Anchor Text • More Specific Subtrees • Conclusion

  24. Tags vs. Anchor text • Q: advantages of tagging data hold up in the presence of anchor text? • A: The inclusion of tagging data would still improve cluster quality. F1 score in the presence of anchor text • Tags are different than anchor text • Performance depressed for Anchors as Words in VSM because of the VSM’s sensitivity to the weights of the now-noisier terms. • Words+Anchors didn’t well because the difficulty of extracting a quality anchor text. • might be improved by down-weighting anchor words or advanced weighting techniques

  25. More Specific Subtrees • Does the impact of tags depend on the specificity of the clustering? • We selected two representative ODP subtrees. • Programming Languages (Top-Programming-Languages category) • Social Sciences (Top/Society/Social Sciences category) F-scores for Programming Languages category F-scores for Social Sciences category • Tags > Words+Tags • Clustering on tags alone outperform alternatives that use word information. • A higher proportion of the remaining tags are direct indicators of sub-category membership

  26. Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Conclusion

  27. Conclusion • Social tagging data provides a useful source of informationfor web page clustering, a task core to several IR applications. • Tagging data improves the performance compared to clustering on page text alone. • K-means enables it to better exploit the inclusion of tagging data. • A novel algorithm, MM-LDA, makes even better.

  28. Clustering the Tagged Web Thank you~

More Related