280 likes | 402 Views
Clustering the Tagged Web. D. Ramage , P. Heymann , C. Manning, & H. Garcia-Molina from Stanford InfoLab. ACM Conference on Web Search and Data Mining (WSDM 2009). IDS Lab. Seminar Spring 2009. Mar. 20 th , 2009. 강 민 석. minsuk@europa.snu.ac.kr. Contents. Introduction Problem Statement
E N D
Clustering the Tagged Web D. Ramage, P. Heymann, C. Manning, & H. Garcia-Molina from Stanford InfoLab ACM Conference on Web Search and Data Mining (WSDM 2009) IDS Lab. Seminar Spring 2009 Mar. 20th, 2009 강 민 석 minsuk@europa.snu.ac.kr
Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Conclusion
Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Conclusion
Introduction: Clustering the Web • One of the most promising approaches to handle the inherent ambiguity of the user query is through automatic clustering of web pages. • Cluster hypothesis: “the associations between documents convey information about the relevance of documents to requests” Clustering set of documents
Introduction: Social Bookmarking & Tag • Tags promise a uniquely well suited source of information on the similarity between web documents. • This paper is the first to systematically evaluate how best to use tags for clustering web documents.
Problem Statement • How cantagging data best be used to improve web document clustering?
Contents • Introduction • Problem Statement • Main Topics • Clustering Algorithms • Combine Words & Tags • Evaluation Metric • Experiments • Further Studies • Conclusion
Main Topics “Clustering the Tagged Web” Title Goal Document Clustering for Search Topics Clustering Algorithm Modeling a Document Evaluation Metric K-means MM-LDA How to CombineWords & Tagsin the VSM use ODP use F-score
Clustering Algorithm • partitions a set of web documents into groups of similar documents • similar to a standard task, except each has tags as well as words • look at two algorithms • K-means based on VSM • LDA-derived based on a probabilistic model Clustering # of clusters K set of documents 1,2,…,D a bag of words a bag of tags
K-means clustering • simple and highly scalable clustering algorithm • based on Vector Space Model • clusters documents into one of K groups by iteratively re-assigning • All documents are vectors, and dimensionality is the size of the vocabulary. • Then, How to model the documents? Images from http://www.cs.cmu.edu/~dpelleg/kmeans.html
MM-LDA (Multi-Multinomial Latent Dirichlet Allocation) • A variation of LDA, a generative probabilistic topic model • LDA models each document as a mixture of hidden topic variables,each topic is associated with a distribution over words. • LDA adds fully generative probabilistic semantics to pLSI,which is itself a probabilistic version of LSI, Latent Semantic Indexing. • We extend LDA to jointly account for words and tags as distinct sets of observations. pLSI LDA MM-LDA
MM-LDA (Multi-Multinomial Latent Dirichlet Allocation) • A variation of LDA, a generative probabilistic topic model • LDA models each document as a mixture of hidden topic variables,each topic is associated with a distribution over words. Process generating a collection of tagged documents • For each topic k,draw a multinomial distribution beta_k of size |W|from Dirichlet distribution with parameter etha_w • For each topic k,draw a multinomial distribution gamma_k of size |T|from Dirichlet distribution with parameter etha_t • For each document i,draw a multinomial distribution theta_i of size |K|from Dirichlet distribution with parameter alpha • For each word j in document i,- Draw a topic z_j from theta_i- Draw a word w_j from beta_z_j • For each word j in document i,- Draw a topic z_j from theta_i- Draw a word w_j from beta_z_j Graphical representation of MM-LDA • Step 1,3,4 are equivalent to standard LDA. • In step 2, we construct distributions of tags per topic. • In step 5, we sample a topic for each tag • After the steps (right side), • Learn MM-LDA parameters using Gibbs sampling
Combining Words & Tags • Key Question is “How to model the documents in the VSM?”. • Five ways to model a document with a bag of words and a bag of tags as a vector V. Words Only Tags Only Words + Tags Tags as Words Times n Tags as New Words
Combining Words & Tags • Example • word vocabulary has 8 words, tag vocabulary has 6 tags Words Only Tags Only Words + Tags Tags as Words Times 2 Tags as New Words
Evaluation of Cluster Quality • It is difficult to evaluate clustering algorithms. • Several studies compared their output with a hierarchical web directory. • We derive gold standard clusters from the ODP.
Evaluation of Cluster Quality • compare the generated clusters with the clustering derived from ODPby using the F1 measure • F1 cluster evaluation measure is the harmonic mean of precision and recall
Contents • Introduction • Problem Statement • Main Topics • Experiments • Term Weighing in VSM • How to Combine Words and Tags • Compare MM-LDA and K-means • Further Studies • Conclusion
Term Weighting in the VSM • A document vector V is defined as • Then, how should the weights be assigned? • consider two common functions: tf & tf-idf tf vs. tf-idf weighting on K-means F1-score for 2,000 documents • Conclusion • Words+Tags outperforms words alone under both. • tf on Words+tags outperforms tf-idf on Words+Tags. • tf-idf performs poorly because it over-emphasized the rarest terms
How to Combine Words and Tags with VSM • Which of the five ways to model a document work best in the VSM? • Ten runs of tf weighting on 13,230 documents F1-score for K-means (tf) with several means of combining words and tags • Conclusion • Words+Tags model outperforms any other model. • Tags are a qualitatively different type of content that “just more words”. • K-means can incorporate tagging data as an independent information channel.
How to Combine Words and Tags with MM-LDA • Then… How about MM-LDA? F1-score for MM-LDA with several means of combining words and tags • Conclusion • Also, Words+Tags model outperforms all other configurations. • Interestingly, performance decrease when the addition of tags to the worddue in part to the very different distributional statistics observed for words vs. tags.
Compare MM-LDA and K-means • Which model is better? F-scores for K-means and MM-LDA on 13,320 documents • Conclusion • The inclusion of tagging data improves the performance. • MM-LDA’s Words+Tags model is significantly better than all other models.
Experiments • Highest scoring tags & words from clusters generated by K-means & MM-LDA
Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Tags vs. Anchor Text • More Specific Subtrees • Conclusion
Tags vs. Anchor text • Q: advantages of tagging data hold up in the presence of anchor text? • A: The inclusion of tagging data would still improve cluster quality. F1 score in the presence of anchor text • Tags are different than anchor text • Performance depressed for Anchors as Words in VSM because of the VSM’s sensitivity to the weights of the now-noisier terms. • Words+Anchors didn’t well because the difficulty of extracting a quality anchor text. • might be improved by down-weighting anchor words or advanced weighting techniques
More Specific Subtrees • Does the impact of tags depend on the specificity of the clustering? • We selected two representative ODP subtrees. • Programming Languages (Top-Programming-Languages category) • Social Sciences (Top/Society/Social Sciences category) F-scores for Programming Languages category F-scores for Social Sciences category • Tags > Words+Tags • Clustering on tags alone outperform alternatives that use word information. • A higher proportion of the remaining tags are direct indicators of sub-category membership
Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Conclusion
Conclusion • Social tagging data provides a useful source of informationfor web page clustering, a task core to several IR applications. • Tagging data improves the performance compared to clustering on page text alone. • K-means enables it to better exploit the inclusion of tagging data. • A novel algorithm, MM-LDA, makes even better.
Clustering the Tagged Web Thank you~