290 likes | 307 Views
This paper introduces a comprehensive hierarchical generative model that simulates the social annotation process in social tagging systems. It represents the relationships between users, documents, tags, and their semantic structures, providing valuable insights into topical structures and user communities of interests.
E N D
KDD’10, July 26, 2010, Washington, D.C. The Topic-Perspective Model for Social Tagging Systems Caimeilu, Xiaohua (Tony) Hu, Xin Chen, Jung-ran Park caimei.lu@drexel.edu {xiaohua.hu, xin.chen, jung-ran.park}@ischool.drexel.edu College of Information Science and Technology Drexel University Philadelphia, PA 19104
Introduction • Social annotations as user-generated data have been exploited in recent literature for various application purposes: • Tag recommendation/prediction. (Jaschke et al. 2007, Heymann et al. 2008, Guan et al. 2009) • Document clustering and classification. (Ramage et al. 2009, Yin et al. 2009, Lu et al. 2009) • Information retrieval. (Zhou et al. 2008) • Not only social annotations themselves, but also the social network formed through users’ social tagging behavior provides rich and valuable information sources for learning the topical structures of web resources, the semantic structures of tags, user communities of interests.
Introduction • Two ways to model social tagging network: • Flat tripartite network • Hierarchical Bayesian models based on Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). • In this paper, we propose a comprehensive hierarchical generative model which simulate the real social annotation process and represent all related entities (users, documents, document content units, and tags) and their relations in a unified framework.
Outline • Observations • The Topic-Perspective Model • Two Models for Comparison • Experiments • Evaluation Results
Observations O1: Social tags are generated differently from resource content • Content terms contained in a document are generated by a single or a small group of authors sharing common interests, whereas the social tags of a resource are generated by many users from different perspectives. • The perspective of a user is not only decided by the user’s interest, but also influenced by the user’s expertise, purpose, preference and other personal factors.
Observations O2: The creation of a tag depends on both the topics of the annotated resource and the user’s perspective. O3: The impact of resource topic and user perspective on the generation of each tag is different. Tags are created by users for different functional purposes, such as topical description, opinion expression, self reference, etc.
http://www.brainyquote.com/ Topical description Subjective opinion
Observations • Different classification schemas of social tags
The Topic-PerspectiveModel • Distinctive Features • The tag generation process is modeled separately from the content term generation process. O1: Social tags are generated differently from resource content • During the tag generation process, resource topics and user perspectives together generate the social tags for a resource. O2: The creation of a tag depends on both the topics of the annotated resource and the user’s perspective. • Each tag differs in the extent of depending on resource topics or user perspectives. (A switch variable is adopted to control the influence of user perspectives and document topics on tag generation.) Q3: The impact of resource topic and user perspective on the generation of each tag is different.
𝛼u 𝜃(u) 𝛼d U 𝛾 𝜃(d) p/zt x 𝜆 z T t w Md Nd 𝜓 𝜙(t) 𝜙(w) K L 𝜂 𝛽t 𝛽w The Topic-PerspectiveModel Tag generation Word generation
The Topic-PerspectiveModel • For each of the D documents d, sample 𝜃(d)d ~ Dirichlet (𝛼d); • For each of the U users u, sample 𝜃(u)u ~ Dirichlet(𝛼u); • For each of the K topics k, sample 𝜙(w)k ~ Dirichlet(𝛽w), and sample 𝜙(t)k ~ Dirichlet(𝛽t); • For each of the L user perspectives l, sample 𝜓l ~ Dirichlet(𝜂); • For each of the Ndword tokens wiin document d: • sample a topic zi ~ Multinomial (𝜃(d)d); • sample a word wi~ Multinomial (𝜙(w)zi); • For each of the T tags t in the collection D, sample 𝜆t ~ Beta (𝛾); • For each of the Mdtag tokens tj in document d created by user u; • sample a flag X ~ Binomial (𝜆tj); • if (X = 1): • Sample a topic ztj ~ Uniform(zw1,…,zwn); • Sample a tag tj ~ Multinomial (𝜙(t)zj); • if (X = 0): • Sample a perspective pj~ Multinomial(𝜃u); • Sample a tag tj ~ Multinomial (𝜓pj)
The Topic-PerspectiveModel • Parameter Estimation The Topic-Perspective model has six parameters for estimation: (1) the document-topic distribution 𝜃(d), (2) the topic-word distribution 𝜙(w), (3) the topic-tag distribution 𝜙(t), (4) the user-perspective distribution 𝜃(u), (5) the perspective-tag distribution 𝜓, (6) and the binomial distribution 𝜆.
The Topic-PerspectiveModel • Parameter Estimation (Gibbs Sampling) • Sampling equation of word topic variables for each content word wi: • Sampling equation of tag topic variables when switch variable X=1: • Sampling equation of the tag perspective variables when X=0:
The Topic-PerspectiveModel • Parameter Estimation
𝜃 𝛼 zt zw w t Nt Nw D 𝜙(w) 𝜙(t) 𝛽t 𝛽w K K Two Models for Comparison • Conditionally-independent LDA (CI-LDA) (Ramage, Heymann et al. 2009)
Two Models for Comparison • CI-LDA • It has bee used for modeling the generation of words and entities in news articles (D. Newman et al., 2006) and the generation of words and document links (Elena Erosheva et al., 2004). • Tag is generated from the same source as the word: the topic of the document. (This is not appropriate, because the process that generates content is different from the annotation process, especially for non-textual resources like images and videos. ) • Users’ impact on the generation of tags is not considered in this model.
D 𝜃 𝛼 zw zt w t Nt Nw 𝛽t 𝜙(t) 𝛽w 𝜙(w) K K Two Models for Comparison • Correspondence LDA (CorrLDA) model for content words and social annotations (M. Bundschus et al., 2009)
Two Models for Comparison • CorrLDA • First proposed to model Image/Caption data by Blei and Jordan (2003) • It is also used to model the generation of words/entities in news articles by Newman et al (Newman, Chemudugunta et al. 2006) • It can model the generation of two different but related information features in a document. • Compared to the CI-LDA model, the CorrLDA model can force a greater degree of correspondence between two information sources. • The CorrLDA model first generates word topics for a document. Then the topics associated with the words in the document are used to generate tags. • Like CI-LDA, the user information is missed in the tag generation process of CorrLDA.
Experiments • Datasets • Data source: 3,246,424 posts for 1,731,780 URLs created by 4784 users crawled from Delicious during Jan. and Feb. 2009. • 41,190 web documents, 4,414 users, 28,740 unique tags, and 129,908 unique words. • We randomly selected 10% of the documents and their associated users and tags as a held-out test data and trained the model on the remaining 90%.
Experiments • Evaluation Criterion--Perplexity • Perplexity is a standard measure for evaluating the generalization performance of a probabilistic model. • The value of perplexity reflects the ability of a model to generalize unseen data. • Specifically, in our case, perplexity reflects the ability of a model to predict tags for new unseen documents. • A lower perplexity score indicates better generalization performance.
Experiments • Parameter Selection • 𝛼d=0.3, 𝛼u =0.3, 𝛽w=0.05, 𝛽t =0.05, 𝜂=0.05, 𝛾=0.5 • The perplexities over the iterations for five settings of topic number K when perspective number L=80
Experiments • Parameter Section • The perplexities over the iterations for five settings of perspective number L when topic number K=80
Evaluation Results • Tag Perplexity (Iteration = 80, Perspective Num=80)
Evaluation Results • Discovered topics
Evaluation Results • Discovered Perspectives
Evaluation Results • The generation sources of Tags • Greater value of 𝜆 indicates a higher probability that the tag is generated from document topics and vice versa
Modeling the generation of social annotations --Topic-Perspective Model • Conclusions By modeling the tag generation and word generation process separately and incorporating the user information into the tag generation process, the Topic-Perspective model is able to model the social annotation system in a more meaningful way and achieve better generalization performance than other models.
Modeling the generation of social annotations --Topic-Perspective Model • Application (Future work) • Tag recommendation • General tag recommendation • Personalized tag recommendation