160 likes | 170 Views
Explore Topic-Link LDA model combining topics, community, link analysis for web data. Achieve task unity with inference methods and experiment results.
E N D
Topic-Link LDA: Joint Models of Topic and Author Community Yan Liu Alexandru Niculescu-Mizil Wojciech Gryc Predictive Modeling Group Mathematical Sciences Department IBM Research Jun 16, 2009
user user movie user user movie Scientific Publications Background • In many applications, we have two types of information • Texts: unstructured information describing the property of the entities • Links: graph describing some relationships between the entities Blogs Movie reviews
Related Work • Previous work treat these tasks independently, for example, in blog analysis domain: • Topic identification: • Latent semantic indexing[Deerwester et al, 1990], probabilistic latent semantic indexing[Hofmann, 1999], latent Direchlet allocation[Blei et al., 2003] • Community discovery • Graph partition for community discovery[Gibson et al.,1998; Chakrabarti & Faloutsos, 2006], HITS algorithm and page-rank algorithm for hubs and authorities[Cohn & Hofmann, 2001] • Link prediction • Graph-based algorithms: preferential attachment[Newman, 1999], content-based algorithms: similarity-based algorithm, combining both content and graph information [Xu et al., 2005; Yu et al., 2006] • Recent progress to model text and links jointly • Link-LDA [Erosheva et al., 2004] • Citation influence model [Dietz et al., 2007] • Link-PLSA-LDA [Nallapati & Cohen, 2008] • Relational topic model (RTM) [Chang & Blei, 2009]
Negative Positive Motivation • Current solutions to both topic modeling and community discovery treat all links (or missing links) between documents as the same • Links between two posts (POSTIVE): sharing little or no content similarity => intimate friendship • Missing links between two posts (NEGATIVE): sharing strong content similarity => less acquaintance Probability Distribution of Content Similarity Scores
Author community α κ τ α G θ μ Link existence θ z z w N P Wi w N M β K β K Topic-Link LDA LDA (Blei et al., 2003) Graphical Model Representation of Topic-Link LDA We are able to achieve three tasks in one unified model: topic identification, community discovery and link prediction
Inference and Learning: Topic-Link LDA (1/3) • The likelihood of the data in Topic-Link LDA model is • We need efficient algorithms to estimate the parameters of the model • Variational Expectation and Maximization (EM) algorithm to derive the updating functions of parameters • We use the following auxiliary functions to approximate the logistic function [Jaakkola, 1997] Links Community Texts
Φ : variational variable of Z γ: variational variable of Θ Inference and Learning: Topic-Link LDA (2/3) • E-step: estimate the expectation of hidden variables
Inference and Learning: Topic-Link LDA (3/3) • M-step: compute the model parameters
Experiment Setup • Web 2.0: blog postson web 2.0 technologies • Top 75 blogs from Technorati and Techmeme Leaderboard • Training: 3853 posts within Feb 1-14, 2008 • Test: 2096 posts within Feb 15-22, 2008 • Politics: blog data sets on US politics • 101 political blogs from Technorati and most popular blog listings in literature • Training: 3467 posts within Feb 1-14, 2008 • Test: 1897 posts within Feb 15-22, 2008 • CORA: research paper citation data from CiteSeer [McCallum et al., 2000] • Papers classified as “Artificial Intelligence-Machine Learning” and “Artificial Intelligence-Data Mining” • 423 authors with their 2695 papers, split into halves by time stamp as training and testing respective
Web 2.0 Perplexity # of hidden topics Topic Modeling • Perplexity of test set • Examples of identified topics on CORA dataset
Spectral clustering Topic-link LDA Community Discovery • Clustering results • Examples of discovered community on CORA dataset
Link Formation • How will content similarity and community similarity contribute to the formation of a link? • Contribution of community (content) similarity = coefficient τ * estimated mean of community (content) similarity scores The community similarity has much stronger effect to link formation in political domains than technical domain andscientific papers
Link Prediction • Baselines • Graph-based algorithms: preferential attachment [Newman, 1999] • Content-based algorithms: content similarity between input pairs of posts as features and logistic regression as classifiers • Evaluation measures • Precision, Recall, F1
Conclusion • In this paper, we develop Topic-Link LDA model to jointly model topics and author community • Motivation: the formation of a link between two documents as a combination of topic similarity and communitycloseness • Advantage: achieving three tasks at the same time, i.e. topic modeling, community discovery and link prediction • Future work • Extending the model to analyze time-series linked documents • Modeling the dynamics between citation andinfluence as topics evolve over time
Acknowledgement • John Lafferty, Eric Xing (CMU) • Jure Leskovec (Cornell Univ.) • Ramesh Nallapati (Stanford Univ.) • Rick Lawrence (IBM Research)