Unified Model for Topic Identification, Community Discovery, and Link Prediction in Textual Data

Topic-Link LDA: Joint Models of Topic and Author Community Yan Liu Alexandru Niculescu-Mizil Wojciech Gryc Predictive Modeling Group Mathematical Sciences Department IBM Research Jun 16, 2009

user user movie user user movie Scientific Publications Background • In many applications, we have two types of information • Texts: unstructured information describing the property of the entities • Links: graph describing some relationships between the entities Blogs Movie reviews

Related Work • Previous work treat these tasks independently, for example, in blog analysis domain: • Topic identification: • Latent semantic indexing[Deerwester et al, 1990], probabilistic latent semantic indexing[Hofmann, 1999], latent Direchlet allocation[Blei et al., 2003] • Community discovery • Graph partition for community discovery[Gibson et al.,1998; Chakrabarti & Faloutsos, 2006], HITS algorithm and page-rank algorithm for hubs and authorities[Cohn & Hofmann, 2001] • Link prediction • Graph-based algorithms: preferential attachment[Newman, 1999], content-based algorithms: similarity-based algorithm, combining both content and graph information [Xu et al., 2005; Yu et al., 2006] • Recent progress to model text and links jointly • Link-LDA [Erosheva et al., 2004] • Citation influence model [Dietz et al., 2007] • Link-PLSA-LDA [Nallapati & Cohen, 2008] • Relational topic model (RTM) [Chang & Blei, 2009]

Negative Positive Motivation • Current solutions to both topic modeling and community discovery treat all links (or missing links) between documents as the same • Links between two posts (POSTIVE): sharing little or no content similarity => intimate friendship • Missing links between two posts (NEGATIVE): sharing strong content similarity => less acquaintance Probability Distribution of Content Similarity Scores

Author community α κ τ α G θ μ Link existence θ z z w N P Wi w N M β K β K Topic-Link LDA LDA (Blei et al., 2003) Graphical Model Representation of Topic-Link LDA We are able to achieve three tasks in one unified model: topic identification, community discovery and link prediction

Inference and Learning: Topic-Link LDA (1/3) • The likelihood of the data in Topic-Link LDA model is • We need efficient algorithms to estimate the parameters of the model • Variational Expectation and Maximization (EM) algorithm to derive the updating functions of parameters • We use the following auxiliary functions to approximate the logistic function [Jaakkola, 1997] Links Community Texts

Φ : variational variable of Z γ: variational variable of Θ Inference and Learning: Topic-Link LDA (2/3) • E-step: estimate the expectation of hidden variables

Inference and Learning: Topic-Link LDA (3/3) • M-step: compute the model parameters

Experiment Setup • Web 2.0: blog postson web 2.0 technologies • Top 75 blogs from Technorati and Techmeme Leaderboard • Training: 3853 posts within Feb 1-14, 2008 • Test: 2096 posts within Feb 15-22, 2008 • Politics: blog data sets on US politics • 101 political blogs from Technorati and most popular blog listings in literature • Training: 3467 posts within Feb 1-14, 2008 • Test: 1897 posts within Feb 15-22, 2008 • CORA: research paper citation data from CiteSeer [McCallum et al., 2000] • Papers classified as “Artificial Intelligence-Machine Learning” and “Artificial Intelligence-Data Mining” • 423 authors with their 2695 papers, split into halves by time stamp as training and testing respective

Web 2.0 Perplexity # of hidden topics Topic Modeling • Perplexity of test set • Examples of identified topics on CORA dataset

Spectral clustering Topic-link LDA Community Discovery • Clustering results • Examples of discovered community on CORA dataset

Link Formation • How will content similarity and community similarity contribute to the formation of a link? • Contribution of community (content) similarity = coefficient τ * estimated mean of community (content) similarity scores The community similarity has much stronger effect to link formation in political domains than technical domain andscientific papers

Link Prediction • Baselines • Graph-based algorithms: preferential attachment [Newman, 1999] • Content-based algorithms: content similarity between input pairs of posts as features and logistic regression as classifiers • Evaluation measures • Precision, Recall, F1

Conclusion • In this paper, we develop Topic-Link LDA model to jointly model topics and author community • Motivation: the formation of a link between two documents as a combination of topic similarity and communitycloseness • Advantage: achieving three tasks at the same time, i.e. topic modeling, community discovery and link prediction • Future work • Extending the model to analyze time-series linked documents • Modeling the dynamics between citation andinfluence as topics evolve over time

Acknowledgement • John Lafferty, Eric Xing (CMU) • Jure Leskovec (Cornell Univ.) • Ramesh Nallapati (Stanford Univ.) • Rick Lawrence (IBM Research)

Thank you !

Unified Model for Topic Identification, Community Discovery, and Link Prediction in Textual Data

Unified Model for Topic Identification, Community Discovery, and Link Prediction in Textual Data

Presentation Transcript

Probabilistic Topic Models and Associative Memory

Generative Topic Models for Community Analysis

Topic 5 Forecasting Models

Author-Topic Models for Large Text Corpora

Automatic Labeling of Multinomial Topic Models

Topic models

Content Classification Analysis based on LDA Topic Model

SEHS Topic 4.2.1 Joint and Movement Type

Automatic Labeling of Multinomial Topic Models

Generative Topic Models for Community Analysis

TOPIC 4 MANAGEMENT MODELS

Topic Models

Topic 6 Globalization and Community

Topic 7: GIS Models and Modeling

Topic: Community Service

Topic Models in Text Processing

Topic 17: Interaction Models

Tech Topic: Link State

Topic 13 Network Models

Topic Significance Ranking for LDA Generative Models

Probabilistic Topic Models

Web-Mining Agents Topic Analysis: pLSI and LDA