1 / 16

Unified Model for Topic Identification, Community Discovery, and Link Prediction in Textual Data

Explore Topic-Link LDA model combining topics, community, link analysis for web data. Achieve task unity with inference methods and experiment results.

gillespiem
Download Presentation

Unified Model for Topic Identification, Community Discovery, and Link Prediction in Textual Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic-Link LDA: Joint Models of Topic and Author Community Yan Liu Alexandru Niculescu-Mizil Wojciech Gryc Predictive Modeling Group Mathematical Sciences Department IBM Research Jun 16, 2009

  2. user user movie user user movie Scientific Publications Background • In many applications, we have two types of information • Texts: unstructured information describing the property of the entities • Links: graph describing some relationships between the entities Blogs Movie reviews

  3. Related Work • Previous work treat these tasks independently, for example, in blog analysis domain: • Topic identification: • Latent semantic indexing[Deerwester et al, 1990], probabilistic latent semantic indexing[Hofmann, 1999], latent Direchlet allocation[Blei et al., 2003] • Community discovery • Graph partition for community discovery[Gibson et al.,1998; Chakrabarti & Faloutsos, 2006], HITS algorithm and page-rank algorithm for hubs and authorities[Cohn & Hofmann, 2001] • Link prediction • Graph-based algorithms: preferential attachment[Newman, 1999], content-based algorithms: similarity-based algorithm, combining both content and graph information [Xu et al., 2005; Yu et al., 2006] • Recent progress to model text and links jointly • Link-LDA [Erosheva et al., 2004] • Citation influence model [Dietz et al., 2007] • Link-PLSA-LDA [Nallapati & Cohen, 2008] • Relational topic model (RTM) [Chang & Blei, 2009]

  4. Negative Positive Motivation • Current solutions to both topic modeling and community discovery treat all links (or missing links) between documents as the same • Links between two posts (POSTIVE): sharing little or no content similarity => intimate friendship • Missing links between two posts (NEGATIVE): sharing strong content similarity => less acquaintance Probability Distribution of Content Similarity Scores

  5. Author community α κ τ α G θ μ Link existence θ z z w N P Wi w N M β K β K Topic-Link LDA LDA (Blei et al., 2003) Graphical Model Representation of Topic-Link LDA We are able to achieve three tasks in one unified model: topic identification, community discovery and link prediction

  6. Inference and Learning: Topic-Link LDA (1/3) • The likelihood of the data in Topic-Link LDA model is • We need efficient algorithms to estimate the parameters of the model • Variational Expectation and Maximization (EM) algorithm to derive the updating functions of parameters • We use the following auxiliary functions to approximate the logistic function [Jaakkola, 1997] Links Community Texts

  7. Φ : variational variable of Z γ: variational variable of Θ Inference and Learning: Topic-Link LDA (2/3) • E-step: estimate the expectation of hidden variables

  8. Inference and Learning: Topic-Link LDA (3/3) • M-step: compute the model parameters

  9. Experiment Setup • Web 2.0: blog postson web 2.0 technologies • Top 75 blogs from Technorati and Techmeme Leaderboard • Training: 3853 posts within Feb 1-14, 2008 • Test: 2096 posts within Feb 15-22, 2008 • Politics: blog data sets on US politics • 101 political blogs from Technorati and most popular blog listings in literature • Training: 3467 posts within Feb 1-14, 2008 • Test: 1897 posts within Feb 15-22, 2008 • CORA: research paper citation data from CiteSeer [McCallum et al., 2000] • Papers classified as “Artificial Intelligence-Machine Learning” and “Artificial Intelligence-Data Mining” • 423 authors with their 2695 papers, split into halves by time stamp as training and testing respective

  10. Web 2.0 Perplexity # of hidden topics Topic Modeling • Perplexity of test set • Examples of identified topics on CORA dataset

  11. Spectral clustering Topic-link LDA Community Discovery • Clustering results • Examples of discovered community on CORA dataset

  12. Link Formation • How will content similarity and community similarity contribute to the formation of a link? • Contribution of community (content) similarity = coefficient τ * estimated mean of community (content) similarity scores The community similarity has much stronger effect to link formation in political domains than technical domain andscientific papers

  13. Link Prediction • Baselines • Graph-based algorithms: preferential attachment [Newman, 1999] • Content-based algorithms: content similarity between input pairs of posts as features and logistic regression as classifiers • Evaluation measures • Precision, Recall, F1

  14. Conclusion • In this paper, we develop Topic-Link LDA model to jointly model topics and author community • Motivation: the formation of a link between two documents as a combination of topic similarity and communitycloseness • Advantage: achieving three tasks at the same time, i.e. topic modeling, community discovery and link prediction • Future work • Extending the model to analyze time-series linked documents • Modeling the dynamics between citation andinfluence as topics evolve over time

  15. Acknowledgement • John Lafferty, Eric Xing (CMU) • Jure Leskovec (Cornell Univ.) • Ramesh Nallapati (Stanford Univ.) • Rick Lawrence (IBM Research)

  16. Thank you !

More Related