Probabilistic Dyadic Data Analysis with Local and Global Consistency

Probabilistic Dyadic Data Analysis with Local andGlobal Consistency Deng Cai, Xuanhui Wang, Xiaofei He Zhejiang UniversityUniversity of Illinois at Urbana Champaign ICML 2009

Outline • Motivation • Traditional topic modeling (e.g. PLSA, LDA) • The geometric structure of the data • Topic Modeling with Local Consistency • Locally-consistent Topic Modeling (LTM) • Experiments • Summary

Why Topic Modeling Text Collection term 0.16relevance 0.08weight 0.07 feedback 0.04… web 0.21search 0.10link 0.08 graph 0.05… … Topic models (Multinomial distributions) Probabilistic Topic Modeling • Powerful tool for text mining • Topic discovery, • Summarization, • Opinion mining, • Many more … 3

Probabilistic Latent Semantic Analysis • Documents • Terms ?

Probabilistic Latent Semantic Analysis number of occurrences of term w in document d • Documents • Terms • Zero frequency problem: terms not occurring in a document get zero probability Naive Approach

Probabilistic Latent Semantic Analysis • Terms • Documents imports economic TRADE Model fitting trade • Latent • Concepts T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999.

Various Topic Modeling Approaches • Probabilistic Latent Semantic Analysis (Indexing) (PLSI) [Hofmann 99] • Latent Dirichlet Allocation (LDA) [Blei et. al. 03] • Pachinko allocation [Li & McCallum 06] • Many more… Failed to consider the geometric structure of the data

Manifold ? • Manifold assumption (maybe too strong) • Locally consistency assumption (much weaker) • Nearby points (neighbors) share the similar properties

Geometric Structure for Topic Modeling p nearest neighbor graph W pLSA… Smoothed Topic distributionsp(z|d) over the graph 9 Intuition: A document has the similar topics to its neighbors

Objective Function Log-Likelihood of pLSA Measure the smoothness of P(z|d) over the geometric structure of the data Regularized Log-Likelihood of LTM

Parameter Estimation via EM • E step: posterior probability of latent variables (“concepts”) Same as PLSA • M step: parameter estimation based on “completed” statistics Same as PLSA A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of Royal Statistical Society B, vol. 39, no. 1, pp. 1-38, 1977

Parameter Estimation via EM • M step: parameter estimation based on “completed” statistics If λ = 0 Same as PLSA

Experimental Results • Text Clustering • Reuters-21578 corpus. 30 categories, 8067 documents with 18832 distinct terms. • Text Classification • TDT2 corpus. 10 categories, 7456 documents with 33947 distinct terms. For the purpose of reproducibility, we provide our algorithms and data sets used in the experiments at: http://www.cs.uiuc.edu/homes/dengcai2/Data/data.html

Clustering Results on Reuters • P(z|d) can be used to indicate the cluster. • Compare 6 algorithms • 3 Topic modeling approaches • 3 Clustering algorithms

Accuracy vs. Parameter λ(regularization)

Accuracy vs. Parameter p (nearest neighbors)

Classification Results on TDT2 • Classifier: SVM • LTM with Label • Construct the graph W considering the label information

Performance vs. Number of hidden topics

Summary • Topic modeling with local and global consistency (considering the geometric structure of the data) • We suggested to use EM to solve the optimization problem. • Experimental results on text clustering and classification show the effectiveness of the proposed approach. • Future work: • Experimental results on real applications • Extend to other topic models (e.g. LDA) • Other ways of constructing the document graph

Thanks!

Probabilistic Dyadic Data Analysis with Local and Global Consistency