200 likes | 213 Views
This paper discusses a probabilistic approach to topic modeling in text collections, considering the local and global consistency of the data's geometric structure. Experimental results on text clustering and classification demonstrate the effectiveness of the proposed approach.
E N D
Probabilistic Dyadic Data Analysis with Local andGlobal Consistency Deng Cai, Xuanhui Wang, Xiaofei He Zhejiang UniversityUniversity of Illinois at Urbana Champaign ICML 2009
Outline • Motivation • Traditional topic modeling (e.g. PLSA, LDA) • The geometric structure of the data • Topic Modeling with Local Consistency • Locally-consistent Topic Modeling (LTM) • Experiments • Summary
Why Topic Modeling Text Collection term 0.16relevance 0.08weight 0.07 feedback 0.04… web 0.21search 0.10link 0.08 graph 0.05… … Topic models (Multinomial distributions) Probabilistic Topic Modeling • Powerful tool for text mining • Topic discovery, • Summarization, • Opinion mining, • Many more … 3
Probabilistic Latent Semantic Analysis • Documents • Terms ?
Probabilistic Latent Semantic Analysis number of occurrences of term w in document d • Documents • Terms • Zero frequency problem: terms not occurring in a document get zero probability Naive Approach
Probabilistic Latent Semantic Analysis • Terms • Documents imports economic TRADE Model fitting trade • Latent • Concepts T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999.
Various Topic Modeling Approaches • Probabilistic Latent Semantic Analysis (Indexing) (PLSI) [Hofmann 99] • Latent Dirichlet Allocation (LDA) [Blei et. al. 03] • Pachinko allocation [Li & McCallum 06] • Many more… Failed to consider the geometric structure of the data
Manifold ? • Manifold assumption (maybe too strong) • Locally consistency assumption (much weaker) • Nearby points (neighbors) share the similar properties
Geometric Structure for Topic Modeling p nearest neighbor graph W pLSA… Smoothed Topic distributionsp(z|d) over the graph 9 Intuition: A document has the similar topics to its neighbors
Objective Function Log-Likelihood of pLSA Measure the smoothness of P(z|d) over the geometric structure of the data Regularized Log-Likelihood of LTM
Parameter Estimation via EM • E step: posterior probability of latent variables (“concepts”) Same as PLSA • M step: parameter estimation based on “completed” statistics Same as PLSA A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of Royal Statistical Society B, vol. 39, no. 1, pp. 1-38, 1977
Parameter Estimation via EM • M step: parameter estimation based on “completed” statistics If λ = 0 Same as PLSA
Experimental Results • Text Clustering • Reuters-21578 corpus. 30 categories, 8067 documents with 18832 distinct terms. • Text Classification • TDT2 corpus. 10 categories, 7456 documents with 33947 distinct terms. For the purpose of reproducibility, we provide our algorithms and data sets used in the experiments at: http://www.cs.uiuc.edu/homes/dengcai2/Data/data.html
Clustering Results on Reuters • P(z|d) can be used to indicate the cluster. • Compare 6 algorithms • 3 Topic modeling approaches • 3 Clustering algorithms
Classification Results on TDT2 • Classifier: SVM • LTM with Label • Construct the graph W considering the label information
Summary • Topic modeling with local and global consistency (considering the geometric structure of the data) • We suggested to use EM to solve the optimization problem. • Experimental results on text clustering and classification show the effectiveness of the proposed approach. • Future work: • Experimental results on real applications • Extend to other topic models (e.g. LDA) • Other ways of constructing the document graph