170 likes | 448 Views
Topic Significance Ranking for LDA Generative Models. Loulwah AlSumait Daniel Barbará James Gentle Carlotta Domeniconi. ECML PKDD - Bled, Slovenia - September 7-11, 2009. Agenda. Introduction Junk/Insignificant topic definitions Distance measures
E N D
Topic Significance Ranking for LDA Generative Models Loulwah AlSumait Daniel Barbará James Gentle Carlotta Domeniconi ECML PKDD - Bled, Slovenia - September 7-11, 2009
Agenda • Introduction • Junk/Insignificant topic definitions • Distance measures • 4-phase Weighted Combination Approach • Experimental results • Conclusions and future work
d zi Nd D Latent Dirichlet Allocation (LDA) ModelBlei, Ng, & Jordan (2003) • Exact inference is intractable • Approximation approaches • Input: K • Output: Φ, θ • Probabilistic generative model • Hidden variables (topics) are associated with the observed text • Dirichlet priors on document and topic distributions Inference Process Generative Process K wi
Topic Significance Ranking • Critical effect of the setting of K on the inferred topics • Most of previous work manually examine the topics • Quantify the semantic significance of topics • How much different is the topic distribution from junk/insignificant topic distributions
Topic Significance Ranking • Example: 20 NewsGroup The Volgenau School of Information Technology and Engineering Department of Computer Science
Junk/Insignificant Topic Definitions • Uniform Distribution Over Words • Uniformity of a topic: • Vacuous Semantic Distribution • , p(wi|k) = ik , • Vacuousness of a topic: • Background Distribution • Background of a topic: ,
Distance Measures • Symmetric KL-Divergence • Uniformity, Background, W-Vacuous • Cosine Dissimilarity • Uniformity , W-Vacuous , Background • Coefficient Correlation • Uniformity , W-Vacuous , Background
Topic Significance Ranking • Multi-Criteria Weighted Combination • 4 phases • Standardization procedure • Transfer distances into standardized measures • Scores • Weights
B U V B V U S S S S S S 1 1 1 2 2 2 k k k k k k W-Vacuous scores Background scores Topic Significance Ranking • 4 phases (Continued) • Intra-Criterion Weighted Combination • Combine standardized measures of each J/I definition • Inter-Criteria Weighted Combination • Combine J/I scores and weights • Topic Rank Uniformity scores TSR X
Individual vs. Combined Score Simulated Data
Individual vs. Combined Score 20 NewsGroups
Conclusions and Future Work • Unsupervised numerical quantification of the topics’ semantic Significance • Novel post analysis in LDA modeling • Three J/I topic distributions • 4 levels of weighted combination approach • Future directions: • Analysis of TSR sensitivity to the approach, K and weights settings • More J/I definitions • Tool to visualize topic evolution in online setting