270 likes | 475 Views
UTOPIAN: U ser-Driven Top ic Modeling Based on I nter a ctive N onnegative Matrix Factorization. Jaegul Choo 1* , Changhyun Lee 1 , Chandan K. Reddy 2 , and Haesun Park 1 1 Georgia Institute of Technology, 2 Wayne State University *e-mail: jaegul.choo@cc.gatech.edu. Intro: Topic Modeling.
E N D
UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1 1Georgia Institute of Technology, 2Wayne State University *e-mail: jaegul.choo@cc.gatech.edu
Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 brain evolve dna genetic gene nerve neuron life organism
Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism
Intro: Topic Modeling Topic: a distribution over keywords Document 1 Document 2 Document 3 Document 4 Document : a distribution over topic Topic 1 Topic 2 Topic 3 brain evolve dna genetic gene nerve neuron life organism
Latent Dirichlet Allocation (LDA) in Visual Analytics • LDA has been widely used in visual analytics. • TIARA [Wei et al. KDD10], iVisClustering [Lee et al. EuroVis12], ParallelTopics [Dou et al. VAST12], TopicViz [Eisenstein et al. CHI-WIP12], … *Image courtesy of original papers.
Overview of Our Work Keyword-induced topic creation Topic merging • Proposes nonnegative matrix factorization (NMF) for topic modeling. • Highlights advantages of NMF over LDA in visual analytics. • Presents UTOPIAN, an NMF-based interactive topic modeling system. Doc-induced topic creation Topic splitting
Nonnegative Matrix Factorization (NMF) Lower-rank approximation with nonnegativity constraints Why nonnegativity? • Easy interpretation and semantically meaningful output Algorithm • Alternating nonnegativity-constrained least squares [Kim et al., 2008] H • min || A – WH ||F W>=0, H>=0 ~ = A W
NMF as Topic Modeling H H ~ = A W W Topic: a distribution over keywords Document 1 Document 2 Document 3 Document 4 Document : a distribution over topic Topic 1 Topic 2 Topic 3 brain evolve dna genetic gene nerve neuron life organism
Advantages of NMF in Visual Analytics • Reliable algorithmic behaviors • Flexible support for user interactions
NMF vs. LDAConsistency from Multiple Runs Documents’ topical membership changes among 10 runs InfoVis/VAST paper data set 20 newsgroup data set
NMF vs. LDAEmpirical Convergence Documents’ topical membership changes between iterations InfoVis/VAST paper data set 48 seconds 10 minutes NMF LDA
NMF vs. LDATopic Summary (Top Keywords) InfoVis/VAST paper data set • Topics are more consistent in NMF than in LDA. • Topic quality is comparable between NMF and LDA.
Advantages of NMF in Visual Analytics • Reliable algorithmic behaviors • Flexible support for user interactions
Weakly Supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH ||F2+ α||(W – Wr)MW ||F2 + β||MH(H – DHHr) ||F2 W>=0, H>=0 • Wr, Hr: reference matrices for W and H • MW, MH: diagonal matrices for weighting/masking columns/rows of W and H • Provides flexible yet intuitive means for user interaction. • Maintains the same computational complexity as original NMF.
UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF Topic merging Keyword-induced topic creation Doc-induced topic creation Topic splitting
UTOPIAN Overview Keyword-induced topic creation Topic merging Supervised t-distributed stochastic neighbor embedding (t-SNE) User interactions supported • Keyword refinement • Topic merging/splitting • Keyword-/document-induced topic creation Real-time interaction via PIVE (Per-Iteration Visualization Environment) Doc-induced topic creation Topic splitting
Supervised t-SNE Original t-SNE • Documents are often too noisy to work with. Supervised t-SNE • d(xi, xj) ← α•d(xi, xj) if xi and xj belongs to the same topic cluster.
PIVE (Per-Iteration Visualization Environment) for Real-time Interaction[Choo et al., under revision] Standard approach PIVE approach
Usage Scenario: Hyundai Genesis Review Data Initial result After interaction
Summary • Presented UTOPIAN, a User-Driven Topic Modeling based on Interactive NMF. • Highlighted the advantages of NMF over LDA in visual analytics. • Reliable algorithmic behaviors • Consistency from multiple runs • Early empirical convergence • Flexible support for user interactions • Keyword refinement • Topic merging/splitting • Keyword-/document-induced topic creation
More in the paper & On-going Work • A general taxonomy of user interactions with computational methods • Keyword-based vs. document-based • Template-based vs. from-scratch-based • Algorithmic details about supported user interactions • Implementation details • More usage scenarios On-going Work • Scaling up the system with parallel distributed NMF
Jaegul Choojaegul.choo@cc.gatech.eduhttp://www.cc.gatech.edu/~joyfull/ Thank you!http://tinyurl.com/UTOPIAN2013 Topic merging Keyword-induced topic creation For more details, please find me at ‘Meet the Candidate’ A601+ A602, 6PM today Doc-induced topic creation Topic splitting