270 likes | 489 Views
Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization. Jaegul Choo *, Barry L. Drake † , and Haesun Park* *Georgia Institute of Technology † Georgia Tech Research Institute Big Data Innovators Gathering (BIG) 2014.
E N D
Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization JaegulChoo*, Barry L. Drake†, and Haesun Park* *Georgia Institute of Technology †Georgia Tech Research Institute Big Data Innovators Gathering (BIG) 2014
What is Visual Analytics? Data Mining Visualization 2
What is Visual Analytics?Leveraging Both Worlds Visual Analytics Data Mining Visualization + 3
Visual Analytics forLarge-Scale Documents UTOPIAN: User-driven Topic Modeling based on Interactive NMF Topic merging Keyword-induced topic creation Doc-induced topic creation Topic splitting VisIRR: Information Retrieval and Personalized Recommender System 4
Motivation: Too Many Documents to Read 5 Product reviews • Which tablet to buy? • iPad (2,000 reviews) vs. Galaxy Tab (1,300 reviews) Research papers • Which sub-area in data mining to focus on? • >Thousands of new papers every year Patent search Many other applications
Topic Modeling: Summarizing Documents … Document 1 Document 2 Document 3 Document 4 … brain evolve dna gene nerve neuron life organism 6 6
Topic Modeling: Summarizing Documents … Document 1 Document 2 Document 3 Document 4 Topic 1 Topic 2 Topic 3 Topic: distribution over keywords … brain evolve dna gene nerve neuron life organism 7 7
Topic Modeling: Summarizing Documents … Document 1 Document 2 Document 3 Document 4 Document: distribution over topics Topic 1 Topic 2 Topic 3 Topic: distribution over keywords … brain evolve dna gene nerve neuron life organism 8 8
Nonnegative Matrix Factorization (NMF) H • min || A – WH ||F W>=0, H>=0 ~ = A W Low-rank approximation via matrix factorization Why nonnegativity constraints? Better interpretation(vs. better approximation, e.g., SVD) 9
NMF as Topic Modeling H H ~ = A W W … Document 1 Document 2 Document 3 Document 4 Document: distribution over topics Topic 1 Topic 2 Topic 3 Topic: distribution over keywords … brain evolve dna gene nerve neuron life organism 10
Why NMF (instead of LDA)?Consistency from Multiple Runs 20 newsgroup data set InfoVis/VAST paper data set Documents’ topical membership changes among 10 runs 11
Why NMF (instead of LDA)?Empirical Convergence InfoVis/VAST paper data set 10 minutes 48 seconds NMF LDA Documents’ topical membership changes between iterations 12
NMF vs. LDATopic Summary (Top Keywords) InfoVis/VAST paper data set 13 • Topics are more consistent in NMF than in LDA. • Topic quality is comparable between NMF and LDA.
UTOPIAN:User-Driven Topic Modeling Based on Interactive NMF[Choo et al., TVCG’13] Keyword-induced topic creation Topic merging Doc-induced topic creation Topic splitting 14
Visualization Example: Car Reviews Topic summaries are NOT perfect. • UTOPIAN allows user interactions for improving them.
Weakly Supervised NMF: Supporting User Interactions Weakly supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH ||F2+ α||(W – Wr)MW||F2 + β||MH(H – DHHr)||F2 W>=0, H>=0 Wr, Hr: reference matrices for W and H (user-input) MW, MH: diagonal matrices for weighting/masking columns and rows of Wand H Algorithm: block-coordinate descent framework 16
http://tinyurl.com/UTOPIAN2013 Interaction Demo Video InfoVis-VAST Paper Data Before interaction After topic splitting (triangle) and topic merging (circle) 17
VisIRR: Information Retrieval and Personalized Recommender System 18
FeaturesEfficient Large-scale Data Processing 19 Document corpus: ~400,000 academic papers in CS Data management Structured data: author, year, venue, keywords, citation/reference count Unstructured data: bag-of-words vectors of title, abstract, keywords Graph data: content, citation, and co-authorship Efficient data handling Dynamic loading from disk to memory via Cache-like strategy Scalable data expansion in O(n)
FeaturesPersonalized Recommendation 20 Works based on user preference on document • Preference scale of 1 (highly dislike) to 5 (highly like) • Various recommendation schemes • Based on content, citation network, and co-authorship Algorithm • Preference propagation on graph using heat kernel rα = α ∑k (1- α)kfWk • rα is a recommendation score vector with a control parameter α, and fis a user-assigned rating, and W is an input graph
http://tinyurl.com/VisIRR VisIRR DemoCitation-based Recommendation • Preference-assigned item as ‘highly like’ : • ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ • Most of the recommended items are highly cited. • Computational zoom-in shows sub-areas relevant to the article. 21
http://tinyurl.com/VisIRR VisIRR DemoCo-authorship-based Recommendation • Preference-assigned item as ‘highly like’ : • ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ • It shows other areas of the authors of this paper. Retrieved + recommended items Computational zoom-in on recommended items 22
Interested in learning Micro-Financing Analysis in Kiva.org? Check out my presentation at Room 104, Wed 4pm 23
Thank you! JaegulChoojaegul.choo@cc.gatech.edu (Currently on the Academic Job Market) Topic merging Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF Doc-induced topic creation Topic splitting VisIRR: Information Retrieval and Personalized Recommender System Micro-Financing Analysis in Kiva.org, : Room 104, Wed 4pm 24 Selected Papers • Choo et al., Document Topic Modeling and Discovery in Visual Analytics via Nonnegative Matrix Factorization, TVCG, 2013 • Choo et al., VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data, Tech Report, Georgia Tech, 2013
UTOPIANInteractions and Key Techniques • Visualization • Supervisedt-SNE • Topic modeling • NMF Interaction • Refining topic keywords • Merging topics • Splitting a topic • Creating new topics from seed documents/keywords Weakly-supervised NMF Per-iteration Visualization Framework
Supervised t-SNE: Visualizing documents Supervised t-SNE • d(xi, xj) ← α•d(xi, xj) if xi and xjbelong to the same topic. (e.g., α=0.3) Original t-SNE • Documents do not have clear topic clusters.
PIVE: (Per-iteration Visualization Environment) Standard approach PIVE approach Integration methodology of Iterative Methods for Real-TimeInteractive Visualization[Choo et al., VAST’14, to submit] 27