Jaegul Choo , Barry L. Drake † , and Haesun Park *Georgia Institute of Technology

Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization JaegulChoo*, Barry L. Drake†, and Haesun Park* *Georgia Institute of Technology †Georgia Tech Research Institute Big Data Innovators Gathering (BIG) 2014

What is Visual Analytics? Data Mining Visualization 2

What is Visual Analytics?Leveraging Both Worlds Visual Analytics Data Mining Visualization + 3

Visual Analytics forLarge-Scale Documents UTOPIAN: User-driven Topic Modeling based on Interactive NMF Topic merging Keyword-induced topic creation Doc-induced topic creation Topic splitting VisIRR: Information Retrieval and Personalized Recommender System 4

Motivation: Too Many Documents to Read 5 Product reviews • Which tablet to buy? • iPad (2,000 reviews) vs. Galaxy Tab (1,300 reviews) Research papers • Which sub-area in data mining to focus on? • >Thousands of new papers every year Patent search Many other applications

Topic Modeling: Summarizing Documents … Document 1 Document 2 Document 3 Document 4 … brain evolve dna gene nerve neuron life organism 6 6

Topic Modeling: Summarizing Documents … Document 1 Document 2 Document 3 Document 4 Topic 1 Topic 2 Topic 3 Topic: distribution over keywords … brain evolve dna gene nerve neuron life organism 7 7

Topic Modeling: Summarizing Documents … Document 1 Document 2 Document 3 Document 4 Document: distribution over topics Topic 1 Topic 2 Topic 3 Topic: distribution over keywords … brain evolve dna gene nerve neuron life organism 8 8

Nonnegative Matrix Factorization (NMF) H • min || A – WH ||F W>=0, H>=0 ~ = A W Low-rank approximation via matrix factorization Why nonnegativity constraints? Better interpretation(vs. better approximation, e.g., SVD) 9

NMF as Topic Modeling H H ~ = A W W … Document 1 Document 2 Document 3 Document 4 Document: distribution over topics Topic 1 Topic 2 Topic 3 Topic: distribution over keywords … brain evolve dna gene nerve neuron life organism 10

Why NMF (instead of LDA)?Consistency from Multiple Runs 20 newsgroup data set InfoVis/VAST paper data set Documents’ topical membership changes among 10 runs 11

Why NMF (instead of LDA)?Empirical Convergence InfoVis/VAST paper data set 10 minutes 48 seconds NMF LDA Documents’ topical membership changes between iterations 12

NMF vs. LDATopic Summary (Top Keywords) InfoVis/VAST paper data set 13 • Topics are more consistent in NMF than in LDA. • Topic quality is comparable between NMF and LDA.

UTOPIAN:User-Driven Topic Modeling Based on Interactive NMF[Choo et al., TVCG’13] Keyword-induced topic creation Topic merging Doc-induced topic creation Topic splitting 14

Visualization Example: Car Reviews Topic summaries are NOT perfect. • UTOPIAN allows user interactions for improving them.

Weakly Supervised NMF: Supporting User Interactions Weakly supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH ||F2+ α||(W – Wr)MW||F2 + β||MH(H – DHHr)||F2 W>=0, H>=0 Wr, Hr: reference matrices for W and H (user-input) MW, MH: diagonal matrices for weighting/masking columns and rows of Wand H Algorithm: block-coordinate descent framework 16

http://tinyurl.com/UTOPIAN2013 Interaction Demo Video InfoVis-VAST Paper Data Before interaction After topic splitting (triangle) and topic merging (circle) 17

VisIRR: Information Retrieval and Personalized Recommender System 18

FeaturesEfficient Large-scale Data Processing 19 Document corpus: ~400,000 academic papers in CS Data management Structured data: author, year, venue, keywords, citation/reference count Unstructured data: bag-of-words vectors of title, abstract, keywords Graph data: content, citation, and co-authorship Efficient data handling Dynamic loading from disk to memory via Cache-like strategy Scalable data expansion in O(n)

FeaturesPersonalized Recommendation 20 Works based on user preference on document • Preference scale of 1 (highly dislike) to 5 (highly like) • Various recommendation schemes • Based on content, citation network, and co-authorship Algorithm • Preference propagation on graph using heat kernel rα = α ∑k (1- α)kfWk • rα is a recommendation score vector with a control parameter α, and fis a user-assigned rating, and W is an input graph

http://tinyurl.com/VisIRR VisIRR DemoCitation-based Recommendation • Preference-assigned item as ‘highly like’ : • ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ • Most of the recommended items are highly cited. • Computational zoom-in shows sub-areas relevant to the article. 21

http://tinyurl.com/VisIRR VisIRR DemoCo-authorship-based Recommendation • Preference-assigned item as ‘highly like’ : • ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ • It shows other areas of the authors of this paper. Retrieved + recommended items Computational zoom-in on recommended items 22

Interested in learning Micro-Financing Analysis in Kiva.org? Check out my presentation at Room 104, Wed 4pm 23

Thank you! JaegulChoojaegul.choo@cc.gatech.edu (Currently on the Academic Job Market) Topic merging Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF Doc-induced topic creation Topic splitting VisIRR: Information Retrieval and Personalized Recommender System Micro-Financing Analysis in Kiva.org, : Room 104, Wed 4pm 24 Selected Papers • Choo et al., Document Topic Modeling and Discovery in Visual Analytics via Nonnegative Matrix Factorization, TVCG, 2013 • Choo et al., VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data, Tech Report, Georgia Tech, 2013

UTOPIANInteractions and Key Techniques • Visualization • Supervisedt-SNE • Topic modeling • NMF Interaction • Refining topic keywords • Merging topics • Splitting a topic • Creating new topics from seed documents/keywords Weakly-supervised NMF Per-iteration Visualization Framework

Supervised t-SNE: Visualizing documents Supervised t-SNE • d(xi, xj) ← α•d(xi, xj) if xi and xjbelong to the same topic. (e.g., α=0.3) Original t-SNE • Documents do not have clear topic clusters.

PIVE: (Per-iteration Visualization Environment) Standard approach PIVE approach Integration methodology of Iterative Methods for Real-TimeInteractive Visualization[Choo et al., VAST’14, to submit] 27

Jaegul Choo , Barry L. Drake † , and Haesun Park *Georgia Institute of Technology