Haesun Park Division of Computational Science and Engineering Georgia Institute of Technology

FODAVA-Lead ResearchDimension Reduction and Data Reduction:Foundations for Interactive Visualization Haesun Park Division of Computational Science and Engineering Georgia Institute of Technology FODAVA Review Meeting, Dec. 2009

Challenges in Analyzing High Dimensional Massive Data on Visual Analytics System • Screen Space and Visual Perception: low dim and number of available pixels fundamentally limiting constraints • High dimensional data: Effective dimension reduction • Large data sets: Informative representation of data • Speed: necessary for real-time, interactive use • Scalable algorithms • Adaptive algorithms Development of Fundamental Theory and Algorithms in Data Representations and Transformations to enable Visual Understanding

FODAVA-Lead Research Topics • Dimension Reduction • Dimension reduction with prior info/interpretability constraints • Manifold learning • Informative Presentation of Large Scale Data • Sparse recovery by L1 penalty • Clustering, semi-supervised clustering • Multi-resolution data approximation • Fast Algorithms • Large-scale optimization/matrix decompositions • Adaptive updating algorithms for dynamic and time-varying data, and interactive vis. • Data Fusion • Fusion of different types of data from various sources • Fusion of different uncertainty level • Integration with DAVA systems

FODAVA-Lead Presentations • H. Park – Overview of proposed FODAVA research, Introduction to FODAVA Test-bed, dimension reduction of clustered data for effective representation, application to text, image, and audio data sets • A. Gray – Nonlinear dimension reduction (manifold learning), fast data analysis algorithms, formulation of problems as large scale optimization problems (SDP) • V. Koltchinskii – Multiple kernel learning method for fusion of data with heterogeneous types, sparse representation • R. Monteiro – Convex optimization, SDP, novel approach for dimension reduction, compressed sensing and sparse representation • J. Stasko – Visual Analytics System demo, interplay between math/comp and interactive visualization

Test Bed for Visual Analytics of High Dimensional Massive Data • Open source software • Integrates results from mathematics, statistics, numerical algorithms/optimization across FODAVA teams • Easily accessible to a wide community of researchers • Makes theory/algorithms relevant and readily available to VA and applications community • Identifies effective methods for specific problems (evaluation) Test Bed Applications FODAVA Fundamental Research

Modules in Data and Visual Analytics System for High Dimensional Massive Data • Data Representation & Transformation Tasks • Classification • Clustering • Regression • Dimension reduction • Density estimation • Retrieval of similar items • Automatic summarization • … Data in Input Space Raw Data Mathematical , Statistical, and Computational Methods Analytical Reasoning Visual Representation and Interaction

Modules in FODAVA Test Bed Informative Representation and Transformation • Visual Representation • Dimension Reduction (2D/3D) • Temporal Trend • Uncertainty • Anomaly/Outlier • Causal relationship • Zoom in/out by dynamic updating … • Vector Rep. of Raw Data • Text • Image • Audio … • Clustering • Summarization • Regression • Multi-Resolution Data Reduction … Label Similarity Density Missing value … Interactive Analysis

Research in Data Representations and Transformations (by H. Park’s group) • 2D/3D Representation of Data with Prior Information(J. Choo, J. Kim, K. Balasubramanian) • Clustered Data: Two-stage dimension reduction for clustered data • Nonnegative Data: • Nonnegative Matrix Factorization (NMF) • Nonnegative Tensor Factorization (NTF) • Clustering and Classification (J. Kim, D. Kuang) • New clustering algorithms based on NMF • Semi-supervised clustering based on NMF • Sparse Representation of Data (J. Kim, V. Koltchinskii, R. Monteiro) • Sparse Solution for Regression • Sparse PCA • FODAVA Testbed Development (J. Choo, J. Kihm, H. Lee)

Nonnegativity Preserving Dim. Reduction Nonnegative Matrix Factorization(NMF) (Paatero&Tappa 94, Lee&Seung NATURE 99, Pauca et al. SIAM DM 04, Hoyer 04, Lin 05, Berry 06, Kim and Park 06 Bioinformatics, Kim and Park 08 SIAM Journal on Matrix Analysis and Applications, …) A W H • min || A – WH ||F W>=0, H>=0 ~ = • Why Nonnegativity Constraints? • Better Approx. vs. Better Representation/Interpretation • Nonnegative Constraints often physically meaningful • Interpretation of analysis results possible • Fastest Algorithm for NMF, with theoretical convergence (J. Kim and H. Park, IDCM08) • NMF/ANLS: Iterate the following with Active Set-type Method fixing W , solve minH>=0 || W H–A||F fixing H , solve minW>=0 || HTWT –AT||F • Sparse NMF can be used as a clustering algorithm

2D RepresentationUtilize Cluster Structure if Known LDA+PCA(2) SVD(2) PCA(2) 2D representation of 700x1000 data with 7 clusters: LDA vs. SVD vs. PCA

Optimal Dimension Reducing Transformation High quality clusters have small trace(Sw) & largetrace(Sb) Want: F s.t. mintrace(FT SwF)& maxtrace(FT SbF) • max trace ((FT SwF)-1 (FT SbF)) LDA(Fisher 36, Rao 48), LDA/GSVD(Park, ..), • max trace (FT SbF)with FTF=I Orthogonal Centroid (Park et al. 03) • max trace (FT(Sw+Sb)F) with FTF=I PCA(Hotelling 33)FTF=I • max trace (FTAATF) with FTF=I LSI (Deerwester et al. 90) Can easily be non-linearized using Kernel functions Optimal Reduced Dimension >> 3 in general trace (Sw ) trace (Sb )

Two-stage Dimension Reduction for 2D Visualization of Clustered Data • LDA + LDA = Rank2 LDA • LDA + PCA • OCM + PCA • OCM + Rank-2 PCA on SFb = Rank-2 PCA on Sb (In-Spire) • (J. Choo, S. Bohn, H.Park, VAST 09)

2D Visualization: Newsgroups Rank-2 LDA LDA + PCA g: talk.politics.guns p: talk.politics.misc c: soc.religion.christian r: talk.religion.misc p: comp.sys.ibm.pc.hardware a: comp.sys.mac.hardware y: sci crypt d: sci.med e: sci.electronics f: misc.forsale b: rec.sport.baseball OCM + PCA Rank-2 PCA on Sb 2D visualization of Newsgroups Data (21347 dimension, 770 items, 11 clusters)

2D Visualization of Clustered Text, Image, Audio Data PCA PCA PCA LDA+PCA Rank-2 LDA Rank-2 LDA h : heart attack c : colon cancer o : oral cancer d : diabetes t : tooth decay Medline Data (Text) Facial Data (Image) Spoken Letters (Audio)

Visual Facial Recognizer: A Test Bed Application • Weizmann Face Data • (352 * 512 pixels each) x (28 persons * 52 images each) • Significant variations in angle, illumination, and facial expression • Problem • No data analytic algorithm alone is perfect. • e.g., Accuracy comparison

Visual Facial Recognizer: A Test Bed Application • Visually reduce human’s search space • → Efficiently utilize human visual recognition • e.g., Test bed visualization of Weizmann images using Rank-2 LDA

Summary / Future Research • Informative 2D/3D Representation of Data • Clustered Data: Two-stage dimension reduction methods • effective for a wide range of problems • Interpretable Dimension Reduction for nonnegative data: NMF • New clustering algorithms based on NMF • Semi-supervised clustering based on NMF • Extension to Tensors for Time-series Data • Customized Fast Algorithms for 2D/3D Reduction needed • Dynamic Updating methods for Efficient and Interactive Visualization • Sparse methods with L1 regularization • Sparse Solution for Regression • Sparse PCA • FODAVA Test bed Development

Haesun Park Division of Computational Science and Engineering Georgia Institute of Technology