A Two-Way Visualization Method for Clustered Data

A Two-Way Visualization Method for Clustered Data Advisor ：Dr. Hsu Presenter： Keng-Wei Chang Author: Yehuda Koren and David Harel ACM SIGKDD international conference on Knowledge discovery and datamining

Outline • Motivation • Objective • Introduction • Basic Notions • Computing The x-Coordinates • Computing The y-Coordinates • Result • Related Work • Conclusions • Personal Opinion

Motivation • A number of technological development have led to an explosion of raw data that has to be analyzed • We are especially interested in two families of tools in this domain • Clustering algorithms and data visualization methods

Objective • in this paper, we integrate the two approaches • hierarchical clustering depicted as a dendrogram • low-dimensional embedding

Introduction • A number of technological development have led to an explosion of raw data that has to be analyzed • We are especially interested in two families of tools in this domain • Clustering algorithms and data visualization methods • Clustering methods can be broadly classified • Hierarchical and partitional

Introduction • Our main interest here is hierarchical clustering • The clustering hierarchy is often visualized as a dendrogram • A full binary tree • has a significant disadvantage • does not provide exploratory visual representations of the data itself • another issue is that of cluster validity

Introduction • we are particularly interested in methods for achieving a low-dimensional embedding of data • principal component analysis (PCA) • multidimensional scaling (MDS) • force-directed placement • solve some limitations of dendrogram • but, cannot utilize external clustering information

Introduction • for a demonstration of the relative merits of the two approaches • a dendrogram vs. a low-dimensional embedding

Introduction • in this paper, we integrate the two approaches • hierarchical clustering depicted as a dendrogram • low-dimensional embedding

Basic Notions • given data about n elements {1,…,n} • relationships between pairs of elements are by • distances dij≥ 0 or • similarities wij≥ 0 • 2-dimentional embedding of the data • id defined by two vectors x, y Є • the coordinates of element i are ( xi, yi)

Computing The x-Coordinates • The embedding must place each element exactly below its corresponding leaf in the dendrogram • this means that the x-coordinate must corresponding leaf in the dendrogram • face the problem of • computing the x-coordinates of the dendrogram leaves • preserves the relationships among the data as much as possible

Computing The x-Coordinates • we exhaust all the existing methods, opting for a twofold process • find the best orientation of the dendrogram • this step determines the ordering of the leaves • decide on the exact gaps between consecutive leaves in the ordering

Dendrogram orientation • a dendrogram has 2n-1 different orientations • example：

Dendrogram orientation • one way of defining formally what should be considered a “good” ordering • associate a cost function with the dendrogram • such that finding the best ordering is equivalent to optimizing this function • be the classical minimum linear arrangement problem minimizes

Dendrogram orientation • in our particular problem • also faced with an ordering task • a permutation of {1, …, n} • however, here we should not consider all possible permutations, but only agree with dendrogram’s structure • n!  2n-1 • using dynamic programming, running time is exponential in the dendrogram’s height not in its size

Dendrogram orientation • introduce an additional form of the cost function maximizes

Dendrogram orientation • given an ordered dendrogram T • a node v • Leaves(v)：the set of leaves in the substree rooted by v • x be the ordering on the leaves • Let S be Leaves(v) • L be the set of leaves of left of S • R be the set of leaves of right of S • if |L| = l, |S| = s, we have x(L) = {1,…,l}, x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}

Dendrogram orientation • a key concept of the algorithm is • local arrangement cost, defined as： • if |L| = l, |S| = s, we have x(L) = {1,…,l}, • x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}

Dendrogram orientation • two additional related terms will be used • another term that will be used in the algorithm

Determining coordinates of the leaves • computing the exact gaps between each two consecutive leaves • example：

Determining coordinates of the leaves • a better approach is to take a weighted average over all influenced leaf pairs

Computing The y-Coordinates • Principle component analysis • Classical multidimensional scaling • Eigen-projection • Stress minimization

Result • Odors dataset • consists of 30 volatile odorous pure chemicals • contains 262 elements, natural clusters : 30 • use a UPGMA agglomerative clustering to construct the dendrogram

Result • Iris dataset • an example of discriminant analysis • contains 150 elements, natural clusters : 3

Result • Gene expression data：CDC15-synchronized cell cycle • a much larger dataset of gene-expression data • contains 6113 elements

Related Work • TreeView • dendrogram over a color-coded matrix

Discussion • success for integrating two key methods in exploratory data analysis • cluster analysis and low-dimensional embedding • two unique properties • Guaranteed separation between any kind of given clusters • The ability to deal with a predefined hierarchical clustering

Personal Opinion • Advantages • has success for integrating two of clustering methods. • more intuition in analyzing • Application • Real data for clustering and analyzing. • May solve the problem lack of clustering information • Limited • cannot show the real shape of clusters

A Two-Way Visualization Method for Clustered Data

A Two-Way Visualization Method for Clustered Data

Presentation Transcript

Data Visualization

A Trajectory-Preserving Synchronization Method for Collaborative Visualization

Estimation techniques for clustered hierarchical data

Data Visualization

Data Visualization

Data Visualization

A Two-Way Street

Data Visualization

Data Visualization

Clustered Data Cache Designs for VLIW Processors

Data Visualization

Data Visualization

A Process Visualization Method developped by

Two-way Analysis of Three-way Data

Data Visualization

Data Visualization

Data Visualization

Rank-Sum Tests for Clustered Data

Data Analysis for Two-Way Tables

Data Analysis for Two-Way Tables

Data Visualization