1 / 29

A Two-Way Visualization Method for Clustered Data

A Two-Way Visualization Method for Clustered Data. Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda Koren and David Harel. ACM SIGKDD international conference on Knowledge discovery and datamining. Outline. Motivation Objective Introduction Basic Notions

kitra-mayer
Download Presentation

A Two-Way Visualization Method for Clustered Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Two-Way Visualization Method for Clustered Data Advisor :Dr. Hsu Presenter: Keng-Wei Chang Author: Yehuda Koren and David Harel ACM SIGKDD international conference on Knowledge discovery and datamining

  2. Outline • Motivation • Objective • Introduction • Basic Notions • Computing The x-Coordinates • Computing The y-Coordinates • Result • Related Work • Conclusions • Personal Opinion

  3. Motivation • A number of technological development have led to an explosion of raw data that has to be analyzed • We are especially interested in two families of tools in this domain • Clustering algorithms and data visualization methods

  4. Objective • in this paper, we integrate the two approaches • hierarchical clustering depicted as a dendrogram • low-dimensional embedding

  5. Introduction • A number of technological development have led to an explosion of raw data that has to be analyzed • We are especially interested in two families of tools in this domain • Clustering algorithms and data visualization methods • Clustering methods can be broadly classified • Hierarchical and partitional

  6. Introduction • Our main interest here is hierarchical clustering • The clustering hierarchy is often visualized as a dendrogram • A full binary tree • has a significant disadvantage • does not provide exploratory visual representations of the data itself • another issue is that of cluster validity

  7. Introduction • we are particularly interested in methods for achieving a low-dimensional embedding of data • principal component analysis (PCA) • multidimensional scaling (MDS) • force-directed placement • solve some limitations of dendrogram • but, cannot utilize external clustering information

  8. Introduction • for a demonstration of the relative merits of the two approaches • a dendrogram vs. a low-dimensional embedding

  9. Introduction • in this paper, we integrate the two approaches • hierarchical clustering depicted as a dendrogram • low-dimensional embedding

  10. Basic Notions • given data about n elements {1,…,n} • relationships between pairs of elements are by • distances dij≥ 0 or • similarities wij≥ 0 • 2-dimentional embedding of the data • id defined by two vectors x, y Є • the coordinates of element i are ( xi, yi)

  11. Computing The x-Coordinates • The embedding must place each element exactly below its corresponding leaf in the dendrogram • this means that the x-coordinate must corresponding leaf in the dendrogram • face the problem of • computing the x-coordinates of the dendrogram leaves • preserves the relationships among the data as much as possible

  12. Computing The x-Coordinates • we exhaust all the existing methods, opting for a twofold process • find the best orientation of the dendrogram • this step determines the ordering of the leaves • decide on the exact gaps between consecutive leaves in the ordering

  13. Dendrogram orientation • a dendrogram has 2n-1 different orientations • example:

  14. Dendrogram orientation • one way of defining formally what should be considered a “good” ordering • associate a cost function with the dendrogram • such that finding the best ordering is equivalent to optimizing this function • be the classical minimum linear arrangement problem minimizes

  15. Dendrogram orientation • in our particular problem • also faced with an ordering task • a permutation of {1, …, n} • however, here we should not consider all possible permutations, but only agree with dendrogram’s structure • n!  2n-1 • using dynamic programming, running time is exponential in the dendrogram’s height not in its size

  16. Dendrogram orientation • introduce an additional form of the cost function maximizes

  17. Dendrogram orientation • given an ordered dendrogram T • a node v • Leaves(v):the set of leaves in the substree rooted by v • x be the ordering on the leaves • Let S be Leaves(v) • L be the set of leaves of left of S • R be the set of leaves of right of S • if |L| = l, |S| = s, we have x(L) = {1,…,l}, x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}

  18. Dendrogram orientation • a key concept of the algorithm is • local arrangement cost, defined as: • if |L| = l, |S| = s, we have x(L) = {1,…,l}, • x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}

  19. Dendrogram orientation • two additional related terms will be used • another term that will be used in the algorithm

  20. Determining coordinates of the leaves • computing the exact gaps between each two consecutive leaves • example:

  21. Determining coordinates of the leaves • a better approach is to take a weighted average over all influenced leaf pairs

  22. Computing The y-Coordinates • Principle component analysis • Classical multidimensional scaling • Eigen-projection • Stress minimization

  23. Result • Odors dataset • consists of 30 volatile odorous pure chemicals • contains 262 elements, natural clusters : 30 • use a UPGMA agglomerative clustering to construct the dendrogram

  24. Result • Iris dataset • an example of discriminant analysis • contains 150 elements, natural clusters : 3

  25. Result • Gene expression data:CDC15-synchronized cell cycle • a much larger dataset of gene-expression data • contains 6113 elements

  26. Related Work • TreeView • dendrogram over a color-coded matrix

  27. Discussion • success for integrating two key methods in exploratory data analysis • cluster analysis and low-dimensional embedding • two unique properties • Guaranteed separation between any kind of given clusters • The ability to deal with a predefined hierarchical clustering

  28. Personal Opinion • Advantages • has success for integrating two of clustering methods. • more intuition in analyzing • Application • Real data for clustering and analyzing. • May solve the problem lack of clustering information • Limited • cannot show the real shape of clusters

More Related