1 / 34

The Clustering Problem

The Clustering Problem. Yongsub Lim Applied Algorithm Laboratory KAIST. Contents. The Clustering Problem Basic Algorithms K-Means K-Clustering of Max. Spacing Two-Phase Algorithms Other Algorithms. The Clustering Problem. Given data, it is to discover “meaningful” groups

Download Presentation

The Clustering Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Clustering Problem Yongsub Lim Applied Algorithm Laboratory KAIST

  2. Contents • The Clustering Problem • Basic Algorithms • K-Means • K-Clustering of Max. Spacing • Two-Phase Algorithms • Other Algorithms The Clustering Problem

  3. The Clustering Problem • Given data, it is to discover “meaningful” groups • Data in same group are similar, and • Data between different groups are not similar The Clustering Problem

  4. Example of clustering The Clustering Problem

  5. Example of clustering The Clustering Problem

  6. Example of clustering The Clustering Problem

  7. Applications of Clustering • The image segmentation problem can be considered as a clustering of pixels of an image • In unsupervised learning, before making a decision rule, we classify unlabeled training data through clustering The Clustering Problem

  8. Applications of Clustering • In a network or a graph, we can do grouping vertices which are highly connected within one group • Clustering is also useful in biology to classify genes The Clustering Problem

  9. Basic Algorithms • Two algorithms will be introduced • K-Means computes iteratively centers of K clusters • K-Clustering of Max. Spacing uses a minimum spanning tree • Objective functions of theses are different The Clustering Problem

  10. K-Means • Determine means of K clusters randomly • At each iteration, • Every data belongs to a cluster whose mean is the nearest one among K means • Re-compute means of all clusters The Clustering Problem

  11. K-Means • Objective is to minimize the sum of distance of centers of clusters and their members • It is clustering for high density in one cluster The Clustering Problem

  12. K-Means Algorithm • Worst case Initial two centers randomly chosen This may be not what we want!!! The Clustering Problem

  13. K-Clustering of Max. Spacing • Given data, find K clusters which maximize the minimum distances between all pairs of clusters • spacing: min. distance between any pair of data in different clusters The Clustering Problem

  14. K-Clustering of Max. Spacing The Clustering Problem

  15. K-Clustering of Max. Spacing • Consider given data to a complete graph with Euclidean distance • Compute a MST • Delete the K-1 most expensive edges of a MST The Clustering Problem

  16. K-Clustering of Max. Spacing Copt Calg ≤ spacing of Calg The Clustering Problem

  17. K-Clustering of Max. Spacing • It is no randomness • Objective seems to be better or more reasonable than K-means The Clustering Problem

  18. K-Means vs. Max. Spacing • Good clustering is • High density in one cluster (K-Means) • Long dist. between clusters (Max. Spacing) > The Clustering Problem

  19. K-Means vs. Max. Spacing The Clustering Problem

  20. Two-Phase Algorithms • Two algorithms will be introduced • In the first phase, both do clustering without restriction on K • In second phase, if # of clusters are larger than K, merge using Max. Spacing The Clustering Problem

  21. OleksandrGrygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms Hierarchical EMST The Clustering Problem

  22. OleksandrGrygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms Hierarchical EMST • HEMST removes all edges with weights greater than the threshold (mean+std. of edges) • If # of clusters is less than a given K, same with Max. Spacing • If not, it runs Max. Spacing on data set each of which is nearest to the center of its cluster The Clustering Problem

  23. Hierarchical EMST The Clustering Problem OleksandrGrygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

  24. Hierarchical EMST The Clustering Problem OleksandrGrygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

  25. Hierarchical EMST The Clustering Problem OleksandrGrygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

  26. OleksandrGrygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms Hierarchical EMST The Clustering Problem

  27. M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection Modified K-Means Process • MKF, in the first phase, is similar to K-Means • The difference is that if data is far enough from all clusters, it becomes the center of the new cluster • While running, if # of clusters is larger than a threshold, the two nearest clusters are merged • In the second phase, apply Max. Spaing The Clustering Problem

  28. M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection Modified K-Means Process • This scheme can identify outliers by using Max. Spacing The Clustering Problem

  29. M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection Modified K-Means Process The Clustering Problem

  30. M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection Modified K-Means Process The Clustering Problem

  31. Two-Phase Algorithms • Both give more weights to members in small sets in the first phase • A small set will be the most likely clustered data, so it is reasonable to decrease distances between them The Clustering Problem

  32. ErezHartuv, Ron Shamir, a clustering algorithm based on graph connectivity Other Algorithms • HCS uses min-cut of a graph • It recursively separate data to disjoint two subsets (min-cut) until all clusters are highly connected • A graph is highly connected if the min. # of edges whose removal disconnects the graph is greater than |V|/2 The Clustering Problem

  33. Ana L.N. Fred, Anil K. Jain, Data Clustering Using Evidence accumulation Other Algorithms • Voting • Apply K-Means N times • If any pair of data belonged to same cluster greater than threshold t times, they are grouped The Clustering Problem

  34. Thanks The Clustering Problem

More Related