Data Clustering: 50 years beyond K-means

Data Clustering: 50 years beyond K-means Presenter : Jiang-Shan Wang Authors : Anil K. Jain 國立雲林科技大學 National Yunlin University of Science and Technology PRL 2010

Outline • Motivation • Objective • Data clustering • User’s dilemma • K-means • Extensions of K-means • Trends in data clustering • Summary • Comments

Motivation • Providing a brief overview of clustering and point out some of the emerging and useful research directions.

Objective Summarizing well known clustering methods, discuss the major challenge and key issues in designing clustering algorithm, and point out some of the emerging and useful research directions.

Data clustering • Three main purposes: • Underlying structure • Natural classification • Compression

K-means • Three parameters • Number of clusters • Cluster initialization • Distance metrics

Extensions of K-means Fuzzy C-means Bisecting K-means X-means K-medoid Kernel K-means

User’s dilemma Representation

User’s dilemma Purpose of grouping

User’s dilemma Number of clusters

User’s dilemma Cluster validity

User’s dilemma Comparing clustering algorithm

User’s dilemma • Admissibility analysis of clustering algorithms • Fisher and vanNess’s criteria • Convex • Cluster proportion • Cluster omission • Monotone • Kleinberg’s criteria • Scale invariance • Richness • consistency

Trends in data clustering Clustering ensembles

Trends in data clustering Semi-supervised clustering

Trends in data clustering • Large-scale clustering • Studies • Efficient Nearest Neighbor • Data summarization • Distributed computing • Incremental clustering • Sampling-based methods

Trends in data clustering • Multi-way clustering • Heterogeneous data • Rank data • Dynamic data • Graph data • Relational data

Summary There needs to be a suite of benchmark data. A tighter integration between clustering algorithms and the application needs. Optimization problems. Stability or consistency. Choose clustering principles according to satisfiability of the stated axioms. Develop semi-supervised clustering.

Comments • Advantage • Many figures to understanding. • Drawback • … • Application • Clustering.

Data Clustering: 50 years beyond K-means