Performance evaluation of some clustering algorithms and validity indices

Performance evaluation of some clustering algorithms and validity indices Advisor ：Dr. Hsu Presenter： Yu Cheng Chen Author: Ujjwal Maulik and Sanghamitra Bandyopadhyay IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,2003 p.p 1650-1654

Outline • Motivation • Objective • Introduction • Clustering Algorithms • Clustering Validity Indices • Experimental Result • Conclusions • Personal Opinion

Motivation • Want to know the performance of some clustering algorithms and validity indices.

Objective • Show the performance of three clustering algorithms, • K-mean • Single linkage • Simulated annealing (SA) • In conjunction with four cluster validity indices. • Davies-Bouldin • Dunn’s index • Calinski-Harabasz index • Index Ґ

Introduction • Clustering is to search the prior partitions of certain dataset. • The two common questions addressed in clustering systems are: • How many cluster are present in the data • How good is the clustering itself

Introduction • The measure of validity of the clusters should be able to show the ordering of clusters in terms of its goodness. • For example • U1,U2,…,Um are m partitions of data set X • V1,V2,…,Vm are corresponding values of a validity measure • Vk1>=Vk2>=…>=Vkm indicate Uk1 is better than Uk2 and so on.

Clustering Algorithms • K-mean: Unassigned data clustering k The center of cluster Data set Reassign the centroid

A B C D 2 Clustering Algorithms • Single linkage:

A B C D 3 Clustering Algorithms • Single linkage:

A B C D Clustering Algorithms • Single linkage: 10

Clustering Algorithms • Simulated annealing:

Clustering Algorithms • Simulated annealing: Components: • “Energy” (Cost) function to minimize • represent entire state, drives system forward • Moves • local rearrangement • Cooling schedule • initial temperature • temperature steps (sequence) • time at each temperature

starting point descend direction local minima global minimal Clustering Algorithms • Basic algorithm of SA • Pick an initial solution as Tmin • Set temperature (T) to initial value • while (T> Tmin) • for time at T • pick a move at random • compute Dcost • if Dcost <=0, then accept • else if Random Prob. < e-Dcost/T, accept • update T

Clustering Algorithms • How to apply in clustering ? • The cost function is • The new status is accepted/rejected according to the prob.

Clustering Validity Indices • Aim to identify the compact and well separated cluster • Davies-Bouldin (DB) index: • The distance of intra-cluster • The distance of inter-cluster • The smaller value of DB correspond to good clusters

Clustering Validity Indices • Dunn’s Index: • The S and T are two nonempty subsets. • The diameter of S is defined as △(S) = max {d(x, y)} • The distance between S and T is δ(S,T)=min {d(x, y)} • The larger value of vD correspond to good cluster.

Clustering Validity Indices • Calinski Harabasz (CH) index: • nk is the number of points in cluster k • z is the centroid of the entire data set.

Clustering Validity Indices • Index I: • n is the total number of points in data set • zk is the center of the k cluster • If K or EK increase, the index I decrease • If K increase, the Dk increase • The three factor are found and balance each other critically.

Clustering Validity Indices • Xie and Beni defined an index that is a ration of the compactness of the fuzzy K-partitions of a data to its separations. • Here, we show the relationship index I and Dunn’s index by following:

Experimental Results • Artificial data sets • AD_10_2, AD_4_3N, AD_2_10

Experimental Results • Real data sets • Crude Oil, and Cancer

Experimental Result • In this paper, the correct number of clusters are decided by : • All three algorithms yield (Kmax-Kmin+1) partitions. • Let UKmin, UKmin+1,…,Ukmax are these partition and Vkmin, Vkmin+1,…,Vkmax are the validity index value. • We choose the K* which has the maximal index value as the correct number of cluster.

Experimental Results • The value of Kmin and Kmax are chosen as 2 and √n

Conclusions • We presented the comparison of 4 cluster validity indices with artificial and real data sets. • The cluster validity indices can used to evolve the appropriate number of clusters. • Index I achieves its maximum value for the correct number of clusters.

Personal Opinion • Drawback: • Only mention crisp clustering algorithm for numeric data. • No analysis about time complexity of validity indices. • Application • Index I can be used to search for the correct number of cluster but how to handle categorical value is still a big problem • Future Work • Need to find or develop the validity index for mix data.

Performance evaluation of some clustering algorithms and validity indices

Performance evaluation of some clustering algorithms and validity indices

Presentation Transcript

Clustering Algorithms

Performance Evaluation of Shadow Detection Algorithms

Clustering Algorithms

Clustering Algorithms

Performance Evaluation for Learning Algorithms

Performance Evaluation of Machine Learning Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Performance evaluation of some clustering algorithms and validity indices

Validity Evaluation

Clustering Algorithms

Clustering Algorithms

Fuzzy cluster validity indices

Performance Evaluation of Grouping Algorithms

Clustering Algorithms

Clustering Algorithms