300 likes | 704 Views
Performance evaluation of some clustering algorithms and validity indices. Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Ujjwal Maulik and Sanghamitra Bandyopadhyay. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,2003 p.p 1650-1654. Outline. Motivation Objective
E N D
Performance evaluation of some clustering algorithms and validity indices Advisor :Dr. Hsu Presenter: Yu Cheng Chen Author: Ujjwal Maulik and Sanghamitra Bandyopadhyay IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,2003 p.p 1650-1654
Outline • Motivation • Objective • Introduction • Clustering Algorithms • Clustering Validity Indices • Experimental Result • Conclusions • Personal Opinion
Motivation • Want to know the performance of some clustering algorithms and validity indices.
Objective • Show the performance of three clustering algorithms, • K-mean • Single linkage • Simulated annealing (SA) • In conjunction with four cluster validity indices. • Davies-Bouldin • Dunn’s index • Calinski-Harabasz index • Index Ґ
Introduction • Clustering is to search the prior partitions of certain dataset. • The two common questions addressed in clustering systems are: • How many cluster are present in the data • How good is the clustering itself
Introduction • The measure of validity of the clusters should be able to show the ordering of clusters in terms of its goodness. • For example • U1,U2,…,Um are m partitions of data set X • V1,V2,…,Vm are corresponding values of a validity measure • Vk1>=Vk2>=…>=Vkm indicate Uk1 is better than Uk2 and so on.
Clustering Algorithms • K-mean: Unassigned data clustering k The center of cluster Data set Reassign the centroid
A B C D 2 Clustering Algorithms • Single linkage:
A B C D 3 Clustering Algorithms • Single linkage:
A B C D Clustering Algorithms • Single linkage: 10
Clustering Algorithms • Simulated annealing:
Clustering Algorithms • Simulated annealing: Components: • “Energy” (Cost) function to minimize • represent entire state, drives system forward • Moves • local rearrangement • Cooling schedule • initial temperature • temperature steps (sequence) • time at each temperature
starting point descend direction local minima global minimal Clustering Algorithms • Basic algorithm of SA • Pick an initial solution as Tmin • Set temperature (T) to initial value • while (T> Tmin) • for time at T • pick a move at random • compute Dcost • if Dcost <=0, then accept • else if Random Prob. < e-Dcost/T, accept • update T
Clustering Algorithms • How to apply in clustering ? • The cost function is • The new status is accepted/rejected according to the prob.
Clustering Validity Indices • Aim to identify the compact and well separated cluster • Davies-Bouldin (DB) index: • The distance of intra-cluster • The distance of inter-cluster • The smaller value of DB correspond to good clusters
Clustering Validity Indices • Dunn’s Index: • The S and T are two nonempty subsets. • The diameter of S is defined as △(S) = max {d(x, y)} • The distance between S and T is δ(S,T)=min {d(x, y)} • The larger value of vD correspond to good cluster.
Clustering Validity Indices • Calinski Harabasz (CH) index: • nk is the number of points in cluster k • z is the centroid of the entire data set.
Clustering Validity Indices • Index I: • n is the total number of points in data set • zk is the center of the k cluster • If K or EK increase, the index I decrease • If K increase, the Dk increase • The three factor are found and balance each other critically.
Clustering Validity Indices • Xie and Beni defined an index that is a ration of the compactness of the fuzzy K-partitions of a data to its separations. • Here, we show the relationship index I and Dunn’s index by following:
Experimental Results • Artificial data sets • AD_10_2, AD_4_3N, AD_2_10
Experimental Results • Real data sets • Crude Oil, and Cancer
Experimental Result • In this paper, the correct number of clusters are decided by : • All three algorithms yield (Kmax-Kmin+1) partitions. • Let UKmin, UKmin+1,…,Ukmax are these partition and Vkmin, Vkmin+1,…,Vkmax are the validity index value. • We choose the K* which has the maximal index value as the correct number of cluster.
Experimental Results • The value of Kmin and Kmax are chosen as 2 and √n
Conclusions • We presented the comparison of 4 cluster validity indices with artificial and real data sets. • The cluster validity indices can used to evolve the appropriate number of clusters. • Index I achieves its maximum value for the correct number of clusters.
Personal Opinion • Drawback: • Only mention crisp clustering algorithm for numeric data. • No analysis about time complexity of validity indices. • Application • Index I can be used to search for the correct number of cluster but how to handle categorical value is still a big problem • Future Work • Need to find or develop the validity index for mix data.