1 / 25

Performance evaluation of some clustering algorithms and validity indices

Performance evaluation of some clustering algorithms and validity indices. Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Ujjwal Maulik and Sanghamitra Bandyopadhyay. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,2003 p.p 1650-1654. Outline. Motivation Objective

mpritchard
Download Presentation

Performance evaluation of some clustering algorithms and validity indices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance evaluation of some clustering algorithms and validity indices Advisor :Dr. Hsu Presenter: Yu Cheng Chen Author: Ujjwal Maulik and Sanghamitra Bandyopadhyay IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,2003 p.p 1650-1654

  2. Outline • Motivation • Objective • Introduction • Clustering Algorithms • Clustering Validity Indices • Experimental Result • Conclusions • Personal Opinion

  3. Motivation • Want to know the performance of some clustering algorithms and validity indices.

  4. Objective • Show the performance of three clustering algorithms, • K-mean • Single linkage • Simulated annealing (SA) • In conjunction with four cluster validity indices. • Davies-Bouldin • Dunn’s index • Calinski-Harabasz index • Index Ґ

  5. Introduction • Clustering is to search the prior partitions of certain dataset. • The two common questions addressed in clustering systems are: • How many cluster are present in the data • How good is the clustering itself

  6. Introduction • The measure of validity of the clusters should be able to show the ordering of clusters in terms of its goodness. • For example • U1,U2,…,Um are m partitions of data set X • V1,V2,…,Vm are corresponding values of a validity measure • Vk1>=Vk2>=…>=Vkm indicate Uk1 is better than Uk2 and so on.

  7. Clustering Algorithms • K-mean: Unassigned data clustering k The center of cluster Data set Reassign the centroid

  8. A B C D 2 Clustering Algorithms • Single linkage:

  9. A B C D 3 Clustering Algorithms • Single linkage:

  10. A B C D Clustering Algorithms • Single linkage: 10

  11. Clustering Algorithms • Simulated annealing:

  12. Clustering Algorithms • Simulated annealing: Components: • “Energy” (Cost) function to minimize • represent entire state, drives system forward • Moves • local rearrangement • Cooling schedule • initial temperature • temperature steps (sequence) • time at each temperature

  13. starting point descend direction local minima global minimal Clustering Algorithms • Basic algorithm of SA • Pick an initial solution as Tmin • Set temperature (T) to initial value • while (T> Tmin) • for time at T • pick a move at random • compute Dcost • if Dcost <=0, then accept • else if Random Prob. < e-Dcost/T, accept • update T

  14. Clustering Algorithms • How to apply in clustering ? • The cost function is • The new status is accepted/rejected according to the prob.

  15. Clustering Validity Indices • Aim to identify the compact and well separated cluster • Davies-Bouldin (DB) index: • The distance of intra-cluster • The distance of inter-cluster • The smaller value of DB correspond to good clusters

  16. Clustering Validity Indices • Dunn’s Index: • The S and T are two nonempty subsets. • The diameter of S is defined as △(S) = max {d(x, y)} • The distance between S and T is δ(S,T)=min {d(x, y)} • The larger value of vD correspond to good cluster.

  17. Clustering Validity Indices • Calinski Harabasz (CH) index: • nk is the number of points in cluster k • z is the centroid of the entire data set.

  18. Clustering Validity Indices • Index I: • n is the total number of points in data set • zk is the center of the k cluster • If K or EK increase, the index I decrease • If K increase, the Dk increase • The three factor are found and balance each other critically.

  19. Clustering Validity Indices • Xie and Beni defined an index that is a ration of the compactness of the fuzzy K-partitions of a data to its separations. • Here, we show the relationship index I and Dunn’s index by following:

  20. Experimental Results • Artificial data sets • AD_10_2, AD_4_3N, AD_2_10

  21. Experimental Results • Real data sets • Crude Oil, and Cancer

  22. Experimental Result • In this paper, the correct number of clusters are decided by : • All three algorithms yield (Kmax-Kmin+1) partitions. • Let UKmin, UKmin+1,…,Ukmax are these partition and Vkmin, Vkmin+1,…,Vkmax are the validity index value. • We choose the K* which has the maximal index value as the correct number of cluster.

  23. Experimental Results • The value of Kmin and Kmax are chosen as 2 and √n

  24. Conclusions • We presented the comparison of 4 cluster validity indices with artificial and real data sets. • The cluster validity indices can used to evolve the appropriate number of clusters. • Index I achieves its maximum value for the correct number of clusters.

  25. Personal Opinion • Drawback: • Only mention crisp clustering algorithm for numeric data. • No analysis about time complexity of validity indices. • Application • Index I can be used to search for the correct number of cluster but how to handle categorical value is still a big problem • Future Work • Need to find or develop the validity index for mix data.

More Related