210 likes | 388 Views
An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data. Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011. Outlines. Motivation Objectives Methodology Experiments
E N D
An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data Presenter : Cheng-Han Tsai Authors : Liang Bai, Jiye Liang, Chuangyin Dang KBS, 2011
Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments
Motivation The k-modes algorithm is sensitive to initial cluster centers and needs to give the number of clusters in advance. We can’t guarantee the number of clusters we select are the best.
Objectives • To propose an initialization method to find initial cluster centers and the number of clusters. • The method can efficiently deal with large categorical data in linear time.
Methodology Data Set Construct a potential exemplars set S 1 2 4 Set the estimated number of clusters 3 5 The clustering result K-modes-type algorithm 7 6
Hamming distance:Differences between two codes(using XOR)ex: 10001001XOR 10110001------------------------ 00111000 → Hamming distance = 3 Methodology The k-modes algorithm
Methodology New cluster centers initialization method Finding the number of clusters
Methodology New cluster centers initialization method.
Methodology • Finding the number of clusters • We need to input a value k’which is a estimated number of clusters • If k’ can’t be determined, we set k’ = |S|
Methodology More than 1 knee point of the function P(k) More than 1 peak of the function C(k)
Experiments • Performance analysis • Soybean dada (4 diseases) • Lung cancer data (3 classes) • Zoo data (7 classes which has 3 big clusters and 4 small clusters) • Mushroom data (2 classes) • Scalability analysis
Experiments Performance analysis
Experiments • Scalability analysis • 67557 data points and 42 categorical attribute
Conclusions The proposed method is effective and efficient for obtaining the good initial cluster centers and the number of clusters The time complexity has been analyzed in linear time
Comments • Advantages • Improve the old method about setting the two parameters • Applications • Data clustering