Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data Presenter : Cheng-Han Tsai Authors : Liang Bai, Jiye Liang, Chuangyin Dang KBS, 2011

Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

Motivation The k-modes algorithm is sensitive to initial cluster centers and needs to give the number of clusters in advance. We can’t guarantee the number of clusters we select are the best.

Objectives • To propose an initialization method to find initial cluster centers and the number of clusters. • The method can efficiently deal with large categorical data in linear time.

Methodology Data Set Construct a potential exemplars set S 1 2 4 Set the estimated number of clusters 3 5 The clustering result K-modes-type algorithm 7 6

Hamming distance:Differences between two codes(using XOR)ex: 10001001XOR 10110001------------------------ 00111000 → Hamming distance = 3 Methodology The k-modes algorithm

Methodology New cluster centers initialization method Finding the number of clusters

Methodology New cluster centers initialization method.

Methodology

Methodology • Finding the number of clusters • We need to input a value k’which is a estimated number of clusters • If k’ can’t be determined, we set k’ = |S|

Methodology

Methodology More than 1 knee point of the function P(k) More than 1 peak of the function C(k)

Experiments • Performance analysis • Soybean dada (4 diseases) • Lung cancer data (3 classes) • Zoo data (7 classes which has 3 big clusters and 4 small clusters) • Mushroom data (2 classes) • Scalability analysis

Experiments Performance analysis

Experiments

Experiments • Scalability analysis • 67557 data points and 42 categorical attribute

Conclusions The proposed method is effective and efficient for obtaining the good initial cluster centers and the number of clusters The time complexity has been analyzed in linear time

Comments • Advantages • Improve the old method about setting the two parameters • Applications • Data clustering

Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011