100 likes | 309 Views
Categorical K-means Clustering Algorithm. Graduate : Chen, Shao-Pei Authors : C.C. Hsu TKDE. Outline. Motivation Objective Methodology Experimental Results Conclusion Appendix. Motivation. The k-means algorithm works only for numerical data
E N D
Categorical K-means Clustering Algorithm Graduate : Chen, Shao-Pei Authors : C.C. Hsu TKDE
Outline • Motivation • Objective • Methodology • Experimental Results • Conclusion • Appendix
Motivation • The k-means algorithm works only for numerical data • The k-modes-type algorithms is that similarity information between categorical values is not considered.
Objective • We propose a categorical k-means algorithm (CAKM) base on distance hierarchy to overcome the problem.
Methodology lk1,1 lkN,1 nd1,1 ndN,1 …... ….…..………………. lk1,h lk1,hl ndN-1,h …… nd2,h ndN,hl nd1,h (a) Distance hierarchy for representing the similarity of categorical values and further for measuring the distance between two multidimensional mixed-type. • Distance hierarchy The structure of distance hierarchy (b) lk1={lk1,1, lk1,2} where the link lk1,1=(Any, Drink) and lk1,2=(Drink, Coke). Two branches may share the same nodes and links in the upper part of the hierarchy. nd1,1 = nd2,1 and lk1,1 =.lk2,1 Y= (TV, 1.2) indicates that Y is in the path between Any and TV and 1.2 away from the root Any.
Methodology Definition 3 (centroid of a loaded distance hierarchy) Definition 2 (local centroid of a link): The centroid C of dh can be determined by identifying the point from the local centroids{LCr,l} that gives the minimum intra distance. The distance between two points in a distance hierarchy is the leaf of the rth branch and the offset [l1,l] Example Example can be determined by The distance of point X=(Coke, 2) to W=(Pepsi, 1.9) and to Y=(TV, 1.2) are The offset of local centroid of lk2,1= (Any, Drink) is The centroid can be restricted to a particular branch called winning branch. and To determine the centroid, we need a loaded distance hierarchy, including intra distance, localcentroid, centroid, and winning branch. Because-0.26 < 2-1 where LC2,1=(Pepsi, 0.26) and LC2,2=(Pepsi, 1) Definition 1 (intra distance of a loaded distance hierarchy): lk1,1=lk2,1 and their centroids LC1,1=(Coke, 0.26) (Pepsi, 0.26) = LC2,1 • Algorithm 1: Determining the centroid of a loaded distance hierarchy • Input: A loaded distance hierarchy dh with data points distributed on the leaves. • Output: the centroid point C • Determine the winning branch bchw among all the branches {bchr} in dh, which generates the maximum accumulated data count; • Calculate the local centroids {LCw,l} of the links of bchw; • Determine C from {LCw,l} which gives the minimum intra distance. Height(h) Any (Pepsi, 0.26) 0 Example =1 lk2,2=(Drink, Pepsi) is Appliance Drink The intra distance with respect to a point, say D, at Drink is Intradh (D)=(206)1/2=14.4. 1 Definition 4 (winning branch of a loaded distance hierarchy) D Winning branch bchwis the branch where the centroid C of dh resides. 2 Coke Pepsi TV PC C must reside on the branches leading by the link (Any, ndi,1) where ndi,1 has the most count of points on its leaves. 3 17 12 14 Any Example where R Appliance M 1 Drink 1 Y Z 1 1 W X 1 1 Coke Pepsi PC TV = ((12*2 + 14*2) – (17*2 + 3*2))/46 = 0.26. Example The accumulated count of the branch bch1 (Any, Coke) is 12 + (12+14) = 38. The winning branch is bch2 (Any, Pepsi) with the maximum accumulated count 14 + (12 +14) = 40. Centroid of a loaded distance hierarchy • Distance hierarchy • C = argmin {Intradh(LC2,1), Intradh(LC2,2)} • = argmin {Intradh((Pepsi, 0.26)), Intradh((Pepsi, 1))} • = argmin {13.448, 14.353} • = (Pepsi, 0.26)
Methodology • Algorithm 2: The categorical k-means clustering algorithm CAKM • Input: An m-dimensional dataset X = {x1, …, xn}, • A set of m distance hierarchies DH = {dh1, …, dhm}, • The parameter of cluster number K • Output: Clusters C = {C1, …, CK} • Initialize K cluster centroids {vk}; • do • Compute the distances of xi and each vk by mapping components to distance hierarchies and aggregating the pairwise distances between xi and vk; • Assign each xiinto Ck whose centroid vk is closest to xi; • Update each centroid vk by loading the cluster’s data Ck to distance hierarchies and then determining the centroid of the loaded hierarchies. 6 repeattill the clusters C have no changes Step 1 Initialization Distance between a data point and a cluster centroid The distance between m-dimensional xi and vkis defined as Choose initial centroids { is randomly drawn from the domain of }, 1≤ k ≤K. Each is a non-negative real and attribute where ), 1jm. =Centroid( number in the range of 0 and the height of where Xi,j and Vk,j are the mapping points of xi,j and vk,j in dhj, respectively. Step 2 Group assignment } For xiX and the set of K centroids { Computation of a cluster centroid at step t, xi is assigned to the closest centroid W is an partition matrix for representing the membership between For a cluster Ck , the cluster centroid is vk=[vk,1, …, vk,m] where vk,j=Centroid( ) by setting =1. That is, data point xiand clusters C. In particular, wik=1 if xiCk, otherwise wik=0. , Use Centroid of a loaded distance hierarchy. Step 3 Update centroids of clusters The clustering algorithm To cluster X into K distinct clusters C={C1, …, CK} by minimizing the following cost function. subject to Step 4 Convergence judgment Each cluster centroid is updated by loading all the data of the cluster If there is no change of data point in each cluster, then stop; otherwise, set t t +1 and go to Step 2. • Categorical k-means algorithm
Experimental Results akis the number of data occurring in both the kth cluster and its corresponding class, and n is the number of the training data. Real dataset Synthetic dataset Average clustering accuracy achieved by four clustering algorithms for the synthetic data set The average accuracy achieved by the four algorithms for the five datasets CAKM correctly classified all the whole dataset in 63 of 100 runs.
Experimental Results Clustering Quality ACU computed at the leaf and Level 1, and the increase rate for the Adult dataset ACU computed at the leaf and Level 1, and the increase rate for the Adult dataset ACU computed at the leaf and Level 1, and the increase rate for the Car Evaluation dataset • Time and space complexity Time complexity of FKM, FKMFC, KM and CAKM Space complexity of FKM, FKMFC, KM and CAKM
Conclusion • According toexperimental results CAKM has its best accuracy in clustering and classification. • CAKM can reflect the cluster structure of the data when the categorical values are similar to one another in different extent. • Future work: • To extend the algorithm for mixed numeric and categorical data. • To compare with recent mixed-type clustering algorithms.