1 / 10

Categorical K-means Clustering Algorithm

Categorical K-means Clustering Algorithm. Graduate : Chen, Shao-Pei Authors : C.C. Hsu TKDE. Outline. Motivation Objective Methodology Experimental Results Conclusion Appendix. Motivation. The k-means algorithm works only for numerical data

blanca
Download Presentation

Categorical K-means Clustering Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Categorical K-means Clustering Algorithm Graduate : Chen, Shao-Pei Authors : C.C. Hsu TKDE

  2. Outline • Motivation • Objective • Methodology • Experimental Results • Conclusion • Appendix

  3. Motivation • The k-means algorithm works only for numerical data • The k-modes-type algorithms is that similarity information between categorical values is not considered.

  4. Objective • We propose a categorical k-means algorithm (CAKM) base on distance hierarchy to overcome the problem.

  5. Methodology lk1,1 lkN,1 nd1,1 ndN,1 …... ….…..………………. lk1,h lk1,hl ndN-1,h …… nd2,h ndN,hl nd1,h (a) Distance hierarchy for representing the similarity of categorical values and further for measuring the distance between two multidimensional mixed-type. • Distance hierarchy The structure of distance hierarchy (b) lk1={lk1,1, lk1,2} where the link lk1,1=(Any, Drink) and lk1,2=(Drink, Coke). Two branches may share the same nodes and links in the upper part of the hierarchy. nd1,1 = nd2,1 and lk1,1 =.lk2,1 Y= (TV, 1.2) indicates that Y is in the path between Any and TV and 1.2 away from the root Any.

  6. Methodology Definition 3 (centroid of a loaded distance hierarchy) Definition 2 (local centroid of a link): The centroid C of dh can be determined by identifying the point from the local centroids{LCr,l} that gives the minimum intra distance. The distance between two points in a distance hierarchy is the leaf of the rth branch and the offset [l1,l] Example Example can be determined by The distance of point X=(Coke, 2) to W=(Pepsi, 1.9) and to Y=(TV, 1.2) are The offset of local centroid of lk2,1= (Any, Drink) is The centroid can be restricted to a particular branch called winning branch. and To determine the centroid, we need a loaded distance hierarchy, including intra distance, localcentroid, centroid, and winning branch. Because-0.26 < 2-1 where LC2,1=(Pepsi, 0.26) and LC2,2=(Pepsi, 1) Definition 1 (intra distance of a loaded distance hierarchy): lk1,1=lk2,1 and their centroids LC1,1=(Coke, 0.26)  (Pepsi, 0.26) = LC2,1 • Algorithm 1: Determining the centroid of a loaded distance hierarchy • Input: A loaded distance hierarchy dh with data points distributed on the leaves. • Output: the centroid point C • Determine the winning branch bchw among all the branches {bchr} in dh, which generates the maximum accumulated data count; • Calculate the local centroids {LCw,l} of the links of bchw; • Determine C from {LCw,l} which gives the minimum intra distance. Height(h) Any (Pepsi, 0.26) 0 Example =1 lk2,2=(Drink, Pepsi) is Appliance Drink The intra distance with respect to a point, say D, at Drink is Intradh (D)=(206)1/2=14.4. 1 Definition 4 (winning branch of a loaded distance hierarchy) D Winning branch bchwis the branch where the centroid C of dh resides. 2 Coke Pepsi TV PC C must reside on the branches leading by the link (Any, ndi,1) where ndi,1 has the most count of points on its leaves. 3 17 12 14 Any Example where R Appliance M 1 Drink 1 Y Z 1 1 W X 1 1 Coke Pepsi PC TV = ((12*2 + 14*2) – (17*2 + 3*2))/46 = 0.26. Example The accumulated count of the branch bch1 (Any, Coke) is 12 + (12+14) = 38. The winning branch is bch2 (Any, Pepsi) with the maximum accumulated count 14 + (12 +14) = 40. Centroid of a loaded distance hierarchy • Distance hierarchy • C = argmin {Intradh(LC2,1), Intradh(LC2,2)} • = argmin {Intradh((Pepsi, 0.26)), Intradh((Pepsi, 1))} • = argmin {13.448, 14.353} • = (Pepsi, 0.26)

  7. Methodology • Algorithm 2: The categorical k-means clustering algorithm CAKM • Input: An m-dimensional dataset X = {x1, …, xn}, • A set of m distance hierarchies DH = {dh1, …, dhm}, • The parameter of cluster number K • Output: Clusters C = {C1, …, CK} • Initialize K cluster centroids {vk}; • do • Compute the distances of xi and each vk by mapping components to distance hierarchies and aggregating the pairwise distances between xi and vk; • Assign each xiinto Ck whose centroid vk is closest to xi; • Update each centroid vk by loading the cluster’s data Ck to distance hierarchies and then determining the centroid of the loaded hierarchies. 6 repeattill the clusters C have no changes Step 1 Initialization Distance between a data point and a cluster centroid The distance between m-dimensional xi and vkis defined as Choose initial centroids { is randomly drawn from the domain of }, 1≤ k ≤K. Each is a non-negative real and attribute where ), 1jm. =Centroid( number in the range of 0 and the height of where Xi,j and Vk,j are the mapping points of xi,j and vk,j in dhj, respectively. Step 2 Group assignment } For xiX and the set of K centroids { Computation of a cluster centroid at step t, xi is assigned to the closest centroid W is an partition matrix for representing the membership between For a cluster Ck , the cluster centroid is vk=[vk,1, …, vk,m] where vk,j=Centroid( ) by setting =1. That is, data point xiand clusters C. In particular, wik=1 if xiCk, otherwise wik=0. , Use Centroid of a loaded distance hierarchy. Step 3 Update centroids of clusters The clustering algorithm To cluster X into K distinct clusters C={C1, …, CK} by minimizing the following cost function. subject to Step 4 Convergence judgment Each cluster centroid is updated by loading all the data of the cluster If there is no change of data point in each cluster, then stop; otherwise, set t t +1 and go to Step 2. • Categorical k-means algorithm

  8. Experimental Results akis the number of data occurring in both the kth cluster and its corresponding class, and n is the number of the training data. Real dataset Synthetic dataset Average clustering accuracy achieved by four clustering algorithms for the synthetic data set The average accuracy achieved by the four algorithms for the five datasets CAKM correctly classified all the whole dataset in 63 of 100 runs.

  9. Experimental Results Clustering Quality ACU computed at the leaf and Level 1, and the increase rate for the Adult dataset ACU computed at the leaf and Level 1, and the increase rate for the Adult dataset ACU computed at the leaf and Level 1, and the increase rate for the Car Evaluation dataset • Time and space complexity Time complexity of FKM, FKMFC, KM and CAKM Space complexity of FKM, FKMFC, KM and CAKM

  10. Conclusion • According toexperimental results CAKM has its best accuracy in clustering and classification. • CAKM can reflect the cluster structure of the data when the categorical values are similar to one another in different extent. • Future work: • To extend the algorithm for mixed numeric and categorical data. • To compare with recent mixed-type clustering algorithms.

More Related