1 / 27

Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values. Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su. Outline. Motivation Objective Research Review Notation K-means Algorithm K-mode Algorithm K-prototype Algorithm Experiment Conclusion

mandel
Download Presentation

Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su The Lab of Intelligent Database System, IDS

  2. Outline • Motivation • Objective • Research Review • Notation • K-means Algorithm • K-mode Algorithm • K-prototype Algorithm • Experiment • Conclusion • Personal opinion The Lab of Intelligent Database System, IDS

  3. Motivation • K-means methods are efficient for processing large data sets • K-means is limited to numeric data • Numeric and categorical data are mixed with million objects in real world The Lab of Intelligent Database System, IDS

  4. Objective • Extending K-means to categorical domains and domains with mixed numeric and categorical values The Lab of Intelligent Database System, IDS

  5. Research review • Partition methods • Partitioning algorithm organizes the objects into K partition(K<N) • K-means[ MacQueen, 1967] • K-medoids[ Kaufman and Rousseeuw, 1990] • CLARANS[ Ng and Han, 1994] The Lab of Intelligent Database System, IDS

  6. Notation • [A1,A2,…..Am]means attribute numbers ,each Ai describes a domains of values, denoted by DOM(Ai) • X={X1,X2,…..,Xn} be a set of n objects,object Xi is represented as [Xi,1,Xi,2,…..,Xi,m} • Xi=Xk if Xi,j =Xk,j for 1<=j<=m • [ ], the first p elements are numeric values, the rest are categorical values The Lab of Intelligent Database System, IDS

  7. K-means Algorithm K is clustering numbers, n is objects number W is an nxk partition matrix, Q={Q1,Q2,…Qk} is a set of objects in the same object domain d(.,.) is the Euclidean distance between two objects Problem P minimise ,1<=i<=n Subject to ,1<=i<=n, 1<=l<=k The Lab of Intelligent Database System, IDS

  8. K-means Algorithm (cont.) • Problem P can be solved by iteratively solving the following two problems: • Problem P1: fix Q= , reduced problem P(W, ) wi,l=1 if d(Xi,Ql) <= d(Xi,Qt), for 1 <= t <= k wi,t=0 for t <> l • Problem P2: fix W= , reduced problem P( ,Q) ,1 <= l <= k, and 1<= j <= m The Lab of Intelligent Database System, IDS

  9. K-means Algorithm (cont.) • Choose an initial and solve P(W, ) to obtain . Set t=0 • Let = and solve P( ,Q) to obtain . if P( , )=P( , ), output , and stop; otherwise, go to 3 • Let = and solve P(W, ) to obtain . if P( , )=P( , ), output , and stop; otherwise, let t=t+1 and go to 2 The Lab of Intelligent Database System, IDS

  10. K-mode Algorithm • Using a simple matching dissimilarity measure for categorical objects • Replacing means of clusters by modes • Using a frequency-based method to find the modes The Lab of Intelligent Database System, IDS

  11. K-mode Algorithm( cont.) • Dissimilarity measure where • Mode of a set A mode of X ={X1,X2,…..,Xn} is a vector Q=[q1,q2,…,qm] minimise The Lab of Intelligent Database System, IDS

  12. K-mode Algorithm( cont.) • Find a mode for a set let be the number of objects having the Kth category in attribute the relative frequency of category in X Theorem 1 D(X,Q) is minimised iff for qj <> for all j=1,…,m The Lab of Intelligent Database System, IDS

  13. K-mode Algorithm( cont.) • Two initial mode selection methods • Select the first K distinct records from the data sets as the K modes • Select the K modes by frequency-based method The Lab of Intelligent Database System, IDS

  14. K-mode Algorithm( cont.) • To calculate the total cost P against the whole data set each time when a new Q or W is obtained where and The Lab of Intelligent Database System, IDS

  15. K-mode Algorithm( cont.) • Select K initial modes, one for each cluster • Allocate an object to the cluster whose mode is the nearest to it . Update the mode of the cluster after each allocation according to theorem 1 The Lab of Intelligent Database System, IDS

  16. K-mode Algorithm( cont.) • After all objects have been allocated to clusters, retest the dissimilarity of objects against the current modes if an object is found its nearest mode belongs to another cluster, reallocate the object to that cluster and update the modes of both clusters • Repeat 3 until no objects has changed clusters The Lab of Intelligent Database System, IDS

  17. K-prototypes Algorithm • To integrate the k-means and k-modes algorithms and to cluster the mixed-type objects • ,m is the attribute numbers the first p means numeric data, the rest means categorical data The Lab of Intelligent Database System, IDS

  18. K-prototypes Algorithm( cont.) • The first term is the Euclidean distance measure on the numeric attributes and the second term is the simple matching dissimilarity measure on the categorical attributes • The weight is used to avoid favouring either type of attribute The Lab of Intelligent Database System, IDS

  19. K-prototypes Algorithm( cont.) • Cost function Minimise The Lab of Intelligent Database System, IDS

  20. K-prototypes Algorithm( cont.) Choose clusters Modify the mode The Lab of Intelligent Database System, IDS

  21. K-prototypes Algorithm( cont.) Modify the mode The Lab of Intelligent Database System, IDS

  22. Experiment • K-modes the data set was the soybean disease data set, with 4 diseases 47 instances: {D=10,C=10,R=10,p=17}, 21 attributes • K-prototype the second data was the credit approval data set, with 2 class 666 instances { approval=299, reject=367}, 6 numeric and 9 categorical attributes The Lab of Intelligent Database System, IDS

  23. Experiment( cont.) The Lab of Intelligent Database System, IDS

  24. Experiment( cont.) The Lab of Intelligent Database System, IDS

  25. Experiment( cont.) The Lab of Intelligent Database System, IDS

  26. Conclusion • The k-modes algorithm is faster than the k-means and k-prototypes algorithm because it needs less iterations to converge • How many clusters are in the data? • The weight adds an additional problem The Lab of Intelligent Database System, IDS

  27. Personal opinion • Conceptual inclusion relationships • Outlier problem • Massive data sets cause efficient problem The Lab of Intelligent Database System, IDS

More Related