130 likes | 377 Views
A k-mean clustering algorithm for mixed numeric and categorical data. Presenter : Shao -Wei Cheng Authors : Amir Ahmad, Lipika Dey. DKE 2007. Outline. Motivation Objective Methodology Experiments Conclusion Comments. Motivation.
E N D
A k-mean clustering algorithm for mixed numeric and categorical data Presenter : Shao-Wei Cheng Authors : Amir Ahmad, Lipika Dey DKE 2007
Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Comments
Motivation • The traditional k-mean algorithm is limited to numeric data. • The Huang’s cost algorithm tried to cluster mixed numeric and categorical data • The cluster center is represented by the mode of the cluster. • Use the binary distance between two categorical attribute values. • The significance(weight) of numeric attribute is taken to be 1, and γjis a user-defined parameter. 3
Objectives • This paper attempts to alleviate the short-comings of Huang’s cost algorithm. • Propose a new representation for the cluster center. • Computing distance between two categorical values by the overall distribution of categorical attribute. • The parameter is defined by the contribution of a categorical attribute. 4
Methodology • Cost function • The Huang’s cost algorithm • The proposed cost algorithm The distance between De Niroand Stewart is ?
Methodology • Significance of numeric attribute • The numeric attributes need to be discretized. • equal width discretization
Methodology • Algorithm • Initialization. • Computing the cluster centers. • Assign the data element to the cluster whose center is closest to it • Repeat 2 and 3, until clusters do not change or for a fixed number of iterations. 8
Experiments • Evaluation method • Data sets • Iris – all numeric attributes • Vote – all categorical attributes • Heart disease data – mixed data set • Australian credit data – mixed data set 9
Experiments 10
Conclusion • This paper introduced a new distance measure for categorical attribute values and proposed a modified k-mean algorithm for clustering mixed data sets. • The results obtained with this algorithm over a number of real-world data sets are highly encouraging. • Future work • Other methods for discretizing numeric valued attributes. • Other implementations of k-mean algorithm. 11
Comments • Advantage • The view of overall attributes is good. • Drawback • … • Application • Mixed data sets clustering.