A k-mean clustering algorithm for mixed numeric and categorical data

A k-mean clustering algorithm for mixed numeric and categorical data Presenter : Shao-Wei Cheng Authors : Amir Ahmad, Lipika Dey DKE 2007

Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Comments

Motivation • The traditional k-mean algorithm is limited to numeric data. • The Huang’s cost algorithm tried to cluster mixed numeric and categorical data • The cluster center is represented by the mode of the cluster. • Use the binary distance between two categorical attribute values. • The significance(weight) of numeric attribute is taken to be 1, and γjis a user-defined parameter. 3

Objectives • This paper attempts to alleviate the short-comings of Huang’s cost algorithm. • Propose a new representation for the cluster center. • Computing distance between two categorical values by the overall distribution of categorical attribute. • The parameter is defined by the contribution of a categorical attribute. 4

Methodology • Cost function • The Huang’s cost algorithm • The proposed cost algorithm The distance between De Niroand Stewart is ?

Methodology

Methodology • Significance of numeric attribute • The numeric attributes need to be discretized. • equal width discretization

Methodology • Algorithm • Initialization. • Computing the cluster centers. • Assign the data element to the cluster whose center is closest to it • Repeat 2 and 3, until clusters do not change or for a fixed number of iterations. 8

Experiments • Evaluation method • Data sets • Iris – all numeric attributes • Vote – all categorical attributes • Heart disease data – mixed data set • Australian credit data – mixed data set 9

Experiments 10

Conclusion • This paper introduced a new distance measure for categorical attribute values and proposed a modified k-mean algorithm for clustering mixed data sets. • The results obtained with this algorithm over a number of real-world data sets are highly encouraging. • Future work • Other methods for discretizing numeric valued attributes. • Other implementations of k-mean algorithm. 11

Comments • Advantage • The view of overall attributes is good. • Drawback • … • Application • Mixed data sets clustering.

A k-mean clustering algorithm for mixed numeric and categorical data

A k-mean clustering algorithm for mixed numeric and categorical data

Presentation Transcript

ROCK: A ROBUST CLUSTERING ALGORITHM FOR CATEGORICAL ATTRIBUTES

Clustering Algorithms for Categorical Data Sets

ROCK: A Robust Clustering Algorithm for Categorical Attributes

A Link-Based Cluster Ensemble Approach for Categorical Data Clustering

On Data Labeling for Clustering Categorical Data

A dissimilarity measure for the K-Modes clustering algorithm

MGR: An information theory based hierarchical divisive clustering algorithm for categorical data

An Effective Clustering Algorithm for Mixed-size Placement

A Hierarchical Clustering Algorithm for Categorical Sequence Data

Unsupervised Learning with Mixed Numeric and Nominal Data

A Secure Clustering Algorithm for Distributed Data Streams

CLOPE: a Fast and Effective Clustering Algorithm for Transactional Data

CACTUS-Clustering Categorical Data Using Summaries

Unsupervised Evolutionary Clustering Algorithm for Mixed Type Data

A Fuzzy k-Modes Algorithm for Clustering Categorical Data

Categorical K-means Clustering Algorithm

Clustering Categorical Data

Non-parametric Methods for Clustering Continuous and Categorical Data

An Effective Clustering Algorithm for Mixed-size Placement