Clustering Categorical Data

Clustering Categorical Data Pasi Fränti 18.2.2016

K-means clustering

Definitions and data Set of N data points: X={x1, x2, …, xN} Partition of the data: P={p1, p2, …, pM}, Set of M cluster prototypes (centroids): C={c1, c2, …, cM},

Distance and cost function Euclidean distance of data vectors: Mean square error:

Clustering result as partition Partition of data Cluster prototypes Illustrated by Voronoi diagram Illustrated by Convex hulls

Duality of partition and centroids Partition of data Cluster prototypes Partition by nearestprototype mapping Centroids as prototypes

Categorical data

Categorical clustering Three attributes

Categorical clustering Sample 2-d data: color and shape Model A Model B Model C

Hamming Distance(Binary and categorical data) • Number of different attribute values. • Distance of (1011101) and (1001001) is 2. • Distance (2143896) and (2233796) • Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube

K-means variants Histogram-based methods: Methods: • k-modes • k-medoids • k-distributions • k-histograms • k-populations • k-representatives

Entropy-based cost functions Category utility: Entropy of data set: Entropies of the clusters relative to the data:

Iterative algorithms

K-modes clusteringDistance function

K-modes clusteringPrototype of cluster

K-medoids clusteringPrototype of cluster Vector with minimal total distance to every other 3 Medoid: 2 2 A C E B C F B D G B C F 2+3=5 2+2=4 2+3=5

K-medoidsExample

K-medoidsCalculation

K-histograms D 2/3 F 1/3

K-distributionsCost function with ε addition

Example of cluster allocationChange of entropy

Problem of non-convergenceNon-convergence

Results with Census dataset

Literature Modified k-modes + k-histograms:M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503-507, March, 2007. ACE:K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp. 253-262, Berkeley, USA, 2005. ROCK:S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp. 345-366, 200x. K-medoids:L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, 1990. K-modes:Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp. 283-304, 1998. K-distributions:Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp. 436-443, Qingdao, China, 2007. K-histograms:Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, CoRR, abs/cs/0509033, http://arxiv.org/abs/cs/0509033, 2005.

Clustering Categorical Data

Clustering Categorical Data

Presentation Transcript

Clustering Categorical Data The Case of Quran Verses

Clustering Algorithms for Categorical Data Sets

Categorical Data

Categorical Data

Categorical Data

Analyzing categorical data

Categorical Data

On Data Labeling for Clustering Categorical Data

Categorical Data

Interpreting Categorical Data

Categorical Data

A Hierarchical Clustering Algorithm for Categorical Sequence Data

Categorical Data

Categorical Data

Categorical Data

CACTUS-Clustering Categorical Data Using Summaries

Analyzing Categorical Data

Categorical data

On clustering tree structured data with categorical nature

A Fuzzy k-Modes Algorithm for Clustering Categorical Data

Categorical K-means Clustering Algorithm

Non-parametric Methods for Clustering Continuous and Categorical Data