X-means: Extending K-means with Efficient Estimation of the Number of Clusters

X-means: Extending K-means with Efficient Estimation of the Number of Clusters Dan Phelleg, Andrew Moore Carnegie Mellon University Published: ICML 2000 Presentation by: Payam Refaeilzadeh

Problems with K-means • Need to know K • Searching for K is expensive • Even K-means with fixed-K scales poorly • Need to calculate the distance from each point to each centroid to find new cluster assignments

Remedies • Forward search for the appropriate value of k in a given range • Recursively split each cluster and use BIC score to decide if we should keep each split • Use kd-trees to accelerate individual rounds of K-means

Splitting • Use local BIC score to decide on keeping a split • Use global BIC score to decide which K to output at the end

BIC (Bayesian Information Criterion) • Adjusted Log-likelihood of the model. • The likelihood that the data is “explained by” the clusters according to the spherical-Gaussian assumption of k-means

Kd-trees • Points to be clustered are put into a binary hierarchical structure • Each node represents a subset of points and stores • The minimal hyper-rectangle enclosing all points in the subset • The vector-sum of all the points in the subset • The number of points in the subset

Using kd-trees • For each centroid store a counter containing the vector sum of all the points belonging to it and the number of points • Update the above by scanning the kd-tree only once • Start with the root node and all centroids • As you walk down the tree centroids start to get black-listed (when the points in that node could not possibly belong to a centroid) • When only one centroid remains, the counter for that centroid can be updated using the statistics stored in the node • At the end of the scan we have enough info to recalculate the centroid coordinates

Results

X-means: Extending K-means with Efficient Estimation of the Number of Clusters