K-means algorithm

K-means algorithm JelenaVukovic 53/07 jeca.zr@gmail.com

Introduction • Basic idea of k-means algorithm • Detailed explenation • Most common problems of the algorithm • Applications • Possible improvements Elektrotehnički fakultet u Beogradu

Bassic principles of algorithm • Given the set of points (x1, x2, … , xn) • Partition n points into k sets (n>k) (S1, S2, … , Sk) • The goal is to minimize within-cluster sum of squares • µi is the mean of points in Si Elektrotehnički fakultet u Beogradu

The algorithm • Initialize the numberof means (k) • Iterate: • Assign each point to the nearest mean • Move mean tocenter of its cluster Elektrotehnički fakultet u Beogradu

The algorithm Move means Assign points to nearest mean Elektrotehnički fakultet u Beogradu

The algorithm • The complexity is O(n * k * I * d) • n – number of points • k – number of clusters • I – number of iterations • d – number of attributes Re-assign points Elektrotehnički fakultet u Beogradu

The algorithm Elektrotehnički fakultet u Beogradu

K nearest neighbors • Very similar algorithm • The decision is made based on thesimple majority of the closest k neighbors • In k-means the Euclidian distant measure is used Elektrotehnički fakultet u Beogradu

Some limitations of algorithm • The number of clusters needs to be known in advance • Initialization of means position • Problems appear when clusters have different • Shapes • Sizes • Density Elektrotehnički fakultet u Beogradu

Initial centroids problem • Random distribution (the most common) • Multiple runs • Testing on a data sample • Analyze the data Elektrotehnički fakultet u Beogradu

Different density Original points 3 Clusters Elektrotehnički fakultet u Beogradu

Non-globular shapes Original points 2 Clusters Elektrotehnički fakultet u Beogradu

Pros and cons Pros Cons K needs to be known Ellipsoid shape is assumed Requires some knowledge about data in advance Possibility of many loop turns, without significant changes in clusters • Simple to implement • Fast • Not highly demanding Elektrotehnički fakultet u Beogradu

Applications of the algorithm • Many different uses • Computer vision • Market segmentation • Geostatic • Astronomy • etc Elektrotehnički fakultet u Beogradu

Improvements • Pre-processing of the data in order to better estimate k • Run multiple iteration in parallel with different centroid initialization • Ignore possible errors to avoid non-standard cluster shapes Elektrotehnički fakultet u Beogradu

Thank you! Elektrotehnički fakultet u Beogradu

K-means algorithm