200 likes | 526 Views
K-means algorithm. Jelena Vukovic 53/07 jeca.zr@gmail.com. Introduction . Basic idea of k-means algorithm Detailed explenation Most common problems of the algorithm Applications Possible improvements. Bassic principles of algorithm. Given the set of points (x 1 , x 2 , … , x n )
E N D
K-means algorithm JelenaVukovic 53/07 jeca.zr@gmail.com
Introduction • Basic idea of k-means algorithm • Detailed explenation • Most common problems of the algorithm • Applications • Possible improvements Elektrotehnički fakultet u Beogradu
Bassic principles of algorithm • Given the set of points (x1, x2, … , xn) • Partition n points into k sets (n>k) (S1, S2, … , Sk) • The goal is to minimize within-cluster sum of squares • µi is the mean of points in Si Elektrotehnički fakultet u Beogradu
The algorithm • Initialize the numberof means (k) • Iterate: • Assign each point to the nearest mean • Move mean tocenter of its cluster Elektrotehnički fakultet u Beogradu
The algorithm Move means Assign points to nearest mean Elektrotehnički fakultet u Beogradu
The algorithm • The complexity is O(n * k * I * d) • n – number of points • k – number of clusters • I – number of iterations • d – number of attributes Re-assign points Elektrotehnički fakultet u Beogradu
The algorithm Elektrotehnički fakultet u Beogradu
K nearest neighbors • Very similar algorithm • The decision is made based on thesimple majority of the closest k neighbors • In k-means the Euclidian distant measure is used Elektrotehnički fakultet u Beogradu
Some limitations of algorithm • The number of clusters needs to be known in advance • Initialization of means position • Problems appear when clusters have different • Shapes • Sizes • Density Elektrotehnički fakultet u Beogradu
Initial centroids problem • Random distribution (the most common) • Multiple runs • Testing on a data sample • Analyze the data Elektrotehnički fakultet u Beogradu
Different density Original points 3 Clusters Elektrotehnički fakultet u Beogradu
Non-globular shapes Original points 2 Clusters Elektrotehnički fakultet u Beogradu
Pros and cons Pros Cons K needs to be known Ellipsoid shape is assumed Requires some knowledge about data in advance Possibility of many loop turns, without significant changes in clusters • Simple to implement • Fast • Not highly demanding Elektrotehnički fakultet u Beogradu
Applications of the algorithm • Many different uses • Computer vision • Market segmentation • Geostatic • Astronomy • etc Elektrotehnički fakultet u Beogradu
Improvements • Pre-processing of the data in order to better estimate k • Run multiple iteration in parallel with different centroid initialization • Ignore possible errors to avoid non-standard cluster shapes Elektrotehnički fakultet u Beogradu
Thank you! Elektrotehnički fakultet u Beogradu