150 likes | 317 Views
Chapter 4 Clustering . What is Clustering?. The process of organizing objects into groups whose members are similar in some way Statistics, machine learning, and database researchers have studied data clustering Recent emphasis on large datasets. Approaches to Clustering.
E N D
What is Clustering? • The process of organizing objects into groups whose members are similar in some way • Statistics, machine learning, and database researchers have studied data clustering • Recent emphasis on large datasets
Approaches to Clustering • Two main approaches to clustering: • PartitionalClustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree
Problem Statement • N objects to be grouped in kclusters • Different possibilities • If we have 5 objects, to be classified into 2 clusters, what are the number of possibilities? 25 / 2!= 32/2=16 • The objective is to find a grouping such that the distances between objects in a group is minimum
Types • Statistical methods • K-means algorithm • Probabilistic clustering • The agglomerative algorithm • Neural network based approaches • Kohonen’s self organizing maps (SOM) • Evolutionary computing (GA) • Text Clustering
K-means Algorithm • Randomly select k points to be the starting points for the centroids of the k clusters. • Assign each object to the centroid closest to the object, forming k exclusive clusters of examples. • Calculate new centroids of the clusters. Take the average of all the attribute values of the objects belonging to the same cluster. • Check if the cluster centroids have changed their coordinates. If yes, repeat from Step 2. • If no, cluster detection is finished, and all objects have their cluster memberships defined.
Numerical Example • One-dimensional database with N = 9 • Objects labeled z1…z9 • Let k = 2 • Let us start with z1 to z2 as the initial centroids: z1=2 z2=4 • Compute distance to centroids.
Example • Reassign each object to the two clusters based on the new calculations: Centroid-1= 2.5 Centriod-2= 16
Clustering- iteration 3 • Reassign each object to the two clusters based on the new calculations: Centroid-1= 3 Centriod-2= 18
Example • No Change in clusters, so the algorithm stops, • The means have converged to their optimal values.