140 likes | 304 Views
DATA MINING. CLUSTERING K-Means. Clustering Definition. Techniques that are used to divide data objects into groups A form of classification in that it creates a labelling object with class(cluster) labels. The labels are derived from the data
E N D
DATA MINING CLUSTERING K-Means
ClusteringDefinition • Techniques that are used to divide data objects into groups • A form of classification in that it creates a labelling object with class(cluster) labels. The labels are derived from the data • Cluster analysis is categorized as unsupervised classification • When you have no idea how to define groups, clustering method can be useful
Types of Clustering • Hierarchical vs Partitional • Hierarchical nested cluster, organized as tree • Partitional fully non-overlapping • Exclusive vs Overlapping vs Fuzzy • Exclusive each object is assigned to a single cluster • Overlapping an object can simultaneously belong to more than one cluster • Fuzzy every object belongs to every cluster with a membership weigth that is between 0 and 1 • Complete vs Partial • Complete assigns every object to cluster • Partial not all objects are assigned
Types of Clusters • Well-separated • Prototype-based • Graph-based • Density-based • Shared-property(Conceptual Cluster)
K-Means • Partitional clustering • Prototype-based • One level
Basic K-Means • k, the number of clusters that are to be formed, must be decided before beginning • Step 1 • Select k data points to act as the seeds (or initial cluster centroids) • Step 2 • Each record is assigned to the centroid which is nearest, thus forming a cluster • Step 3 • The centroids of the new clusters are then calculated. Go back to Step 2
Basic K-means -2- Determine cluster boundaries Assign each record to the nearest centroid Calculate new centroid
Choosing Initial Centroids • Random initial centroids • Poor • Can have empty cluster • Limits of random initialization • Multiple runs with different set of randomly choosen centroids then select the set of cluster with the minimum SSE
Similarity, Association, and Distance • The method just described assumes that each record can be described as a point in a metric space • This is not easily done for many data sets (e.g., categorical and some numeric variables) • Pre-processing is often necessary • Records in a cluster should have a natural association. A measure of similarity is required. • Euclidean distance is often used, but it is not always suitable • Euclidean distance treats changes in each dimension equally, but changes in one field may be more important than changes in another • and changes of the same “size” in different fields can have very different significances • e.g. 1 metre difference in height vs. $1 difference in annual income
Measures of Similarity • Euclidean distance between vectors X and Y • Weighting
Redefine Cluster Centroids • Sum of the Squared Error for data in euclidean space. The centroid(mean) of the ith cluster is defined: • Other case:
Bisecting K-means • Basic idea: • Split the set of all points into two cluster • Select one of these clusters to split • so on, until K cluster have been produced • Choose the cluster to split: • Cluster with largest SSE • Cluster with largest size • Both, or other criterion • Bisecting is less susceptible to initialization problems
Strengths and Weaknesses • Strengths • Simple and can be used for wide variety data types • Efficient in computation • Weaknesses • Not suitable for all types of data • Cannot contains outliers, should be remove • Restricted to data for which there is a notion of a center(centroids)