CLUSTERING

CLUSTERING By M.GANESHKUMAR, Assistant Professor, RajapalayamRajus` College Rajapalayam

What is Cluster Analysis? • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms

Patterns to be clustered are either labeled or unlabeled. • The commonest form of unsupervised learning • Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given

Measure the Quality of Clustering • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j) • There is a separate “quality” function that measures the “goodness” of a cluster. • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables. • Weights should be associated with different variables based on applications and data semantics. • It is hard to define “similar enough” or “good enough” • the answer is typically highly subjective.

Similarity and Dissimilarity Between Objects • Distances are normally used to measure the similarity or dissimilarity between two data objects • Some popular ones include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

If q = 1, d is Manhattan distance If q = 2, d is Euclidean distance: • Properties • d(i,j) 0 • d(i,i)= 0 • d(i,j)= d(j,i) • d(i,j) d(i,k)+ d(k,j)

Overview of clustering • From the paper “Data clustering: review” • Feature Selection • identifying the most effective subset of the original features to use in clustering • Feature Extraction • transformations of the input features to produce new salient features. • Interpattern Similarity • measured by a distance function defined on pairs of patterns. • Grouping • methods to group similar patterns in the same cluster

Major Clustering Approaches • Partitioning approach: • Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors • Typical methods: k-means, k-medoids, CLARANS • Hierarchical approach: • Create a hierarchical decomposition of the set of data (or objects) using some criterion • Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON • Density-based approach: • Based on connectivity and density functions • Typical methods: DBSACN, OPTICS, DenClue

animal vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean Hierarchical Clustering • Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents. • One approach: recursive application of a partitional clustering algorithm.

Dendrogram: Hierarchical Clustering • Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. 11

Sec. 17.1 Hierarchical Agglomerative Clustering (HAC) • Starts with each doc in a separate cluster • then repeatedly joins the closest pair of clusters, until there is only one cluster. • The history of merging forms a binary tree or hierarchy. Note: the resulting clusters are still “hard” and induce a partition

Hierarchical clustering • There are two styles of hierarchical clustering algorithms to build a tree from the input set S: • Agglomerative (bottom-up): • Beginning with singletons (sets with 1 element) • Merging them until S is achieved as the root. • It is the most common approach. • Divisive (top-down): • Recursively partitioning S until singleton sets are reached.

Hierarchical clustering • Input: a pairwise matrix involved all instances in S • Algorithm • Place each instance of S in its own cluster (singleton), creating the list of clusters L (initially, the leaves of T): L= S1, S2, S3, ..., Sn-1, Sn. • Compute a merging cost function between every pair of elements in L to find the two closest clusters {Si, Sj} which will be the cheapest couple to merge. • Remove Si and Sj from L. • Merge Si and Sj to create a new internal node Sij in T which will be the parent of Si and Sj in the resulting tree. • Go to Step 2 until there is only one set remaining.

Step 2 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering. • In single-linkage clustering (also called the connectedness or minimum method): we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. • In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster. • In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Hierarchical clustering: example

Hierarchical clustering: example using single linkage

Hierarchical clustering: forming clusters • Forming clusters from dendograms

Advantages • Dendograms are great for visualization • Provides hierarchical relations between clusters • Shown to be able to capture concentric clusters • Disadvantages • Not easy to define levels for clusters • Experiments showed that other clustering techniques outperform hierarchical clustering

Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen’67): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

K-Means • Step 0: Start with a random partition into K clusters • Step 1: Generate a new partition by assigning each pattern to its closest cluster center • Step 2: Compute new cluster centers as the centroids of the clusters. • Step 3: Steps 1 and 2 are repeated until there is no change in the membership (also cluster centers remain the same)

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 The K-Means Clustering Method • Example 10 9 8 7 6 5 Update the cluster means Assign each objects to most similar center 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means

CLUSTERING

CLUSTERING

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering