370 likes | 376 Views
Cluster Analysis Part 2. Recap. Considered two types of clustering algorithms – partitioning type and hierarchical type. Discussed a partitioning algorithm called K-Means. Needs user to specify the number of clusters and their seeds. Uses Euclidean distance nearest neighbour approach.
E N D
Recap • Considered two types of clustering algorithms – partitioning type and hierarchical type. • Discussed a partitioning algorithm called K-Means. Needs user to specify the number of clusters and their seeds. Uses Euclidean distance nearest neighbour approach. • A hierarchical algorithm called Agglomerative Method was discussed. Starts with one cluster for each object, merging pair of clusters that are nearest. Continues until one large cluster includes all. Distances between clusters need to be computed. Possible approaches are Single Link, Complete Link and others.
Recap • In our examples, the attributes are not independent. There is strong correlation between students’ marks and strong correlation between Olympic medals and per capita income. • Examples used are always very simple and too small to illustrate the methods fully. • For K-Means, it is difficult to guess the number of clusters and starting seeds. No generally accepted procedure for determining them. • How to interpret results?
K-Means Iterative-improvement greedy algorithm. Random selection of clusters to start which is then improved by reassignment of objects until no improvement is possible. All the data in the dataset must be processed for each iteration. If the data is very large and cannot fit in the main memory the process may become inefficient
K-Means Finds a local minimum. To have a better chance of finding the global minimum, the method should be restarted with a new set of cluster seeds (and may be a different number of clusters). The results of several such runs can then be evaluated based on within-group variations and between-group variation. The best result accepted.
K-Means Define within cluster variation = I Define between clusters variation = E I = sum of squares of pair-wise distances between objects in a cluster E = sum of squares of pair-wise distances between central points of clusters One possible objective could then be to maximize E/I. This ratio could also be used to judge quality of different iterations.
K-Means Small I shows that the cluster is tight. Large I shows that the cluster has much variation. Similarly a small E (relative to say sum of I's) shows that the clusters are not well separated. May be the clusters found are not very good. Large E shows good separation between the clusters.
Hierarchical Methods Quite different from K-Means. No need to specify the number of clusters or cluster seeds. Agglomerative method starts with each object in an individual cluster and then merges the two closest clusters to build larger and larger clusters. The divisive approach is the opposite of that. Results depend on the distance measure used. Not easy to evaluate the quality of the results.
Distance between Clusters Single Link method – nearest neighbor - distance between two clusters is the distance between the two closest points, one from each cluster. Chains can be formed (why?) and a long string of points may be assigned to the same cluster. Complete Linkage method – furthest neighbor – distance between two clusters is the distance between the two furthest points, one from each cluster. Does not allow chains.
Distance between Clusters Centroid method – distance between two clusters is the distance between the centroids or the centres of gravity of the two clusters. Unweighted pair-group average method –distance between two clusters is the average distance between all pairs of objects in the two clusters. This means p*n distances need to be computed if p and n are the number of objects in each of the clusters
Distance between Clusters Ward’s method – is the difference between the total within cluster sum of squares for the two clusters separately and the within cluster sum of squares resulting from merging the two clusters.
Single Link B Consider two clusters Each with 5 objects. Need the smallest distance to be computed. B B B B A A A A A
Complete Link B Consider two clusters each with 5 objects. Need the largest distance to be computed. B B B B A A A A A
Centroid Link B Consider two clusters. Need the distance between the centroids to be computed. C B B B B A A C A A A
Pair-Group Average Consider two clusters each with 4 objects. Need 16 distances to be computed and average found. B B B B A A A A
Divisive Clustering Opposite to Agglomerative. Less commonly used. Start with one cluster that has all the objects and seek to split the cluster into two which themselves are then split into even smaller clusters. The split may be based on using one variable at a time or using all the variables together. The algorithm terminates when a termination condition is met or when each cluster has only one object.
Divisive Clustering There are two types of divisive methods. The split may be based on using one variable at a time or using all the variables together. Monothetic - split a cluster using only one variable at a time. How does one choose the variable? Polythetic - split a cluster using all of the attributes together. How to allocate objects to clusters?
Algorithm A typical (polythetic) divisive algorithm works like the following: • Decide on a method of measuring the distance between two objects and decide a threshold distance. • Create a distance matrix by computing distances between all pairs of objects within the cluster. Sort these distances in ascending order. • Find the two most dissimilar objects (i.e. the two objects that have the largest distance between them).
Algorithm • If the distance is smaller than a pre-specified threshold and there is no other cluster that needs to be divided then stop. • Use the pair of objects identified in Step 3 as seeds of a K-means type algorithm to create two new clusters. Examine all objects. Place each object in the cluster which has the seed with a smaller distance. • If there is only one object in each cluster then stop otherwise continue with Step 2.
Divisive Method Two major issues that need resolving are: • Which cluster to split next? • How to split the cluster?
Which cluster to split? Number of possibilities: • Split the clusters in some sequential order • Split the cluster that has the largest number of objects • Split the cluster that has the largest variation within it. The first two approaches are clearly very simple but the third approach is better since it is based on splitting a cluster that has the most variation which is a sound criterion.
How to split a cluster? A simple approach for splitting a cluster is to split the cluster based on distances between the objects in the cluster as outlined in the algorithm. A distance matrix is created and the two most dissimilar objects are selected as seeds of two new clusters. A method like the K-Means method may then be used to split the cluster.
Example We now consider the simple example about students’ marks and use the distance between two objects as the Manhattan distance since it is simple to compute without a computer. We first calculate the distance matrix.
Example • The matrix gives distance of each object with every other object. • The largest distance is between S8 and S9. They become the seeds of two new clusters. Use K-means to split the group in two clusters.
Example • Distances of other objects from S8 and S9 are: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S8 44 74 40 8 91 104 28 0 115 98 S9 20 22 36 60 37 46 30 115 0 99 The two Clusters are: C1 S8, S4, S7, S10 C2 S9, S1, S2, S3, S5, S6
Next split • We now decide which cluster to split next and then repeat the process. • The two Clusters are: C1 S8, S4, S7, S10 C2 S9, S1, S2, S3, S5, S6 • We may want to split the larger cluster first • Find the largest distance in C2 first. The information is available in the distance matrix given earlier.
Next split • In cluster C2 the largest distance is 86 between S3 and S6. C1 can be split with these as seeds. • In cluster C1 (distance matrix on next slide) the largest distance is 98 between S8 and S10. C1 can be split with these seeds.
Next • We stop short of finishing the example. • The result might look something like what is given on the next slide
Hierarchical Clustering • The ordering produced can be useful in gaining some insight into the data. • The major difficulty is that once an object is allocated to a cluster it cannot be moved to another cluster even if the initial allocation was incorrect • Different distance metrics can produce different results
Validating Clusters Difficult problem since no objective metric exists. Often a subjective measure must be used e.g. does the result provide insight to the data to the user? The validity therefore is in the eye of the beholder. Statistical testing may be possible – could test a hypothesis that there are no clusters. May be test data is available where clusters are known.
Summary There is a bewildering array of clustering methods, they are based on diverse underlying principles and they often lead to qualitatively different results. Little work done in independently reasoning about clustering without considering any of the methods. What basic properties clustering ought to obey? Three properties could be scale-invariance, richness and consistency.