570 likes | 765 Views
Cluster Analysis (Lecture# 07-08). Dr. Tahseen Ahmed Jilani Assistant Professor Member IEEE-CIS, IFSA, IRSS Department of Computer Science University of Karachi References: Richard A. Johnson and Dean W. Wishern, “Applied Multivariate Statistical Analysis”, Pearson Education.
E N D
Cluster Analysis(Lecture# 07-08) Dr. Tahseen Ahmed Jilani Assistant Professor Member IEEE-CIS, IFSA, IRSS Department of Computer Science University of Karachi References: Richard A. Johnson and Dean W. Wishern, “Applied Multivariate Statistical Analysis”, Pearson Education. Mehmed Kantardzic, “Data Mining: Concepts, Models, Methods and Algorithms”.
Cluster Analysis • Its is a set of methodologies for automatic classification of samples into a number of groups using a measure of association, so that the samples in one group are similar and samples belonging to different groups are not similar. • The input for a system of cluster analysis is a set of samples and a measure of similarity (or dissimilarity) between two samples. The output from cluster analysis is a number of groups (clusters) that form a partition, or a structure of partitions, of the data set. • One additional result of cluster analysis is a generalized description of every cluster, and this is especially important for a deeper analysis of the data set's characteristics. Dr. Tahseen A. Jilani-DCS-Uok
Clustering Concepts • Samples for clustering are represented as a vector of measurements, or more formally, as a point in a multidimensional space. • Samples within a valid cluster are more similar to each other than they are to a sample belonging to a different cluster. Thus, Clustering methodology is particularly appropriate for the exploration of interrelationships among samples to make a preliminary assessment of the sample structure. • Humans perform competitively with automatic-clustering procedures in one, two, or three dimensions, but most real problems involve clustering in higher dimensions. It is very difficult for humans to intuitively interpret data embedded in a high-dimensional space. Dr. Tahseen A. Jilani-DCS-Uok
Clustering Concepts: Example • Table 6.1 shows a simple example of clustering information for nine customers, distributed across three clusters. Two features describe customers: the first feature is the number of items the customers bought, and the second feature shows the price they paid for each. Dr. Tahseen A. Jilani-DCS-Uok
Decision Surfaces, Coarse and fine Clustering • Even this simple example and interpretation of a cluster's characteristics shows that clustering analysis (in some references also called unsupervised classification) refers to situations in which the objective is to construct decision boundaries (classification surfaces)based on unlabeled training data set. The samples in these data sets have only input dimensions, and the learning process is classified as unsupervised. • Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes in an n-dimensional data space. • To compound the problem further, the number of clusters in the data often depends on the resolution (fine vs. coarse) with which we view the data. Dr. Tahseen A. Jilani-DCS-Uok
Visual Clustering for Low Dimensional Data with Small Number of Samples • Figure 6.1a shows a set of points (samples in a two-dimensional space) scattered on a 2D plane. • This kind of arbitrariness for the number of clusters as shown in Figure (b) and Figure (c) is a major problem in clustering. What will happen in 3D or Kth Dimension. Figure: Cluster analysis of points in a 2D-space Dr. Tahseen A. Jilani-DCS-Uok
How Clustering Will work in Higher Dimensions? • Example on last slid is for 2D data set only. How to perform clustering for a dataset with 15 fields (15-D) for each record (sample) • Accordingly, we need an objective criterionfor clustering. To describe this criterion, we have to introduce a more formalized approach in describing the basic concepts and the clustering process. • An input to a cluster analysis can be described as an ordered pair (X, s), or (X, d), where X is a set of descriptions of samples, and s and d are measures for similarity or dissimilarity (distance) between samples, respectively. Output from the clustering system is a partition Λ = {G1, G2, …, GN} where Gk, k = 1, …, N is a crisp subset of X and also are non-overlapping. Dr. Tahseen A. Jilani-DCS-Uok
Types of Cluster Representations There are several schemata for a formal description of discovered clusters: • Represent a cluster of points in an n-dimensional space (samples) by their centroid or by a set of distant (border) points in a cluster. • Represent a cluster graphically using nodes in a clustering tree. • Represent clusters by using logical expression on sample attributes. Dr. Tahseen A. Jilani-DCS-Uok
Common Problem with Clustering Algorithms • The availability of a vast collection of clustering algorithms in the literature and also in different software environments can easily confound a user attempting to select an approach suitable for the problem at hand. • It is important to mention that there is no clustering technique that is universally applicable in uncovering the variety of structures present in multidimensional data sets. The user's understanding of the problem and the, corresponding data types will be the best criteria to select the appropriate method. Dr. Tahseen A. Jilani-DCS-Uok
Most Common Clustering Approaches • Hierarchical Clustering: They organize data in a nested sequence of groups, which can be displayed in the form of a dendrogram or a tree structure. • Iterative Square-Error Partitional Clustering Attempt to obtain that partition which minimizes the within-cluster scatter or maximizes the between-cluster scatter. These methods are nonhierarchical because all resulting clusters are groups of samples at the same level of partition. To guarantee that an optimum solution has been obtained, one has to examine all possible partitions of N samples of n-dimensions into K clusters (for a given K), but that retrieval process is not computationally feasible. Dr. Tahseen A. Jilani-DCS-Uok
Different Measures of Similarity/ dissimilarity in Clustering algorithms • Since similarity is fundamental to the definition of a cluster, so this measure must be chosen very carefully because the quality of a clustering process depends on this decision. • A sample x (or feature vector, observation) is a single data vector used by the clustering algorithm in a space of samples X. • We assume that each sample xi € X; i=1,…,n, is represented by a vector xi={xi1, xi2, …, xim}. The value m is the number of dimensions (features) of samples, while n is the total number of samples prepared for a clustering process that belongs to the sample domain X. Dr. Tahseen A. Jilani-DCS-Uok
Similarity Measure OR Dissimilarity Measure • It is most common to calculate, instead of the similarity measure s(x,x’), the dissimilarity d(x,x’) between two samples using a distance measure defined on the feature space. • A distance measure may be a metric or a quasi-metric on the sample space, and it is used to quantify the dissimilarity of samples. • A distance d(x,x′) is small when x and x′ are similar; if x and x′ are not similar d(x, x′) is large. We assume without loss of generality that distance measure is also symmetric: Dr. Tahseen A. Jilani-DCS-Uok
Similarity Measure OR Dissimilarity Measure • Most efforts to produce a rather simple group structure from a complex data set require a measure of “closeness” or “similarity”. There is often a great deal of subjectivity involved in the choice of a similarity measure. • Important considerations include the nature of the variables (discrete, continuous, binary), scales of measurement (nominal, ordinal, interval, ratio) and subject matter knowledge. • When samples are clustered, proximity is usually indicated by some sort of distance. Dr. Tahseen A. Jilani-DCS-Uok
Well-known Dissimilarity Measures: Distance/Metric • Cosine-Correlation metric: • Canberra Metric: • Czekanowski Coeff: • Euclidean Distance: • L1 or city block distance: • Minkowski metric Dr. Tahseen A. Jilani-DCS-Uok
Similarity for Qualitative Features • Computing distances or measures of similarity between samples that have some or all features that are non`-continuous is problematic, since the different types of features are not comparable and one standard measure is not applicable. • A conventional method for obtaining a distance measure between two samples xi and xj represented with binary features is to use the 2 × 2 contingency table for samples xi and xj, as shown in Table 6.2. Dr. Tahseen A. Jilani-DCS-Uok
Qualitative Similarity Meaures • Simple Matching Coefficient (SMC) • Jaccard’s Coefficient • Rao's Coefficient Dr. Tahseen A. Jilani-DCS-Uok
Example: Qualitative Similarity Meaures • Consider five individuals possesses the following characteristics: • Define six variables X1, X2, X3, X4, X5 and X6 Dr. Tahseen A. Jilani-DCS-Uok
The score for individual 1 and 2 on the p=6 binary variables. Applying similarity coefficient, which gives equal weight to matches, we compute (a+d)/6 = (1+0)/6 =1/6 • The second table shows the dissimilarity matrix for all the five individuals (pair wise). Dr. Tahseen A. Jilani-DCS-Uok
Mutual Neighbor Distance (MND) for Categorical Samples Dr. Tahseen A. Jilani-DCS-Uok
Types of Hierarchical Clustering • Most procedures for hierarchical clustering are not based on the concept of optimization, and the goal is to find some approximate, suboptimal solution, using iterations for improvement of partitions until convergence. • Algorithms of hierarchical cluster analysis are divided into the two categories divisible algorithms and agglomerative algorithms. • A Divisible Algorithmstarts from the entire set of samples X and divides it into a partition of subsets, then divides each subset into smaller sets, and so on. Thus, a divisible algorithm generates a sequence of partitions that is ordered from a coarser one to a finer one. Dr. Tahseen A. Jilani-DCS-Uok
Agglomerative Algorithm • An agglomerative algorithmfirst regards each object as an initial cluster. The clusters are merged into a coarser partition, and the merging process proceeds until the trivial partition is obtained: all objects are in one large cluster. • This process of clustering is a bottom-up process, where partitions from a finer one to a coarser one. • In general, agglomerative algorithms are more frequently used in real-world applications than divisible methods. Dr. Tahseen A. Jilani-DCS-Uok
Types of Agglomerative Hierarchical Clustering Algo • Most agglomerative hierarchical clustering algorithms are variants of the single-linkor complete-link algorithms. These two basic algorithms differ only in the way they characterize the similarity between a pair of clusters. • In the single-link method, the distance between two clusters is the minimum of the distances between all pairs of samples drawn from the two clusters (one element from the first cluster, the other from the second). • In the complete-link algorithm, the distance between two clusters is the maximum of all distances between all pairs drawn from the two clusters. A graphical illustration of these two distance measures is given in Figure 6.5. Dr. Tahseen A. Jilani-DCS-Uok
Diagrammatic Presentation of Single and Complete Link Agglomerative Hierarchical Clustering Algorithms • Linkage Methods are suitable for clustering samples as well as variables. Dr. Tahseen A. Jilani-DCS-Uok
The Basic Steps of the Agglomerative Clustering • Place each sample in its own cluster. Construct the list of inter-cluster distances for all distinct unordered pairs of samples, and sort this list in ascending order. • Step through the sorted list of distances, forming for each distinct threshold value dk a graph of the samples where pairs of samples closer than dk are connected into a new cluster by a graph edge. If all the samples are members of a connected graph, stop. Otherwise, repeat this step. • The output of the algorithm is a nested hierarchy of graphs, which can be cut at the desired dissimilarity level forming a partition (clusters) identified by simple connected components in the corresponding subgraph. Dr. Tahseen A. Jilani-DCS-Uok
Types of Agglomerative Hierarchical Clustering Algo • The results of both divisible and agglomerative clustering methods may be displayed in the form of a two-dimensional diagram known as Dendrogram/ tree diagram. Dr. Tahseen A. Jilani-DCS-Uok
Single Linkage Method • The inputs to a single link method can be distances or similarities between samples. • Groups are formed from the individual samples by merging the corresponding objects, say U and V, to get the cluster (UV). • The distances between UV and any other cluster W are computer by d(UV)W = min{dUW, dVW}. • The result of the merging of clusters to form new clusters can be shown graphically using Dendrogram or tree diagram. Dr. Tahseen A. Jilani-DCS-Uok
Example: Single Link Agglomerative Clustering Method • Consider the hypothetical distances between pairs of five objects as follows. • Treating each object as a cluster. We commence clustering by merging the two closest items. • Since min(dik)=d53=2. So samples 3 and 5 are merged to form a new cluster (35). • To implement the next level/ iteration of the clustering, we need the distances between the cluster (35) and the remaining samples/clusters, 1,2 and 4. Dr. Tahseen A. Jilani-DCS-Uok
Example (Continued): Step #02 and #03 • d(35)1=min{d31,d51}= min{3,11}= 3 (So merge) cluster (35) and 1 • d(35)2=min{d32,d52}= min{7,10}= 7 • d(35)4=min{d34,d54}= min{9,8}= 8 • Deleting the initial distance rows and columns corresponding to objects 3 and 5, and adding a row and column for the cluster (35), we obtain the new distance matrix. • Here min (d)=3= d(35)1 • d(135)2=min{d(35)2,d21}= min{7,9}= 7 (So merge) cluster (135) and 4 • d(135)4=min{d(35)4,d14}= min{8,6}= 6 Dr. Tahseen A. Jilani-DCS-Uok
The minimum nearest neighbor distance between pairs of clusters is d42=5, and so we merge samples 4 and 2 to form (24). Finally, • d(135)24=min{d(135)2,d(135)4}= 6 Dr. Tahseen A. Jilani-DCS-Uok
Type here • X1= Fixed-charge coverage ration (income/debt) • X2= Rate of Return on capital • X3= Cost per KW capacity in place • X4= Annual load Factor • X5= Peak kWh demand growth from 1974 to 1975 • X6= Sales (kWh use per year) • X7= Percent Nuclear • X8= Total fuel cost (cents per kWh) Dr. Tahseen A. Jilani-DCS-Uok
Correlation Matrix between pairs of variables using MATLAB >> X=[put all data inside the square] >> Y=corr(X) % output will be 8x8 matrix of correlations 1.0000 0.1598 -0.1028 -0.0820 -0.2618 -0.1517 0.0448 -0.0134 0.1598 1.0000 -0.3108 0.1881 -0.2618 -0.2486 0.3973 -0.1432 -0.1028 -0.3108 1.0000 0.1003 0.3611 0.0280 0.1147 0.0052 -0.0820 0.1881 0.1003 1.0000 -0.0100 -0.2879 -0.1642 0.4855 -0.2618 -0.2618 0.3611 -0.0100 1.0000 0.2793 -0.0699 -0.0656 -0.1517 -0.2486 0.0280 -0.2879 0.2793 1.0000 -0.3737 -0.5605 0.0448 0.3973 0.1147 -0.1642 -0.0699 -0.3737 1.0000 -0.1851 -0.0134 -0.1432 0.0052 0.4855 -0.0656 -0.5605 -0.1851 1.0000 Dr. Tahseen A. Jilani-DCS-Uok
MATLAB code for Single Linkage Method for Agglomerative Clustering MATLAB CODE • >> X=[ all data in this matrix] • >> Y=pdist(X) • >> Z=linkage(Y) • >> [H,T] = dendrogram (Z, 'colorthreshold', 'default'); Dr. Tahseen A. Jilani-DCS-Uok
Types of Agglomerative Hierarchical Clustering Algo Dr. Tahseen A. Jilani-DCS-Uok
Complete Linkage Method • DO It Yourself Dr. Tahseen A. Jilani-DCS-Uok
Partitional Clustering • Every partitional-clustering algorithm obtains a single partition of the data instead of the clustering structure, such as a dendrogram, produced by a hierarchical technique. • Partitional methods have the advantage in applications involving large data sets for which the construction of a dendrogram is computationally very complex. Dr. Tahseen A. Jilani-DCS-Uok
Criterion /Performance/ Objective Function • The partitional techniques usually produce clusters by optimizing a criterion function defined either locally (on a subset of samples) or globally (defined over all of the samples). Thus we say that a clustering criterion can be either global or local. Global Criteria and Local Criteria • A global criterion, such as the Euclidean square-error measure, represents each cluster by a prototype or centroid and assigns the samples to clusters according to the most similar prototypes. • A local criterion, such as the minimal mutual neighbor distance (MND), forms clusters by utilizing the local structure or context in the data. Therefore, identifying high-density regions in the data space is a basic criterion for forming clusters. Dr. Tahseen A. Jilani-DCS-Uok
Criterion Function:Global Criteria and Local Criteria • A global criterion, such as the Euclidean square-error measure, represents each cluster by a prototype or centroid and assigns the samples to clusters according to the most similar prototypes. • A local criterion, such as the minimal mutual neighbor distance (MND), forms clusters by utilizing the local structure or context in the data. Therefore, identifying high-density regions in the data space is a basic criterion for forming clusters. Dr. Tahseen A. Jilani-DCS-Uok
Other Criterion Function • MSE • RMSE • Absolute MSE • Other Statistical Criterion Dr. Tahseen A. Jilani-DCS-Uok
Mean Square Error based on Euclidean Distance • The most commonly used partitional-clustering strategy is based on the square-error criterion. • The general objective is to obtain the partition that, for a fixed number of clusters, minimizes the total square-error. • Suppose that the given set of N samples in an n-dimensional space has somehow been partitioned into K clusters {C1, C2, …, Ck}. Each Ck has nk samples and each sample is in exactly one cluster, so that ∑ nk = N, where k = 1,…,K. Dr. Tahseen A. Jilani-DCS-Uok
MSE based on Euclidean Distance • The mean vector Mk of cluster Ck is defined as the centroid of the cluster or Mean of the cluster • where xik is the ith sample belonging to cluster Ck. • The square-error for cluster Ck is the sum of the squared Euclidean distances between each sample in Ck and its centroid. This error is also called the within-cluster variation: Dr. Tahseen A. Jilani-DCS-Uok
K-means partitional-clustering algorithm • The square-error for the entire clustering space containing K clusters is the sum of the within-cluster variations: • The objective of a square-error clustering method is to find a partition containing K clusters that minimize E2k for a given K. • The K-means partitional-clustering algorithm is the simplest and most commonly used algorithm employing a square-error criterion. • It starts with a random, initial partition and keeps reassigning the samples to clusters, based on the similarity between samples and clusters, until a convergence criterion is met. Dr. Tahseen A. Jilani-DCS-Uok
Diagrammatic Presentation of K-Mean/Centroids Algorithm Dr. Tahseen A. Jilani-DCS-Uok
K-means partitional-clustering algorithm • Typically, this criterion is met when there is no reassignment of any sample from one cluster to another that will cause a decrease of the total squared error. • K-means algorithm is popular because it is easy to implement, and its time and space complexity is relatively small. • A major problem with this algorithm is that it is sensitive to the selection of the initial partition and may converge to a local minimum of the criterion function if the initial partition is not properly chosen. • The simple K-means partitional-clustering algorithm is computationally efficient and gives surprisingly good results if the clusters are compact: hypersphericalin shape, and well separatedin the feature space. Dr. Tahseen A. Jilani-DCS-Uok
Basic steps of the K-means algorithm: • Select an initial partition with K clusters containing randomly chosen samples, and compute the centroids of the clusters, • Generate a new partition by assigning each sample to the closest cluster center, • Compute new cluster centers as the centroids of the clusters, • Repeat steps 2 and 3 until an optimum value of the criterion function is found (or until the cluster membership stabilizes). Dr. Tahseen A. Jilani-DCS-Uok
Example • Let us analyze the steps of the K-means algorithm on the simple data set given in Figure 6.6. Suppose that the required number of clusters is two, and initially, clusters are formed from random distribution of samples. • Step#01X1=(0,2), X2=(0,0), X3=(1.5,0), X4=(5,0), X5=(5,2) C1 = {x1, x2, x4} and. C2 = {x3, x5}. The centriods for these two clusters are. Mean value of Cluster# 01= Mean value of Cluster# 02= Within-cluster variations, after initial random distribution of samples, are Dr. Tahseen A. Jilani-DCS-Uok
Example (Continue) • Within-cluster variations, after initial random distribution of samples, are • And the total square error is • ESS=SSE= =19.36+8.12= 27.48 Dr. Tahseen A. Jilani-DCS-Uok
Example: K-Mean Clustering (Continue) • Step#02 • When we reassign all samples, depending on a minimum distance from centroids M1 and M2, the new redistribution of sample inside clusters will be Dr. Tahseen A. Jilani-DCS-Uok
Example: K-Mean Clustering (Continue) • C1 = {x1, x2, x3} and C2 = {x4, x5} have new centroids are and The corresponding within-cluster variations and the total square error are , so See that after the first iteration, the total square error is significantly reduced (from the value 27.48 to 6.17). In this simple example, the first iteration was at the same time the final one because if we analyze the distances between the new centroids and the samples, the latter will all be assigned to the same clusters. There is no reassignment and therefore the algorithm halts. Dr. Tahseen A. Jilani-DCS-Uok
The reasons behind the popularity of K-means algorithm In summary, only the K-means algorithm and its equivalent in an artificial neural networks domain-the Kohonen neural networks -have been applied for clustering on large data sets. Other approaches have been tested, typically, on small data sets. • Its time complexity is O(nkl), where n is the number of samples, k is the number of clusters, and 1 is the number of iterations taken by the algorithm to converge. Typically, k and l are fixed in advance and so the algorithm has linear time complexity in the size of the data set. • Its space complexity is O(k+n), and if it is possible to store all the data in the primary memory, access time to all elements is very fast and the algorithm is very efficient. Dr. Tahseen A. Jilani-DCS-Uok