220 likes | 423 Views
Cluster analysis. 포항공과대학교 산업공학과 확률통계연구실 이 재 현. Definition. Cluster analysis is a technigue used for combining observations into groups or clusters such that Each group or cluster is homogeneous or compact with respect to certain characteristics
E N D
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현
Definition • Cluster analysis is a technigue used for combining observations into groups or clusters such that • Each group or cluster is homogeneous or compact with respect to certain characteristics • Each group should be different from other groups with respect to the same characteristics • Example • A marketing manager is interested in identifying similar cities that can be used for test marketing • The campaign manager for a political candidate is interested in identifying groups of votes who have similar views on important issues
Objective of clustering analysis • The objective of cluster analysis is to group observation into clusters such that each cluster is as homogeneous as possible with respect to the clustering variables • overview of cluster analysis • step 1 ; n objects measured on p variables • step 2 ; Transform to n * n similarity(distance) matrix • step 3 ; Cluster formation (Hierarchical or nonhierarchical clusters) • step 4 ; Cluster profile
Key problem • Measure of similarity • Fundamental to the use of any clustering technique is the computation of a measure of similarity to distance between the respective objects. • Distance-type measures – Euclidean distance for standardized data, Mahalanobis distance • Matching-type measures – Association coefficients, correlation coefficients • A procedure for forming the clusters • Hierarchical clustering – Centroid method, Single-linkage method, Complete-linkage method, Average-linkage method, Ward’s method. • Nonhierarchical clustering – k-means clustering
Similarity Measure – Distance type • Minkowski metric • If r = 2, then Euclidean distance • if r = 1, then absolute distance • consider below example
Similarity Measure – Distance type • Euclidean distance for standardized data • To make scale invariant data • The squared euclidean distance is weighted by • Mahalanobis distance x is p*1 vector, S is a p*p covariance matrix • It is designed to take into account the correlation among the variables and is also scale invariant.
Similarity Measure – Matching type • Association coefficients • This type of measure is used to represent similarity for binary variables • Similarity coefficients
Similarity Measure – Matching type • Correlation coefficient • Pearson product moment correlation coefficient is used for measure of similarity. • dAB = 1, dAC = 0.82
Hierarchical clustering • Centroid method • Each group is replaced by Average Subject which is the centroid of that group
Hierarchical clustering • Single-Linkage method • The distance between two clusters is represented by the minimum of the distance between all possible pairs of subjects in the two clusters = 181 and = 145 = 221 and = 181
Hierarchical clustering • Complete-Linkage method • The distance between two clusters is defined as the maximum of the distances between all possible pairs of observations in the two clusters = 181 and = 145 = 625 and = 557
Hierarchical clustering • Average-Linkage method • The distance between two clusters is obtained by taking the average distance between all pairs of subjects in the two clusters and (181 + 145) / 2 = 163
Hierarchical clustering • Ward’s method • It forms clusters by maximizing within-clusters homogeneity. The within-group sum of squares is used as the measure of homogeneity. The Ward’s method tries to minimize the total within-group or within-cluster sums of squares
Evaluating the cluster solution and determining the number of cluster • Root-mean-square standard deviation(RMSSTD)of the new cluster • RMSSTD if the pooled standard deviation of all the variables forming the cluster. pooled variance = pooled SS for all the variables / pooled degrees of freedom for all the variables • R-Squared(RS) • RS is the ratio of SSb to SSt (SSt = SSb + SSw) RS of CL2 is (701.166 – 184.000) / 701.166 = 0.7376
Evaluating the cluster solution and determining the number of cluster • Semipartial R-Squared (SPR) • The sum of pooled SSw’s of cluster joined to obtain the new cluster is called loss of homogeneity. If loss of homogeneity is large then the new cluster is obtained by merging two heterogeneous clusters. • SPR is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. • SPR of CL2 is (183 – (1 – 13)) / 701.166 = 0.241 • Distance between clusters • It is simply the euclidean distance between the centroids of the two clusters that are to be joined or merger and it is termed the centroid distance (CD)
Evaluating the cluster solution and determining the number of cluster • Summary of the statistics for evaluating cluster solution
Nonhierarchical clustering • The data are divided into k partitions or groups with each partition representing a cluster. The number of clusters must be known a priori. • Step • Select k initial cluster centroids or seeds, where k is number of clusters desired. • Assign each observation to the cluster to which it is the closest. • Reassign or reallocate each observation to one of the k clusters according to a predetermined stopping rule. • Stop if there is no reallocation of data points or if the reassignment satisfies the criteria set by the stopping rule. Otherwise go to Step 2. • Difference • the method used for obtaining initial cluster centroids or seeds • the rule used for reassigning observations
Nonhierarchical clustering • Algorithm 1 • step • select the first k observation as cluster center • compute the centroid of each cluster • reassigned by computing the distance of each observation
Nonhierarchical clustering • Algorithm 2 • step • select the first k observation as cluster center • seeds are replaced by remaining observation. • reassigned by computing the distance of each observation • {1}, {2}, {3} • {1}, {2}, {3, 4} • {1, 2}, {5}, {3, 4} • {1, 2}, {5, 6}, {3, 4}
Nonhierarchical clustering • Algorithm 3 • selecting the initial seeds Sum(i) be the sum of the values of the variables • Minimizes the ESS Change in ESS = 3[(5-27.5)2 + (5-19.5)2]/2 – [(5-5.5)2 + (5-5.5)2]/2 increase decrease
Which clustering method is best • Hierarchical methods • advantage ; Do not require a priori knowledge of the number of clusters of the starting partition. • disadvantage ; Once an observation is assigned to a cluster it cannot be reassigned to another cluster. • Nonhierarchical methods • The cluster centers or the initial partition has to be identified before the technique can proceed to cluster observations. The nonhierarchical clustering algorithms, in general, are very sensitive to the initial partition. • k-mean algorithm and other nonhierarchical clustering algorithms perform poorly when random initial partitions are used. However, their performance is much superior when the results from hierarchical methods are used to form the initial partition. • Hierarchical and nonhierarchical techniques should be viewed an complementary clustering techniques rather than as competing techniques.