1 / 21

Metrics, Algorithms & Follow-ups

Metrics, Algorithms & Follow-ups. Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up analyses “Internal” ANOVA & ldf Analyses “External ” ldf ANOVA & ldf Analyses. Profile Dissimilarity Measures.

merv
Download Presentation

Metrics, Algorithms & Follow-ups

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up analyses “Internal” ANOVA & ldf Analyses “External” ldf ANOVA & ldf Analyses

  2. Profile Dissimilarity Measures • For each formula • y are data from 1st person & x are data from 2nd person being compared • summing across vars • Euclidean  [ (y - x)² ] • Squared Euclidean  (y - x)² (probably most popular) • City-Block  | y - x | • Chebychev max | y - x | • Cosine cos rxy(similarity index)

  3. X = 2, 0 -3 -2 -1 0 1 2 3 V2 Euclidean [ (y - x)² ] Y = -2, -2 -3 -2 -1 0 1 2 3 V1 • ([-2 – 2]2 + ) [-2 – 0]2 ( 42 + 22) =  20 = 4.47 4.47 represents the “multivariate dissimilarity” of X & Y

  4. X = 2, 0 Squared Euclidean (y - x)² -3 -2 -1 0 1 2 3 V2 Y = -2, -2 ([-2 – 2]2 + ) [-2 – 0]2 ( 42 + 22) = 20 -3 -2 -1 0 1 2 3 V1 • 20 represents the “multivariate dissimilarity” of X & Y • Squared Euclidean is a little better at “noticing” strays • remember that we use a square root transform to “pull in” outliers • leaving the value squared makes the strays stand out a bit more

  5. X = 2, 0 City Block  |y – x| Y = -2, -2 -3 -2 -1 0 1 2 3 V2 ([-2 – 2] + ) [-2 – 0] ( 4 + 2) = 6 -3 -2 -1 0 1 2 3 V1 So named because in a city you have to go “around the block” you can’t “cut the diagonal”

  6. Chebychevmax | y - x | X = 2, 0 Y = -2, -2 -3 -2 -1 0 1 2 3 V2 |-2 – 2 | = 4 | -2 – 0 | = 2 max ( 4 & 2) = 4 -3 -2 -1 0 1 2 3 V1 Uses the “greatest univariate difference” to represent the multivariate dissimilarity.

  7. Cosinecos rxy X = 2, 0 -3 -2 -1 0 1 2 3 V2 Y = -2, -2 Φ First: Correlate scores from 2 cases across variables -3 -2 -1 0 1 2 3 V1 Second: Find the cosine (angle Φ) of that correlation • This is a “similarity index” – all the other measures we have looked at are “dissimilarity indices” • Using correlations ignores level differences between cases – looks only at shape differences (see next page)

  8. Based on Euclidean or Squared Euclidean these four cases would probably group as: • orange & blue • green & yellow • While those within the groups have somewhat different shapes, they are very similar levels A B C D E • Based on Cos r these four cases would probably group as: • blue & green • orange & yellow • Because correlation pays attention only to profile shape It is important to carefully consider how you want to define “profile similarity” when clustering – it will likely change the results you get.

  9. How Hierarchical Clustering works • Data in an “X” matrix (cases x variables) • Compute the “profile similarity” of all pairs of cases and put those values in a “D” matrix (cases x cases) • Start with # clusters = # cases (1 case in @ cluster) • On each step • Identify the 2 clusters that are “most similar” • A “cluster” may have 1 or more cases • Combine those 2 into a single cluster • Re-compute the “profile similarity” among all cluster pairs • Repeat until there is a single cluster

  10. Amalgamation & Linkage Procedures -- which clusters to combine ? • Wards -- joins the two clusters that will produce the smallest increase in the pooled within-cluster variation (works best with Squared Euclidean) • Centroid Condensation -- joins the two clusters with the closest centroids -- profile of joined cluster is mean of two (works best with squared Euclidean distance metric) • Median Condensation -- same as centroid, except that equal weighting is used to construct the centroid of the joined cluster (as if the 2 clusters being joined had equal-N) • Between Groups Average Linkage -- joins the two clusters for which the average distance between members of those two clusters is the smallest

  11. Amalgamation & Linkage Procedures, cont. • Within-groups Average Linkage -- joins the two clusters for which the average distance between members of the resulting cluster will be smallest • Single Linkage -- two clusters are joined which have the most similar two cases • Complete Linkage -- two clusters are joined for which the maximum distance between a pair of cases in the two clusters is the smallest

  12. Wards -- joins the two clusters that will produce the smallest increase in the pooled within-cluster variation • works well with Squared Euclidean metrics (identifies strays) • attempts to reduce cluster overlay by minimizing SSerror • produces more clusters, each with lower variability • Computationally intensive, but Statistically simple … • On each step … • Take every pair of clusters & combine them • Compute the variance across cases for each variable • Combine those univariate variances into a multivariate variance index • Identify the pair of clusters with the smallest multivariate variance index • Those two clusters are combined

  13. Compute the centroid for each cluster CentroidCondensation -- joins the two clusters with the closest centroids -- profile of joined cluster is mean of two The distance between every pair of cluster centroids is computed. The two clusters with the shortest centroid distance are joined The centroid for the new cluster is computed as themeanof the joined centroids • new centroid will be closest to the larger group – it contributes more cases • if a “stray” is added, it is unlikely to mis-position the new centroid

  14. Compute the centroid for each cluster MedianCondensation -- joins the two clusters with the closest centroids -- profile of joined cluster is median of two -- is better than last if suspect groups of different size The distance between every pair of cluster centroids is computed. The two clusters with the shortest centroid distance are joined The centroid for the new cluster is computed as the median of the joined centroids • new centroid will be closest to the larger group – it contributes more cases • if a “stray” is added, it is very likely to mis-position new centroid

  15. For each pair of clusters find the links across the clusters – links for one shown – yep, there are lots of these Between Groups Average Linkage -- joins the two clusters with smallest average cross-linkage -- profile of joined cluster is mean of two The two clusters with the shortest average centroid distance are joined -- more complete than just comparing centroid distances The centroid for the new cluster is computed as the mean of the joined centroids • new centroid will be closest to the larger group – it contributes more cases • if a “stray” is added, it is unlikely to mis-position the new centroid

  16. For each pair of clusters find the links within that cluster pair – a few between & within shown – yep, there are scads of these Within Groups Average Linkage -- joins the two clusters with smallest average within linkage -- profile of joined cluster is mean of two The two clusters with the shortest average centroid distance are joined -- more complete than between groups average linkage The centroid for the new cluster is computed as the mean of the joined centroids • like Wards, but w/ “smallest distance” instead of “minimum SS” • new centroid will be closest to the larger group • if a “stray” is added, it is unlikely to mis-position the new centroid

  17. Single Linkage -- joins the two clusters with the nearest neighbors -- profile of joined cluster is computed from case data Compute the nearest neighbor distance for each cluster pair The two clusters with the shortest nearest neighbor distance are joined The centroid for the new cluster is computed from all cases in the new cluster • groupings based on position of a single pair of cases • outlying cases can lead to “undisciplined groupings” – see above

  18. Complete Linkage -- joins the two clusters with the nearest farthest neighbors -- profile of joined cluster is computed from case data Compute the farthest neighbor distance for each cluster pair The two clusters with the shortest farthest neighbor distance are joined The centroid for the new cluster is computed from all cases in the new cluster • groupings based on position of a single pair of cases • can lead to “undisciplined groupings” see above

  19. k-means Clustering – Non-hierarchical • select the desired number of clusters • identify the “k” clustering variables • First Iteration • the computer places each case into the k-dimensional space • the computer randomly assigns cases to the “k” groups & computes the k-dim centroid of each group • compute the distance from each case to each group centroid • cases are re-assigned to the group to which they are closest • Subsequent Iterations • re-compute the centroid for each group • for each case re-compute the distance to each group centroid • cases are re-assigned to the group to which they are closest • Stop • when cases don’t change groups or centroids don’t change • failure to converge can happen, but doesn’t often

  20. Hierarchical“&” k-means Clustering • There are two major issues in cluster analysis…leading to a third • How many clusters are there ? • Who belongs to each cluster ? • 3. What are the clusters ? That is, how do we describe them, based on a description of who is in each ?? • Different combinations of clustering metrics, amalgamation and link often lead to different answers to these questions. • Hierarchical & K-means clustering often lead to different answers as well. • The more clusters in the solutions derived by different procedures, the more likely those clusters are to disagree • How different procedures handle strays and small-frequency profiles often accounts for the resulting differences

  21. Using ldf when clustering • It is common to hear that following a clustering with an ldf is “silly” -- depends ! • There are two different kinds of ldfs -- with different goals . . . • predicting groups using the same variables used to create the clusters  an “internal ldf” • always “works” -- there are discriminable groups (duh!!) • but will learn something from what variables separate which groups (may be a small subset of the variables used) • Reclassification errors tell you about “strays” & “forces” • gives you a “spatial model” of the clusters (concentrated vs. diffuse structure, etc.) • predicting groups using a different set of variables than those used to create the clusters  an “external ldf” • asks if knowing group membership “tells you anything”

More Related