Cluster Analysis

Cluster Analysis – 2 Approaches • K-Means (traditional) • Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation by Wagner Kamakura at Statistical Modeling Week 2004

K-Means clustering Partitioning algorithm • Partitions a data set into a pre-determined number of groups (clusters) that are homogeneous in terms of selected continuous variables Y1,Y2,… How it works: • User chooses the number K of clusters and selects variables to define these clusters • Step 1: Algorithm randomly positions each cluster at a point in the variable space • Step 2: Each case is assigned to the nearest of the K clusters using Euclidean distance • Step 3: Within-cluster means are computed and the clusters are re-positioned at this centroid point • The process is iterated until convergence

Illustration of K-Means clustering – Step 1 Cluster 1 Cluster 3 Y2 Cluster 2 Y1

Step 2: Cases Assigned to Clusters Y2 Y1

Step 3: Clusters are Repositioned Y2 Y1

Steps 2 and 3 are Repeated Y2 Y1

Cases are Assigned to Repositioned Clusters, and Algorithm Continues Y2 Y1

Y2 Gender Distance Female ? Male Single Married Divorced Y1 Marital Problems with K-Means Approach • Needs a metric for similarity or distance between pairs of respondents • No statistical criterion to choose number of clusters • Solution is not unique – depends on random start • Use of Euclidean distance implies that within each cluster, the variance of Y1 equals the variance of Y2 • Can’t classify respondents with missing data

Latent Class Cluster Models Differences from traditional clustering approach • Based on similarity in response patterns, rather than distance between respondents • Maximum likelihood and posterior mode parameter estimation utilizes the EM algorithm, which corresponds to a probabilistic extension of the K-Means algorithm • Can be applied to variables of different scale types (discrete or continuous) • Statistical tests available to compare different models • Random sets of starting points to avoid local solutions • Missing data not a major problem

Latent Class Cluster Analysis • Instead of using distances to classify cases into segments, it uses probabilities • Can handle nominal, ordinal and continuous variables (any combination of these) • Isn’t as sensitive to missing data as traditional cluster analysis techniques. • Easier to classify a respondent into a segment when some of the data is not available Reference: Magidson and Vermunt “Latent class models for clustering: A comparison with K-means”, Canadian Journal of Marketing Research, Vol. 20.1, 2002, pp.36-43.

Cluster Analysis – 2 Approaches