Cluster Analysis

Cluster Analysis Grouping Cases or Variables

Clustering Cases • Goal is to cluster cases into groups based on shared characteristics. • Start out with each case being a one-case cluster. • The clusters are located in k-dimensional space, where k is the number of variables. • Compute the squared Euclidian distance between each case and each other case.

Squared Euclidian Distance • the sum across variables (from i = 1 to v) of the squared difference between the score on variable i for the one case (Xi) and the score on variable i for the other case (Yi)

Agglomerate • The two cases closest to each other are agglomerated into a cluster. • The distances between entities (clusters and cases) are recomputed. • The two entities closest to each other are agglomerated. • This continues until all cases end up in one cluster.

What is the Correct Solution? • You may have theoretical reasons to expect a certain k cluster solution. • Look at that solution and see if it matches your expectations. • Alternatively, you may try to make sense out of solutions at two or more levels of the analysis.

Faculty Salaries • Subjects were faculty in Psychology at ECU. • Variables were rank, experience, number of publications, course load, and salary. • Data are at ClusterAnonFaculty.sav • Also see the statistical output

Analyze, Classify, Hierarchical Cluster

Statistics

Plots

Method

Save

Proximity Matrix • We did not request this, but if we had it would display a measure of dissimilarity for each pair of entities. • The pair of cases with the smallest squared Euclidian distance are clustered.

Look at the Agglomeration Schedule. Cases 32 and 33 are clustered. They are very similar (distance = 0.000)

Steps 2 Through 5

Stages 2-5 • The agglomeration schedule show that in Stage 2 cases 41 and 42 are clustered. • In Stage 3 cases 43 and 44 are clustered. • In Stage 4 cases 37 and 38 are clustered. • In Stage 5 case 39 is added to the cluster that contains cases 37 and 38. • And so on.

Vertical Icicle, Two Clusters • Look at the top of the display (next slide). • You can see two clusters • On the left Boris through Willy • On the right, Deanna through Sunila • The 2 cluster solution was adjuncts versus full time faculty.

Vertical Icicle, Three Clusters • Look at the icicle second highest white bar. • Now there are three clusters • Adjuncts • Junior faculty (Deanna through Mickey) • Senior faculty (Lawrence through Roslyn)

Vertical Icicle, Four Clusters • Look at the white bar furthest to the right. • Now there are four clusters • Adjuncts • Junior faculty • The acting chair (Lawrence) • The rest of the senior faculty (Catalina through Roslyn)

The Dendogram • At the far right you can see the two cluster solution. • The next step to the left shows the three cluster solution. • The next step to the left shows the four cluster solution. • And so on. • Truncated and rotated dendogram on next slide.

Compare Two Clusters • The 2 cluster solution was adjuncts versus everybody else. • Look at the t tests in the output • Adjuncts had lower rank, experience, number of publications, course load, and salary.

Compare Three Clusters • Look at the ANOVAs and plots. • The senior faculty had higher salary, experience, rank, and number of pubs. Compare Four Clusters • The acting chair had a higher salary and number of publications.

I Could Not Help Myself • With these data on hand, I could not resist predicting salary from the other variables. • Salary was well correlated with Rank, FTEs, Publications, and Experience. • In the multiple regression, only Rank and FTEs had significant unique effects. • The residuals suggest who was being overpaid and who underpaid.

Split by Sex • For men, the unique effect of number of publications was positive – more publications, higher salary. • For women it was negative – more publications, lower salary. • Curious.

Workaholism • Aziz & Zickar (2005) • Workaholics may be defined as those • High in work involvement, • High in drive to work, and • Low in work enjoyment. • For each case, a score was obtained for each of these three dimensions.

The Three Cluster Solution • Workaholics • High work involvement • High drive to work • Low work enjoyment • Positively engaged workers • High work involvement • Medium drive to work • High work enjoyment

Unengaged workers • Low work involvement • Low drive to work • Low work enjoyment • Past research/theory indicated there should be six clusters, but the theorized six clusters were not obtained.

Clustering Variables • FactBeer.sav • The statistical output. • Analyze, Classify, Hierarchical Cluster

Statistics

Plots

Method

Proximity Matrix • Is simply the intercorrelation matrix • The two most correlated variables are Color and Aroma (r = .909) – they are clustered on the first step. • Stage 2: Size and Alcohol (r = .904) are clustered. • Stage 3: Taste added to the cluster that already contains Color and Aroma

Also See Other Tables & Plots • Stage 4: Cost added to the cluster that already contains Size and Alcohol. • Stage 5: The two clusters are combined • But they are not very similar (similarity coefficient = .038) • Now we have one cluster with six variables and one with one (Reputation)

Cluster Analysis