Cluster Analysis

Cluster Analysis Some of the examples on cluster analysis will come from Base SAS.

What is Cluster Analysis? • What is cluster analysis? • Cluster analysis is an exploratory data analysis tool. • Cluster analysis can classify records (people, things, events, etc) into groups, or clusters. • The classification is performed in a manner to produce a strong degree of association between members of the same cluster and weak degree of association between members of different clusters. • Cluster analysis can illuminate associations and structure in the data that was not previously evident, but are logical and useful once found.

When Do We Use Cluster Analysis? • When do we use cluster analysis? • When you want to classify people/things together into groups/clusters. Thus there are many situations in which cluster analysis can be helpful. Often it is useful in marketing. • Why marketing? • Often in marketing it is desirable to group customers with similar likes and dislikes together. • This can enable more effective marketing campaigns. • Also in marketing it is often difficult to determine how to place customers into categories.

Measures of Distance • Two Main Types of Distance Used for Cluster Analysis • Euclidean distance • Standard Euclidean distance • Standardize the data first then calculate the Euclidean distance. Typically we will use sample mean and sample standard deviation for respectively.

An Example of the Power of Cluster Analysis • We will now give an example of the power of cluster analysis with SAS. • Generate data (Complex code to generate random data, expecting 3 clusters.): • DATA body1; *the data represents the heights and weights of 300 women. • KEEP height1 weight1; • N=100; SCALE=1; *N=observations per cluster. Scale=variance; • MX=150; MY=45; LINK GENERATE; *mean of x,y of cluster 1; • MX=155; MY=50; LINK GENERATE; *mean of x,y of cluster 2; • MX=160; MY=40; LINK GENERATE; *mean of x,y of cluster 3; • STOP; • GENERATE: • DO I=1TO N; • height1=RANNOR(0)*SCALE+MX; *generates from random normal distribution; • weight1=RANNOR(0)*SCALE+MY; *generates from random normal distribution; • OUTPUT; • END; • RETURN; • RUN;

An Example of the Power of Cluster Analysis • How will we group the women looking at height and weight? • Now let us see the data. • procprintdata=body1; run;

An Example of the Power of Cluster Analysis • How will we group the women looking at height and weight? • Let us see some descriptive statistics for more information. • procunivariatedata=body1; • var height1 weight1; • run; Descriptive Statistics on Height

An Example of the Power of Cluster Analysis • How will we group the women looking at height and weight? • Let us see some descriptive statistics for more information. • procunivariatedata=body1; • var height1 weight1; • run; Descriptive Statistics on Weight

An Example of the Power of Cluster Analysis • How will we group the women looking at height and weight? • The descriptive statistics were informative but it is still very difficult to know how to group the women using their heights and weights. • Cluster Analysis: • PROCCLUSTERDATA=body1 OUTTREE=TREE METHOD=SINGLE PRINT=20; • run; Input (name=body1) and output (name=tree) datasets respectively. The SAS keywords are data and outtree. The clustering method is a single linkage method.

An Example of the Power of Cluster Analysis Prevents printing of the tree. • Cluster Analysis: • PROCTREEdata=tree NOPRINTOUT=OUT N=3; • COPY height1 weight1; • PROCGPLOT; • PLOT height1*weight1=CLUSTER; • RUN; Will copy the variables height1 and weight1 to the output dataset called out. Specifies number of desired clusters. For this example it is 3. Makes the plot distinguish between the clusters produced. The vertical (y) axis and the horizontal (x) axis respectively.

An Example of the Power of Cluster Analysis • The results of clustering using SAS.

The Example of the Power of Cluster Analysis Input (name=body1) and output (name=tree) datasets respectively. The SAS keywords are data and outtree. Cluster Analysis: • PROCCLUSTERDATA=body1 OUTTREE=TREE METHOD=SINGLE PRINT=20; run; The clustering method used in the example was single linkage method. • In SAS the options for “Method” are: • AVE or AVERAGE for average linkage method. • CEN or CENTROID for centroid method. • COM or COMPLETE for complete linkage method • DEN or DENSITY for density linkage methods using nonparametric probability density estimation. • EML use for coordinate data, much slower than the other clustering methods. • FLE or FLEXIBLE for Lance-Williams flexible-beta method. • MCQ or MCQUITTY for McQuitty’s similarity analysis. • MED or MEDIAN for Grower’s median method. • SIN or SINGLE for single linkage method. • TWO or TWOSTAGE for two-stage density linkage. • WAR or WARD for Ward’s minimum variance method. A lot of possible methods!?! The most common are Single, Average Complete, Centroid, and Ward’s.

Cluster Analysis: Two Main Procedures In SAS • Proc Cluster and it is Hierarchical. • With Proc Tree • Proc FastClus and it is Nonhierarchical.

Cluster Analysis: Hierarchical • Proc Cluster • Hierarchical: The data points are placed into clusters in a nested sequence of clustering. The most efficient type of hierarchical clustering methods are the single link clustering methods. • Types of single link clustering: • Nearest Neighbor Method • Furthest Neighbor Method • Many Others • Nearest Neighbor Method tends to create fewer clusters than the Furthest Neighbor Method. • Most other methods tend to give results somewhere in between the latter two methods mentioned. • It is good to try more than one method. • If multiple methods produce consistent/similar results then your results are more credible.

Cluster Analysis: Nonhierarchical • Proc FastClus • Nonhierarchical: Set of cluster seed points and build clusters around the seeds using dissimilarity measures and distance between data points and the cluster seed points. • Example of a dissimilarity measure is Euclidean distance • Three main disadvantages: • Must guess initially the number of clusters that exist. • Greatly influenced by the location of the cluster seed points. • Possibly not feasible computationally due to the multitude of possible choices for the number of clusters and the location for the cluster seeds as well.

Reference: • Much of the next few slides are taken from SPSS Help/Tutorial.

SPSS’s: The TwoStep Cluster Analysis • The TwoStep Cluster Analysis procedure is an exploratory tool designed to reveal natural groupings (or clusters) within a data set that would otherwise not be apparent. The algorithm employed by this procedure has several desirable features that differentiate it from traditional clustering techniques: • The ability to create clusters based on both categorical and continuous variables. • Automatic selection of the number of clusters. • The ability to analyze large data files efficiently.

SPSS’s: The TwoStep Cluster Analysis • In order to handle categorical and continuous variables, the TwoStep Cluster Analysis procedure uses a likelihood distance measure which assumes that variables in the cluster model are independent. Further, each continuous variable is assumed to have a normal (Gaussian) distribution and each categorical variable is assumed to have a multinomial distribution. Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions, but you should try to be aware of how well these assumptions are met. • The two steps of the TwoStep Cluster Analysis procedure's algorithm: • Step 1. The procedure begins with the construction of a Cluster Features (CF) Tree. The tree begins by placing the first case at the root of the tree in a leaf node that contains variable information about that case. Each successive case is then added to an existing node or forms a new node, based upon its similarity to existing nodes and using the distance measure as the similarity criterion. A node that contains multiple cases contains a summary of variable information about those cases. Thus, the CF tree provides a capsule summary of the data file. • Step 2. The leaf nodes of the CF tree are then grouped using an agglomerative clustering algorithm. The agglomerative clustering can be used to produce a range of solutions. To determine which number of clusters is "best", each of these cluster solutions is compared using Schwarz's Bayesian Criterion (BIC) or the Akaike Information Criterion (AIC) as the clustering criterion.

SPSS’s: The TwoStep Cluster Analysis • Car manufacturers need to be able to appraise the current market to determine the likely competition for their vehicles. If cars can be grouped according to available data, this task can be largely automatic using cluster analysis. • Information for various makes and models of motor vehicles is contained in car_sales.sav. Use the TwoStep Cluster Analysis procedure to group automobiles according to their prices and physical properties.

SPSS’s: The TwoStep Cluster Analysis

SPSS’s: The TwoStep Cluster Analysis • Select Vehicle type as a categorical variable. • Select Price in thousands through Fuel efficiency as continuous variables. • Click Plots.

SPSS’s: The TwoStep Cluster Analysis • Select Rank of variable importance. • Select By variable in the Rank Variables group. • Select Confidence level. • 4.Click Continue • 5.Then click Output in the TwoStep Cluster Analysis dialog box.

SPSS’s: The TwoStep Cluster Analysis • Select Information criterion (AIC or BIC) in the Statistics group. • Click Continue. • Finally, click OK in the TwoStep Cluster Analysis dialog box.

SPSS’s: The TwoStep Cluster Analysis • The Auto-clustering table summarizes the process by which the number of clusters is chosen. • The clustering criterion (in this case the BIC) is computed for each potential number of clusters. Smaller values of the BIC indicate better models, and in this situation, the "best" cluster solution has the smallest BIC.

SPSS’s: The TwoStep Cluster Analysis However, there are clustering problems in which the BIC will continue to decrease as the number of clusters increases, but the improvement in the cluster solution, as measured by the BIC Change, is not worth the increased complexity of the cluster model, as measured by the number of clusters.

SPSS’s: The TwoStep Cluster Analysis In such situations, the changes in BIC and changes in the distance measure are evaluated to determine the "best" cluster solution. A good solution will have a reasonably large Ratio of BIC Changes and a large Ratio of Distance Measures.

SPSS’s: The TwoStep Cluster Analysis The cluster distribution table shows the frequency of each cluster. Of the 157 total cases, 5 were excluded from the analysis due to missing values on one or more of the variables. Of the 152 cases assigned to clusters, 62 were assigned to the first cluster, 39 to the second, and 51 to the third.

SPSS’s: The TwoStep Cluster Analysis The centroids show that the clusters are well separated by the continuous variables.

SPSS’s: The TwoStep Cluster Analysis Motor vehicles in cluster 1 are cheap, small, and fuel efficient.

SPSS’s: The TwoStep Cluster Analysis Motor vehicles in cluster 2 are moderately priced, heavy, and have a large gas tank, presumably to compensate for their poor fuel efficiency.

SPSS’s: The TwoStep Cluster Analysis Motor vehicles in cluster 3 are expensive, large, and are moderately fuel efficient.

SPSS’s: The TwoStep Cluster Analysis The cluster frequency table by Vehicle type further clarifies the properties of the clusters.

Go To SPSS Output for the Rest • Note for myself, so as not to forget.

SPSS’s: The TwoStep Cluster Analysis • Using the TwoStep Cluster Analysis procedure, you have separated the motor vehicles into three fairly broad categories. In order to obtain finer separations within these groups, you should collect information on other attributes of the vehicles. For example, you could note the crash test performance or the options available. • The TwoStep Cluster Analysis procedure is useful for finding natural groupings of cases or variables. It works well with categorical and continuous variables, and can analyze very large data files. • If you have a small number of cases, and want to choose between several methods for cluster formation, variable transformation, and measuring the dissimilarity between clusters, try the Hierarchical Cluster Analysis procedure. The Hierarchical Cluster Analysis procedure also allows you to cluster variables instead of cases. • The K-Means Cluster Analysis procedure is limited to scale variables, but can be used to analyze large data and allows you to save the distances from cluster centers for each object.

Cluster Analysis