1.23k likes | 1.52k Views
Session 9. Outline. Two Multivariate Methods Cluster Analysis Excel Minitab Discriminant Analysis Excel Minitab Steam case Cars. Cluster Analysis.
E N D
Outline • Two Multivariate Methods • Cluster Analysis • Excel • Minitab • Discriminant Analysis • Excel • Minitab • Steam case • Cars Applied Regression -- Prof. Juran
Cluster Analysis • Concerned with grouping a large number of observations into reasonable sub-groups (clusters) on the basis of their similarities on multiple dimensions • Similar to regression in terms of its basic method: finding a solution that minimizes a total sum of squared errors • Not concerned with explaining variability or forecasting • No dependent variable Applied Regression -- Prof. Juran
Example: MBA Programs Applied Regression -- Prof. Juran
Cluster Analysis Questions • Given a certain number of clusters, which schools are grouped together? • How is the set of clusters affected if we change the number of clusters? • For each cluster, which school is the most “typical”? • How different are the clusters from each other? • What is the best number of clusters? Applied Regression -- Prof. Juran
Basic Method in Excel • We will assume that all of these attributes deserve equal weighting in our analysis. We will • name a school as the “typical” school in each cluster (called the centroid of the cluster), • assign each of the non-centroid schools to the cluster where they are most similar to the centroid, and • optimize the identities of the centroids and the cluster assignments so as to minimize the total Euclidean distance between each school and its cluster centroid. • We define “most similar” to be the least sum of squared errors across all attributes between a cluster member and the centroid of the cluster. Applied Regression -- Prof. Juran
Nonlinear Problems Some nonlinear problems can be formulated in a linear fashion (i.e. some network problems). Other nonlinear functions can be solved with our basic methods (i.e. smooth, continuous functions that are concave or convex, such as portfolio variances). However, there are many types of nonlinear problems that pose significant difficulties. Applied Regression -- Prof. Juran
Nonlinear Problems The linear solution to a nonlinear (say, integer) problem may be infeasible. The linear solution may be far away from the actual optimal solution. Some functions have many local minima (or maxima), and Solver is not guaranteed to find the global minimum (or maximum). Applied Regression -- Prof. Juran
3 Solvers • Simplex LP Solver • GRG Nonlinear Solver • Evolutionary Solver Applied Regression -- Prof. Juran
Solution Methodology The standard simplex algorithm (Solver’s default method) won’t work on this problem. The GRG Nonlinear algorithm will make an honest effort, but is likely to give up without finding the optimal solution. This can result from the use of MAX, IF, and SUMIF functions, resulting in discontinuities in our productive function and constraints as functions of the decision variables. It can also be the result of using numerical decision variables that are in fact simply names (as in this example, where the names of the clusters happen to be numbers). The Evolutionary Solver, a genetic algorithm, can do a good job with a problem like this, but is not guaranteed to find the optimal solution. Applied Regression -- Prof. Juran
Solution Methodology The Evolutionary Solver operates in a completely different way from the other types. Instead of searching in a structured way guaranteed to reach the optimal solution, genetic algorithms operate somewhat like biological evolutionary processes, with some degree of randomness in the steps taken from one solution to the next. In a finite period of time, the Evolutionary Solver is not guaranteed to find the optimal solution, but it will find very good solutions and try to improve upon them. Applied Regression -- Prof. Juran
Standardization In cluster analysis it is common to standardize the attribute data, so that those variables with large units (such as cost, salary and student body size) do not dominate the sum of squares over attributes with small units (such as % female, % admitted, and % with a job at graduation). So we transform each attribute for each school into a z-value. Applied Regression -- Prof. Juran
Optimization Procedure We set up the model in a large spreadsheet, as shown here. The upper section contains the standardized data, the middle section contains information about the 10 centroids, and the lower section evaluates the distances between each school and each of the centroids, and assigns schools to clusters on the basis of minimum distance. Applied Regression -- Prof. Juran
Decision Variables We begin by setting up cells C34:C43, where Solver can identify which schools are centroids. In this initial solution, all centroids have a value of 1 (the index for Stanford), and the corresponding standardized data for Stanford appear in D34:P43. These indices will be manipulated by Solver to find the best ten centroids. Applied Regression -- Prof. Juran
In the lower section of the worksheet, we calculate the total squared distance from each school to each centroid, and pick the minimum. Cell B45 — the objective function in this problem — is the sum of M49:M75. Applied Regression -- Prof. Juran
Cluster Analysis Questions • Given a certain number of clusters, which schools are grouped together? • How is the set of clusters affected if we change the number of clusters? • For each cluster, which school is the most “typical”? • How different are the clusters from each other? • What is the best number of clusters? Applied Regression -- Prof. Juran
Given a certain number of clusters, which schools are grouped together? • Columbia and NYU are always in the same cluster, as are Harvard-Penn, Indiana-Michigan State. • Michigan-Cornell-Yale-Dartmouth-Chicago-Duke. • Texas-Emory-Georgetown-Minnesota. • What happens with UCLA-Berkeley? Applied Regression -- Prof. Juran
How is the set of clusters affected if we change the number of clusters? • Notice the behavior of Northwestern as we reduce the number of clusters. • Stanford seems to be very different from all other schools; the last school to have its own cluster. Applied Regression -- Prof. Juran
For each cluster, which school is the most “typical”? • The centroid represents the school most typical in each cluster. • We observe that Michigan is almost always the centroid of a large cluster. Applied Regression -- Prof. Juran
How different are the clusters from each other? • This is difficult to assess with this method; Minitab will provide more useful output. Applied Regression -- Prof. Juran
What is the best number of clusters? Applied Regression -- Prof. Juran
Correlation issues? Applied Regression -- Prof. Juran