1 / 20

SAS Homework 4 Review Clustering and Segmentation

SAS Homework 4 Review Clustering and Segmentation . MIS2502 Data Analytics. SAS Homework 4 Review Clustering and Segmentation . Using AAEM.DUNGAREE Data Set Explore data set : SALESTOT and STOREID Assign ID to STOREID SALESTOT Role – Rejected Add a Cluster node (Explore)

maj
Download Presentation

SAS Homework 4 Review Clustering and Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SAS Homework 4 ReviewClustering and Segmentation MIS2502 Data Analytics

  2. SAS Homework 4 Review Clustering and Segmentation • Using AAEM.DUNGAREE Data Set • Explore data set : SALESTOT and STOREID • Assign ID to STOREID • SALESTOT Role – Rejected • Add a Cluster node (Explore) • In Properties select Internal Standardization => Standardize • Run and Evaluate • Change Properties Segment Max to 6 • Run and Evaluate • Add a Segment Profile node (Assess) • Run and Evaluate

  3. Set Up • Retail – looking for patterns sales of types of jeans by store

  4. Data Source - Edit Variables

  5. Data Source – Explore Note scale

  6. Add Cluster Node, Standardize

  7. Segments, Automaticnote root mean square std deviation

  8. Change Number of Clusters to 6

  9. Segments, Max 6note root mean square std deviation

  10. Segment Profile Node

  11. Segment Profiles red outline is the overall distribution

  12. Questions How do the SALESTOT and STOREID distributions differ from the other variables’ distributions (look at the histograms of each one)? Assign STOREID a model role of IDand SALESTOT a model role of Rejected. Make sure that the remaining variables have the Input model role and the Interval measurement level. Based on the variable descriptions on page 1 and your answer to part Why do you think that the variable SALESTOT should be rejected? Add a Cluster node to the diagram workspace and connect it to the Input Data node. Select the Cluster node and select Internal StandardizationStandardization. Why is it important to standardize your inputs? (hint: look at the range of the scales on the X axis of the histograms) Run the diagram from the Cluster node and examine the results. How many clusters are created? What might be a problem with having so many clusters? What is the highest root mean squared standard deviation among the clusters? Two hints: Look at the Mean Statistics window. The root mean squared standard deviation means basically the same thing as the sum of squares error.

  13. Distribution of Store Id

  14. Distribution of SaleTot • Does tell you that there are a handful of stores selling well below average • These 2 variables aren’t useful for the product mix analysis.

  15. Why Standardize ? • Note difference in range of numbers on x axis

  16. Segment Profile Node

  17. Reading a Histogram 4) Now look at the specific segment distribution (blue). For this segment approximately 86% of the stores sell within  volume ranges 3 and 4.,  Look at the distribution in total,  and then the individual bars.  For this distribution you would say that for this segment, they sell less original jeans than average, and in a narrower range /with less variability (not part of the question).  Overall you can say this because the distribution is to the left of and 'tighter' than the overall distribution.       1) The red bars are the distribution of Original Jeans sales over all segments. By comparing the specific segment distribution (blue) to the overall distribution (red) you can make some observations about the what makes this segment differentin regards to Original Jeans sold. 3) note that for ranges 3 ,4 and 5, the overall average (red) shows  roughly that 65% of stores sell in these volume ranges (11%  and 23 %  and 31% respectively). You get this by reading the Y axis. 2) Note that you have 8 ranges of standardized sales volumes on the x axis for the overall average (the red).  These are ordered for lowest (on the left) to highest (on the right).  We established this earlier when looking at the individual  segments. 5) Conclusion: Overall, this segment has more stores selling original jeans in lower volume ranges  than the overall average.  Therefore, for this segment we can say that the stores sell less Original Jeans than average. 

  18. Original Segment Profiles red outline is the overall distribution

  19. In Class Answer the questions about this output: 1. How many distinct customer groups (segments) are there? 2. Explain how the customers in cluster 1 are different from cluster 2? 3. What aspect of the customer data most differentiates cluster 1 from cluster 3? 4. Which cluster has the highest cohesion? In practical terms, what does that mean?

  20. In Class – Evaluating Clustering Output 5. Is the root mean squared standard deviation of these clusters higher or lower than they were in the three cluster scenario? Why? 6. Is the distance to the nearest cluster higher or lower than in the three cluster scenario? Why? 7. Which scenario (#1 or #2) has higher cohesion among its clusters? 8. Which scenario (#1 or #2) has higher separation between its clusters?

More Related