1 / 71

Object Orie’d Data Analysis, Last Time

This tutorial provides an overview of K-means clustering using Q-Q plots and cluster evaluation techniques such as SigClust and 2-means cluster index. It also explores the impact of different starting values on clustering results.

holderm
Download Presentation

Object Orie’d Data Analysis, Last Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Object Orie’d Data Analysis, Last Time • Finished Q-Q Plots • Assess variability with Q-Q Envelope Plot • SigClust • When is a cluster “really there”? • Statistic: 2-means Cluster Index • Gaussian null distribution • Fit to data (for HDLSS data, using invariance) • P-values by simulation • Breast Cancer Data

  2. More on K-Means Clustering Classical Algorithm (from MacQueen,1967) • Start with initial means • Cluster: each data pt. to closest mean • Recompute Class mean • Stop when no change Demo from: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

  3. More on K-Means Clustering Raw Data 2 Starting Centers

  4. More on K-Means Clustering Assign Each Data Point To Nearest Center Recompute Mean Re-assign

  5. More on K-Means Clustering Recompute Mean Re-Assign Data Points To Nearest Center

  6. More on K-Means Clustering Recompute Mean Re-Assign Data Points To Nearest Center

  7. More on K-Means Clustering Recompute Mean Final Assignment

  8. More on K-Means Clustering New Example Raw Data Deliberately Strange Starting Centers

  9. More on K-Means Clustering Assign Clusters To Given Means Note poor clustering

  10. More on K-Means Clustering Recompute Mean Re-assign Shows Improvement

  11. More on K-Means Clustering Recompute Mean Re-assign Shows Improvement Now very good

  12. More on K-Means Clustering Different Example Best 2-means Cluster? Local Minima?

  13. More on K-Means Clustering Assign Recompute Mean Re-assign Note poor clustering

  14. More on K-Means Clustering Recompute Mean Final Assignment Stuck in Local Min

  15. More on K-Means Clustering Same Data But slightly different starting points Impact???

  16. More on K-Means Clustering Assign Recompute Mean Re-assign Note poor clustering

  17. More on K-Means Clustering Recompute Mean Final Assignment Now get Global Min

  18. More on K-Means Clustering ???Next time: Redo above, using my own Matlab calculations That way can show each step And get right answers.

  19. More on K-Means Clustering Now explore starting values: • Approach randomly choose 2 data points • Give stable solutions? • Explore for different point configurations • And try 100 random choices • Do 2-d examples for easy visualization

  20. More on K-Means Clustering 2 Clusters: Raw Data (Normal mixture)

  21. More on K-Means Clustering 2 Clusters: Cluster Index, based on 100 Random Starts

  22. More on K-Means Clustering 2 Clusters: Chosen Clustering

  23. More on K-Means Clustering 2 Clusters Results • All starts end up with good answer • Answer is very good (CI = 0.03) • No obvious local minima

  24. More on K-Means Clustering Stretched Gaussian: Raw Data

  25. More on K-Means Clustering Stretched Gaussian : C. I., based on 100 Random Starts

  26. More on K-Means Clustering Stretched Gaussian : Chosen Clustering

  27. More on K-Means Clustering Stretched Gaussian Results • All starts end up with same answer • Answer is less good (CI = 0.35) • No obvious local minima

  28. More on K-Means Clustering Standard Gaussian: Raw Data

  29. More on K-Means Clustering Standard Gaussian : C. I., based on 100 Random Starts

  30. More on K-Means Clustering Standard Gaussian: Chosen Clustering

  31. More on K-Means Clustering Standard Gaussian Results • All starts end up with same answer • Answer even less good (CI = 0.62) • No obvious local minima • So still stable, despite poor CI

  32. More on K-Means Clustering 4 Balanced Clusters: Raw Data (Normal mixture)

  33. More on K-Means Clustering 4 Balanced Clusters: CI, based on 100 Random Starts

  34. More on K-Means Clustering 4 Balanced Clusters 100 Random Starts • Many different solutions appear • I.e. there are many local minima • Sorting on CI (bottom) shows how many • 2 seem smaller than others • What are other local minima? Understand with deeper visualization

  35. More on K-Means Clustering 4 Balanced Clusters: Class Assignment Image Plot

  36. More on K-Means Clustering 4 Balanced Clusters: Vertically Regroup (better view?)

  37. More on K-Means Clustering 4 Balanced Clusters: Choose cases to “flip” – color cases

  38. More on K-Means Clustering 4 Balanced Clusters: Choose cases to “flip” – color cases

  39. More on K-Means Clustering 4 Balanced Clusters: “flip”, shows local min clusters

  40. More on K-Means Clustering 4 Balanced Clusters: sort columns, for better visualization

  41. More on K-Means Clustering 4 Balanced Clusters: CI, based on 100 Random Starts

  42. More on K-Means Clustering 4 Balanced Clusters: Color according to local minima

  43. More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, smallest CI

  44. More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, 2nd small CI

  45. More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 3rd CI

  46. More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 4th CI

  47. More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 5th CI

  48. More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 6th CI

  49. More on K-Means Clustering 4 Balanced Clusters Results • Many Local Minima • Two good ones appear often (2-2 splits) • 4 worse ones (1-3 splits less common) • 1 with single strange point • Overall very unstable • Raises concern over starting values

  50. More on K-Means Clustering 4 Unbalanced Clusters: Raw Data (try for stability)

More Related