1 / 110

Cluster validation

Cluster validation. Machine Learning University of Eastern Finland. Clustering methods: Part 3. Pasi Fränti. 10.5.2017. Part I: Introduction. Supervised classification: Ground truth class labels known Accuracy, precision, recall Cluster analysis: No class labels Validation need to:

rharriet
Download Presentation

Cluster validation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster validation Machine Learning University of Eastern Finland Clustering methods: Part 3 Pasi Fränti 10.5.2017

  2. Part I:Introduction

  3. Supervised classification: Ground truth class labels known Accuracy, precision, recall Cluster analysis: No class labels Validation need to: Compare clustering algorithms Solve the number of clusters Avoid finding patterns in noise P Cluster validation Precision = 5/5 = 100% Recall = 5/7 = 71% Oranges: Apples: Precision = 5/5 = 100% Recall = 3/5 = 60%

  4. Internal Index: Validate without external info With different number of clusters Solve the number of clusters External Index Validate against ground truth Compare two clusters:(how similar) Measuring clustering validity ? ? ? ?

  5. Clustering of random data Random Points DBSCAN K-means Complete Link

  6. Distinguishing whether non-random structure actually exists in the data (one cluster). Comparing the results of a cluster analysis to external ground truth (class labels). Evaluating how well the results fit the data without reference to external information. Comparing two different clustering results to determine which is better. Determining the number of clusters. Cluster validation process

  7. Cluster validation refers to procedures that evaluate the results of clustering in a quantitative and objective fashion. [Jain & Dubes, 1988] How to be “quantitative”: To employ the measures. How to be “objective”: To validate the measures! m* Partitions P Codebook C INPUT: DataSet(X) Clustering Algorithm Validity Index Different number of clustersm Cluster validation process

  8. Part II:Internal indexes

  9. Internal indexes Ground truth is rarely available but unsupervised validation must be done. Minimizes (or maximizes) internal index: Variances of within cluster and between clusters Rate-distortion method F-ratio Davies-Bouldin index (DBI) Bayesian Information Criterion (BIC) Silhouette Coefficient Minimum description principle (MDL) Stochastic complexity (SC)

  10. Sum of squared errors The more clusters the smaller the value. Small knee-point near the correct value. But how to detect? Knee-point between 14 and 15 clusters.

  11. Sum of squared errors 5 clusters 10 clusters

  12. Minimize within cluster variance (TSE) Maximize between cluster variance Inter-cluster variance is maximized Intra-cluster variance is minimized From TSE to cluster validity

  13. Jump point of TSE(rate-distortion approach) First derivative of powered TSE values: Biggest jump on 15 clusters.

  14. Cluster variances Within cluster: Between clusters: Total Variance of data set: SSB SSW

  15. WB-index Measures ratio of between-groups variance against the within-groups variance WB-index: SSB

  16. Sum-of-squares based indexes SSW / k ---- Ball and Hall (1965) k2|W| ---- Marriot (1971) ---- Calinski & Harabasz (1974) log(SSB/SSW) ---- Hartigan (1975) ---- Xu (1997) (d = dimensions; N = size of data; k = number of clusters) • SSW = Sum of squares within the clusters (=TSE) • SSB = Sum of squares between the clusters

  17. Calculation of WB-index(called also F-ratio / F-test)

  18. Dataset S1

  19. Dataset S2

  20. Dataset S3

  21. Dataset S4

  22. Extension for S3

  23. Sum-of-square based index SSW / SSB & MSE SSW / m log(SSB/SSW) m* SSW/SSB

  24. Davies-Bouldin index (DBI) Minimize intra cluster variance Maximize the distance between clusters Cost function weighted sum of the two:

  25. Davies-Bouldin index (DBI)

  26. Measured values for S2

  27. Cohesion: measures how close objects are in a cluster Separation: measure how separated the clusters are cohesion separation Silhouette coefficient[Kaufman&Rousseeuw, 1990]

  28. Cohesion a(x): average distance of x to all other vectors in the same cluster. Separation b(x): average distance of x to the vectors in other clusters. Find the minimum among the clusters. silhouettes(x): s(x) = [-1, +1]: -1=bad, 0=indifferent, 1=good Silhouette coefficient (SC): Silhouette coefficient

  29. x x cohesion separation Silhouette coefficient (SC) a(x): average distance in the cluster b(x): average distances to others clusters, find minimal

  30. Performance of SC

  31. Bayesian information criterion (BIC) Formula for GMM L(θ) -- log-likelihood function of all models; n -- size of data set; m -- number of clusters Under spherical Gaussian assumption, we get : Formula of BIC in partitioning-based clustering d -- dimension of the data set ni -- size of the ith cluster ∑ i -- covariance of ith cluster

  32. Knee Point Detection on BIC Original BIC = F(m) SD(m) = F(m-1) + F(m+1) – 2∙F(m)

  33. Internal indexes

  34. Internal indexes Soft partitions

  35. Comparison of the indexesK-means

  36. Comparison of the indexesRandom Swap

  37. Part III: Stochastic complexity for binary data

  38. Stochastic complexity Principle of minimum description length (MDL): find clustering C that can be used for describing the data with minimum information. Data = Clustering + description of data. Clustering defined by the centroids. Data defined by: which cluster (partition index) where in cluster (difference from centroid)

  39. Solution for binary data where This can be simplified to:

  40. Number of clusters by stochastic complexity (SC)

  41. Part IV:Stability-based approach

  42. Cross-validation Compare clustering of full data against sub-sample

  43. Cross-validation: Correct Same results

  44. Cross-validationIncorrect Different results

  45. Stability approach in general • Add randomness • Cross-validation strategy • Solve the clustering • Compare clustering

  46. Adding randomness • Three choices: 1. Subsample 2. Add noise 3. Randomize the algorithm • What subsample size? • How to model noise and how much? • Use k-means?

  47. Sub-sample size • Too large (80%): same clustering always • Too small (5%): may break cluster structure • Recommended 20-40% Spiral dataset 60% subsample 20% subsample

  48. Classification approach Does not really add anything more. Just makes process more complex.

  49. Comparison of three approaches • Cross-validation works ok • Classification also ok • Randomizing algorithm fails

  50. Problem Stability can also come from other reasons: • Different cluster sizes • Wrong cluster model Happens when k<k*

More Related