Centroid index Cluster level quality measure

Centroid indexCluster level quality measure Pasi Fränti 3.9.2018

Clustering accuracy

P Classification accuracy Known class labels Solution A: Solution B: Oranges Oranges 100 % Precision = 5/7 = 71% Recall = 5/5 = 100% Apples Apples 100 % Precision = 3/3 = 100% Recall = 3/5 = 60%

Clustering accuracy No class labels! Solution A: Solution B: ???

Internal index Sum-of-squared error (MSE) Solution B: Solution A: 11.3 31.2 30 40 18.7 8.8

External index Compare two solutions Solution B: Solution A: 5 5/7 = 71% 2/5 = 40% 2 3/5 = 60% 3 • Two clustering (A and B) • Clustering against ground truth

Set-matching based methods M. Rezaei and P. Fränti, "Set-matching measures for external cluster validity", IEEE Trans. on Knowledge and Data Engineering, 28 (8), 2173-2186, August 2016.

What about this…? Solution 2: Solution 1: Solution 3: ? ?

External index Selection of existed methods • Pair-counting measures • Rand index (RI) [Rand, 1971] • Adjusted Rand index (ARI) [Hubert & Arabie, 1985] • Information-theoretic measures • Mutual information (MI) [Vinh, Epps, Bailey, 2010] • Normalized Mutual information (NMI) [Kvalseth, 1987] • Set-matching based measures • Normalized van Dongen (NVD) [Kvalseth, 1987] • Criterion H (CH) [Meila & Heckerman, 2001] • Purity [Rendon et al, 2011] • Centroid index (CI) [Fränti, Rezaei & Zhao, 2014]

Cluster level measure

Point level vs. cluster level Cluster-level mismatches Point-level differences

Point level vs. cluster level Agglomerative (AC) ARI=0.91 ARI=0.82 Random Swap (RS) K-means ARI=0.88

Point level vs. cluster level

Centroid index

Pigeon hole principle • 15 pigeons (clustering A) • 15 pigeon holes (clustering B) • Only one bijective (1:1) • mapping exists

Centroid index (CI) [Fränti, Rezaei, Zhao, Pattern Recognition 2014] empty CI = 4 15 prototypes(pigeons) 15 real clusters (pigeon holes) empty empty empty

Definitions • Find nearest centroids (AB): • Detect prototypes with no mapping: • Number of orphans: • Centroid index: Number of zero mappings! Mapping both ways

 Example S2 1 1 2 IndegreeCounts 1 2 1 Mappings 1 1 0 1 1 1 1 1 0 Value 0 indicates an orphan cluster CI = 2

 Example Birch1 Mappings CI = 7

 Example Birch2 Mappings 1 0 1 1 Two clustersbut only one allocated 3 1 Three mappedinto one CI = 18

 Example Birch2 Mappings 2 1 Two clustersbut only one allocated 0 1 1 0 Three mappedinto one 1 CI = 18

Unbalanced exampleK-means result KMGT CI = 4 500 0 2000 0 1 1 1011 492 4 2 458 560 0 989 490 0 K-means tend to put too many clusters here … … and too few here

Unbalanced exampleK-means result GTKM CI = 4 5 1 1 0 1 0 0 0

Experiments

Mean Squared Errors Green = Correct clustering structure Raw numbers don’t tell much

Adjusted Rand Index [Hubert & Arabie, 1985] How highis good?

Normalized Mutual information [Kvalseth, 1987]

Normalized Van Dongen [Kvalseth, 1987] Lower is better

Centroid Similarity Index [Fränti, Rezaei, Zhao, 2014] Ok but lacks threshold

Centroid Index [Fränti, Rezaei, Zhao, 2014]

Going deeper…

Bridge Accurate clustering GAIS-2002 similar to GAIS-2012 ? -0.04 -0.29 -0.29

GAIS’02 and GAIS’12 the same? Virtually the same MSE-values GAIS 2002 GAIS 2012 160.8 160.72 -0.04 160.68 160.6 -0.29 -0.29 160.43 -0.04 160.39 160.4 GAIS 2002 +RS(8M) GAIS 2002 +RS(8M) 160.2

GAIS’02 and GAIS’12 the same? But different structure! GAIS 2002 GAIS 2012 160.8 160.72 CI=17 160.68 160.6 CI=0 CI=1 160.43 CI=18 160.39 160.4 GAIS 2002 +RS(8M) GAIS 2012 +RS(8M) 160.2

Seemingly the same solutions Same structure “same family”

But why? • Real cluster structure missing • Clusters allocated like well optimized “grid” • Several grids results different allocation • Overall clustering quality can still be the same

RS runs

RS runs 1M vs 8M CI=26 MSE=0.07% CI=26 MSE=0.02% B1 C1 A1 160.8 1M: 161.37 161.48 161.44 CI=1 MSE=0.15% CI=3 MSE=0.28% CI=1 MSE=0.09% 160.6 8M: 160.99 161.22 161.24 160.4 B8 C8 A8 CI=24 MSE=0.01% CI=26 MSE=0.16% 160.2

Generalization

Three alternatives 3. Partition similarity 1. Prototype similarity Prototype must exist 2. Model similarity Derived from model

Partition similarity • Cluster similarity using Jaccard • Calculated from contigency table

x Split-and-Merge EM x Random Swap EM Gaussian mixture model 1 1 RI 0.98 ARI 0.84 MI 3.60 NMI 0.94 NVD 0.08 CH 0.16 1 1 1 1 3 1 CI=2 1 1 1 1 0 1

Gaussian mixture model CI=2 CI=1 CI=3 CI=0 CI=1 CI=0

Arbitrary-shape data

KMGT 1 1 2 CI = 2 2 0 1 0

GTKM 1 1 1 0 CI = 1 1 1 2

SLGT 1 0 1 3 0 1 CI = 2

GTSL 0 2 2 0 0 2 CI = 3

Summary of experimentsprototype similarity

Summary of experimentspartition similarity

Centroid index Cluster level quality measure