280 likes | 294 Views
Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?. Nguyen Xuan Vinh, UNSW Julien Epps, UNSW James Bailey, Uni Melbourne. Australia's ICT Research Centre of Excellence. Correction for Chance for Information Theoretic based measures - Outline.
E N D
Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? Nguyen Xuan Vinh, UNSW Julien Epps, UNSW James Bailey, Uni Melbourne Australia's ICT Research Centre of Excellence
Correction for Chance for Information Theoretic based measures - Outline Introduction to clustering and clustering comparison A brief survey of clustering comparison measures How chance agreement affects information theoretic based measures? Adjusted-for-chance measures Conclusion
Introduction Clustering: the “art” of dividing data points in a data set into meaningful groups Notation: • Data set: S={s1,s2, ... sN} • (Hard) clustering: a way to partition the data set into non-overlapping parts • U={U1, U2, ... UR} • Ui's are non-overlapping subsets of S • V={V1, V2, ... VC} • Vj's are non-overlapping subsets of S
Introduction • Clustering comparison measures are used to: • Evaluate the goodness of clustering solutions (assuming the “true” clustering is known) • Evaluate clustering algorithms • (over multiple data sets) • More active use: • To search for a good clustering solution • as in Ensemble clustering • To quantify the discordance within a set of clusterings • => stability assessment • might give useful hint to model selection, such as choosing the “right” number of cluster
Correction for Chance for Information Theoretic based measures - Outline • Introduction to clustering and clustering comparison • A brief review of Clustering comparison measures • How chance agreement affects Information theoretic based measures? • Adjusted-for-chance measures • Conclusion
Clustering comparison measuresA brief review 3 categories: • Pair-counting based • Rand Index (RI), Adjusted Rand Index (ARI) • Jaccard Index, Folkes & Mallows index… • 22 in total (Albatineh et. al. (2006)) • Set-matching based • The “classification error”, the Van Dongen metric • Information theoretic based • Mutual Information (MI), normalized MI • Variation of Information (Meila (2005))
Clustering comparison measuresA brief review – Rand Index • Rand Index (RI): a set of N data point has N(N-1)/2 pairs of points, which can be classified into 4 categories • Proportions of pairs that both clusterings agree
Clustering comparison measuresA brief review – Adjusted Rand Index • Problem with the Rand Index: baseline value (average value between random clusterings) is high and varies • Solution: Adjusted index • Adjusted Rand Index
Why do we care about Information theoretic based measures? • Strong theoretical foundation: information theory • Ability to detect general non-linear correlation • As a matter of fact, they are receiving increasing interest • Ensemble clustering: Strehl and Ghosh (2002), (Fern and Brodley (2003); Singh et al. (2007); He et al. (2008); • Comparison measures: Meila (2003, 2005, 2007), Vinh and Phuong (2008a, 2008b)
Ingredients for Information theoretic based measures • Given a clustering U={U1, U2, ... UR} • Entropy of U: • Where: • The uncertainty in determining the cluster label of a datapoint in S
Ingredients for Information theoretic based measures Mutual information Given the clusterings with points distributions U={U1, U2, ... UR} P(i)=|Ui|/N V={V1, V2, ... VC} P’(j)=|Vj|/N Mutual information is calculated from the Joint distribution of points into clusters in U & V Where:
Information theoretic based measures • Mutual information • A similarity measure, range: [0, log N] • Measures the information shared between two clusterings • Normalized Mutual Information • Range: [0,1] • Variation of Information (Meila (2005)) VI(U,V)= H(U)+H(V)- 2I(U,V) • A dissimilarity measure • Is a true metric on the space of the clusterings • Range: [0, log N]
Correction for Chance for Information Theoretic based measures - Outline • Introduction to clustering and clustering comparison • A brief survey of Clustering comparison measures • How chance agreement affects Information theoretic based measures? • Adjusted-for-chance measures • Conclusion
How chance agreement affects Information theoretic based measures? – Scenario 1 • A known ground-truth clustering with Ktrue clusters • Algorithm 1 generate clustering with K1 clusters • Algorithm 2 generate clustering with K2 clusters • K1> K2, would the comparison be “fair”? True clustering, Ktrue=5 d d’ Algorithm 2, K2=7 Algorithm 1, K1=3
How chance agreement affects Information theoretic based measures? – Experiment 1 • Fix a ground-true clusteringwith Ktrue clusters • 10000 random clusterings are generated for each value of K in the range [0, Kmax=2Ktrue] • Measure the average similarity from each set to the ground-truth using the Normalized Mutual Information True clustering, Ktrue=Kmax/2 Average d … K=2 K=Kmax K=3
How chance agreement affects Information theoretic based measures? – Experiment 1 • Using NMI and RI: Random clusterings with larger number of clusters tend to lie closer to the ground-truth than clusterings with fewer clusters • ARI appears to be unbiased in favour of any particular number of clusters
How chance agreement affects Information theoretic based measures? – Scenario 2 • Select the appropriate number of clusters • In clustering K is unknown • Approach: • For Hierarchical clustering: 30 stopping rule procedures (Milligan and Cooper (1985)) • For model based clustering: Bayesian Information Criterion (BIC) • The Gap statistics • … • The stability assessment approach
How chance agreement affects Information theoretic based measures? – Scenario 2 • Select the appropriate number of clusters via stability assessment • Generate a multiple sets of clusterings, each having the same number of clusters • Measure the concordance within each set by calculating the average pairwise similarity value (Consensus index) • Higher value indicate stability => a hint to select the true number of clusters … #clusters=2 … #clusters=Ktrue #clusters=Kmax
How chance agreement affects Information theoretic based measures? – Experiment 2 • Experiment 2: • Generate 200 random clusterings of N data points for each value of K in the range [0, Kmax] • Measure the average pairwise similarity within each set using the Normalized Mutual Information
How chance agreement affects Information theoretic based measures? – Experiment 2 • Using NMI and ARI: Average pairwise similarity within sets of random clusterings with a larger number of clusters tend to be higher than that within sets of random clusterings with fewer clusters • Using ARI: unbiased toward any particular number of cluster
Correction for Chance for Information Theoretic based measures - Outline • Introduction to clustering and clustering comparison • A brief survey of Clustering comparison measures • How chance agreement affects Information theoretic based measures? • Adjusted-for-chance measures • Conclusion
Adjusting information theoretic measure for chance • Model of Randomness: hypergeometric distribution model (clusterings are created randomly subject to the fixed maginal condition) • Model previously employed for the Adjusted Rand Index.
Adjusting information theoretic measure for chance • Expected Mutual Information between a pair of random clustering is given by:
Adjusting Mutual Information (AMI) • General formula for an adjusted similarity measure: • The Adjusted Mutual Information
Experiment 1 • Variation due to chance is negligible • Note: data are generated not according to the assumed model
Experiment 2 • Variation due to chance is negligible • Note: data are generated not according to the assumed model
Conclusion & Future work • Information theoretic measures for clustering comparision are affected by chance, especially when the number of data point is per cluster is small • Adjusted-for-chance measures have been proposed • Work well in practice, despite the hypergemetric assumption of randomness • Code: http://ee.unsw.edu.au/~nguyenv/Software.htm • What are the differences between the ARI and the AMI? 'Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance', N.X. Vinh, Epps, J. and Bailey, J., to be submitted.