240 likes | 421 Views
Measures of Clustering Quality: A Working Set of Axioms for Clustering. Margareta Ackerman Joint work with Shai Ben-David. The Theory-Practice Gap. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy
E N D
Measures of Clustering Quality: A Working Set of Axioms for Clustering Margareta Ackerman Joint work with Shai Ben-David
The Theory-Practice Gap Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science . . . All apply clustering to gain a first understanding of the structure of large data sets. Yet, there is distressingly little theoretical understanding of clustering.
Questions that research of fundamentals of clustering should address • Can clustering be given a formal and generaldefinition? • What is a “good” clustering? • Can we distinguish “clusterable” from “structureless” data?
Inherent Obstacles • Clustering is not well defined. There is a wide variety of different clustering tasks, with different (often implicit) measures of quality. • In most practical clustering tasks there is • no clear ground truth to evaluate your solution by. • (in contrast with classification tasks, in which you can • have a hold out labeled set to evaluate the classifier against). • A clustering may have different value to different users. • e.g.Cluster paintings by painter vs. topic
Common Solutions Objective utility functions Sum Of In-Cluster Distances, Average Distances to Center Points, Cut Weight, Spectral Clustering, etc. (Shmoys, Charikar, Meyerson, Luxburg, ..) Analyze the computational complexity of discrete optimization problems. Consider a restricted set of distributions (“generative models”): Ex. Mixtures of Gaussians [Dasgupta‘99], [Vempala, ’03], [Kannan et al ‘04], [Achlitopas, McSherry ‘05]. Recover the parameters of the model generating the data. Add structure:“Relevant Information” Ex. Information bottle-neck approach [Tishby, Pereira, Bialek ‘99] Factor out user-irrelevant information. Many more…
Quest for a General Theory What can we say independently of any specificalgorithm, specificobjective function or specific generative data model ? Clustering Axioms Postulate axioms that, ideally, every clustering approach should satisfy . e.g. [Hartigan 1975], [Puzicha, Hofmann, Buhmann ‘00], [Kleinberg ‘02]. usually conclude with negative results.
Our Formal Setup For a finite domain set S, a distance functiond is the distance defined between the domain points. A Clustering Function maps Input: adistance function d over S to Output: a partition (clustering) of S
Kleinberg’s Work on Clustering Functions Kleinberg proposes natural-looking “Axioms” that distinguish clustering functions from other functions that output domain partitions.
Kleinberg’s Axioms • Scale Invariance F(λd)=F(d) for all d and all strictly positive λ. • Consistency Ifd’equals d, except for shrinking distances within clusters of F(d) or stretching between-cluster distances, then F(d)=F(d’). • Richness For any partition P of S, there exists a distance function d over S so that F(d)=P.
Theorem [Kleinberg, 2002]: These axioms are inconsistent. Namely, no function can satisfy these three axioms. How come “axioms” that seem to capture our intuition about clustering are inconsistent?? Our answer: The formalization of these axioms is stronger than the intuition they intend to capture. We express that same intuition in an alternative framework, and achieve consistency.
Clustering-Quality Measures How good is this clustering? Clustering-quality measures quantify the quality of clusterings.
Defining Clustering-Quality Measures A clustering-quality measure is a function m(dataset, clustering)εR satisfying some properties that make this function a meaningful clustering-quality measure. What properties should it satisfy?
Rephrasing Kleinberg’s axioms as clustering-quality measures axioms • Scale Invariance m(C,d)=m(C, λd) for all d and all strictly positive λ, and C over d. • Richness For any clustering C of S, there exists a distance function d over S so that C = argmaxcm (C,d).
Rephrasing Kleinberg’s axioms as clustering-quality measures axioms • Consistency If d’ equals d, except for shrinking distances within clusters of C or stretching between-cluster distances, then m(C,d)≤m(C,d’). C d d’ C
An Additional Axiom Clusterings C over (X,d) and C’ over (X,d’) are isomorphic, if there exists a distance-preserving automorphismf:X →X, such that x,y share the same C-cluster ifff(x) and f(y) share the same C’-cluster. Isomorphism Invariance: If C and C’ are isomorphic, then m(C,d) = m(C’,d’).
Major Gain – Consistency of New Axioms • Theorem:Consistency, scale invariance, richness, and isomorphism invariance for clustering quality measures form a consistent set of requirements. We prove this result by demonstrating measures that satisfy these axioms. Moreover, every reasonable CQM satisfies our axioms.
An example of a CQM for center-based clustering: Relative Margin The Relative Margin of a point x in Cis (dist. to closest center to x)/ (dist. to 2nd closest center to x) The Relative Margin of C is the average relative margin over all non-center points (over all possible center settings). Relative Margin satisfies scale-invariance, consistency, richness, and isomorphism invariance.
Additional CQMs Satisfying Our Axioms • C-index(Dalrymple-Alford, 1970) • Gamma(Baker & Hubert, 1975) • Adjusted ratio of clustering (Roenker et al., 1971) • D-index(Dalrymple-Alford, 1970) • Modified ratio of repetition (Bower, Lesgold, and Tieman, 1969) • Dunn's index (Dunn, 1973) • Variations of Dunn’s index (Bezdekand Pal, 1998) • Strict separation (based on Balacan, Blum, and Vempala, 2008) • And many more...
Why is the CQM formalism more faithful to intuition? In the setting of clustering functions, the consistency axiom requires that consistent changes to the underlying distance should not create anynew contenders for the best-clustering of the data. C’ d d’ C C A clustering functionthat satisfies Kleinberg’s Consistency cannot output C’.
Why is the CQM formalism more faithful to intuition? In the setting of clustering-quality measures, the consistency axiom requires only that the quality of the clustering of a given clustering C does not get worse. C’ d d’ C C While the quality of C improves, a different clustering,C’,can still have better quality.
Summary • The intuition behind Kleinberg’s axioms is consistent (in spite of his impossibility result). • The Impossibility Result can be overcome by a change of formalism. • We do this by focusing on clustering-quality measures. • Every reasonable clustering-quality measure satisfies our axioms.
Future Work • How can the “completeness” of a set of axioms be argued? • Are the axioms useful for gaining interesting new insights about clusterings? • Can we find properties that distinguish different clustering paradigms?
Appendix: Another Clustering-Quality Measure: Gamma (Baker & Hubert, 1975) Gamma is the best performing measure in Milligan’s study of 30 internal criterions (Milligan, 1981). • Let d(+) denote the number of times that points which were clustered together in C had distance greater than two points which were not in the same cluster • Let d(-)denote the opposite result Gamma satisfies scale-invariance, consistency, richness, and isomorphism invariance.
Variants of Quality Measures Given a clustering-quality measurem, we can create new ones by applying it to a subset of the clusters. mmin(C,d) = mins(m(S,d)), where S is a subset of a least 2 clusters in C. Similarly, we can define mmax and maverage . Ifmsatisfies the axioms of clustering-quality measures, then so do mmin,mmax ,andmaverage .