160 likes | 266 Views
Generalization Bounds for Clustering - Some Thoughts and Many Questions. Shai Ben-David University of Waterloo Canada, Dec 2004. The Goal. Provide rigorous generalization bounds for clustering. Why ?.
E N D
Generalization Bounds for Clustering - Some Thoughts and ManyQuestions Shai Ben-David University of Waterloo Canada, Dec 2004
The Goal Provide rigorous generalization bounds for clustering. Why ? It would be useful to have assurances that clusterings that we produce are meaningful, rather than just an artifact of data randomness.
1st Step: A formal model for Sample Based Clustering • There is some large, possibly infinite, domain setX. • An unknown probability distribution over X generates an i. i.d sample. • Upon viewing such a sample, a learner wishes to deduce a clustering, as a simple, yet meaningful, description of the distribution.
2nd Step:What should a bound look like? Roughly, we wish to be able to say: If sufficiently many sample points have been drawn, then the clustering we come up with is “stable”.
What should a bound look like?More formally If S1 , S2 are sufficiently large i.id. samples from the same distribution, then, w.h.p., C(S1) is ‘similar’ to C(S2) Where C(S) is the clustering we get by applying our clustering alg. to S
How is it Different than Classification bounds? Classification generalization bounds guarantee the convergence of the loss of the hypothesis – “For any distribution P, large enough samples S, L(A(S1)) is close to L(A(P))” Since for clustering there is no natural analogue of the distribution true cost, L(A(P)), weconsider its ‘stability’ implication: “If S1, S2 are sufficiently large i.id. samples from the same distribution, then, w.h.p., L(A(S1)) is close to L(A(S2))” Here, for clustering, we seek a stronger statement, namely: “If S1, S2 are sufficiently large i.id. samples from the same distribution, then, w.h.p., C(S1) is ‘similar’ to C(S2)”
A Different Perspective – Replication From a more traditional scientific-methodology point of view, Stability can be viewed as the fundamental issue of replication -- to what extent are the results of an experiment reproducible? Replication has been investigated in many applications of clustering, but mostly by visual inspection of the results of cluster analysis on two samples.
What should a bound look like?More formally If S1 , S2 are sufficiently large i.id. samples from the same distribution, then, w.h.p., C(S1) is ‘similar’ to C(S2) where C(S) is the clustering we get by applying our clustering alg. to S
Some Issues need Clarification: How should similarity between clusters be defined? There are two notions to be defined; Similarity between clusterings of the sameset, and similarity between clusterings of different sets. Similarity between two clusterings of the same set have been extensively discussed in theliterature (see, e.g, Meila in COLT’03).
Reducing the Second Notion to the First: A common approach to the defining similarity between clusterings of different sets is to reduce it to a definition of similarity between clusterings of the same set. This is done via an extension operator - a method for extending a clustering of a domain subset to a clustering of the full domain (Breckenridge ’89, Roth et al COMPSTAT’02 and BD inCOLT’04) Examples of such extensions are Nearest Neighbor, or Center-Based clustering.
Reducing the Similarity over Two Sets to Similarity over Same Set : For a clustering C1 of S1 (or C2 of S2), use the extension operator to extend C1 to a clustering C1,2 of S2 (or C2,1 of C1, respectively). Given a similarity measure d for same-set clusterings, define a similarity measure D(C1, C2) = ½(d(C1, C2,1) + d(C2, C1,2))
Types of Potential Bounds: 1. Fixed # of Clusters If the number of clusters, k, is fixed, there is no hope to get distribution free stability results. Example1: The uniform distribution over a circle: Example2: Square with 4 equal mass hips on its corners --bad for k ≠4 Example 3: Cocentric rings -- bad for center-based clustering algorithms.
What Can We Currently Prove?(Not too much …) Von Luxemburg, Bousquet and Belkin(this NIPS), analyze when does Spectral Clustering converge to a global clustering of the domain space. Koltchinskii(2002) proved that if the underlying distribution is generated by a certain tree structure of Gaussians, then a clustering algorithm can recover this structure from random samples. BD(COLT 2004) showed distribution-free s convergence rates for the limited issue of clustering loss function.
Fixed # of Clusters– Natural questions • What is the “Intrinsic Instability” of a given sample distribution? (Buhmann et al) • What levels of intrinsic instability grant a clustering meaningless? • Can one characterize (useful) families of probability • distributions for which cluster stability holds • (i.e., the intrinsic instability is zero)?
Types of Potential Bounds:2. Let the algorithm Chose k Now there may be hope for distribution-free Bounds (the algorithm may choose to have just one cluster for a uniform distribution). Major issue: A tradeoff between the stability and the “information content” of a clustering.
Potential Uses of Bounds: • To assure that the outcome of a clustering algorithm is meaningful. • Model selection – Choose the number of clusters that • maximizes a stability-based criterion (Lange – Braun- Roth- • Buhmann NIPS’02) • Help Detect changes in the sample generating distribution (“the two-sample problem”)