270 likes | 474 Views
A UNIQUENESS THEOREM FOR CLUSTERING. Reza Bosagh Zadeh (Carnegie Mellon) Shai Ben-David (Waterloo) UAI 09, Montréal, June 2009. TALK OUTLINE. Questions being addressed Introduce Axioms Introduce Properties Uniqueness Theorem for Single-Linkage Taxonomy of Partitioning Functions.
E N D
A UNIQUENESS THEOREMFOR CLUSTERING Reza Bosagh Zadeh (Carnegie Mellon) Shai Ben-David (Waterloo) UAI 09, Montréal, June 2009
TALK OUTLINE • Questions being addressed • Introduce Axioms • Introduce Properties • Uniqueness Theorem for Single-Linkage • Taxonomy of Partitioning Functions
WHAT IS CLUSTERING? • Given a collection of objects (characterized by feature vectors, or just a matrix of pair-wise similarities), detects the presence of distinct groups, and assign objects to groups.
SOME BASIC UNANSWERED QUESTIONS • Are there principles governing all clustering paradigms? • Which clustering paradigm should I use for a given task?
THERE ARE MANY CLUSTERING TASKS “Clustering” is an ill-defined problem • There are many different clustering tasks, leading to different clustering paradigms:
THERE ARE MANY CLUSTERING TASKS “Clustering” is an ill-defined problem • There are many different clustering tasks, leading to different clustering paradigms:
WE WOULD LIKE TO DISCUSS THE BROAD NOTION OF CLUSTERING Independently of any • particular algorithm, • particular objective function, or • particular generative data model
WHAT FOR? • Choosing a suitable algorithm for a given task. • Axioms:to capture intuition about clustering in general. • Expected to be satisfied by all clustering functions • Properties: to capture differences between different clustering paradigms
TIMELINE • Jardine, Sibson 1971 • Considered only hierarchical functions • Kleinberg 2003 • Presented an impossibility result • This paper • Presents a uniqueness result for Single-Linkage
THE BASIC SETTING • For a finite domain set S, a distance function is a symmetric mapping d:SxS → R+ such that d(x,y)=0 iff x=y. • A partitioning function takes a dissimilarity function on S and returns a partition of S. • We wish to define the axioms that distinguishclustering functions, from any other functions that output domain partitions.
KLEINBERG’S AXIOMS • Scale Invariance F(λd)=F(d) for all d and all strictly positive λ. • Richness The range of F(d) over all d is the set of all possible partitionings • Consistency If d’ equals d except for shrinking distances within clusters of F(d) or stretching between-cluster distances, then F(d) = F(d’).
KLEINBERG’S AXIOMS • Scale Invariance F(λd)=F(d) for all d and all strictly positive λ. • Richness The range of F(d) over all d is the set of all possible partitionings • Consistency If d’ equals d except for shrinking distances within clusters of F(d) or stretching between-cluster distances, then F(d) = F(d’). Inconsistent! No algorithm can satisfy all 3 of these.
CONSISTENT AXIOMS: Fix k • Scale Invariance F(λd, k)=F(d, k) for all d and all strictly positive λ. • k-Richness The range of F(d, k) over all d is the set of all possible k-partitionings • Consistency If d’ equals d except for shrinking distances within clusters of F(d, k) or stretching between-cluster distances, then F(d, k)=F(d’, k). Consistent! (And satisfied by Single-Linkage, Min-Sum, …)
CLUSTERING FUNCTIONS • Definition. Call any partitioning function which satisfies a Clustering Function • Scale Invariance • k-Richness • Consistency
TWO CLUSTERING FUNCTIONS Single-Linkage Min-Sum k-Clustering • Start with with all points in their own cluster • While there are more than k clusters • Merge the two most similar clusters Find the k-partitioning Γ which minimizes Not Hierarchical (Is NP-Hard to optimize) Both Functions satisfy: Similarity between two clusters is the similarity of the most similar two points from differing clusters • Scale Invariance • k-Richness • Consistency Hierarchical Proofs in paper.
CLUSTERING FUNCTIONS • Single-Linkage and Min-Sum are both Clustering functions. • How to distinguish between them in an Axiomatic framework? Use Properties • Not all properties are desired in every clustering situation: pick and choose properties for your task
PROPERTIES - ORDER-CONSISTENCY • Order-Consistency • If two datasets d and d’ have the same ordering of the distances, then for all k, • F(d, k)=F(d’, k) • In other words the clustering function only cares about whether a pair of points are closer/further than another pair of points. • Satisfied by Single-Linkage, Max-Linkage, Average-Linkage… • NOT satisfied by most objective functions (Min-Sum, k-means, …)
PATH-DISTANCE In other words, we find the path from x to y, which has the smallest longest jump in it. 1 2 e.g. Pd( , ) = 2 Since the path from above has a jump of distance 2 7 Undrawn edges are large 4 1
PATH-DISTANCE • Imagine each point is an island, and we would like to go from island a to island b. As if we’re trying to cross a river by jumping on rocks. • Being human,we are restricted in how far we can jump from island to island. • Path-Distance would have us find the path with the smallest longest jump, ensuring that we could complete all the jumps successfully.
PROPERTIES – PATH-DISTANCE COHERENCE • Path-Distance Coherence • If two datasets d and d’ have the same induced path distance then for all k, • F(d, k)=F(d’, k)
UNIQUENESS THEOREM • Theorem (This work) • Single-Linkage is the only clustering function satisfying Order-Consistency and Path Distance-Coherence
UNIQUENESS THEOREM • Theorem (This work) • Single-Linkage is the only clustering function satisfying Order-Consistency and Path-Distance-Coherence • Is Path-Distance-Coherence doing all the work? No. • Consistency is necessary for uniqueness • k-Richness is necessary • “X is Necessary”: All other axioms/properties satisfied, just X missing, still not enough to get uniqueness
PRACTICAL CONSIDERATIONS • Single-Linkage is not always the right function to use. • Because Path-Distance-Coherence is not always desirable. • It’s not always immediately obvious when we want a function to focus on the Path Distance • Introduce a different formulation involving Minimum Spanning Trees
PROPERTIES - MST-COHERENCE If F , 2 Then F , 2 , 2 20 20 20 20 • MST-Coherence • If two datasets d and d’ have the same Minimum Spanning Tree then for all k, • F(d, k)=F(d’, k)
A TAXONOMY OF CLUSTERING FUNCTIONS • Min-Sum satisfies neither MST-Coherence nor Order-Consistency • Future work: Characterize other clustering functions