A Theory of Learning and Clustering via Similarity Functions

A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint work withAvrim BlumandSantosh Vempala 09/17/2007

K( , ). 2-Minute Version Generic classification problem: learn to distinguish men from women. Problem: pixel representation not so good. Nice SLT theory Powerful technique: use akernel, a special kind of similarity function But, theory in terms of implicit mappings. Can we develop a theory that views K as a measure of similarity? What are general sufficient conditions for K to be useful for learning?

K( , ). 2-Minute Version Generic classification problem: learn to distinguish men from women. Problem: pixel representation not so good. Powerful technique: use akernel, a special kind of similarity function What if don’t have any labeled data? (i.e., clustering) Can we develop a theory of conditions sufficient for K to be useful now?

Part I: On Similarity Functions for Classification

+ + + - + - - - - - + - - + - + + - - - Kernel Functions and Learning E.g., given images labeled by gender, learn a rule to distinguish men from women. [Goal: do well on new data] Problem: our best algorithms learn linear separators,not good for data in natural representation. Old approach: learn a more complex class of functions. New approach:use a Kernel.

 (x) 1 w Kernels, Kernalizable Algorithms • K kernel if 9implicit mapping  s.t. K(x,y)=(x) ¢(y). Point: many algorithms interact with data only via dot-products. • If replace x¢y with K(x,y), it acts implicitly as if data was in higher-dimensional -space. • If data is linearly separable by large margin in -space, don’t have to pay in terms of sample complexity or comp time. If margin  in -space, only need 1/2 examples to learn well.

Kernels and Similarity Functions Kernels: useful for many kinds of data, elegant SLT. Our Work: analyze more general similarity functions. Characterization ofgood similarity functions: 1) In terms of natural direct properties. • no implicit high-dimensional spaces • no requirement of positive-semidefiniteness 2) If K satisfies these, can be used for learning. 3) Is broad: includes usual notion of “good kernel”. has a large margin sep. in -space

A First Attempt: Definition Satisfying (1) and (2) P distribution over labeled examples (x, l(x)) • K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1-prob. mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ • E.g., K(x,y) ¸ 0.2, l(x) = l(y) K(x,y) random in [-1,1], l(x) l(y) Note: might not be a legal kernel.

A First Attempt: Definition Satisfying (1) and (2). How to use it? • K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1-prob. mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm • Draw S+ of O((1/2) ln(1/2)) positive examples. • Draw S- of O((1/2) ln(1/2)) negative examples. • Classify x based on which gives better score. Guarantee: with probability ¸1-, error · + .

A First Attempt: Definition Satisfying (1) and (2). How to use it? • K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1-prob. mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Guarantee: with probability ¸1-, error · + . • Hoeffding: for any given “goodx”, prob. of error w.r.t. x (over draw of S+, S-) is ·2. • At most  chance that the error rate over GOOD is ¸. • Overall error rate · + .

more similar to + than to typical - + + + + + + - - - - - - A First Attempt: Not Broad Enough • K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1-prob. mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ • K(x,y)=x ¢ y has large margin separator but doesn’t satisfy our definition.

A First Attempt: Not Broad Enough • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if a 1-prob. mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ R + + + + + + - - - - - - Broaden:OK if 9 non-negligible R s.t. most x are on average more similar to y2R of same label than to y2 R of other label.

Broader/Main Definition • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1]a 1-prob. mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • “Triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. • Take a new set of labeled examples, project to this space, and run any alg for learning lin. separators. Theorem: with probability ¸ 1-, exists linear separator of error · + at margin /4.

Main Definition & Algorithm, Implications • S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • “Triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. Theorem: with prob. ¸ 1-, exists linear separator of error · + at margin /4. legal kernel K arbitrary sim. function (,)-goodsim. function (+,/4)-goodkernelfunction Theorem Any (,)-good kernel is an (’,’)-good similarity function. (some penalty: ’ =  + extra, ’ = 2extra )

Similarity Functions for Classification, Summary • Formal wayof understanding kernels as similarity functions. • Algorithms and guarantees forgeneral similarity functionsthat aren’t necessarily PSD.

Part II: Can we use this angle to help think about Clustering?

What if only unlabeled examples available? [sports] [fashion] S set of n objects. [documents, images] There is some (unknown) “ground truth” clustering. Each object has true label l(x) in {1,…,t}. [topic] Goal: h of low error up to isomorphism of label names. Err(h) = minPrx~S[(h(x))  l(x)] Problem: only have unlabeled data! But we have a Similarity function!

What conditions on a similarity function would be enough to allow one to cluster well? [sports] [fashion] S set of n objects. [documents, images] There is some (unknown) “ground truth” clustering. Each object has true label l(x) in {1,…,t}. [topic] Goal: h of low error up to isomorphism of label names. Err(h) = minPrx~S[(h(x))  l(x)] Problem: only have unlabeled data! But we have a Similarity function!

Contrast with “Standard” Approach Traditional approach: theinput is a graph or embedding of points into Rd. - analyze algos to optimize various criteria - which criterion produces “better-looking” results We flip this perspective around. More natural, since the input graph/similarity is merely based on some heuristic. - closer to learning mixtures of Gaussians - discriminative, not generative

[sports] [fashion] What conditions on a similarity function would be enough to allow one to cluster well? Condition that trivially works. K(x,y) > 0 for all x,y, l(x) = l(y).K(x,y) < 0 for all x,y, l(x)  l(y).

fashion fashion sports sports Lacoste Lacoste soccer soccer Coco Chanel Coco Chanel tennis tennis What conditions on a similarity function would be enough to allow one to cluster well? Strict Ordering Property Still Strong K is s.t. all x aremore similar topoints y intheir own clusterthan to any y’ in other clusters. Problem: same K can satisfy it for two very different clusterings of the same data! Unlike learning, you can’t even test your hypotheses!

Lacoste soccer Coco Chanel tennis Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it.

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. All topics sports fashion soccer tennis Coco Chanel Lacoste

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. All topics sports fashion Coco Chanel Lacoste

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. All topics sports fashion soccer Coco Chanel tennis Lacoste

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. All topics sports fashion soccer tennis 2.List of clusterings s.t. at least one has low error. Tradeoff strength of assumption with size of list.

Start Getting Nice Algorithms/Properties Sufficient for hierarchical clustering Strict Ordering Property K is s.t. all x aremore similar topoints y intheir own clusterthan to any y’ in other clusters. Sufficient for hierarchical clustering Weak Stability Property For all clusters C, C’, for allA in C, A’ in C’: at least one of A, A’ is more attracted to its own cluster than to the other. A’ A

Example Analysis for Strong Stability Property K is s.t. for allC, C’,allA inC, A’inC’ K(A,C-A) > K(A,A’), • Failure iff merge P1, P2 s.t. P1½ C, P2Å C =. • But must exist P3½ C s.t. K(P1,P3) ¸ K(P1,C-P1) and K(P1,C-P1) > K(P1,P2). (K(A,A’) - average attraction between A and A’) Algorithm Average Single-Linkage. • merge “parts” whose average similarity is highest. Analysis: All “parts” made are laminar wrt target clustering. Contradiction.

Strong Stability Property, Inductive Setting Inductive Setting Draw sample S, hierarchically partition S. • Need to argue that sampling preserves stability. Insert new points as they arrive. Assume for allC, C’, all A ½ C, A’µ C’: K(A,C-A) > K(A,A’)+ • A sample cplx type argument using Regularity type results of [AFKK].

A’ A Weaker Conditions Not Sufficient for hierarchy Average AttractionProperty Ex’ 2 C(x)[K(x,x’)] > Ex’ 2 C’ [K(x,x’)]+ (8 C’C(x)) Can produce a small list of clusterings. Upper bound tO(t/2). [doesn’t depend on n] Lower bound ~ t(1/). Sufficient for hierarchy Stability of Large SubsetsProperty Might cause bottom-up algorithms to fail. Find hierarchy using learning-based algorithm. (running time tO(t/2))

Similarity Functions for Clustering, Summary Discriminative/SLT-style model for Clustering with non-interactive feedback. • Minimal conditions on K to be useful for clustering. • List Clustering • Hierarchical clustering • Our notion of property: analogue of a data dependent concept class in classification.

Thank you !

A Theory of Learning and Clustering via Similarity Functions