Learning with Similarity Functions

Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD Maria-Florina Balcan

Kernels and Similarity Functions Kernels have become a powerful tool in ML. • Useful in practice for dealing with many different kinds of data. • Elegant theory about what makes a given kernel good for a given learning problem. Our Goal: analyze more general similarity functions. • In the process we describe ways of constructing good data dependent kernels. Maria-Florina Balcan

 (x) 1 w Kernels • A kernel K is a pairwise similarity function s.t. 9 an implicit mapping  s.t. K(x,y)=(x) ¢(y). • Point is: many learning algorithms can be written so only interact with data via dot-products. • If replace x¢y with K(x,y), it acts implicitly as if data was in higher-dimensional -space. • If data is linearly separable by large margin in -space, don’t have to pay in terms of data or comp time. If margin  in -space, only need 1/2 examples to learn well. Maria-Florina Balcan

General Similarity Functions Goal:definition ofgood similarity functionfor a learning problem that: 1) Talks in terms of natural direct properties: • no implicit high-dimensional spaces • no requirement of positive-semidefiniteness 2) If K satisfies these properties for our given problem, then has implications to learning. 3) Is broad: includes usual notion of “good kernel”. (induces a large margin separator in -space) Maria-Florina Balcan

- B C - A + A First Attempt: Definition satisfying properties (1) and (2) Let P be a distribution over labeled examples (x, l(x)) • K:(x,y) ! [-1,1] is an (,)-good similarity for P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ • Suppose that positives have K(x,y) ¸ 0.2, negatives have K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are uniform random in [-1,1]. Note: this might not be a legal kernel. Maria-Florina Balcan

A First Attempt: Definition satisfying properties (1) and (2). How to use it? • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm • Draw S+ of O((1/2) ln(1/2)) positive examples. • Draw S- of O((1/2) ln(1/2)) negative examples. • Classify x based on which gives better score. Maria-Florina Balcan

A First Attempt: How to use it? • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass ofx satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm • Draw S+ of O((1/2) ln(1/2)) positive examples. • Draw S- of O((1/2) ln(1/2)) negative examples. • Classify x based on which gives better score. Guarantee: with probability ¸1-, error · + . Proof • Hoeffding: for any given “goodx”, probability of error w.r.t. x (over draw of S+, S-) at most 2. • By Markov, at most  chance that the error rate over GOOD is more than . So overall error rate · + . Maria-Florina Balcan

more similar to negs than to typical pos + + + + + + - - - - - - A First Attempt: Not Broad Enough • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ • K(x,y)=x ¢ y has good (large margin) separator but doesn’t satisfy our definition. Maria-Florina Balcan

A First Attempt: Not Broad Enough • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ R + + + + + + - - - - - - Idea: would work if we didn’t pick y’s rom top-left. Broaden to say:OK if 9 large region R s.t. most x are on average more similar to y2R of same label than to y2 R of other label. Maria-Florina Balcan

Broader/Main Definition • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1]at leasta 1-probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Maria-Florina Balcan

Main Definition, How to Use It • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1] at leasta 1-probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • Use to “triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. • Take a new set of labeled examples, project to this space, and run your favorite alg for learning lin. separators. Point is: with probability ¸ 1-, exists linear separator of error · + at margin /4. (w = [w(y1), …,w(yd),-w(zd),…,-w(zd)]) Maria-Florina Balcan

Main Definition, Implications Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • Use to “triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. Guarantee: with prob. ¸ 1-, exists linear separator of error · + at margin /4. legal kernel Implications K arbitrary sim. function (,)-goodsim. function (+,/4)-goodkernelfunction Maria-Florina Balcan

Good Kernels are Good Similarity Functions Main Definition: K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1] at leasta 1-probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Theorem • An (,)-good kernel is an (’,’)-good similarity function under main definition. Our current proofs incur some penalty: ’ =  + extra, ’ = 3extra. Maria-Florina Balcan

Good Kernels are Good Similarity Functions Theorem • An (,)-good kernel is an (’,’)-good similarity function under main definition, where ’ =  + extra, ’ = 3extra. Proof Sketch • Suppose K is a good kernel in usual sense. • Then, standard margin bounds imply: • if S is a random sample of size Õ(1/(2)), then whp we can give weights wS(y) to all examples y 2 S so that the weighted sum of these examples defines a good LTF. • But, we want sample-independent weights [and bounded]. • Boundedness not too hard (imagine a margin-perceptron run over just the good y). • Get sample-independence using an averaging argument. Maria-Florina Balcan

Sample complexity is roughly Learning with Multiple Similarity Functions • Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good. Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • Use to “triangulate” data: F(x) = [K1(x,y1), …,Kr(x,yd), K1(x,zd),…,Kr(x,zd)]. Guarantee: The induced distribution F(P) in R2dr has a separator of error · +  at margin at least Maria-Florina Balcan

Implications & Conclusions • Develop theory that provides a formal way of understanding kernels as similarity function. • Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric). Open Problems • Improve existing bounds. • Better results for learning with multiple similarity functions. Extending [SB’06]. Maria-Florina Balcan

Learning with Similarity Functions

Learning with Similarity Functions

Presentation Transcript

Similarity

CPN Distance/Similarity Functions

Learning Submodular Functions

Learning Submodular Functions

Learning Submodular Functions

Estimation Camera Response Functions using Probabilistic Intensity Similarity

Learning Term-weighting Functions for Similarity Measures

Similarity

Similarity

Learning Embeddings for Similarity-Based Retrieval

On a Theory of Similarity functions for Learning and Clustering

Similarity

Similarity

A Theory of Learning and Clustering via Similarity Functions

On a Theory of Similarity Functions for Learning and Clustering

Learning with Similarity Functions

Similarity searching modell with Excel

Learning Functions with Sesame Street

Similarity