160 likes | 201 Views
Explore kernels and similarity functions as powerful tools in machine learning, with an elegant theory on what makes kernels effective in various data types. Analyze and construct general similarity functions to enhance learning algorithms. Delve into definitions of good similarity functions and their practical implications for learning problems. Maria-Florina Balcan's research introduces novel algorithms using similarity functions, providing broad insights and guaranteeing error rates. Discover the essential concept of general similarity functions and their application in learning scenarios.
E N D
Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD Maria-Florina Balcan
Kernels and Similarity Functions Kernels have become a powerful tool in ML. • Useful in practice for dealing with many different kinds of data. • Elegant theory about what makes a given kernel good for a given learning problem. Our Goal: analyze more general similarity functions. • In the process we describe ways of constructing good data dependent kernels. Maria-Florina Balcan
(x) 1 w Kernels • A kernel K is a pairwise similarity function s.t. 9 an implicit mapping s.t. K(x,y)=(x) ¢(y). • Point is: many learning algorithms can be written so only interact with data via dot-products. • If replace x¢y with K(x,y), it acts implicitly as if data was in higher-dimensional -space. • If data is linearly separable by large margin in -space, don’t have to pay in terms of data or comp time. If margin in -space, only need 1/2 examples to learn well. Maria-Florina Balcan
General Similarity Functions Goal:definition ofgood similarity functionfor a learning problem that: 1) Talks in terms of natural direct properties: • no implicit high-dimensional spaces • no requirement of positive-semidefiniteness 2) If K satisfies these properties for our given problem, then has implications to learning. 3) Is broad: includes usual notion of “good kernel”. (induces a large margin separator in -space) Maria-Florina Balcan
- B C - A + A First Attempt: Definition satisfying properties (1) and (2) Let P be a distribution over labeled examples (x, l(x)) • K:(x,y) ! [-1,1] is an (,)-good similarity for P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ • Suppose that positives have K(x,y) ¸ 0.2, negatives have K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are uniform random in [-1,1]. Note: this might not be a legal kernel. Maria-Florina Balcan
A First Attempt: Definition satisfying properties (1) and (2). How to use it? • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm • Draw S+ of O((1/2) ln(1/2)) positive examples. • Draw S- of O((1/2) ln(1/2)) negative examples. • Classify x based on which gives better score. Maria-Florina Balcan
A First Attempt: How to use it? • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass ofx satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm • Draw S+ of O((1/2) ln(1/2)) positive examples. • Draw S- of O((1/2) ln(1/2)) negative examples. • Classify x based on which gives better score. Guarantee: with probability ¸1-, error · + . Proof • Hoeffding: for any given “goodx”, probability of error w.r.t. x (over draw of S+, S-) at most 2. • By Markov, at most chance that the error rate over GOOD is more than . So overall error rate · + . Maria-Florina Balcan
more similar to negs than to typical pos + + + + + + - - - - - - A First Attempt: Not Broad Enough • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ • K(x,y)=x ¢ y has good (large margin) separator but doesn’t satisfy our definition. Maria-Florina Balcan
A First Attempt: Not Broad Enough • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ R + + + + + + - - - - - - Idea: would work if we didn’t pick y’s rom top-left. Broaden to say:OK if 9 large region R s.t. most x are on average more similar to y2R of same label than to y2 R of other label. Maria-Florina Balcan
Broader/Main Definition • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1]at leasta 1-probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Maria-Florina Balcan
Main Definition, How to Use It • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1] at leasta 1-probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • Use to “triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. • Take a new set of labeled examples, project to this space, and run your favorite alg for learning lin. separators. Point is: with probability ¸ 1-, exists linear separator of error · + at margin /4. (w = [w(y1), …,w(yd),-w(zd),…,-w(zd)]) Maria-Florina Balcan
Main Definition, Implications Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • Use to “triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. Guarantee: with prob. ¸ 1-, exists linear separator of error · + at margin /4. legal kernel Implications K arbitrary sim. function (,)-goodsim. function (+,/4)-goodkernelfunction Maria-Florina Balcan
Good Kernels are Good Similarity Functions Main Definition: K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1] at leasta 1-probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Theorem • An (,)-good kernel is an (’,’)-good similarity function under main definition. Our current proofs incur some penalty: ’ = + extra, ’ = 3extra. Maria-Florina Balcan
Good Kernels are Good Similarity Functions Theorem • An (,)-good kernel is an (’,’)-good similarity function under main definition, where ’ = + extra, ’ = 3extra. Proof Sketch • Suppose K is a good kernel in usual sense. • Then, standard margin bounds imply: • if S is a random sample of size Õ(1/(2)), then whp we can give weights wS(y) to all examples y 2 S so that the weighted sum of these examples defines a good LTF. • But, we want sample-independent weights [and bounded]. • Boundedness not too hard (imagine a margin-perceptron run over just the good y). • Get sample-independence using an averaging argument. Maria-Florina Balcan
Sample complexity is roughly Learning with Multiple Similarity Functions • Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good. Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • Use to “triangulate” data: F(x) = [K1(x,y1), …,Kr(x,yd), K1(x,zd),…,Kr(x,zd)]. Guarantee: The induced distribution F(P) in R2dr has a separator of error · + at margin at least Maria-Florina Balcan
Implications & Conclusions • Develop theory that provides a formal way of understanding kernels as similarity function. • Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric). Open Problems • Improve existing bounds. • Better results for learning with multiple similarity functions. Extending [SB’06]. Maria-Florina Balcan