1 / 31

Features, Kernels, and Similarity functions

Features, Kernels, and Similarity functions. Avrim Blum Machine learning lunch 03/05/07. Suppose you want to…. use learning to solve some classification problem. E.g., given a set of images, learn a rule to distinguish men from women.

gaye
Download Presentation

Features, Kernels, and Similarity functions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07

  2. Suppose you want to… use learning to solve some classification problem. E.g., given a set of images, learn a rule to distinguish men from women. • The first thing you need to do is decide what you want as features. • Or, for algs like SVM and Perceptron, can use a kernel function, which provides an implicit feature space. But then what kernel to use? • Can Theory provide any help or guidance?

  3. Plan for this talk Discuss a few ways theory might be of help: • Algorithms designed to do well in large feature spaces when only a small number of features are actually useful. • So you can pile a lot on when you don’t know much about the domain. • Kernel functions. Standard theoretical view, plus new one that may provide more guidance. • Bridge between “implicit mapping” and “similarity function” views. Talk about quality of a kernel in terms of more tangible properties. [work with Nina Balcan] • Combining the above. Using kernels to generate explicit features.

  4. A classic conceptual question • How is it possible to learn anything quickly when there is so much irrelevant information around? • Must there be some hard-coded focusing mechanism, or can learning handle it?

  5. A classic conceptual question Let’s try a very simple theoretical model. • Have n boolean features. Labels are + or -. 1001101110 + 1100111101 + 0111010111 - • Assume distinction is based on just one feature. • How many prediction mistakes do you need to make before you’ve figured out which one it is? • Can take majority vote over all possibilities consistent with data so far. Each mistake crosses off at least half. O(log n) mistakes total. • log(n) is good: doubling n only adds 1 more mistake. • Can’t do better (consider log(n) random strings with random labels. Whp there is a consistent feature in hindsight).

  6. A classic conceptual question What about more interesting classes of functions (not just target  a single feature)?

  7. Littlestone’s Winnow algorithm[MLJ 1988] • Motivated by the question: what if target is an OR of r << n features? • Majority vote scheme over all nr possibilities would make O(r log n) mistakes but totally impractical. Can you do this efficiently? • Winnow is simple efficient algorithm that meets this bound. • More generally, if exists LTF such that • positives satisfy w1x1+w2x2+…+wnxn  c, • negatives satisfy w1x1+w2x2+…+wnxn  c - ,(W=i|wi|) • Then # mistakes = O((W/)2 log n). • E.g., if target is “k of r” function, get O(r2 log n). • Key point: still only log dependence on n. 100101011001101011 + x4 x7 x10

  8. Littlestone’s Winnow algorithm[MLJ 1988] w+ w- 1001011011001 How does it work? Balanced version: • Maintain weight vectors w+ and w-. • Initialize all weights to 1. Classify based on whether w+x or w-x is larger. (Have x00) • If make mistake on positive x, for each xi=1, • wi+ = (1+)wi+, wi- = (1-)wi-. • And vice-versa for mistake on negative x. Other properties: • Can show this approximates maxent constraints. • In other direction, [Ng’04] shows that maxent with L1 regularization gets Winnow-like bounds.

  9. Practical issues • On batch problem, may want to cycle through data, each time with smaller . • Can also do margin version: update if just barely correct. • If want to output a likelihood, natural is ew+x/[ew+x + ew-x]. Can extend to multiclass too. • William & Vitor have paper with some other nice practical adjustments.

  10. Winnow versus Perceptron/SVM + + + + + - + - - - - Winnow is similar at high level to Perceptron updates. What’s the difference? • Suppose data is linearly separable by wx = 0 with |wx| . • For Perceptron, mistakes/samples bounded by O((L2(w)L2(x)/)2) • For Winnow, mistakes/samples bounded byO((L1(w)L(x)/)2(logn)) • For boolean features, L(x)=1. L2(x) can be sqrt(n). • If target is sparse, examples dense, Winnow is better. • E.g., x random in {0,1}n, f(x)=x1. Perceptron: O(n) mistakes. • If target is dense (most features are relevant) and examples are sparse, then Perceptron wins.

  11. OK, now on to kernels…

  12. Generic problem • Given a set of images: , want to learn a linear separator to distinguish men from women. • Problem: pixel representation no good. One approach: • Pick a better set of features! But seems ad-hoc. Instead: • Use a Kernel! K( , ) = ()¢( ).  is implicit, high-dimensional mapping. • Perceptron/SVM only interact with data through dot-products, so can be “kernelized”. If data is separable in -space by large L2 margin, don’t have to pay for it.

  13. Kernels z2 x2 X X X X X X X X X X X X O X X O O O X X O X x1 O O X O O z1 O O O O O X X O O X X X X X O X X X X z3 X X X X X X X • E.g., the kernel K(x,y) = (1+x¢y)d for the case of n=2, d=2, corresponds to the implicit mapping:

  14. Kernels • Perceptron/SVM only interact with data through dot-products, so can be “kernelized”. If data is separable in -space by large L2 margin, don’t have to pay for it. • E.g., K(x,y) = (1 + xy)d • :(n-diml space) ! (nd-diml space). • E.g., K(x,y) = e-(x-y)2 • Conceptual warning: You’re not really “getting all the power of the high dimensional space without paying for it”. The margin matters. • E.g., K(x,y)=1 if x=y, K(x,y)=0 otherwise. Corresponds to mapping where every example gets its own coordinate. Everything is linearly separable but no generalization.

  15. Question: do we need the notion of an implicit space to understand what makes a kernel helpful for learning?

  16. Focus on batch setting • Assume examples drawn from some probability distribution: • Distribution D over x, labeled by target function c. • Or distribution P over (x, l) • Will call P (or (c,D)) our “learning problem”. • Given labeled training data, want algorithm to do well on new data.

  17. x y Something funny about theory of kernels • On the one hand, operationally a kernel is just a similarity function: K(x,y) 2 [-1,1], with some extra requirements. [here I’m scaling to |(x)| = 1] • And in practice, people think of a good kernel as a good measure of similarity between data points for the task at hand. • But Theory talks about margins in implicit high-dimensional F-space. K(x,y) = F(x)¢F(y).

  18. I want to use ML to classify protein structures and I’m trying to decide on a similarity fn to use. Any help? It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.

  19. Umm… thanks, I guess. It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.

  20. Something funny about theory of kernels • Theory talks about margins in implicit high-dimensional F-space. K(x,y) = F(x)¢F(y). • Not great for intuition (do I expect this kernel or that one to work better for me) • Can we connect better with idea of a good kernel being one that is a good notion of similarity for the problem at hand? • Motivation [BBV]: If margin  in -space, then can pick Õ(1/2)random examples y1,…,yn (“landmarks”), and do mapping x  [K(x,y1),…,K(x,yn)]. Whp data in this space will be apx linearly separable.

  21. Goal: notion of “good similarity function” that… • Talks in terms of more intuitive properties (no implicit high-diml spaces, no requirement of positive-semidefiniteness, etc) • If K satisfies these properties for our given problem, then has implications to learning • Is broad: includes usual notion of “good kernel” (one that induces a large margin separator in F-space). If so, then this can help with designing the K. [Recent work with Nina, with extensions by Nati Srebro]

  22. Proposal satisfying (1) and (2): • Say have a learning problem P (distribution D over examples labeled by unknown target f). • Sim fnK:(x,y)![-1,1]is (,)-goodfor P if at least a 1- fraction of examples x satisfy: Ey~D[K(x,y)|l(y)=l(x)] ¸ Ey~D[K(x,y)|l(y)l(x)]+ • Q: how could you use this to learn?

  23. How to use it At least a 1- prob mass of x satisfy: Ey~D[K(x,y)|l(y)=l(x)] ¸ Ey~D[K(x,y)|l(y)l(x)]+ • Draw S+ of O((1/2)ln1/2) positive examples. • Draw S- of O((1/2)ln1/2) negative examples. • Classifyx based on which gives betterscore. • Hoeffding: for any given “good x”, prob of error over draw of S+,S- at most 2. • So, at most  chance our draw is bad on more than  fraction of “good x”. • With prob ¸ 1-, error rate · + .

  24. But not broad enough + + _ • K(x,y)=x¢y has good separator but doesn’t satisfy defn. (half of positives are more similar to negs that to typical pos) 30o 30o

  25. But not broad enough + + _ • Idea: would work if we didn’t pick y’s from top-left. • Broaden to say:OK if 9 large region R s.t. most x are on average more similar to y2R of same label than to y2R of other label. (even if don’t know R in advance) 30o 30o

  26. Broader defn… • Say K:(x,y)![-1,1]is an (,)-good similarity function for P if exists a weighting function w(y)2[0,1] s.t. at least 1- frac. of x satisfy: Ey~D[w(y)K(x,y)|l(y)=l(x)]¸Ey~D[w(y)K(x,y)|l(y)l(x)]+ • Can still use for learning: • Draw S+ = {y1,…,yn}, S- = {z1,…,zn}. n=Õ(1/2) • Use to “triangulate” data: x  [K(x,y1), …,K(x,yn), K(x,z1),…,K(x,zn)]. • Whp, exists good separator in this space: w = [w(y1),…,w(yn),-w(z1),…,-w(zn)]

  27. Broader defn… • Say K:(x,y)![-1,1]is an (,)-good similarity function for P if exists a weighting function w(y)2[0,1] s.t. at least 1- frac. of x satisfy: Ey~D[w(y)K(x,y)|l(y)=l(x)]¸Ey~D[w(y)K(x,y)|l(y)l(x)]+ • So, take new set of examples, project to this space, and run your favorite linear separator learning algorithm.* • Whp, exists good separator in this space: w = [w(y1),…,w(yn),-w(z1),…,-w(zn)] *Technically bounds are better if adjust definition to penalize examples more that fail the inequality badly…

  28. Broader defn… Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). Think of these as “landmarks”. • Use to “triangulate” data: X  [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. Guarantee: with prob. ¸ 1-, exists linear separator of error · + at margin /4. • Actually, margin is good in both L1 and L2 senses. • This particular approach requires wasting examples for use as the “landmarks”. But could use unlabeled data for this part.

  29. Interesting property of definition • An (,)-good kernel [at least 1- fraction of x have margin ¸] is an (’,’)-good sim fn under this definition. • But our current proofs suffer a penalty: ’ =  + extra, ’ = 3extra. • So, at qualitative level, can have theory of similarity functions that doesn’t require implicit spaces. Nati Srebro has improved to 2, which is tight, + extended to hinge-loss.

  30. Approach we’re investigating With Nina & Mugizi: • Take a problem where original features already pretty good, plus you have a couple reasonable similarity functions K1, K2,… • Take some unlabeled data as landmarks, use to enlarge feature space K1(x,y1), K2(x,y1), K1(x,y2),… • Run Winnow on the result. • Can prove guarantees if some convex combination of the Ki is good.

  31. Open questions • This view gives some sufficient conditions for a similarity function to be useful for learning but doesn’t have direct implications to direct use in SVM, say. • Can one define other interesting, reasonably intuitive, sufficient conditions for a similarity function to be useful for learning?

More Related