Features, Kernels, and Similarity functions

Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07

Suppose you want to… use learning to solve some classification problem. E.g., given a set of images, learn a rule to distinguish men from women. • The first thing you need to do is decide what you want as features. • Or, for algs like SVM and Perceptron, can use a kernel function, which provides an implicit feature space. But then what kernel to use? • Can Theory provide any help or guidance?

Plan for this talk Discuss a few ways theory might be of help: • Algorithms designed to do well in large feature spaces when only a small number of features are actually useful. • So you can pile a lot on when you don’t know much about the domain. • Kernel functions. Standard theoretical view, plus new one that may provide more guidance. • Bridge between “implicit mapping” and “similarity function” views. Talk about quality of a kernel in terms of more tangible properties. [work with Nina Balcan] • Combining the above. Using kernels to generate explicit features.

A classic conceptual question • How is it possible to learn anything quickly when there is so much irrelevant information around? • Must there be some hard-coded focusing mechanism, or can learning handle it?

A classic conceptual question Let’s try a very simple theoretical model. • Have n boolean features. Labels are + or -. 1001101110 + 1100111101 + 0111010111 - • Assume distinction is based on just one feature. • How many prediction mistakes do you need to make before you’ve figured out which one it is? • Can take majority vote over all possibilities consistent with data so far. Each mistake crosses off at least half. O(log n) mistakes total. • log(n) is good: doubling n only adds 1 more mistake. • Can’t do better (consider log(n) random strings with random labels. Whp there is a consistent feature in hindsight).

A classic conceptual question What about more interesting classes of functions (not just target  a single feature)?

Littlestone’s Winnow algorithm[MLJ 1988] • Motivated by the question: what if target is an OR of r << n features? • Majority vote scheme over all nr possibilities would make O(r log n) mistakes but totally impractical. Can you do this efficiently? • Winnow is simple efficient algorithm that meets this bound. • More generally, if exists LTF such that • positives satisfy w1x1+w2x2+…+wnxn  c, • negatives satisfy w1x1+w2x2+…+wnxn  c - ,(W=i|wi|) • Then # mistakes = O((W/)2 log n). • E.g., if target is “k of r” function, get O(r2 log n). • Key point: still only log dependence on n. 100101011001101011 + x4 x7 x10

Littlestone’s Winnow algorithm[MLJ 1988] w+ w- 1001011011001 How does it work? Balanced version: • Maintain weight vectors w+ and w-. • Initialize all weights to 1. Classify based on whether w+x or w-x is larger. (Have x00) • If make mistake on positive x, for each xi=1, • wi+ = (1+)wi+, wi- = (1-)wi-. • And vice-versa for mistake on negative x. Other properties: • Can show this approximates maxent constraints. • In other direction, [Ng’04] shows that maxent with L1 regularization gets Winnow-like bounds.

Practical issues • On batch problem, may want to cycle through data, each time with smaller . • Can also do margin version: update if just barely correct. • If want to output a likelihood, natural is ew+x/[ew+x + ew-x]. Can extend to multiclass too. • William & Vitor have paper with some other nice practical adjustments.

Winnow versus Perceptron/SVM + + + + + - + - - - - Winnow is similar at high level to Perceptron updates. What’s the difference? • Suppose data is linearly separable by wx = 0 with |wx| . • For Perceptron, mistakes/samples bounded by O((L2(w)L2(x)/)2) • For Winnow, mistakes/samples bounded byO((L1(w)L(x)/)2(logn)) • For boolean features, L(x)=1. L2(x) can be sqrt(n). • If target is sparse, examples dense, Winnow is better. • E.g., x random in {0,1}n, f(x)=x1. Perceptron: O(n) mistakes. • If target is dense (most features are relevant) and examples are sparse, then Perceptron wins.

OK, now on to kernels…

Generic problem • Given a set of images: , want to learn a linear separator to distinguish men from women. • Problem: pixel representation no good. One approach: • Pick a better set of features! But seems ad-hoc. Instead: • Use a Kernel! K( , ) = ()¢( ).  is implicit, high-dimensional mapping. • Perceptron/SVM only interact with data through dot-products, so can be “kernelized”. If data is separable in -space by large L2 margin, don’t have to pay for it.

Kernels z2 x2 X X X X X X X X X X X X O X X O O O X X O X x1 O O X O O z1 O O O O O X X O O X X X X X O X X X X z3 X X X X X X X • E.g., the kernel K(x,y) = (1+x¢y)d for the case of n=2, d=2, corresponds to the implicit mapping:

Kernels • Perceptron/SVM only interact with data through dot-products, so can be “kernelized”. If data is separable in -space by large L2 margin, don’t have to pay for it. • E.g., K(x,y) = (1 + xy)d • :(n-diml space) ! (nd-diml space). • E.g., K(x,y) = e-(x-y)2 • Conceptual warning: You’re not really “getting all the power of the high dimensional space without paying for it”. The margin matters. • E.g., K(x,y)=1 if x=y, K(x,y)=0 otherwise. Corresponds to mapping where every example gets its own coordinate. Everything is linearly separable but no generalization.

Question: do we need the notion of an implicit space to understand what makes a kernel helpful for learning?

Focus on batch setting • Assume examples drawn from some probability distribution: • Distribution D over x, labeled by target function c. • Or distribution P over (x, l) • Will call P (or (c,D)) our “learning problem”. • Given labeled training data, want algorithm to do well on new data.

x y Something funny about theory of kernels • On the one hand, operationally a kernel is just a similarity function: K(x,y) 2 [-1,1], with some extra requirements. [here I’m scaling to |(x)| = 1] • And in practice, people think of a good kernel as a good measure of similarity between data points for the task at hand. • But Theory talks about margins in implicit high-dimensional F-space. K(x,y) = F(x)¢F(y).

I want to use ML to classify protein structures and I’m trying to decide on a similarity fn to use. Any help? It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.

Umm… thanks, I guess. It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.

Something funny about theory of kernels • Theory talks about margins in implicit high-dimensional F-space. K(x,y) = F(x)¢F(y). • Not great for intuition (do I expect this kernel or that one to work better for me) • Can we connect better with idea of a good kernel being one that is a good notion of similarity for the problem at hand? • Motivation [BBV]: If margin  in -space, then can pick Õ(1/2)random examples y1,…,yn (“landmarks”), and do mapping x  [K(x,y1),…,K(x,yn)]. Whp data in this space will be apx linearly separable.

Goal: notion of “good similarity function” that… • Talks in terms of more intuitive properties (no implicit high-diml spaces, no requirement of positive-semidefiniteness, etc) • If K satisfies these properties for our given problem, then has implications to learning • Is broad: includes usual notion of “good kernel” (one that induces a large margin separator in F-space). If so, then this can help with designing the K. [Recent work with Nina, with extensions by Nati Srebro]

Proposal satisfying (1) and (2): • Say have a learning problem P (distribution D over examples labeled by unknown target f). • Sim fnK:(x,y)![-1,1]is (,)-goodfor P if at least a 1- fraction of examples x satisfy: Ey~D[K(x,y)|l(y)=l(x)] ¸ Ey~D[K(x,y)|l(y)l(x)]+ • Q: how could you use this to learn?

How to use it At least a 1- prob mass of x satisfy: Ey~D[K(x,y)|l(y)=l(x)] ¸ Ey~D[K(x,y)|l(y)l(x)]+ • Draw S+ of O((1/2)ln1/2) positive examples. • Draw S- of O((1/2)ln1/2) negative examples. • Classifyx based on which gives betterscore. • Hoeffding: for any given “good x”, prob of error over draw of S+,S- at most 2. • So, at most  chance our draw is bad on more than  fraction of “good x”. • With prob ¸ 1-, error rate · + .

But not broad enough + + _ • K(x,y)=x¢y has good separator but doesn’t satisfy defn. (half of positives are more similar to negs that to typical pos) 30o 30o

But not broad enough + + _ • Idea: would work if we didn’t pick y’s from top-left. • Broaden to say:OK if 9 large region R s.t. most x are on average more similar to y2R of same label than to y2R of other label. (even if don’t know R in advance) 30o 30o

Broader defn… • Say K:(x,y)![-1,1]is an (,)-good similarity function for P if exists a weighting function w(y)2[0,1] s.t. at least 1- frac. of x satisfy: Ey~D[w(y)K(x,y)|l(y)=l(x)]¸Ey~D[w(y)K(x,y)|l(y)l(x)]+ • Can still use for learning: • Draw S+ = {y1,…,yn}, S- = {z1,…,zn}. n=Õ(1/2) • Use to “triangulate” data: x  [K(x,y1), …,K(x,yn), K(x,z1),…,K(x,zn)]. • Whp, exists good separator in this space: w = [w(y1),…,w(yn),-w(z1),…,-w(zn)]

Broader defn… • Say K:(x,y)![-1,1]is an (,)-good similarity function for P if exists a weighting function w(y)2[0,1] s.t. at least 1- frac. of x satisfy: Ey~D[w(y)K(x,y)|l(y)=l(x)]¸Ey~D[w(y)K(x,y)|l(y)l(x)]+ • So, take new set of examples, project to this space, and run your favorite linear separator learning algorithm.* • Whp, exists good separator in this space: w = [w(y1),…,w(yn),-w(z1),…,-w(zn)] *Technically bounds are better if adjust definition to penalize examples more that fail the inequality badly…

Broader defn… Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). Think of these as “landmarks”. • Use to “triangulate” data: X  [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. Guarantee: with prob. ¸ 1-, exists linear separator of error · + at margin /4. • Actually, margin is good in both L1 and L2 senses. • This particular approach requires wasting examples for use as the “landmarks”. But could use unlabeled data for this part.

Interesting property of definition • An (,)-good kernel [at least 1- fraction of x have margin ¸] is an (’,’)-good sim fn under this definition. • But our current proofs suffer a penalty: ’ =  + extra, ’ = 3extra. • So, at qualitative level, can have theory of similarity functions that doesn’t require implicit spaces. Nati Srebro has improved to 2, which is tight, + extended to hinge-loss.

Approach we’re investigating With Nina & Mugizi: • Take a problem where original features already pretty good, plus you have a couple reasonable similarity functions K1, K2,… • Take some unlabeled data as landmarks, use to enlarge feature space K1(x,y1), K2(x,y1), K1(x,y2),… • Run Winnow on the result. • Can prove guarantees if some convex combination of the Ki is good.

Open questions • This view gives some sufficient conditions for a similarity function to be useful for learning but doesn’t have direct implications to direct use in SVM, say. • Can one define other interesting, reasonably intuitive, sufficient conditions for a similarity function to be useful for learning?

Features, Kernels, and Similarity functions