Kernels and Margins

Kernels and Margins Maria Florina Balcan 10/25/2011

O X O X X O X X O O X X X O O X X O Linear Separators • Hypothesis class of linear decision surfaces in Rn • Instance space: X=Rn w • h(x)=w ¢ x, if h(x)> 0, then label x as +, otherwise label it as -

Lin. Separators: Perceptron algorithm • Start with all-zeroes weight vector w. • Given example x, predict positive , w ¢ x ¸ 0. • On a mistake, update as follows: • Mistake on positive, then update w Ã w + x • Mistake on negative, then update w Ã w - x • w is a weighted sum of the incorrectly classified examples Note: Guarantee: mistake bound is 1/2

Geometric Margin • If S is a set of labeled examples, then a vector w has margin  w.r.t. S if  x a  w h: w¢x = 0

What if Not Linearly Separable Problem: data not linearly separable in the most natural feature representation. Example: No good linear separator in pixel representation. vs Solutions: • Classic: “Learn a more complex class of functions”. • Modern: “Use a Kernel” (prominent method today)

Overview of Kernel Methods What is a Kernel? • A kernel K is a legal def of dot-product: i.e. there exists an implicit mapping  such that K( , )= ( )¢ ( ). Why Kernels matter? • Many algorithms interact with data only via dot-products. So, if replace x ¢ y with K(x,y), they act implicitly as if data was in the higher-dimensional -space. • If data is linearly separable by margin in -space, then good sample complexity.

Kernels • K(¢,¢) - kernel if it can be viewed as a legal definition of inner product: • 9: X ! RN such that K(x,y) =(x) ¢(y) • range of  - “-space” • N can be very large • But think of  as implicit, not explicit!

x2 X X X X X X X X X X O X X X O X O O O X X X X x1 O O z1 O O O O O X O O O X X O X X O X X X X z3 X X X X X X X X X Example K(x,y) = (x¢y)d corresponds to E.g., for n=2, d=2, the kernel original space -space z2

Kernels More examples: K is a kernel iff • K is symmetric • for any set of training points x1, x2, …,xm and for any a1, a2, …, am2 R we have: • Linear: K(x,y)=x ¢ y • Polynomial: K(x,y) =(x ¢ y)d or K(x,y) =(1+x ¢ y)d • Gaussian: Theorem

Kernelizing a learning algorithm • If all computations involving instances are in terms of inner products then: • Conceptually, work in a very high diml space and the alg’s performance depends only on linear separability in that extended space. • Computationally, only need to modify the alg. by replacing each x ¢ y with a K(x,y). • Examples of kernalizable algos: Perceptron, SVM.

Lin. Separators: Perceptron algorithm • Start with all-zeroes weight vector w. • Given example x, predict positive , w ¢ x ¸ 0. • On a mistake, update as follows: • Mistake on positive, then update w Ã w + x • Mistake on negative, then update w Ã w - x • Easy to kernelize since w is a weighted sum of examples: Replace with Note: need to store all the mistakes so far.

+ +  +  + + - + - - - - Generalize Well if Good Margin • If data is linearly separable by margin in -space, then good sample complexity. If margin  in -space, then need sample size of only Õ(1/2) to get confidence in generalization. |(x)| · 1 • Cannot rely on standard VC-bounds since the dimension of the phi-space might be very large. • VC-dim for the class of linear sep. in Rm is m+1.

Kernels & Large Margins • If S is a set of labeled examples, then a vector w in the -space has margin  if: • A vector w in the -space has margin  with respect to P if: • A vector w in the -space has error  at margin  if: (,)-good kernel

Large Margin Classifiers • Iflarge margin, then the amount of data we need depends only on 1/ and is independent on the dim of the space! • If large margin  and if our alg. produces a large margin classifier, then the amount of data we need depends only on 1/ [Bartlett & Shawe-Taylor ’99] • If large margin, then Perceptron also behaves well. • Another nice justification based on Random Projection [Arriaga & Vempala ’99].

Kernels & Large Margins • Powerful combination in ML in recent years! • A kernel implicitly allows mapping data into a high dimensional space and performing certain operations there without paying a high price computationally. • If data indeed has a large margin linear separator in that space, then one can avoid paying a high price in terms of sample size as well.

Kernels Methods • Offer great modularity. • No need to change the underlying learning algorithm to accommodate a particular choice of kernel function. • Also, we can substitute a different algorithm while maintaining the same kernel.

What we really care about are good kernels not only legal kernels!

Good Kernels, Margins, and Low Dimensional Mappings • Designing a kernel function is much like designing a feature space. • Given a good kernel K, we can reinterpret K as defining a new set of features. [Balcan-Blum -Vempala, MLJ’06]

Kernels as Features [BBV06] • If indeed large margin under K, then a randomlinear projection of the -space down to a low dimensional space approximately preserves linear separability. • by Johnson-Lindenstrauss lemma!

Main Idea - Johnson-Lindenstrauss lemma RN X O w X X O x X O  X O A X X X X X X O O O Rd O X O w’ X O O X x’ X X O

Main Idea - Johnson-Lindenstrauss lemma, cont • For any vectors u,v with prob. (1-), ](u,v) is preserved up to §/2, if we use • Usual use in algorithms: m points, set =O(1/m2) • In our case, if we want w.h.p. 9 separator of error , use

RN X O X X O X O  X O A X X X X X O O O Rd O F X O X O O X X X O Main Idea, cont • If c has margin  in the -space, then F¤(D,c), will w.h.p. have a linear separator of error at most . X

Problem Statement • For a given kernel K, the dimensionality and description of (x) might be large, or even unknown. • Do not want to explicitly compute (x). • Given kernel K, produce such a mapping F efficiently: • running time that depends polynomially only on 1/ and the time to compute K. • no dependence on the dimension of the “-space”.

Main Result [BBV06] • Positive answer - if our procedure for computing the mapping F is also given black-box access to the distribution D (unlabeled data). Formally..... • Given black-box access to K(¢,¢), given access to D and , , , construct, in poly time, F:X ! Rd, where , s. t. if c has margin  in the -space, then with prob. 1-, the induced distribution in Rd is separable at margin °/2 with error ·.

3 methods (from simplest to best) • Draw d examples x1, …, xd from D. Use: F0(x) = (K(x,x1), ..., K(x,xd)). For d = (8/e)[1/g2 + ln 1/d], if P was separable with margin g in f-space, then w.h.p. this will be separable with error e. (but this method doesn’t preserve margin). • Same d, but a little more complicated. Separable with error e at margin g/2. • Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/e, rather than linear. So, can set e¿ 1/d.

w A Key Fact Claim 1:If 9 unit-length w of margin g in f-space, then if draw x1, …, xd2 D for d ¸ (8/e)[1/g2 + ln 1/d], w.h.p. (1-d) exists w’ in span((x1),...,(xd)) of error · e at margin g/2. Proof:Let S = {(x)} for examples x drawn so far. • win = proj(w,span(S)), wout = w – win. • Say wout is large if Prx(|wout¢(x)| ¸g/2)¸e; else small. • If small, then done: w’ = win. • Else, next x has at least e prob of improving S. |wout|2Ã |wout|2 – (g/2)2 • Can happen at most 4/g2 times. □

A First Attempt • If draw x1,...,xd2 D for d = (8/e)[1/g2 + ln 1/d], then whp exists w’ in span(f(x1),...,f(xd)) of error ·e at margin g/2. • So, for some w’ = a1f(x1) + ... + adf(xd), Pr(x,l) 2 P [sign(w’ ¢f(x)) ¹l] ·e. • But notice that w’¢f(x) = a1K(x, x1) + ... + adK(x, xd). ) vector (a1,...ad) is a separator in the feature space (K(x,x1), …, K(x,xd)) with error . • However margin not preserved because of length of target, examples [if many of the phi(x_i) are very similar, then their associated features K(x,x_i) will be highly correlated].

A First Mapping • Draw a set S={x1, ..., xd} of unlabeled examples from D. • Run K(x,y) for all x,y2S, get M(S)=(K(xi,xj))xi,xj2 S. • Place S into d-dim. space based on K (or M(S)). Rd F1(x3) X K(x1,x1)=|F1(x1)|2 K(x3,x3) x3 x1 F1(x1) F1 K(x1,x2) x2 F1(x2) K(x2,x2)

A First Mapping,cont • What to do with new points? • Extend the embedding F1 to all of X: • consider F1: X ! Rddefined as follows: for x 2 X, let F1(x) 2 Rdbe the point such that F1(x) ¢F1(xi) = K(x,xi), for all i 2 {1, ..., d}. • The mapping is equivalent to orthogonally projecting (x) down to span((x1),, (xd)).

A First Mapping,cont • From Claim 1, we have that with prob. 1-±, for some w’ = a1f(x1) + ... + adf(xd) we have: • Consider w’’ = a1F_1(x1) + ... + adF_d(xd). We have |w’’| = |w’|, w’ ¢Á(x) = w’’ ¢F1(x), |F1(x)| · |Á(x)| • So:

An improved mapping • A two-stage process, compose the first mapping, F1, with a random linear projection. • Combine two types of random projection: • a projection based on points chosen at random from D. • a projection based on choosing points uniformly at random in the intermediate space.

X O RN X X O (N>>n) O X  X O Rn X O F1 X O X X O X O X X A X X X X O O O F O Rd X O X O X X O X O

Improved Mapping - Properties • Given , ,  <1, , if P has margin  in the -space, then with probability ¸1-, our mapping into Rd, has the property that F(D,c) is linearly separable with error at most , at margin at most , given that we use unlabeled examples.

Improved Mapping, Consequences • If P has margin  in the -space then we can use unlabeled examples to produce a mapping into Rd for , such that w.h.p. data is linearly separable with error ¿'/d. • The error rate of the induced target function in Rd is so small that a set S of labeled examples will, w.h.p., be perfectly separable in Rd. • Can use any generic, zero-noise linear separator algorithm in Rd.

Kernels and Margins