1 / 16

Kernels, Margins, and Low-dimensional Mappings

Kernels, Margins, and Low-dimensional Mappings. Maria-Florina Balcan, Avrim Blum, Santosh Vempala. [NIPS 2007 Workshop on TOPOLOGY LEARNING ]. Generic problem. Given a set of images: , want to learn a linear separator to distinguish men from women.

ghazi
Download Presentation

Kernels, Margins, and Low-dimensional Mappings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernels, Margins, and Low-dimensional Mappings Maria-Florina Balcan, Avrim Blum, Santosh Vempala [NIPS 2007 Workshop on TOPOLOGY LEARNING ]

  2. Generic problem • Given a set of images: , want to learn a linear separator to distinguish men from women. • Problem: pixel representation no good. Old style advice: • Pick a better set of features! • But seems ad-hoc. Not scientific. New style advice: • Use a Kernel! K( , ) = ()¢( ).  is implicit, high-dimensional mapping. • Feels more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.

  3. Generic problem Old style advice: • Pick a better set of features! • But seems ad-hoc. Not scientific. New style advice: • Use a Kernel! K( , ) = ()¢ ().  is implicit, high-dimensional mapping. • Feels more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator. • E.g., K(x,y) = (x ¢ y + 1)m. :(n-diml space) ! (nm-diml space).

  4. Claim: Can view new method as way of conducting old method. • Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D], • Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [9 large-margin separator in -space for D,c], then this is a good feature set[9 almost-as-good separator]. “You give me a kernel, I give you a set of features” Do this using idea of random projection…

  5. Claim: Can view new method as way of conducting old method. • Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D], • Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [9 large-margin separator in -space for D,c], then this is a good feature set[9 almost-as-good separator]. E.g., sample z1,...,zd from D. Given x, define xi = K(x,zi). Implications: • Practical: alternative to kernelizing the algorithm. • Conceptual: View kernel as (principled) way of doing feature generation. View as similarity function, rather than “magic power of implicit high dimensional space”.

  6. Basic setup, definitions + - + - w  X • Instance space X. • Distribution D, target c. Use P = (D,c). • K(x,y) = (x)¢(y). • P is separable with margin g in -space if 9 w s.t. Pr(x,l)2 P[l(w¢(x)/|(x)|) <g]=0. (|w|=1) • Error e at margin g: replace “0” with “e”. Goal is to use K to get mapping to low-dim’l space. P=(D,c)

  7. One idea: Johnson-Lindenstrauss lemma + - + -  + - + - X P=(D,c) • If P separable with margin g in f-space, then with prob 1-d, a random linear projection down to space of dimension d = O((1/g2)log[1/(de)]) will have a linear separator of error < e. [Arriaga Vempala] • If vectors are r1,r2,...,rd, then can view as features xi = (x)¢ ri. • Problem: uses . Can we do directly, using K as black-box, without computing ?

  8. 3 methods (from simplest to best) • Draw d examples z1,...,zd from D. Use: F(x) = (K(x,z1), ..., K(x,zd)). [So, “xi” = K(x,zi)] For d = (8/e)[1/g2 + ln 1/d], if P was separable with margin g in -space, then whp this will be separable with error e. (but this method doesn’t preserve margin). • Same d, but a little more complicated. Separable with error e at margin g/2. • Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/e, rather than linear. So, can set e¿ 1/d. All these methods need access to D, unlike JL. Can this be removed? We show NO for generic K, but may be possible for natural K.

  9. Key fact Claim:If 9 perfect w of margin g in f-space, then if draw z1,...,zd2 D for d ¸ (8/e)[1/g2 + ln 1/d], whp (1-d) exists w’ in span((z1),...,(zd)) of error ·e at margin g/2. Proof:Let S = examples drawn so far. Assume |w|=1, |(z)|=18 z. • win = proj(w,span(S)), wout = w – win. • Say wout is large if Prz(|wout¢(z)| ¸g/2)¸e; else small. • If small, then done: w’ = win. • Else, next z has at least e prob of improving S. |wout|2Ã |wout|2 – (g/2)2 • Can happen at most 4/g2 times. □

  10. So.... If draw z1,...,zd2 D for d = (8/e)[1/g2 + ln 1/d], then whp exists w’ in span((z1),...,(zd)) of error ·e at margin g/2. • So, for some w’ = a1(z1) + ... + ad(zd), Pr(x,l) 2 P [sign(w’ ¢(x)) ¹l] ·e. • But notice that w’¢(x) = a1K(x,z1) + ... + adK(x,zd). ) vector (a1,...ad) is an e-good separator in the feature space: xi = K(x,zi). • But margin not preserved because length of target, examples not preserved.

  11. How to preserve margin? (mapping #2) • We know 9w’ in span((z1),...,(zd)) of error ·e at margin g/2. • So, given a new x, just want to do an orthogonal projection of (x) into that span. (preserves dot-product, decreases |(x)|, so only increases margin). • Run K(zi,zj) for all i,j=1,...,d. Get matrix M. • Decompose M = UTU. • (Mapping #2) = (mapping #1)U-1. □

  12. Mapping #2, Details • Draw a set S={z1, ..., zd} of d = (8/e)[1/g2 + ln 1/d], unlabeled examples from D. • Run K(x,y) for all x,y2S, get M(S)=(K(zi,zj))zi,zj2 S. • Place S into d-dim. space based on K (or M(S)). Rd F2(z3) X K(z1,z1)=|F2(z1)|2 K(z3,z3) z3 z1 F2(z1) F1 K(z1,z2) z2 F2(z2) K(z2,z2)

  13. Mapping #2, Details, cont • What to do with new points? • Extend the embedding F1to all of X: • consider F2: X ! Rd defined as follows: for x 2 X, let F2(x) 2 Rd be the point of smallest length such that F2(x) ¢F2(zi) = K(x,zi), for all i 2 {1, ..., d}. • The mapping is equivalent to orthogonally projecting (x) down to span((z1),…, (zd)).

  14. How to improve dimension? • Current mapping (F2) gives d = (8/e)[1/g2 + ln 1/d]. • Johnson-Lindenstrauss gives d1 = O((1/g2) log 1/(de) ). Nice because can have d¿ 1/. • Answer: just combine the two... • Run Mapping #2, then do random projection down from that. • Gives us desired dimension (# features), though sample-complexity remains as in mapping #2.

  15. RN X X O O O X  X X O Rd X O F2 X O X X O X O X X JL X X X X O O O F O Rd1 X O X O X X O X O

  16. Open Problems • For specific natural kernels, like K(x,y) = (1 + x¢y)m, is there an efficient analog to JL, without needing access to D? • Or, at least can one at least reduce the sample-complexity ? (use fewer accesses to D) • Can one extend results (e.g., mapping #1: x [K(x,z1), ..., K(x,zd)]) to more general similarity functions K? • Not exactly clear what theorem statement would look like.

More Related