Kernels for dummies Tea talk September 22, 2014

Kernels for dummies Tea talk September 22, 2014

I find kernel talks confusing. • Equations seem to appear out of nowhere. • It’s hard for me to extract the main message. • I still don’t know what an RKHS is. • Which is a little strange, considering I’ve know about Hilbert spaces since before most of you were born.

So today we’re going to attempt to demystify kernels. My claim: if you understand linear algebra, you’ll understand kernels. There will be no RKHSs in this talk. And no Polish spaces. And no mean embeddings. But also no proofs; basically I’m going to avoid the hard stuff. Instead, just the intuition!

We’ll start with a pretty standard problem: • You have samples from some distribution, p(x). • You want to minimize • F(dxf(x) p(x)) • with respect to f(x). • I’ll give examples later, but just about every kernel talk you have ever heard considers this problem, or a slight generalization.

Our problem: ~ f(x) = arg minF(dxf(x) p(x)) ~ f(x) p(x) x ~ p

Our problem: ~ f(x) = arg minF(dxf(x) p(x)) ~ f(x) p(x) x ~ p 1 p(x) = i (x-xi) n

1 dxf(x)i (x-xi) Our problem: n ~ f(x) = arg minF(dxf(x) p(x)) ~ f(x) f(x) = +(x-x1) => dxf(x) p(x) = + f(x) = –(x-x1) => dxf(x) p(x) = – By suitably adjusting f(x), dxf(x) p(x) can range from – to +. Therefore, we have to regularize f(x): we need a smoothness constraint.

If you’re Bayesian, you put a prior over f(x). If you’re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).

An aside: <f, g> = dxdyf(x) K-1(x, y) g(y).

If you’re Bayesian, you put a prior over f(x). If you’re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).

This raises two questions: • 1. How do we make sense of K-1(x, y)? • 2. What does • have to do with smoothness? dxdyf(x) K-1(x, y) f(y)

1. Making sense of K-1(x, y): dyK-1(x, y)K(y, z) = (x – z) defines K-1(x, y). Think of K as a matrix. K-1is its inverse. K has an uncountably infinite number of indices. But otherwise it’s a very standard matrix. K-1 exists if all the eigenvalues of K are positive. An aside: K-1 doesn’t really exist. But that’s irrelevant.

2. What does dxdyf(x) K-1(x, y) f(y) have to do with smoothness? I’ll answer for a specific case: translation invariant kernels, K(x, y) = K(x – y).

dxdyf(x) K-1(x, y) f(y) • =dxdyf(x) K-1(x–y) f(y) • = dk translation invariance |f(k)|2 Fourier transform K(k) Fourier transform of f(x) Fourier transform of K(x)

dxdyf(x) K-1(x, y) f(y) • =dxdyf(x) K-1(x–y) f(y) • = dk • For smooth Kernels, K(k)falls off rapidly with k. • For the above integral to be small, f(k) must fall off rapidly with k. • In other words, f(k) must be smooth. translation invariance |f(k)|2 Fourier transform K(k)

dxdyf(x) K-1(x, y) f(y) • =dxdyf(x) K-1(x–y) f(y) • = dk • Example: K(x) = exp(-x2/2) • => K(k) exp(-k2/2) |f(k)|2  dk|f(k)|2exp(+k2/2) K(k)

More generally, dyK(x, y)gk(y) = (k) gk(x) => dxdyf(x) K-1(x, y) f(y) = dk Typically, (k) falls of with k gk(x) become increasingly rough with k [dxf(x) gk(x)]2 (k)

Finally, let’s link this to linear algebra dxf(x) g(x)  fg dxf(x) A(x, y) fA(y) =>dxdyf(x) K-1(x, y) f(y) = fK-1f Compare to: ixiyi = xy jAijxj = (Ax)i ijxiAijxj = xA-1x Integrals are glorified sums! -1

Our problem: ~ f = arg minF(fp) ~ f fK-1f is small Two notions of small: d [ F(fp) +  fK-1f ] = 0 df Lagrange multipliers: fK-1f = constant.  fixed an aside: fK-1f can often be thought of as coming from a prior.

d [ F(fp) +  fK-1f ] = 0 df is easy to solve: F(fp) p + 2K-1f = 0 => f = – Remember: p(x) = i (x-xi) => Kp(x) = iK(x, xi) F(fp) Kp 2 1 n 1 n

The more general problem: ~ ~ {f1, f2, …} = arg minF(f1p1, f2p2, …) ~ ~ f1, f2, … the fiK-1fi are small Almost all kernel related problems fall into this class. Those problems are fully specified by: the functional, F(f1p1, f2p2, …), to be minimized what one means by small (e.g., fiK-1fi= ci) The rest is (typically very straightforward) algebra.

Three examples: 1. A “witness” function. 2. Ridge regression. 3. Kernel PCA (which is a little bit different).

1. A “witness” function. Maximize [f(p-q)]2 subject to the constraint fK-1f = 1 pand q are sums of delta functions. [ ] 2 dxf(x) [p(x) - q(x)] dxdyf(x) K-1(x, y) f(y)

1. A “witness” function. Maximize: [f(p-q)]2 subject to the constraint: fK-1f = 1 Lagrange multipliers: ( [f(p-q)]2–  fK-1f ) = 0 => f = => [f(p-q)]2 = (p-q)K(p-q) d df K(p-q) [(p-q)K(p-q)]1/2

1. A “witness” function. pKp = dxdyp(x) K(x, y)p(y) 1 1 j (x-xj) j (y-xj) n n 1 = ijK(xi, xj) n2 We didn’t mention RKHSs We didn’t mention mean embeddings All we did was linear algebra

1. A “witness” function. ( [f(p-q)]2–  fK-1f ) = 0 => f = => [f(p-q)]2 = (p-q)K(p-q) d df K(p-q) [(p-q)K(p-q)]1/2 ~50% of Arthur’s Gatsby job talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.

2. Ridge regression. minimize i(yi – fpi)2 + fK-1f with respect to f. i labels observations theyi are observed (they’re scalars) we have samples from the distributions pi(x)  is fixed Ridge regression(with a kernel twist).

2. Ridge regression. solution (very straightforward algebra): f* = iiKpi i = i(B + I)ijyi -1 identity matrix Bij = piKpj 1 = mnK(xm, xn) i j ninj

2. Ridge regression. solution (very straightforward algebra): f* = iiKpi i = i(B + I)ijyi -1 We didn’t mention RKHSs We didn’t mention mean embeddings All we did was linear algebra

2. Ridge regression. minimize i(yi – fpi)2 + fK-1f with respect to f f* = i iKpi i = i (B + I)ijyi -1 ~50% of Zoltan’s second to last research talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.

3. Kernel PCA (which is a little bit different). We have a set of points (in, for instance, Euclidean space), zi , i=1, …, n. We want to project them into a higher dimensional space, and do PCA in that space. Why not go to the extreme, and project them into an infinite dimensional space? fi(x) = K(zi, x)

3. Kernel PCA (which is a little bit different). Now we have a set of points (in function space), fi , i=1, …, n. We want to find a lower dimensional manifold that captures as much variance as possible. If this were standard PCA, we would minimize i (fi - jAijvj)  (fi - jAijvj) with respect to Aijand vj .

3. Kernel PCA (which is a little bit different). Remember, (fi - kAijvk)  (fi - kAijvk) is shorthand for dx (fi(x) - jAijvj(x)) (fi(x)- jAijvj(x)) But we can mess with the norm to emphasize smoothness, dxdy(fi(x) - jAijvj(x)) Q-1(x,y) (fi(y) - jAijvj(y))

3. Kernel PCA (which is a little bit different). and minimize i (fi - jAijvj) Q-1 (fi - jAijvj) with respect to Aij and vj. If we set Q = K, we get standard kernel PCA. That’s the most convenient choice, because it makes it easy to compute Aijand vj. I don’t know if there are any other justifications.

Summary Most (almost all?) kernel problems are of the form ~ ~ {f1, f2, …} = arg minF(f1p1, f2p2, …) ~ ~ f1, f2, … the fiK-1fi are small Specify the functional, F(f1p1, f2p2, …), to be minimized what one means by small (e.g., fiK-1fi= ci), and rest is (typically very straightforward) algebra.

The typical problem: The solution (two lines of algebra) d [ F(fp) +  fK-1f ] = 0 df F(fp) Kp f = – 2

There is no reason (I can find) to mention RKHSs or mean embeddings. All quantities one needs arise very naturally as the solution to the problem one has proposed.

Kernels for dummies Tea talk September 22, 2014

Kernels for dummies Tea talk September 22, 2014

Presentation Transcript

NCATE for Dummies

Mission 2014 GIS Workshop September 22, 2010

LIGHTING for Dummies

Domains for Dummies

WMIRS FOR DUMMIES

2014 IPPS Final Rule For Dummies

OECGI For Dummies:

National Training Program September 22 -25, 2014

September 22, 2014

Annual Coaches meeting September 22, 2014

MENU Week Commencing 22 nd September 2014

Monday, September 22, 2014

22 nd September 2014 Tone / Atmosphere/ Techniques

September 22 nd , 2014

Monday, September 22, 2014

September 22, 2014

Coaches Training September 22, 2014

AGENDA: September 22 , 2014 Monday