370 likes | 378 Views
Understand the concept of kernels without complicated equations. Learn how kernels can solve optimization problems and improve smoothness.
E N D
Kernels for dummies Tea talk September 22, 2014
I find kernel talks confusing. • Equations seem to appear out of nowhere. • It’s hard for me to extract the main message. • I still don’t know what an RKHS is. • Which is a little strange, considering I’ve know about Hilbert spaces since before most of you were born.
So today we’re going to attempt to demystify kernels. My claim: if you understand linear algebra, you’ll understand kernels. There will be no RKHSs in this talk. And no Polish spaces. And no mean embeddings. But also no proofs; basically I’m going to avoid the hard stuff. Instead, just the intuition!
We’ll start with a pretty standard problem: • You have samples from some distribution, p(x). • You want to minimize • F(dxf(x) p(x)) • with respect to f(x). • I’ll give examples later, but just about every kernel talk you have ever heard considers this problem, or a slight generalization.
Our problem: ~ f(x) = arg minF(dxf(x) p(x)) ~ f(x) p(x) x ~ p
Our problem: ~ f(x) = arg minF(dxf(x) p(x)) ~ f(x) p(x) x ~ p 1 p(x) = i (x-xi) n
1 dxf(x)i (x-xi) Our problem: n ~ f(x) = arg minF(dxf(x) p(x)) ~ f(x) f(x) = +(x-x1) => dxf(x) p(x) = + f(x) = –(x-x1) => dxf(x) p(x) = – By suitably adjusting f(x), dxf(x) p(x) can range from – to +. Therefore, we have to regularize f(x): we need a smoothness constraint.
If you’re Bayesian, you put a prior over f(x). If you’re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).
An aside: <f, g> = dxdyf(x) K-1(x, y) g(y).
If you’re Bayesian, you put a prior over f(x). If you’re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).
This raises two questions: • 1. How do we make sense of K-1(x, y)? • 2. What does • have to do with smoothness? dxdyf(x) K-1(x, y) f(y)
1. Making sense of K-1(x, y): dyK-1(x, y)K(y, z) = (x – z) defines K-1(x, y). Think of K as a matrix. K-1is its inverse. K has an uncountably infinite number of indices. But otherwise it’s a very standard matrix. K-1 exists if all the eigenvalues of K are positive. An aside: K-1 doesn’t really exist. But that’s irrelevant.
2. What does dxdyf(x) K-1(x, y) f(y) have to do with smoothness? I’ll answer for a specific case: translation invariant kernels, K(x, y) = K(x – y).
dxdyf(x) K-1(x, y) f(y) • =dxdyf(x) K-1(x–y) f(y) • = dk translation invariance |f(k)|2 Fourier transform K(k) Fourier transform of f(x) Fourier transform of K(x)
dxdyf(x) K-1(x, y) f(y) • =dxdyf(x) K-1(x–y) f(y) • = dk • For smooth Kernels, K(k)falls off rapidly with k. • For the above integral to be small, f(k) must fall off rapidly with k. • In other words, f(k) must be smooth. translation invariance |f(k)|2 Fourier transform K(k)
dxdyf(x) K-1(x, y) f(y) • =dxdyf(x) K-1(x–y) f(y) • = dk • Example: K(x) = exp(-x2/2) • => K(k) exp(-k2/2) |f(k)|2 dk|f(k)|2exp(+k2/2) K(k)
More generally, dyK(x, y)gk(y) = (k) gk(x) => dxdyf(x) K-1(x, y) f(y) = dk Typically, (k) falls of with k gk(x) become increasingly rough with k [dxf(x) gk(x)]2 (k)
Finally, let’s link this to linear algebra dxf(x) g(x) fg dxf(x) A(x, y) fA(y) =>dxdyf(x) K-1(x, y) f(y) = fK-1f Compare to: ixiyi = xy jAijxj = (Ax)i ijxiAijxj = xA-1x Integrals are glorified sums! -1
Our problem: ~ f = arg minF(fp) ~ f fK-1f is small Two notions of small: d [ F(fp) + fK-1f ] = 0 df Lagrange multipliers: fK-1f = constant. fixed an aside: fK-1f can often be thought of as coming from a prior.
d [ F(fp) + fK-1f ] = 0 df is easy to solve: F(fp) p + 2K-1f = 0 => f = – Remember: p(x) = i (x-xi) => Kp(x) = iK(x, xi) F(fp) Kp 2 1 n 1 n
The more general problem: ~ ~ {f1, f2, …} = arg minF(f1p1, f2p2, …) ~ ~ f1, f2, … the fiK-1fi are small Almost all kernel related problems fall into this class. Those problems are fully specified by: the functional, F(f1p1, f2p2, …), to be minimized what one means by small (e.g., fiK-1fi= ci) The rest is (typically very straightforward) algebra.
Three examples: 1. A “witness” function. 2. Ridge regression. 3. Kernel PCA (which is a little bit different).
1. A “witness” function. Maximize [f(p-q)]2 subject to the constraint fK-1f = 1 pand q are sums of delta functions. [ ] 2 dxf(x) [p(x) - q(x)] dxdyf(x) K-1(x, y) f(y)
1. A “witness” function. Maximize: [f(p-q)]2 subject to the constraint: fK-1f = 1 Lagrange multipliers: ( [f(p-q)]2– fK-1f ) = 0 => f = => [f(p-q)]2 = (p-q)K(p-q) d df K(p-q) [(p-q)K(p-q)]1/2
1. A “witness” function. pKp = dxdyp(x) K(x, y)p(y) 1 1 j (x-xj) j (y-xj) n n 1 = ijK(xi, xj) n2 We didn’t mention RKHSs We didn’t mention mean embeddings All we did was linear algebra
1. A “witness” function. ( [f(p-q)]2– fK-1f ) = 0 => f = => [f(p-q)]2 = (p-q)K(p-q) d df K(p-q) [(p-q)K(p-q)]1/2 ~50% of Arthur’s Gatsby job talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.
2. Ridge regression. minimize i(yi – fpi)2 + fK-1f with respect to f. i labels observations theyi are observed (they’re scalars) we have samples from the distributions pi(x) is fixed Ridge regression(with a kernel twist).
2. Ridge regression. solution (very straightforward algebra): f* = iiKpi i = i(B + I)ijyi -1 identity matrix Bij = piKpj 1 = mnK(xm, xn) i j ninj
2. Ridge regression. solution (very straightforward algebra): f* = iiKpi i = i(B + I)ijyi -1 We didn’t mention RKHSs We didn’t mention mean embeddings All we did was linear algebra
2. Ridge regression. minimize i(yi – fpi)2 + fK-1f with respect to f f* = i iKpi i = i (B + I)ijyi -1 ~50% of Zoltan’s second to last research talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.
3. Kernel PCA (which is a little bit different). We have a set of points (in, for instance, Euclidean space), zi , i=1, …, n. We want to project them into a higher dimensional space, and do PCA in that space. Why not go to the extreme, and project them into an infinite dimensional space? fi(x) = K(zi, x)
3. Kernel PCA (which is a little bit different). Now we have a set of points (in function space), fi , i=1, …, n. We want to find a lower dimensional manifold that captures as much variance as possible. If this were standard PCA, we would minimize i (fi - jAijvj) (fi - jAijvj) with respect to Aijand vj .
3. Kernel PCA (which is a little bit different). Remember, (fi - kAijvk) (fi - kAijvk) is shorthand for dx (fi(x) - jAijvj(x)) (fi(x)- jAijvj(x)) But we can mess with the norm to emphasize smoothness, dxdy(fi(x) - jAijvj(x)) Q-1(x,y) (fi(y) - jAijvj(y))
3. Kernel PCA (which is a little bit different). and minimize i (fi - jAijvj) Q-1 (fi - jAijvj) with respect to Aij and vj. If we set Q = K, we get standard kernel PCA. That’s the most convenient choice, because it makes it easy to compute Aijand vj. I don’t know if there are any other justifications.
Summary Most (almost all?) kernel problems are of the form ~ ~ {f1, f2, …} = arg minF(f1p1, f2p2, …) ~ ~ f1, f2, … the fiK-1fi are small Specify the functional, F(f1p1, f2p2, …), to be minimized what one means by small (e.g., fiK-1fi= ci), and rest is (typically very straightforward) algebra.
The typical problem: The solution (two lines of algebra) d [ F(fp) + fK-1f ] = 0 df F(fp) Kp f = – 2
There is no reason (I can find) to mention RKHSs or mean embeddings. All quantities one needs arise very naturally as the solution to the problem one has proposed.