270 likes | 410 Views
Learning to make specific predictions using Slow Feature Analysis. Memory/prediction hierarchy with temporal invariances. Slow: temporally invariant abstractions. Fast: quickly changing input. But… how does each module work: learn, map, and predict?. My (old) module:
E N D
Learning to make specificpredictions using Slow Feature Analysis
Memory/prediction hierarchy with temporal invariances Slow: temporally invariant abstractions Fast: quickly changing input But… how does each module work: learn, map, and predict?
My (old) module: • Quantize high-dim input space • Map to low-dim output space • Discover temporal sequences in input space • Map sequences to low-dim sequence language • Feedback = same map run backwards • Problems: • Sequence-mapping (step #4) depends on several previous • steps brittle, not robust • Sequence-mapping not well-defined statistically
New module design: Slow Feature Analysis (SFA) • Pro’s of SFA: • Nearly guaranteed to find some slow features • No quantization • Defined over entire input space • Hierarchical “stacking” is easy • Statistically robust building blocks (simple polynomials, Principal Components Analysis, variance reduction, etc) • a great way to find invariant functions • invariants change slowly, hence easily predictable
BUT… ….No feedback! • Can’t get specific output from invariant input • It’s hard to take a low-dim signal and turn it into the right high-dim one (underdetermined) Here’s my solution (straightforward, probably done before somewhere): Do feedback with separate map
First, show it working… … then, show how & why Input space: 20-dim “retina” Input shapes: Gaussian blurs (wrapped) of 3 different widths Input sequences: constant-velocity motion (0.3 pixels/step) T = 0 … T=2 … T=4 Pixel 21 = pixel 1 T = 23 … T=25 … T=27
Sanity-check: slow features extracted match generating parameters: Gaussian std dev. “What” Slow feature #1 “Where” Gaussian center pos’n Slow feature #2 (… so far, this is plain vanilla SFA, nothing new…)
New contribution: Predict all pixels of next image, given previous images… T = 0 … T=2 … T=4 T=5 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Reference prediction is to use previous image (“tomorrow’s weather is just like today’s”) T=4 T=5
Plot ratio: (mean-squared prediction error ) (mean-squared reference error) Reference prediction Median ratio over all points = 0.06 (including discontinuities) …over high-confidence points = 0.03 (toss worst 20%)
Take-home messages: • SFA can be inverted • SFA can be used to make specific predictions • The prediction works very well • The prediction can be further improved by using confidence estimates So why is it hard, and how is it done?....
Low-dim slow features: S1 = 0.3 x1 + 0.1 x12 + 1.4 x2 x3 + 1.1 x42 +…. + 0.5 x5 x9 + … Why it’s hard: easy High-dim: x1 x2 x3 ……………………………………………..…………………..x20 But given S1 = 1.4 S2 = -0.33 x1= ? x2=? x3=? x4=? x5=? x6=? . . . x20=? HARD • Infinitely many possibilities of x’s • Vastly under-determined • No simple polynomial-inverse formula (e.g. “quadratic formula”)
Very simple, graphable example: (x1, x2) 2-dim S1 1-dim S1(t) = x12 + x22 nearly constant, i.e. slow x1(t), x2(t) approx circular motion in plane Illustrate a series of six clue/trick pairs for learning specific-prediction mapping
≠ Clue #1: The actual input data is a small subset of all possible input data (i.e. on a “manifold”) actual possible Trick #1: Find a set of points which represent where the actual input data is 20-80 “anchor points” Ai (Found using k-means, k-medoids, etc. This is quantization, but only for feedback)
Clue #2: The actual input data is not distributed evenly about those anchor-points yes no Trick #2: Calculate covariance matrix Ciof data around Ai data Eigenvectors of Ci
Clue #3: S(x)is locally linear about each anchor point Trick #3: Construct linear (affine) Taylor-series mappings SLiapproximating S(x) about each Ai (NB: this doesn’t require polynomial SFA, just differentiable)
Good news: Linear SLi can be pseudo-inverted (SVD) Bad news: We don’t want any old (x1,x2), we want (x1,x2)on the data manifold Clue #4: Covariance eigenvectors tell us about the local data manifold Trick #4: • Get SVD pseudo-inverseDX = SLi-1(Snew – S(Ai)) • Then stretch DX onto manifold by multiplying by chopped* Ci Snew DS Stretched DX S(Ai) DX DX …stretch… * Projection matrix, keeping only as many eigenvectors as dimensions of S
Good news: Given Ai and Ci, we can invert Snew Xnew Bad news: How do we choose whichAi andSLi-1 to use? ? ? These three all have the same value of Snew ?
Clue #5: a) We need an anchor Ai such thatS(Ai)is close toSnew Snew Close candidates S(Ai) b) Need a “hint” of which anchors are close in X-space Hint region Trick #5: Choose anchor Ai such that • Ai is “close to” the hint AND • S(Ai) is close to Snew
All tricks together: Map local linear inverse about each anchor point S(Ai) neighbors x Anchors +
Clue #6: The local data scatter can decide if a given point is probable (“on the manifold”) or not improbable probable Trick #6: Use Gaussian hyper-ellipsoid probabilities about closest Ai (this can tell if a prediction makes sense or not) improbable probable
Estimated uncertainty increases away from anchor points -log(P)
Summary of SFA inverse/prediction method: We have X(t-2), X(t-1), X(t)… we wantX(t+1) S • Calculate slow features S(t-2), S(t-1), S(t) t 2. Extrapolate that trend linearly to Snew (NB: S varies slowly/smoothly in time) Snew S t 3. Find candidate S(Ai)’s close to Snew Snew all S(Ai) e.g. candidate i= {1, 16, 3, 7}
Summary cont’d 4. Take X(t)as “hint,” and find candidate Ai’s close to it e.g. candidate i = {8, 3, 5, 17} 5. Find “best” candidate Ai , whose index is high on both candidate lists:
6. Use chosen Aiand pseudo-inverse (i.e. SLi-1(Snew – S(Ai) ) with SVD) to get DX S(Ai) DX 7. Stretch DX onto low-dim manifold using chopped Ci Stretched DX DX …stretch… 8. Add stretched DX back onto Ai to get final prediction Ai Stretched DX
improbable probable 9. Use covariance hyper-ellipsoids to estimate confidence in this prediction This method uses virtually everything we know about the data; any improvements presumably would need further clues… • Discrete sub-manifolds • Discrete sequence steps • Better nonlinear mappings
Next steps • Online learning • Adjust anchor points and covariance as new data arrive • Use weighted k-medoid clusters to mix in old with new data • Hierarchy • Set output of one layer as input to next • Enforce ever-slower features up the hierarchy • Test with more complex stimuli and natural movies • Let feedback from above modify slow feature polynomials • Find slow features in the unpredicted input (input – prediction)