Nens220: Lecture 5

Nens220: Lecture 5 Neural Networks part 2

Topics • Dayan and Abbott chapters 9,10 • Function approximation: Radial Basis Functions • Density estimation: kernel methods • Clustering: Expectation Maximization, k-means, and Kohonen networks • Reinforcement Learning • Independent Components Analysis

Cortical map plasticity • Brain area dedicated to a function increases with the importance of that function. Why? • Need models of allocation of resources. • Information theory tells us the optimal answer, but does not tell us the algorithms needed to find it. • Radial basis functions and kernel methods are models for this process…

Nonlinear networks • Inputs are a nonlinear function y w f(x) x Learning the functions f(x) is “feature extraction”.

Radial basis functions • The functions f(x) are called “basis functions”. If they are radially symmetric then they are “radial basis functions” Example: Gaussians… Equivalent to filtering and subsampling.

The curse of dimensionality • If x has D dimensions and you use N basis functions per dimension, then you need ND basis functions. If D>20, this is almost always impossible, no matter how small N is. • Solution: There isn’t that much data anyway. You never need more basis functions than data points.

RBF solution • Place the centers mi at the locations of each data point xi. • Make the widths ai reasonable (but usually want smaller than the distance between data points). • Make the weights wi equal to the desired value of the output yi.

Why is this good? • You can represent any desired function y=g(x) to arbitrary accuracy if you have enough rbfs. • Learning is easy (sometimes). One-shot memorization. Minimal “un-learning”. • Generalization is often poor.

Kernel Density Estimation • Similar to RBFs • For each datapoint xi add a new “kernel” function at that point: • Result is a convolution with the data points (same as spike rate reconstruction…)

Clustering algorithms • Set of categories r=A, B, C, etc… • Set of expected data for each category p(x|r) • Any given observation x has a probability of being in category r: p(r|x) • Goal is to figure out what the categories are by observing values of x. • “Mixture model”:

K-means algorithm • Start with k centers kr ; each r is one category. • New points xi are assigned the category of the closest center. • Move each center to the mean of all its assigned points. • Then need to re-categorize all points every time the mean changes.

Kohonen network • How to do K-means without having to remember all the data points. • “Winner-take-all” network: Every time a data point arrives, only the closest mean will change:

Expectation Maximization • Need a set of parameterized functions to represent p(x|r) [eg: Gaussians] • When a data point xi arrives, calculate p(r|x) for each category r using current gaussians. • Use Bayes’ rule to find p(x|r)=p(r|x)p(x)/p(r) for each r. • Adjust parameters [mean, variance] at each step, using the new empirical p(x|r).

EM Convergence • If the data is well-represented by a sum of Gaussian bumps, then this will probably converge to the correct centers and widths of the bumps. • Problem is to know the shape and number of the kernels in advance.

Map plasticity • RBFs, kde, K-means, and EM all place more bumps wherever there is more data. • NB: RBFs are a supervised algorithm, others are unsupervised. • May be models for cortical map formation and “cortical magnification” • If each bump is a cell, then the cell responses will cluster near regions of high data density. • This is also the correct solution for information theory.

Reinforcement Learning • Neither supervised nor unsupervised • There is a reward signal R(x) that is a function of the current state x. • Goal is to maximize the future reward • Try to predict current reward: R(x)=wTx • Try to predict future reward: V(t)=wTx(t)

Predicting Reward • Can learn R(x) using the LMS rule • To learn V(t), you need to know what you will do in the future (x(t+1), x(t+2), etc…) • So you need a “Policy” x(t+1)=f(x(t)) • Ask: what is expected total future reward for this particular policy, if I start in state x(t)? • If I have a choice of state x(t), which is best? • Then: how do I optimize the policy?

Temporal Difference Learning • Estimate the value of each state V(x), assuming you follow your policy • Use LMS rule, but target is • Now, use a “greedy” policy and always choose the next state with the highest value.

Backing up • TD “backs up” the value from the goal toward earlier states. • Allows you to predict the future effects of current actions. • May be a model for reinforcement and learning role of dopamine.

Independent Components Analysis • A linear procedure, like PCA: y=Wx • But not interested in maximizing variance; goal is to make outputs independent (if possible): • ICA = PCA for Gaussians. Only interesting for non-Gaussian distributions.

ICA x Y=Wx

Nens220: Lecture 5