530 likes | 676 Views
Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors in arm movements. Nature. 1998 392(6675):494-7. Motor learning: learning algorithms - network and distributed representations
E N D
Presentation next week: cerebellum and supervised learning Kitazawa S, Kimura T, Yin PB. Cerebellar complex spikes encode both destinations and errors in arm movements. Nature. 1998 392(6675):494-7.
Motor learning: learning algorithms - network and distributed representations - supervised learning - perceptrons and LMS - backpropagation - reinforcement learning - unsupervised learning - Hebbian networks
Motor learning - supervised learning - knowledge of desired behavior is specified y x i.e. for every input x, we know the corresponding desired output y
Motor learning - supervised learning e.g. learning mapping between joint configuration and end point Vision gives you information about both values (or could use proprioception for joint angles)
Motor learning - supervised learning - limited feedback from the periphery - just get a ‘good’ or ‘bad’ evaluation - have to adjust behavior to maximize ‘good’ evaluation => reinforcement learning e.g. maze learning Sequence of actions leads to a reward - how do we learn the appropriate sequence?
Motor learning - unsupervised learning - no feedback from the periphery - rely on statistics of the inputs (or outputs) to find structure in the data e.g. clustering of data x2 x1
Motor learning - unsupervised learning - no feedback from the periphery - rely on statistics of the inputs (or outputs) to find structure in the data e.g. clustering of data x2 x1 Develop representations based on properties of the data
Motor learning - supervised motor learning - parameterized models - non-parametric, ‘neural network’ models - reinforcement learning - unsupervised learning - Hebbian learning - principle components analysis - independent components analysis
Supervised motor learning - learning parameterized models linear regression We know the general structure of the model: y = a*x + b but we don’t know parameters a or b. y x We want to estimate a and b based on paired data sets {xi} and {yi}
Parameterized models Linear regression y = a*x + b analytical solution (Intro stats): b = Sxiyi/Sxixi a = <y> - b<x> <x> is the expected value of x, i.e. the mean y x This is from Intro stats – single step of calculation across all data
Parameterized models Linear regression using iterative gradient descent y* = a* x + b*; a*, b* the correct parameters, y* is the observed data assume initial parameters a and b, anddefine an error term: E = 1/2(y – y*)2; y is the value predicted by the current parameters y is the target value we want to find parameters which minimize this error - move the parameters to reduce the errors a = a + da; da is the change in a to reduce the error b = b + db; db is the change in b to reduce the error choose da, db along the gradient of the error
Parameterized models y* = a* x + b* E = 1/2(y – y*)2 find the gradient of the error wrt to the parameters: dE/da = -(y – y*)dy*/da; = -(y – y*)x; dE/db = -(y – y*); choose a = a - m(y – y*)x; b = b - m(y – y*); with 0< m < 1 to control speed of learning
Parameterized models e.g. iterative gradient descent for linear regression y x
Parameterized models learn limb parameters for 2dof: x = l1*cos(q1) + l2*cos(q1+q2) y = l1*sin(q1) - l2*sin(q1+q2) y x
Motor learning and representations - how are properties of the limb represented by the CNS? Distributed representations - parameters are not explicitly fit - both the parameters and the model structure are identified end position angle Learn parameters and model within a distributed network
Distributed models - network architecture w11 1 1 w12 w21 2 2 x y w22 W … 1 1 inputs outputs 2 2 y1 = w11*x1 + w21*x2 y2 = w12*x1 + w22*x2 => y = Wx as shown here, this is just linear regression
Distributed network models simple network one layer linear units x y W 1 1 inputs outputs 2 2 3 y = Wx from inputs x and corresponding outputs y*, find W that best approximates the function
Distributed network models To fit the network parameters: define error: E = ½(y – y*)2 take derivitive wrt weights: dE/dW = -(y - y*)xT update weights: W = W - u (y - y*)xT or changing weight by weight: wij = wij - u (yj - yj*)xiT i.e. similar to the rule for linear regression simple network one layer linear units x y W 1 1 inputs outputs 2 2 3 this is Widrow-Hoff/ adaline/ LMS rule - least mean squares rule
Distributed network models - linear units, single layer networks batch mode: learn from all the data at once W = W + udE/dW online mode: learn from each data point at a time wi = wi + udEi/dwi; for {xi,yi}, the ith data point x y W 1 1 inputs outputs 2 2 3
Distributed network models • linear units, single layer networks • - essentially linear regression • gradient descent learning rule leads to • LMS update rule to change weights • iteratively x y W 1 1 inputs outputs 2 2 3
Distributed network models • more complicated computations • classification: learn to assign data points to correct category x2 x1
Distributed network models • more complicated computations • classification: learn to assign data points to correct category x2 y = 1 y = -1 x1 We want to classify the inputs (x) to outputs of either y = {-1,1} i.e. categorize the data
Distributed network models • more complicated computations • classification: learn to assign data points to correct category w x2 y = w*x > 0 x1 y = w*x < 0 The weight vector acts to project the inputs to produce the outputs - if we take y = sign(w*x), we’re can do classification
Distributed network models - categorization (non-linear transformation) Learning in nonlinear networks - outputs are non-linear function of their inputs: sigmoidal ‘squashing’ function g(Wx) = 1/(1 - exp(Wx)) x y W 1 1 2 category (0,1) patterns 3 g(Wx) Wx works like a ‘bistable’ categorization unit can also use g(x) = sign(x) (Perceptrons)
Distributed network models - categorization (non-linear transformation) Learning in nonlinear networks y = g(Wx) = 1/(1 - exp(Wx)) Find the gradient: E = ½(y – y*)2 dE/dw = -(y - y*) g’(Wx) x note that: g’(z) = g(z)(1 – g(z)) x y W 1 1 2 category (0,1) patterns 3 this is the basic neural network learning rule
Distributed network models • non-linear units, single layer networks • - ‘logistic’, non-linear regression • allows learning of categorization problems x y W 1 1 patterns category (0,1) 2 3
Distributed network models - single layer, classification networks find a network to perform logical AND function x2 • x1 x2 y • 0 0 0 • 0 1 0 • 0 0 • 1 1 1 x1
Distributed network models - single layer, classification networks find a network to perform AND logical function x2 • x1 x2 y • 0 0 0 • 0 1 0 • 0 0 • 1 1 1 x1
x2 x1 Distributed network models - single layer, classification networks logical AND x - choose W = [10 10] x1 x2 - need an offset to the inputs to shift the origin -.6 -.6 W 1 1 y y • x1 x2 Wx threshold(y) • 0 0 -1.2 0 • 0 1 -.2 0 • 0 -.2 0 • 1 1 .8 1
Distributed network models - single layer, classification networks find a network to perform logical XOR function x2 • x1 x2 y • 0 0 0 • 0 1 1 • 0 1 • 1 1 0 x1 What weights will make this work? • there are none single layer networks are computationally limited
Distributed network models - multiple layer networks XOR can be solved with multi-layered network x x1 x2 1 1 1 1 • x1 x2 h1 h2 Wh y • 0 0 0 0 0 0 • 0 1 1 0 1 1 • 0 1 0 1 1 • 1 1 1 1 -1 0 h1 h2 -.5 -1.5 1 -2 y -.5 y • more complicated computations can be performed with multiple layer networks • - can characterize problems which are not linearly separable
Distributed network models - learning in multiple layer networks y x h W V 1 1 1 outputs inputs 2 2 2 Consider a linear network: h = Wx y = Vh NB: there’s not much point to multiple layers with linear units since it can all be reexpressed as a single linear network: y = VWx = W’x; i.e. just redefine your weight matrix
x h W V 1 1 1 outputs inputs 2 2 2 Distributed network models - learning in multiple layer networks y linear network: h = Wx y = Vh Form the error: E = ½(y – y*)2 To update the weights V, from h to y: dE/dV = (y - y*) dy/dV = (y – y*) h i.e. the same rule as for the single layer network
x h W V 1 1 1 outputs inputs 2 2 2 Distributed network models - learning in multiple layer networks y linear network: h = Wx y = Vh To update the weights W, from x to h use chain rule: dE/dW = (y - y*) dy/dW; = (y - y*) dy/dh dh/dW; = (y – y*) V x - this is the gradient for the ‘hidden’ layer
x h W V 1 1 1 outputs inputs 2 2 2 Distributed network models - learning in multiple layer networks y non-linear network: h = g(Wx) y = g(Vh) Updating weights V is same as before: dE/dV = (y – y*) g’(Vh) h
x h W V 1 1 1 outputs inputs 2 2 2 Distributed network models - learning in multiple layer networks y To update the weights W use chain rule: dE/dW = (y - y*) dy/dW; = (y - y*) dy/dh dh/dW; = (y – y*) g’(Vh) V g’(Wx) x Essentially, we’re propagating the error backwards through the network, changing weights according to how much they affect the output => Backpropagation learning
Distributed network models - backpropagation learning in multiple layer networks y x h W V 1 1 1 linear network: h = Wx y = Vh outputs inputs 2 2 2 • Find out how much of the error in output is due to V • - the responsibility will be due to the activity of h: dE/dV = (y-y*)h • - change V according to this responsibility • Find out how much of the error is due to W • - units in h which have a large output weight V will be more • responsible for the error (i.e. weight error by V): (y-y*) V • - values in h will be due to activities in x (i.e. weight h responsibility • by x): dE/dW =(y-y*) V x • - change W according to this ‘accumulated’ responsibility
Learning in multi-layer neural networks • - backpropagation learning • allows for simple learning of arbitrarily complex input/output mappings • with enough ‘neurons’, most any mapping is possible • results in ‘distributed’ representations • knowledge of the mapping is distributed across neuronal populations • not individual cells • changes in restricted regions of the input state space will result in • restricted changes of the output
Learning in multi-layer neural networks • - backpropagation learning • - much slower than paramaterized models • - network needs to estimate the parameters and model structure • from scratch • - convergence can be slow, especially if the error surface is shallow • - speed can be increased by altering the learning rate (annealing) • or by using conjugate gradient descent • - or with ‘momentum’ • W = W – udE/dW – n< change in W last time> error parameters
Learning in multi-layer neural networks • - backpropagation learning • - local minima • - error surface might have small ‘basins’ which can trap the network error global mininum local mininum parameters • Start the network in different initial conditions to find the global mininum
Learning in multi-layer neural networks • - backpropagation learning • - Choosing the learning rate • - small values for u can take long time for network to converge • - large values can lead to instability + learning rate too high learning rate ok
Motor learning: learning algorithms • - gradient descent • - change model parameters to reduce error in prediction • - parameterized models • - non-parametric models • - single layer, linear and non-linear networks • - LMS/adaline learning rules • - multi layer, non-linear networks • - back propagation learning • in all of the above, we knew the correct answer and tried to match it • - i.e. ‘supervised learning’ • But what if our knowledge of outcome is limited? • => reinforcement learning
Reinforcement learning - supervised learning, but with limited feedback evaluation: {good, bad} 1 1 environment 2 2 3 inputs outputs The environment sends back a global signal saying good or bad (1 or -1) depending on system performance e.g. move the limb and bump into things (pain as a reinforcer)
Reinforcement learning - supervised learning, but with limited feedback Using a global reinforcement signal to train a network Basic idea - start with initial network - produce an output based on a given inpuyt - but add noise to the network to explore - evaluate the output - find those units with large activity - change weights so that they’ll be large the next time the input is given
Reinforcement learning - supervised learning, but with limited feedback Using a global reinforcement signal to train a network Associative reward-penalty algorithm (AR-P) Consider probabilistic outputs: p(y) = 1/(1+exp(-Wx)) The output produced on any given trial is therefore stochastic, with expected value determined by the sigmoid: <y> = tanh(-Wx) We then use gradient descent to get the update rule: dW = { u+ (y - <y>) W, if r is reward u- (-y - <y>) W, if r is penalty y x W 1 1 2 2 3
Reinforcement learning - supervised learning, but with limited feedback Using a global reinforcement signal to train a network Associative reward-penalty algorithm (AR-P) • dW = { u+ (y - <y>) W, if r is reward • u- (-y - <y>) W, if r is penalty • 1) if expected value is close to what it • actually did, then don’t change things • (nothing new) • 2) if expected value is different from what • it did, and it was rewarded, then • change W so that it will do it again • 3) if expected value is different from what • it did, and it was penalized, then • change W so that it won’t do it again • => trial and error learning y x W 1 1 2 2 3
Reinforcement learning - supervised learning, but with limited feedback Using a global reinforcement signal to train a network Much slower than gradient descent More biologically plausible - how is error backpropagated in supervised learning? - more directly ethologically plausible - based on direct reward/penalty feedback - information about survival
Motor learning: learning algorithms • - gradient descent • - change model parameters to reduce error in prediction • - parameterized models • - non-parametric models • - single layer, linear and non-linear networks • - LMS/adaline learning rules • - multi layer, non-linear networks • - back propagation learning • reinforcement learning • - AR-P networks • - Q learning, TD learning, dynamic programming • - unsupervised learning