310 likes | 322 Views
10. Supervised learning and rewards systems. Fundamentals of Computational Neuroscience, T. P. Trappenberg , 2002. Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science and Engineering
E N D
10. Supervised learning and rewards systems Fundamentals of Computational Neuroscience, T. P. Trappenberg, 2002. Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science and Engineering Graduate Programs in Cognitive Science, Brain Science and Bioinformatics Brain-Mind-Behavior Concentration Program Seoul National University E-mail: btzhang@bi.snu.ac.kr This material is available online at http://bi.snu.ac.kr/
10.1 Motor learning and control • Act o a large number of training data without the intention of storing all the specific examples • The learning of motor skills, motor control • Important for the survival of a species • Ex) Catching a ball, play the piano, etc • The brain must be able to direct the control system • Visual guidance • Arm movements with visual signals • Commonly able to adapt to the changed environment within only a few additional trials
10.1.1 Feedback controller • How limb movements could be controlled by the nervous system • Feedback control • How to find and implement an appropriate and accurate motor command generator Fig. 10.1 Negative feedback control and the elements of a standard control system.
10.1.2 Forward controller • Refined schemes for motor control with slow sensory feedback • Forward models • the dynamic of the controlled object and the behavior of the sensory system Fig. 10.2 Forward model controller
10.1.3 Inverse model controller • Refined schemes for motor control with slow sensory feedback • Inverse model controller • Incorporated as side-loop to the standard feedback controller, learns to correct the computation of the motor command generator Fig. 10.3 Inverse model controller
10.1.4 The cerebellum and motor control • Adaptive controllers are realized in the brain and are vital for our survival Fig. 10.4 Schematic illustration of some connectivity patterns in the cerebellum. Note that the output of the cerebellum is provided by Purkinje neurons that make inhibitory synapses. Climbing fibers specific for each Purkinje neuron and are tightly interwoven with their dendritic tree.
10.2 The delta rule • Forward and inverse models can be implemented by feed-forward mapping networks • How such mapping networks can be trained • To minimize the mean difference between the output of a feed-forward mapping network and a desired state provided by a teacher • Object function or cost function • Measures the distance between the actual output and the desired output, E • The mean square error (MSE) • routi is actual output • yiis the desired output (10.1)
10.2.1 Gradient descent • Minimize the error function of a single-layer mapping network • By changing the weight values • k, learning rate • The gradient of the error function • Delta rule (10.2) (10.3) (10.4) (10.5) Fig. 10.5 Illustration of error minimization with a gradient descent method on a one-dimensional error surface E(w).
10.2.2 Batch versus online algorithm • Batch algorithm versus Online learning algorithm Table 10.1 Summary of delta-rule algorithm
10.2.3 Supervised learning • The delta learning rule depends on knowledge of the desired output • Supervised learning • Supplies the network with the desired response • The training signal • The climbing fiber in the cerebellum could very well supply such an error signal to the purkinje cells • The weight changes still takes the form of a correlation rule between an error factor • The biological mechanisms underlying synaptic plasticity • Unsupervised learning • Hebbian learning
10.2.4 Supervised learning in multilayer networks • Generalize the delta rule to multilayer mapping network • The error-back-propagation algorithmsor generalized delta rule • The application of multilayer feed-forward mapping networks (multilayer perceptrons) • Discuss difficulties in connecting the computational step with brain processes • Strongly restricted number of hidden nodes to achieve good generalization • There might not be the need in the brain to train multilayer mapping networks with supervised learning algorithms with the generalized delta rule • Single-layer networks can represent complicated function • Expansion recoding
10.3 Generalized delta rules (1) • The gradient of the MSE error function with respect to the output weights • The delta factor • The calculation of the gradients with respect to the weights to the hidden layer • The derivative of the output layer • The delta term of the hidden term (10.6) (10.7) (10.8) (10.9) (10.10)
10.3 Generalized delta rules (2) Table 10.2 Summary of error-back-propagation algorithm
10.3.1 Biological plausibility • The back-propagation of error signals is probably the most problematic feature in biological terms • The non-locality of the algorithm in which a neuron has to gather the back-propagated error from all the other nodes to which it projects • Synchronization issues • Disadvantages for true parallel processing • The delta signals is also problematic • How a forward propagating phase of signals can be separated effectively from the back-propagation phase of the error signals
10.3.2 Advanced algorithms • The basic error-back-propagation algorithm • Convergence performance problem • The learning in the form of statistical learning theories • Improvements over the basic algorithm • Initial conditions • Different error functions • Various acceleration techniques • Hybrid methods • The limitation of the basic error-back-propagation algorithm • Alternative learning strategies
10.3.3 Momentum method and adaptive learning rate • The basic gradient descent method • Typically find an initial phase • Followed by a phase of very slow convergence • A shallow part of the error function • Momentum term • Remembers the changes of the weight in the previous time step • The momentum term has the effect of biasing the direction of the new update vector towards the previous direction • To increase the learning rate • when the gradient become small (10.11)
10.3.4 Different error functions • Shallow areas in the error function depend on the particular choice of the error function • Entropic error function • A proper measure for the information content (or entropy) of the actual output of the multilayer perceptron given the knowledge of the correct output • It is not always obvious which error functions should be used • A general strategy for choosing the error function can unfortunately not be given (10.12)
1.03.5 High-order gradient methods • The basic line search algorithm of gradient decent is known for its poor performance with shallow error functions • The minimization of an error function • Many other advanced minimization techniques • Take high-order gradient terms into account • Curvature terms • The curvature of the error surface in the weight change calculations • The calculation of the inverse of the Hessian matrix • Natural gradient algorithm • Levenberg-marquardt method
10.3.6 Local minima and simulated annealing • A general limitation of pure gradient descent methods • A local minimum of the error surface • The system is not able to approach a global minimum of the error function • Solution • Stochastic processes • Simulated annealing • Add noise to the weight values Fig. 10.5 Illustration of error minimization with a gradient descent method on a one-dimensional error surface E(w).
10.3.7 Hybrid methods • A variety of methods utilize the rapid initial convergence of the gradient descent method and combine it • Global search strategies • After the gradient descent method slows down below an acceptable level, a new starting point is chosen randomly • Hybrid methods combine the efficient local optimization capabilities of gradient descent method with the global search abilities of stochastic processes • Genetic algorithms use similar combinations of deterministic minimization and stochastic components
10.4 Reward learning10.4.1 Classical conditioning and temporal credit assignment problem • Learning with reward signals • Conditioning Fig. 10.6 Classical conditioning and temporal credit assignment problem. A subject is required to associate the ringing of a bell with the pressing of a button that will open the door to a chamber with some food reward. In the example the subject has learned to press the left button after the ringing of the bell. This is an example of a temporal credit assignment problem. It is difficult to devise a system that is still open to possible other solutions such as a bigger reward hidden in the right chamber.
10.4.2 Stochastic escape • The experiment another chamber (with rodent) • A larger food reward • Conditioned • Chance to open the left door after the ringing of the bell • If the rodent always stuck to the initial conditioned situation it would never learn about the existence of the larger food reward • If the rodent is running around randomly in the button chamber before the bell rings it could still happen that I presses the right button before running to the left button • The opening right door and the larger food reward • Changes the association of auditory signal to new motor action • Stochastic escape that can balance habit versus novelty
10.4.3 Reinforcement models • The implementation of a system • Learns from reward signals within neural architectures • The input to this node represent a certain input stimulus such as the ringing of the bell • The node gets activated under the right conditions and is therefore able to predict the future reward Fig. 10.7 (A) Linear predictor node. (10.13)
10.4.4 Temporal delta rule • A reward is given at time t + 1 • A scalar value r(t + 1) • A temporal version of the delta rule • Eligibility trace • Node calculate an effective reinforcement signal • Rescorla-Wagner theory • The model can produce one-step ahead predictions of a reward signal (10.14) (10.15) Fig. 10.7 (B) Neural implementation of temporal delta rule.
10.4.5 Reward chain • Learning in the previous model is restricted to the prediction of reward in the next time step • The ability to predict future reward at different time steps or even whole series of reward • V(t), all the future rewards into account, reinforcement value • αi, allow us to specify the weights we give to the reward at different times • A simple realization of such model • 0 ≤γ < 1, αi= γi-1 (10.16) (10.17)
10.4.6 Temporal difference learning • Temporal difference learning (advanced reinforcement learning) • Predict the reinforcement value at time t correctly • Predict the correct reinforcement value at previous time step • So • Minimize the temporal difference error (10.18) (10.19) (10.20) Fig. 10.7 (C) Neural implementation of temporal difference learning. (10.21)
10.4.7 Adaptive critic controller • Temporal difference learning is method of learning to predict future reward contingencies • Adaptive critic • Designed to predict the correct motor command for accurate future actions • Supervise the motor command generator Fig. 10.8 Adaptive critic controller.
10.4.8 The basal ganglia in the actor-critic scheme Fig. 10.9 (A) Anatomical overview of the connections within the basal ganglia and the major projections comprising the input and output of the basal ganglia. (B) Organizations within the basal ganglia are composed of processing pathways within the striosomal and matrix modules reflecting an architecture that could implement an actor-critic control scheme. C, cerebral cortex; F, frontal lobe; TH, thalamus; ST, subthalamic nucleus; PD, pallidusl; SPm, spiny neurons in the matri module; SPs, spiny neurons in the striosomal module; DA, dopaminergic neurons.
10.4.9 Other reward mechanisms in the brain • The proposed functional role of the basal ganglia • Only one hypothesis mentioned in the literature • Several hypothesis • The details of the biochemical nature of an eligibility trace • Experimental verifications • The origin of reward learning in the brain is still not very understood • Involve some association of reward contingencies with specific motor actions in the brain • Amygdala • Orbitofrontal cortex • Dopaminergic neurons
Conclusion • Motor learning • Feedback, forward, inverse model controller • The delta rule • Gradient descent • Batch algorithm • Online learning • Supervised learning • Generalized delta rule • Acceleration of delta rule • Reward learning • Classical conditioning • Reinforcement learning • Biological mechanisms of reward leanring