240 likes | 416 Views
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no all knowing teacher, the reinforcement signal carries less information. Central problem – temporal credit assignment.
E N D
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no all knowing teacher, the reinforcement signal carries less information. Central problem – temporal credit assignment.
Example: Spatial learning is impaired by block of NMDA receptors (Morris, 1989) platform Morris water maze rat
Solving this problem is comprised of two separate tasks. • Predicting reward • Choosing the correct action • or • Policy evaluation (critic) • Policy improvement (actor)
Classical vs. instrumental conditioning Classical think -> Pavlov dog In instrumental the animal is rewarded for “correct” actions, and not, or even punished for incorrect. In instrumental (Operant) what the animal does (Policy) matters.
Predicting reward – Rascola-Wagner rule Notation u – stimulus r - reward v – expected reward w – weight (filter) With: For more than one stimulus:
Random reward Learning, r=1 Extinction, r=0
Predicting future reward: Temporal Difference learning In more realistic conditions, especially in operant conditioning the actual reward might come some time after the signal for the reward. What we might care about is not the immediate reward at this time point, but rather the total reward predicted given the choice made at this time. How can we estimate the total reward? Total averagefuture reward at time t: Assume that we estimate this with a linear estimator:
Use the δ rule at time t: Where δis the difference between the actual future rewards, and the prediction of these rewards: But, we do not know Instead we can approximate this by:
(1) (2) Which gives us: The temporal difference learning rule then becomes:
Dopamine and predicted reward Activity of VTA doparminergic neurons in a monkey. A. top- before learning, bottom after learning B. After learning. top- with reward, bottom – no reward
Generalization of TD(0) • u can be a vector u, so w is also a vector. This is for more complex, or multiple possible stimuli. • A decay term. Here: Current location Location moved to after action a This has the effect of putting a stronger emphasis on rewards that take fewer steps to reach.
Until now – how do we predict a reward. Still need to see how we make decisions of which path to take, or what policy to use. Describe bee foraging example: Different reward for each flower ? Different reward for each flower P(rb) and P(ry)
Learn “action values” mb and my (the actor), these will determine which choice to make. Assume rb=1, ry=2, what is the best choice we can make? The average reward is: What will maximize this reward?
Learn “action values” mb and my, these will determine which choice to make. Use softmax: This is a stochastic choice, β is a variability parameter. A good choice for the “action values”: is to set them to the mean reward: This is also called “indirect actor” (???)
How good is this choice? Assume β=1, rb=1, ry=2, what is <r> >> rb=1;ry=2; >> pb=exp(rb)/(exp(rb)+exp(ry))pb = 0.2689 >> py=exp(ry)/(exp(rb)+exp(ry))py = 0.7311 >> r_av=rb*pb+ry*pyr_av = 1.7311
This choice can be learned using a delta rule t<100; rb=1, ry=2 t >100; rb=2, ry=1 β=1 β=50
Another option (direct actor ???) is to set the activation values to maximize the expected reward: This can be done by stochastic gradient decent on <r> For example: So that generally for actions variable mx given action a: A good choice for r0 is the mean of rx over all possible choices. (See D&A book pg 344)
The Maze task and sequential action choice Policy evaluation Policy evaluation: Initial random policy What would it be for an ideal policy?
Policy improvement Using the direct actor learn to improve the policy. At A: ? For left turn For right turn Note – policy improvement and policy evaluation are best carried out sequentially: evaluate – improve – evaluate – Improve …
V(B)=2.5 V(C)=1 V(a)=1.75