Learning and Memory

Learning and Memory Reinforcement Learning

Learning Levels • Darwinian • Trial -> death or children • Skinnerian • Reinforcement learning • Popperian • Our hypotheses die in our stead • Gregorian • Tools and artifacts

Machine Learning • Unsupervised • Cluster similar items • Association (no “right” answer) • Supervised • For observations/features, teacher gives the correct “answer” • E.g., Learn to recognize categories • Reinforcement • Take action, observe consequence • bad dog!

Pavlovian Conditioning • Pavlov • Food causes salivation • Sound before food • -> sound causes salivation • Learn to associate sound with food

Operant Conditioning

Associative Memory • Hebbian Learning • When two connected neurons are both excited, the connection between them is strengthened Neurons that fire together, wire together

Explanations of Pavlov • S-S (stimulus-stimulus) • Dogs learn to associate sound with food • (and salivate based on “thinking” of food) • S-R (stimulus-response) • Dogs learn to salivate based on the tone • (and salivate directly without “thinking” of food) • How to test? • Do dogs think lights are food?

Conditioning in humans • Two pathways • The “slow” pathway dogs use • Cognitive (conscious) learning • How to test this hypothesis • Learn to blink based on a stimuli associated with a puff of air.

Blocking • Tone -> Shock -> Fear • Tone -> Fear • Tone + Light -> Shock -> Fear • Light -> ?

Rescorla-Wagner Model • Hypothesis: learn from observations that are surprising • Vn<- Vn + c (Vmax - Vn) • D Vn= c (Vmax - Vn) • Vn is strength of association between US and CS • c is the learning rate • Predictions • contingency

Limitations of Rescorla-Wagner • Tone -> food • Light -> food • Tone + light -> ?

Reinforcement Learning • Many times one takes a long sequence of actions, and only discovers the result of these actions later (e.g. when you win or lose a game) • Q: How can one ascribe credit (or blame) to one action is a sequence of actions • A: by noting surprises

Consider a game • Estimate probability of winning • Take an action, see how the opponent (or the world) responds • Re-estimate probability of winning • If it is unchanged, you learned nothing • If it is higher, the initial state was better than you thought • If it is lower, the state was worse than you thought

Tic-tac-toe example • Decision tree • Alternate layers give possible moves for each player

Reinforcement Learning • State • E.g. board position • Action • E.g. move • Policy • State -> Action • Reward function • State -> utility • Model of the environment • State, action -> state

Definitions of key terms • State • What you need to know about the world to predict the effect of an action • Policy • What action to take in each state • Reward function • The cost or benefit of being in a state • (e.g. points won or lost, happiness gained or lost)

Value Iteration • Value Function • Expected value of a policy over time = sum of the expected rewards • V(s) <- V(s) + c[V(s’) - V(s)] • s = state before the move • s’ = state after the move • “temporal difference” learning

Mouse in Maze Example policy value function

Dopamine & Reinforcement

Exploration - Exploitation • Exploration • Always try a different route to work • Exploitation • Always take the best route to work that you have found so far • Learning requires exploration • Unless the environment is noisy

RL can be very simple • Simple learning algorithm leads to optimal policy • Without predicting the effects of the agents actions • Without predicting immediate payoffs • Without planning • Without explicit model of the world

How to play chess • Computer • Evaluation function for board positions • Fast search • Human (grandmaster) • Memorize tens of thousands of board positions and what do to • Do a much smaller search!

AI and Games • Chess Backgammon Deterministic Stochastic Position Policy evaluation + search

Scaling up value functions • For small number of states • Learn the value function of each state • Not possible for Backgammon • 1020 states • Learn mapping from features to value • Then use reinforcement learning to get improved value estimates

Q-learning • Instead of the Value of a state, learn the value Q(s,a) of taking an action a from a state s. • Optimal policy: take best action • maxa Q(s,a) • Learning rule • Q(s, a) <- Q(s, a) + c[rt + maxb Q(s’, b) - Q(s, a)]

Learning to Sing • Zerbra Finch hears father’s song • Memorizes it • Then practices for months to learn to reproduce it • What kind of learning is this?

Controversies? • Is conditioning good? • How much learning do people do? • Innateness, learning, and free will

Learning and Memory