270 likes | 458 Views
Learning and Memory. Reinforcement Learning. Learning Levels. Darwinian Trial -> death or children Skinnerian Reinforcement learning Popperian Our hypotheses die in our stead Gregorian Tools and artifacts. Machine Learning. Unsupervised Cluster similar items
E N D
Learning and Memory Reinforcement Learning
Learning Levels • Darwinian • Trial -> death or children • Skinnerian • Reinforcement learning • Popperian • Our hypotheses die in our stead • Gregorian • Tools and artifacts
Machine Learning • Unsupervised • Cluster similar items • Association (no “right” answer) • Supervised • For observations/features, teacher gives the correct “answer” • E.g., Learn to recognize categories • Reinforcement • Take action, observe consequence • bad dog!
Pavlovian Conditioning • Pavlov • Food causes salivation • Sound before food • -> sound causes salivation • Learn to associate sound with food
Associative Memory • Hebbian Learning • When two connected neurons are both excited, the connection between them is strengthened Neurons that fire together, wire together
Explanations of Pavlov • S-S (stimulus-stimulus) • Dogs learn to associate sound with food • (and salivate based on “thinking” of food) • S-R (stimulus-response) • Dogs learn to salivate based on the tone • (and salivate directly without “thinking” of food) • How to test? • Do dogs think lights are food?
Conditioning in humans • Two pathways • The “slow” pathway dogs use • Cognitive (conscious) learning • How to test this hypothesis • Learn to blink based on a stimuli associated with a puff of air.
Blocking • Tone -> Shock -> Fear • Tone -> Fear • Tone + Light -> Shock -> Fear • Light -> ?
Rescorla-Wagner Model • Hypothesis: learn from observations that are surprising • Vn<- Vn + c (Vmax - Vn) • D Vn= c (Vmax - Vn) • Vn is strength of association between US and CS • c is the learning rate • Predictions • contingency
Limitations of Rescorla-Wagner • Tone -> food • Light -> food • Tone + light -> ?
Reinforcement Learning • Many times one takes a long sequence of actions, and only discovers the result of these actions later (e.g. when you win or lose a game) • Q: How can one ascribe credit (or blame) to one action is a sequence of actions • A: by noting surprises
Consider a game • Estimate probability of winning • Take an action, see how the opponent (or the world) responds • Re-estimate probability of winning • If it is unchanged, you learned nothing • If it is higher, the initial state was better than you thought • If it is lower, the state was worse than you thought
Tic-tac-toe example • Decision tree • Alternate layers give possible moves for each player
Reinforcement Learning • State • E.g. board position • Action • E.g. move • Policy • State -> Action • Reward function • State -> utility • Model of the environment • State, action -> state
Definitions of key terms • State • What you need to know about the world to predict the effect of an action • Policy • What action to take in each state • Reward function • The cost or benefit of being in a state • (e.g. points won or lost, happiness gained or lost)
Value Iteration • Value Function • Expected value of a policy over time = sum of the expected rewards • V(s) <- V(s) + c[V(s’) - V(s)] • s = state before the move • s’ = state after the move • “temporal difference” learning
Mouse in Maze Example policy value function
Exploration - Exploitation • Exploration • Always try a different route to work • Exploitation • Always take the best route to work that you have found so far • Learning requires exploration • Unless the environment is noisy
RL can be very simple • Simple learning algorithm leads to optimal policy • Without predicting the effects of the agents actions • Without predicting immediate payoffs • Without planning • Without explicit model of the world
How to play chess • Computer • Evaluation function for board positions • Fast search • Human (grandmaster) • Memorize tens of thousands of board positions and what do to • Do a much smaller search!
AI and Games • Chess Backgammon Deterministic Stochastic Position Policy evaluation + search
Scaling up value functions • For small number of states • Learn the value function of each state • Not possible for Backgammon • 1020 states • Learn mapping from features to value • Then use reinforcement learning to get improved value estimates
Q-learning • Instead of the Value of a state, learn the value Q(s,a) of taking an action a from a state s. • Optimal policy: take best action • maxa Q(s,a) • Learning rule • Q(s, a) <- Q(s, a) + c[rt + maxb Q(s’, b) - Q(s, a)]
Learning to Sing • Zerbra Finch hears father’s song • Memorizes it • Then practices for months to learn to reproduce it • What kind of learning is this?
Controversies? • Is conditioning good? • How much learning do people do? • Innateness, learning, and free will