360 likes | 377 Views
This lecture explores the combination of dynamic programming and Monte Carlo methods in model-free reinforcement learning using temporal difference learning and Q-learning. It also discusses the problems with TD value learning and the concept of Q-functions.
E N D
CS 188: Artificial IntelligenceSpring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley
Announcements • Othello tournament rules up. • On-line readings for this week.
T T T T T T T T T T T T T T T T T T T T Combining DP and MC
s a s, a s,a,s’ s’ Model-Free Learning • Big idea: why bother learning T? • Update each time we experience a transition • Frequent outcomes will contribute more updates (over time) • Temporal difference learning (TD) • Policy still fixed! • Move values toward value of whatever successor occurs
TD Learning features • On-line, Incremental • Bootstrapping (like DP unlike MC) • Model free • Converges for any policy to the correct value of a state for that policy. • On average when alpha is small • With probability 1 when alpha is high in the beginning and low at the end (say 1/k)
Driving Home • Changes recommended by Monte Carlo methods (a=1) • Changes recommended • by TD methods (a=1)
s a s, a s,a,s’ s’ Problems with TD Value Learning • TD value learning is model-free for policy evaluation • However, if we want to turn our value estimates into a policy, we’re sunk: • Idea: Learn state-action pairings (Q-values) directly • Makes action selection model-free too!
Q-Functions • A q-value is the value of a (state and action) under a policy • Utility of taking starting in state s, taking action a, then following thereafter
The Bellman Equations • Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy • Formally:
Q-Learning • Learn Q*(s,a) values • Receive a sample (s,a,s’,r) • Consider your old estimate: • Consider your new sample estimate: • Nudge the old estimate towards the new sample:
Exploration / Exploitation • Several schemes for forcing exploration • Simplest: random actions (-greedy) • Every time step, flip a coin • With probability , act randomly • With probability 1-, act according to current policy (best q value for instance) • Problems with random actions? • You do explore the space, but keep thrashing around once learning is done • One solution: lower over time • Another solution: exploration functions
Q-Learning • Q-learning produces tables of q-values:
Demo of Q Learning • Demo arm-control • Parameters • a = learning rate • g = discounted reward (high for future rewards) • e = exploration(should decrease with time) • MDP • Reward= number of the pixel moved to the right/ iteration number • Actions : Arm up and down (yellow line), hand up and down (red line)
Q-Learning Properties • Will converge to optimal policy • If you explore enough • If you make the learning rate small enough • Neat property: does not learn policies which are optimal in the presence of action selection noise
Q-Learning • In realistic situations, we cannot possibly learn about every single state! • Too many states to visit them all in training • Too many states to even hold the q-tables in memory • Instead, we want to generalize: • Learn about some small number of training states from experience • Generalize that experience to new, similar states • This is a fundamental idea in machine learning • Clustering, classification (unsupervised, supervised), non-parametric learning
Evaluation Functions • Function which scores non-terminals • Ideal function: returns the utility of the position • In practice: typically weighted linear sum of features: • e.g. f1(s) = (num white queens – num black queens), etc.
Function Approximation • Problem: inefficient to learn each state’s utility (or eval function) one by one • Solution: what we learn about one state (or position) should generalize to similar states • Very much like supervised learning • If states are treated entirely independently, we can only learn on very small state spaces
Linear Value Functions • Another option: values are linear functions of features of states (or action-state pairs) • Good if you can describe states well using a few features (e.g. for game playing board evaluations) • Now we only have to learn a few weights rather than a value for each state 0.80 0.85 0.90 0.95 0.70 0.80 0.85 0.60 0.65 0.70 0.75
TD Updates for Linear Qs • Can use TD learning with linear Qs • (Actually it’s just like the perceptron!) • Old Q-learning update: • Simply update weights of features in Q(a,s)
Generalization of Q-functions • Non-linear Q functions are required for more complex spaces. Such functions can be learnt using • Multi-Layer Perceptrons (TD-gammon) • Support Vector Machines • Non-Parametric Methods
Demo: Learning walking controllers • (From Stanford AI Lab)
Policy Search • Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best • E.g. the value functions from the gridworld were probably bad estimates of future rewards, but they could still produce good decisions • Solution: learn the policy that maximizes rewards rather than the value that predicts rewards • This is the idea behind policy search, such as what controlled the upside-down helicopter
Policy Search • Simplest policy search: • Start with an initial linear value function or q-function • Nudge each feature weight up and down and see if your policy is better than before • Problems: • How do we tell the policy got better? • Need to run many sample episodes! • If there are a lot of features, this can be impractical
Policy Search* • Advanced policy search: • Write a stochastic (soft) policy: • Turns out you can efficiently approximate the derivative of the returns with respect to the parameters w (details in the book, but you don’t have to know them) • Take uphill steps, recalculate derivatives, etc.
Neural Correlates of RL Parkinson’s Disease Motor control + initialtion? Intracranial self-stimulation; Drug addiction; Natural rewards Reward pathway? Learning? Also involved in: • Working memory • Novel situations • ADHD • Schizophrenia • …
= Conditional stimulus = Unconditional stimulus Response = Unconditional response (reflex); conditional response (reflex) Conditioning Ivan Pavlov
Unpredicted reward (unlearned/no stimulus) Predicted reward (learned task) Omitted reward (probe trial) Dopamine Levels track RL signals (Montague et al. 1996)
Current Hypothesis Phasic dopamine encodes a reward prediction error • Precise (normative!) theory for generation of DA firing patterns • Compelling account for the role of DA in classical conditioning: prediction error acts as signal driving learning in prediction areas • Evidence • Monkey single cell recordings • Human fMRI studies • Current Research • Better information processing model • Other reward/punishment circuits including Amygdala (for visual perception) • Overall circuit (PFC-Basal Ganglia interaction)
Reinforcement Learning • What you should know • MDPs • Utilities, discounting • Policy Evaluation • Bellman’s equation • Value iteration • Policy iteration • Reinforcement Learning • Adaptive Dynamic Programming • TD learning (Model-free) • Q Learning • Function Approximation
Hierarchical RL • Stratagus: Example of a large RL task, from Bhaskara Marthi’s thesis (w/ Stuart Russell) • Stratagus is hard for reinforcement learning algorithms • > 10100 states • > 1030 actions at each point • Time horizon ≈ 104 steps • Stratagus is hard for human programmers • Typically takes several person-months for game companies to write computer opponent • Still, no match for experienced human players • Programming involves much trial and error • Hierarchical RL • Humans supply high-level prior knowledge using partial program • Learning algorithm fills in the details
Partial “Alisp” Program (defun top () (loop (choose (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice (move ‘(N S E W NOOP)) (action move))))
Hierarchical RL • They then define a hierarchical Q-function which learns a linear feature-based mini-Q-function at each choice point • Very good at balancing resources and directing rewards to the right region • Still not very good at the strategic elements of these kinds of games (i.e. the Markov game aspect) [DEMO]