1 / 36

Combining DP and MC in Model-Free Reinforcement Learning

This lecture explores the combination of dynamic programming and Monte Carlo methods in model-free reinforcement learning using temporal difference learning and Q-learning. It also discusses the problems with TD value learning and the concept of Q-functions.

llogan
Download Presentation

Combining DP and MC in Model-Free Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 188: Artificial IntelligenceSpring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley

  2. Announcements • Othello tournament rules up. • On-line readings for this week.

  3. T T T T T T T T T T T T T T T T T T T T Combining DP and MC

  4. s a s, a s,a,s’ s’ Model-Free Learning • Big idea: why bother learning T? • Update each time we experience a transition • Frequent outcomes will contribute more updates (over time) • Temporal difference learning (TD) • Policy still fixed! • Move values toward value of whatever successor occurs

  5. TD Learning features • On-line, Incremental • Bootstrapping (like DP unlike MC) • Model free • Converges for any policy to the correct value of a state for that policy. • On average when alpha is small • With probability 1 when alpha is high in the beginning and low at the end (say 1/k)

  6. Driving Home • Changes recommended by Monte Carlo methods (a=1) • Changes recommended • by TD methods (a=1)

  7. s a s, a s,a,s’ s’ Problems with TD Value Learning • TD value learning is model-free for policy evaluation • However, if we want to turn our value estimates into a policy, we’re sunk: • Idea: Learn state-action pairings (Q-values) directly • Makes action selection model-free too!

  8. Q-Functions • A q-value is the value of a (state and action) under a policy • Utility of taking starting in state s, taking action a, then following  thereafter

  9. The Bellman Equations • Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy • Formally:

  10. Q-Learning • Learn Q*(s,a) values • Receive a sample (s,a,s’,r) • Consider your old estimate: • Consider your new sample estimate: • Nudge the old estimate towards the new sample:

  11. Q-Learning

  12. Exploration / Exploitation • Several schemes for forcing exploration • Simplest: random actions (-greedy) • Every time step, flip a coin • With probability , act randomly • With probability 1-, act according to current policy (best q value for instance) • Problems with random actions? • You do explore the space, but keep thrashing around once learning is done • One solution: lower  over time • Another solution: exploration functions

  13. Q-Learning • Q-learning produces tables of q-values:

  14. Demo of Q Learning • Demo arm-control • Parameters • a = learning rate • g = discounted reward (high for future rewards) • e = exploration(should decrease with time) • MDP • Reward= number of the pixel moved to the right/ iteration number • Actions : Arm up and down (yellow line), hand up and down (red line)

  15. Q-Learning Properties • Will converge to optimal policy • If you explore enough • If you make the learning rate small enough • Neat property: does not learn policies which are optimal in the presence of action selection noise

  16. Q-Learning • In realistic situations, we cannot possibly learn about every single state! • Too many states to visit them all in training • Too many states to even hold the q-tables in memory • Instead, we want to generalize: • Learn about some small number of training states from experience • Generalize that experience to new, similar states • This is a fundamental idea in machine learning • Clustering, classification (unsupervised, supervised), non-parametric learning

  17. Evaluation Functions • Function which scores non-terminals • Ideal function: returns the utility of the position • In practice: typically weighted linear sum of features: • e.g. f1(s) = (num white queens – num black queens), etc.

  18. Function Approximation • Problem: inefficient to learn each state’s utility (or eval function) one by one • Solution: what we learn about one state (or position) should generalize to similar states • Very much like supervised learning • If states are treated entirely independently, we can only learn on very small state spaces

  19. Linear Value Functions • Another option: values are linear functions of features of states (or action-state pairs) • Good if you can describe states well using a few features (e.g. for game playing board evaluations) • Now we only have to learn a few weights rather than a value for each state 0.80 0.85 0.90 0.95 0.70 0.80 0.85 0.60 0.65 0.70 0.75

  20. TD Updates for Linear Qs • Can use TD learning with linear Qs • (Actually it’s just like the perceptron!) • Old Q-learning update: • Simply update weights of features in Q(a,s)

  21. Generalization of Q-functions • Non-linear Q functions are required for more complex spaces. Such functions can be learnt using • Multi-Layer Perceptrons (TD-gammon) • Support Vector Machines • Non-Parametric Methods

  22. Demo: Learning walking controllers • (From Stanford AI Lab)

  23. Policy Search

  24. Policy Search • Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best • E.g. the value functions from the gridworld were probably bad estimates of future rewards, but they could still produce good decisions • Solution: learn the policy that maximizes rewards rather than the value that predicts rewards • This is the idea behind policy search, such as what controlled the upside-down helicopter

  25. Policy Search • Simplest policy search: • Start with an initial linear value function or q-function • Nudge each feature weight up and down and see if your policy is better than before • Problems: • How do we tell the policy got better? • Need to run many sample episodes! • If there are a lot of features, this can be impractical

  26. Policy Search* • Advanced policy search: • Write a stochastic (soft) policy: • Turns out you can efficiently approximate the derivative of the returns with respect to the parameters w (details in the book, but you don’t have to know them) • Take uphill steps, recalculate derivatives, etc.

  27. Helicopter Control (Andrew Ng)

  28. Neural Correlates of RL Parkinson’s Disease  Motor control + initialtion? Intracranial self-stimulation; Drug addiction; Natural rewards  Reward pathway?  Learning? Also involved in: • Working memory • Novel situations • ADHD • Schizophrenia • …

  29. = Conditional stimulus = Unconditional stimulus Response = Unconditional response (reflex); conditional response (reflex) Conditioning Ivan Pavlov

  30. Unpredicted reward (unlearned/no stimulus) Predicted reward (learned task) Omitted reward (probe trial) Dopamine Levels track RL signals (Montague et al. 1996)

  31. Current Hypothesis Phasic dopamine encodes a reward prediction error • Precise (normative!) theory for generation of DA firing patterns • Compelling account for the role of DA in classical conditioning: prediction error acts as signal driving learning in prediction areas • Evidence • Monkey single cell recordings • Human fMRI studies • Current Research • Better information processing model • Other reward/punishment circuits including Amygdala (for visual perception) • Overall circuit (PFC-Basal Ganglia interaction)

  32. Reinforcement Learning • What you should know • MDPs • Utilities, discounting • Policy Evaluation • Bellman’s equation • Value iteration • Policy iteration • Reinforcement Learning • Adaptive Dynamic Programming • TD learning (Model-free) • Q Learning • Function Approximation

  33. Hierarchical Learning

  34. Hierarchical RL • Stratagus: Example of a large RL task, from Bhaskara Marthi’s thesis (w/ Stuart Russell) • Stratagus is hard for reinforcement learning algorithms • > 10100 states • > 1030 actions at each point • Time horizon ≈ 104 steps • Stratagus is hard for human programmers • Typically takes several person-months for game companies to write computer opponent • Still, no match for experienced human players • Programming involves much trial and error • Hierarchical RL • Humans supply high-level prior knowledge using partial program • Learning algorithm fills in the details

  35. Partial “Alisp” Program (defun top () (loop (choose (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice (move ‘(N S E W NOOP)) (action move))))

  36. Hierarchical RL • They then define a hierarchical Q-function which learns a linear feature-based mini-Q-function at each choice point • Very good at balancing resources and directing rewards to the right region • Still not very good at the strategic elements of these kinds of games (i.e. the Markov game aspect) [DEMO]

More Related