360 likes | 507 Views
CIS 540 Robotics Laboratory. Robotic Soccer: A Machine Learning Testbed. Monday, September 18, 2000 William H. Hsu Department of Computing and Information Sciences, KSU http://www.cis.ksu.edu/~bhsu Readings: Chapter 13.1-13.4, Mitchell Sections 20.1-20.2, Russell and Norvig.
E N D
CIS 540 Robotics Laboratory Robotic Soccer: A Machine Learning Testbed Monday, September 18, 2000 William H. Hsu Department of Computing and Information Sciences, KSU http://www.cis.ksu.edu/~bhsu Readings: Chapter 13.1-13.4, Mitchell Sections 20.1-20.2, Russell and Norvig
Lecture Outline • References: Chapter 13, Mitchell; Sections 20.1-20.2, Russell and Norvig • Today: Sections 13.1-13.4, Mitchell • Review: “Learning to Predict by the Method of Temporal Differences”, Sutton • Introduction to Reinforcement Learning • Control Learning • Control policies that choose optimal actions • MDP framework, continued • Issues • Delayed reward • Active learning opportunities • Partial observability • Reuse requirement • Q Learning • Dynamic programming algorithm • Deterministic and nondeterministic cases; convergence properties
Control Learning • Learning to Choose Actions • Performance element • Applying policy in uncertain environment (last time) • Control, optimization objectives: belong to intelligent agent • Applications: automation (including mobile robotics), information retrieval • Examples • Robot learning to dock on battery charger • Learning to choose actions to optimize factory output • Learning to play Backgammon • Problem Characteristics • Delayed reward: loss signal may be episodic (e.g., win-loss at end of game) • Opportunity for active exploration: situated learning • Possible partially observability of states • Possible need to learn multiple tasks with same sensors, effectors (e.g., actuators)
Example:TD-Gammon • Learns to Play Backgammon [Tesauro, 1995] • Predecessor: NeuroGammon [Tesauro and Sejnowski, 1989] • Learned from examples of labelled moves (very tedious for human expert) • Result: strong computer player, but not grandmaster-level • TD-Gammon: first version, 1992 - used reinforcement learning • Immediate Reward • +100 if win • -100 if loss • 0 for all other states • Learning in TD-Gammon • Algorithm: temporal differences [Sutton, 1988] - next time • Training: playing 200000 - 1.5 million games against itself (several weeks) • Learning curve: improves until ~1.5 million games • Result: now approximately equal to best human player (won World Cup of Backgammon in 1992; among top 3 since 1995)
Interactive Model • State (may be partially observable), incremental reward presented to agent • Agent selects actions based upon (current) policy • Taking action puts agent into new state in environment • New reward: reinforcement (feedback) • Agent uses decision cycle to estimate new state, compute outcome distributions, select new actions • Reinforcement Learning Problem • Given • Observation sequence • Discount factor [0, 1) • Learn to: choose actions that maximize r(t) + r(t + 1)+ 2r(t + 2)+ … State Reward Action Reinforcement Learning:Problem Definition Agent Policy Environment
Quick Review:Markov Decision Processes • Markov Decision Processes (MDPs) • Components • Finite set of states S • Set of actions A • At each time, agent • observes state s(t) S and chooses action a(t) A; • then receives reward r(t), • and state changes to s(t + 1) • Markov property, akaMarkov assumption: s(t + 1) = (t + 1) and r(t) = r(s(t), a(t)) • i.e., r(t) and s(t + 1) depend only on current state and action • Previous history s(0), s(1), …, s(t - 1): irrelevant • i.e., s(t + 1) conditionally independent of s(0), s(1), …, s(t - 1) given s(t) • , r: may be nondeterministic; not necessarily known to agent • Variants: totally observable (accessible), partially observable (inaccessible) • Criterion for a(t): Total Reward – Maximum Expected Utility (MEU)
Agent’s Learning Task • Performance Element • Execute actions in environment, observe results • Learn action policy : state action that maximizes expected discounted rewardE[r(t) + r(t + 1)+ 2r(t + 2)+ …] from any starting state in S • [0, 1) • Discount factor on future rewards • Expresses preference for rewards sooner rather than later • Note: Something New! • Target function is : state action • However… • We have no training examples of form <state, action> • Training examples are of form <<state, action>, reward>
First Learning Scenario: Deterministic Worlds • Agent considers adopting policy from policy space • For each possible policy , can define an evaluation function over states: where r(t), r(t + 1),r(t + 2), … are generated by following policy starting at state s • Restated, task is to learn optimal policy * • Finding Optimal Policy 0 100 90 100 0 0 90 100 0 G 0 G 72 81 G G 81 0 0 0 0 0 81 90 0 81 90 100 100 81 90 100 0 0 72 81 Value Function r(state, action) immediate reward values Q(state, action) values V*(state) values One optimal policy
Idea • Might have agent try to learn evaluation function V* (abbreviated V*) • Could then perform lookahead search to choose best action from any state s, because: • Problem with Idea • Works well if agent knows • : state action state • r : state action R • When agent doesn’t know and r, cannot choose actions this way What to Learn
Solution Approach • Define new function very similar to V* • If agent learns Q, it can choose optimal action even without knowing ! • Using Learned Q • Q: evaluation function to be learned by agent • Apply Q to select action • Idealized, computed policy (this is your brain without Q-learning): • Approximated policy (this is your brain with Q-learning): Q Function
Developing Recurrence Equation for Q • Note: Q and V* closely related • Allows us to write Q recursively as • Nice! Let denote learner’s current approximation to Q • Training Rule • s’: state resulting from applying action a in state s • (Deterministic) transition function made implicit • Dynamic programming: iterate over table of possible a’ values Training Rule to Learn Q
(Nonterminating) Procedure for Situated Agent • Procedure Q-Learning-Deterministic (Reinforcement-Stream) • Reinforcement-Stream: consists of <<state, action>, reward> tuples • FOR each <s, a> DO • Initialize table entry • Observe current state s • WHILE (true) DO • Select action a and execute it • Receive immediate reward r • Observe new state s’ • Update table entry for as follows • Move: record transition from s to s’ Q Learning forDeterministic Worlds
Example: Propagating Credit (Q) for Candidate Action • glyph: denotes mobile robot • Initial state: s1(upper left) • Let discount factor be 0.9 • Q estimate • Property • If rewards nonnegative, then increases monotonically between 0 and true Q 90 72 100 100 aright 81 81 G G 63 63 Updating the Q Estimate s1 s2 s1 s2 Initial state: s1 Next state: s2
Claim • converges to Q • Scenario: deterministic world where <s, a> are observed (visited) infinitely often • Proof • Define full interval: interval during which each <s, a> is visited • During each full interval, largest error in table is reduced by factor of • Let be table after n updates and n be the maximum error in ; that is, • For any table entry , updated error in revised estimate is • Note: used general fact, Convergence
Second Learning Scenario: Nondeterministic World (Nondeterministic MDP) • What if reward and next state are nondeterministically selected? • i.e., reward function and transition function are nondeterministic • Nondeterminism may express many kinds of uncertainty • Inherent uncertainty in dynamics of world • Effector exceptions (qualifications), side effects (ramifications) • Solution Approach • Redefine V, Q in terms of expected values • Introduce decay factor; retain some of previous Q value • Compare: momentum term in ANN learning Nondeterministic Case
Q Learning Generalizes to Nondeterministic Worlds • Alter training rule to • Decaying weighted average • visitsn(s, a): total number of times <s, a> has been visited by iteration n, inclusive • r: observed reward (may also be stochastically determined) • Can Still Prove Convergence of to Q [Watkins and Dayan, 1992] • Intuitive Idea • [0, 1]: discounts estimate by number of visits, prevents oscillation • Tradeoff • More gradual revisions to • Able to deal with stochastic environment: P(s’ | s, a), P(r | s, s’, a) Nondeterministic Case:Generalizing Q Learning
Terminology • Reinforcement Learning (RL) • RL: learning to choose optimal actions from <state, reward, action> observations • Scenarios • Delayed reward: reinforcement is deferred until end of episode • Active learning: agent can control collection of experience • Partial observability: may only be able to observe rewards (must infer state) • Reuse requirement: sensors, effectors may be required for multiple tasks • Markov Decision Processes (MDPs) • Markovity (aka Markov property, Markov assumption): CI assumption on states over time • Maximum expected utility (MEU): maximum expected total reward (under additive decomposition assumption) • Q Learning • Action-value functionQ : stateaction value • Q learning: training rule and dynamic programming algorithm for RL
CIS 540 Robotics Laboratory More Reinforcement Learning: Temporal Differences Monday, September 18, 2000 William H. Hsu Department of Computing and Information Sciences, KSU http://www.cis.ksu.edu/~bhsu Readings: Sections 13.5-13.8, Mitchell Sections 20.2-20.7, Russell and Norvig
Lecture Outline • Readings: 13.1-13.4, Mitchell; 20.2-20.7, Russell and Norvig • This Week’s Paper Review: “Connectionist Learning Procedures”, Hinton • Suggested Exercises: 13.4, Mitchell; 20.11, Russell and Norvig • Reinforcement Learning (RL) Concluded • Control policies that choose optimal actions • MDP framework, continued • Continuing research topics • Active learning: experimentation (exploration) strategies • Generalization in RL • Next: ANNs and GAs for RL • Temporal Diffference (TD) Learning • Family of dynamic programming algorithms for RL • Generalization of Q learning • More than one step of lookahead • More on TD learning in action
Interactive Model • State s (may be partially observable) • Agent selects action a based upon (current) policy • Incremental reward (akareinforcement) r(s, a) presented to agent • Taking action puts agent into new state s’ = (s, a) in environment • Agent uses decision cycle to estimate s’, compute outcome distributions, select new actions • Reinforcement Learning Problem • Given • Observation sequence • Discount factor [0, 1) • Learn to: choose actions that maximize r(t) + r(t + 1)+ 2r(t + 2)+ … State Reward Action Quick Review:Policy Learning Framework Agent Policy Environment
Deterministic World Scenario • “Knowledge-free” (here, model-free) search for policy from policy space • For each possible policy , can define an evaluation function over states: where r(t), r(t + 1),r(t + 2), … are generated by following policy starting at state s • Restated, task is to learn optimal policy * • Finding Optimal Policy • Q-Learning Training Rule 0 100 90 100 0 0 90 100 0 G 0 G 72 81 G G 81 0 0 0 0 0 81 90 0 81 90 100 100 81 90 100 0 0 72 81 Quick Review:Q Learning r(state, action) immediate reward values Q(state, action) values V*(state) values One optimal policy
Learning Scenarios • First Learning Scenario • Passive learning in known environment (Section 20.2, Russell and Norvig) • Intuition (passive learning in known and unknown environments) • Training sequences (s1, s2, …, sn, r = U(sn)) • Learner has fixed policy ; determine benefits (expected total reward) • Important note: known accessible deterministic (even if transition model known, state may not be directly observable and may be stochastic) • Solutions: naïve updating (LMS), dynamic programming, temporal differences • Second Learning Scenario • Passive learning in unknown environment (Section 20.3, Russell and Norvig) • Solutions: LMS, temporal differences; adaptation of dynamic programming • Third Learning Scenario • Active learning in unknown environment (Sections 20.4-20.6, Russell and Norvig) • Policy must be learned (e.g., through application and exploration) • Solutions: dynamic programming (Q-learning), temporal differences
Solution Approaches • Naïve updating: least-mean-square (LMS) utility update • Dynamic programming (DP): solving constraint equations • Adaptive DP (ADP): includes value iteration, policy iteration, exact Q-learning • Passive case: teacher selects sequences (trajectories through environment) • Active case: exactQ-learning (recursive exploration) • Method of temporal differences (TD): approximating constraint equations • Intuitive idea: use observed transitions to adjust U(s) or Q(s, a) • Active case: approximateQ-learning (TD Q-learning) • Passive: Examples • Temporal differences: U(s) U(s) + (R(s) + U(s’) - U(s)) • No exploration function • Active: Examples • ADP (value iteration): U(s) R(s) + maxa (s’ (Ms,s’(a) · U(s’))) • Exploration (exact Q-learning): Reinforcement Learning Methods
Active Learning and Exploration • Active Learning Framework • So far: optimal behavior is to choose action with maximum expected utility (MEU), given current estimates • Proposed revision: action has two outcomes • Gains rewards on current sequence (agent preference: greed) • Affects percepts ability of agent to learn ability of agent to receive future rewards (agent preference: “investment in education”, akanovelty, curiosity) • Tradeoff: comfort (lower risk) reduced payoff versus higher risk, high potential • Problem: how to quantify tradeoff, reward latter case? • Exploration • Define: exploration function - e.g., f(u, n) = (n < N) ? R+ : u • u: expected utility under optimistic estimate; f increasing in u (greed) • n N(s, a): number of trials of action-value pair; f decreasing in n (curiosity) • Optimistic utility estimator: U+(s) R(s) + maxaf (s’ (Ms,s’(a) · U+(s’)), N(s, a)) • Key Issues: Generalization (Today); Allocation (CIS 830)
Q-Learning • Reduce discrepancy between successive estimates • Q estimates • One step time difference • Method of Temporal Differences (TD()), aka Temporal Differencing • Why not two steps? • Or n steps? • TD() formula • Blends all of these • Intuitive idea: use constant 0 1 to combine estimates from various lookahead distances (note normalization factor 1 - ) Temporal Difference Learning:Rationale and Formula
Training Rule: Derivation from Formula • Formula: • Recurrence equation for Q()(s(t), a(t)) (recursive definition) defines update rule • Select a(t + i) based on current policy • Algorithm • Use above training rule • Properties • Sometimes converges faster than Q learning • Converges for learning V* for any 0 1 [Dayan, 1992] • Other results [Sutton, 1988; Peng and Williams, 1994] • Application: Tesauro’s TD-Gammon uses this algorithm [Tesauro, 1995] • Recommended book • Reinforcement Learning [Sutton and Barto, 1998] • http://www.cs.umass.edu/~rich/book/the-book.html Temporal Difference Learning: TD() Training Rule and Algorithm
Applying Results of RL:Models versus Action-Value Functions • Distinction: Learning Policies with and without Models • Model-theoretic approach • Learning: transition function , utility function U • ADP component: value/policy iteration to reconstruct U from R • Putting learning and ADP components together: decision cycle (Lecture 17) • Function Active-ADP-Agent: Figure 20.9, Russell and Norvig • Contrast: Q-learning • Produces estimated action-value function • No environment model (i.e., no explicit representation of state transitions) • NB: this includes both exact and approximate (e.g., TD) Q-learning • Function Q-Learning-Agent: Figure 20.12, Russell and Norvig • Ramifications: A Debate • Knowledge in model-theoretic approach corresponds to “pseudo-experience” in TD (see: 20.3, Russell and Norvig; distal supervised learning; phantom induction) • Dissenting conjecture: model-free methods “reduce need for knowledge” • At issue: when is it worth while to combine analytical, inductive learning?
Applying Results of RL:MDP Decision Cycle Revisited • Function Decision-Theoretic-Agent (Percept) • Percept: agent’s input; collected evidence about world (from sensors) • COMPUTE updated probabilities for current state based on available evidence, including current percept and previous action (prediction, estimation) • COMPUTE outcome probabilities for actions, given action descriptions and probabilities of current state (decision model) • SELECT action with highest expected utility, given probabilities of outcomes and utility functions • RETURN action • Situated Decision Cycle • Update percepts, collect rewards • Update active model (prediction and estimation; decision model) • Update utility function: value iteration • Selecting action to maximize expected utility: performance element • Role of Learning: Acquire State Transition Model, Utility Function
Explicit Representation • One output value for each input tuple • Assumption: functions represented in tabular form for DP • Utility U: state value, Uh: statevector value • Transition M: state state action probability • Reward R: state value, r: state action value • Action-value Q: state action value • Reasonable for small state spaces, breaks down rapidly with more states • ADP convergence, time per iteration becomes unmanageable • “Real-world” problems and games: still intractable even for approximate ADP • Solution Approach: Implicit Representation • Compact representation: allows calculation of U, M, R, Q • e.g., checkers: • Input Generalization • Key benefit of compact representation: inductive generalization over states • Implicit representation : RL :: representation bias : supervised learning Generalization in RL
Q-Learning • Exact version closely related to DP-based MDP solvers • Typical assumption: perfect knowledge of (s, a) and r(s, a) • NB: remember, does not mean • Accessibility (total observability of s) • Determinism of , r • Situated Learning • akain vivo, online, lifelong learning • Achieved by moving about, interacting with real environment • Opposite: simulated, in vitro learning • Bellman’s Equation [Bellman, 1957] • Note very close relationship to definition of optimal policy: • Result: satisfies above equation iff =* Relationship to Dynamic Programming
Subtle Issues and Continuing Research • Current Research Topics • Replace table of Q estimates with ANN or other generalizer • Neural reinforcement learning (next time) • Genetic reinforcement learning (next week) • Handle case where state only partially observable • Estimation problem clear for ADPs (many approaches, e.g., Kalman filtering) • How to learn Q in MDPs? • Optimal exploration strategies • Extend to continuous action, state • Knowledge: incorporate or attempt to discover? • Role of Knowledge in Control Learning • Method of incorporating domain knowledge: simulated experiences • Distal supervised learning [Jordan and Rumelhart, 1992] • Pseudo-experience [Russell and Norvig, 1995] • Phantom induction [Brodie and Dejong, 1998]) • TD Q-learning: knowledge discovery or brute force (or both)?
RL Applications:Game Playing • Board Games • Checkers • Samuel’s player [Samuel, 1959]: precursor to temporal difference methods • Early case of multi-agent learning and co-evolution • Backgammon • Predecessor: Neurogammon (backprop-based) [Tesauro and Sejnowski, 1989] • TD-Gammon: based on TD() [Tesauro, 1992] • Robot Games • Soccer • RoboCup web site: http://www.robocup.org • Soccer server manual: http://www.dsv.su.se/~johank/RoboCup/manual/ • Air hockey: http://cyclops.csl.uiuc.edu • Discussions Online (Other Games and Applications) • Sutton and Barto book: http://www.cs.umass.edu/~rich/book/11/node1.html • Sheppard’s thesis: http://www.cs.jhu.edu/~sheppard/thesis/node32.html
RL Applications:Control and Optimization • Mobile Robot Control: Autonomous Exploration and Navigation • USC Information Sciences Institute (Shen et al): http://www.isi.edu/~shen • Fribourg (Perez): http://lslwww.epfl.ch/~aperez/robotreinfo.html • Edinburgh (Adams et al): http://www.dai.ed.ac.uk/groups/mrg/MRG.html • CMU (Mitchell et al): http://www.cs.cmu.edu/~rll • General Robotics: Smart Sensors and Actuators • CMU robotics FAQ: http://www.frc.ri.cmu.edu/robotics-faq/TOC.html • Colorado State (Anderson et al): http://www.cs.colostate.edu/~anderson/res/rl/ • Optimization: General Automation • Planning • UM Amherst: http://eksl-www.cs.umass.edu/planning-resources.html • USC ISI (Knoblock et al) http://www.isi.edu/~knoblock • Scheduling: http://www.cs.umass.edu/~rich/book/11/node7.html
Terminology • Reinforcement Learning (RL) • Definition: learning policies : stateaction from <<state, action>, reward> • Markov decision problems (MDPs): finding control policies to choose optimal actions • Q-learning: produces action-value function Q : stateaction value (expected utility) • Active learning: experimentation (exploration) strategies • Exploration function: f(u, n) • Tradeoff: greed (u) preference versus novelty (1 / n) preference, akacuriosity • Temporal Diffference (TD) Learning • : constant for blending alternative training estimates from multi-step lookahead • TD(): algorithm that uses recursive training rule with -estimates • Generalization in RL • Explicit representation: tabular representation of U, M, R, Q • Implicit representation: compact (akacompressed) representation
Summary Points • Reinforcement Learning (RL) Concluded • Review: RL framework (learning from <<state, action>, reward> • Continuing research topics • Active learning: experimentation (exploration) strategies • Generalization in RL: made possible by implicit representations • Temporal Diffference (TD) Learning • Family of algorithms for RL: generalizes Q-learning • More than one step of lookahead • Many more TD learning results, applications: [Sutton and Barto, 1998] • More Discussions Online • Harmon’s tutorial: http://www-anw.cs.umass.edu/~mharmon/rltutorial/ • CMU RL Group: http://www.cs.cmu.edu/Groups/reinforcement/www/ • Michigan State RL Repository: http://www.cse.msu.edu/rlr/ • For More Info • Post to http://www.kddresearch.org web board • Send e-mail to bhsu@cis.ksu.edu, dag@cis.ksu.edu
Summary Points • Control Learning • Learning policies from <state, reward, action> observations • Objective: choose optimal actions given new percepts and incremental rewards • Issues • Delayed reward • Active learning opportunities • Partial observability • Reuse of sensors, effectors • Q Learning • Action-value function Q : stateaction value (expected utility) • Training rule • Dynamic programming algorithm • Q learning for deterministic worlds • Convergence to true Q • Generalizing Q learning to nondeterministic worlds • Next Week: More Reinforcement Learning (Temporal Differences)