Monday, September 18, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

CIS 540 Robotics Laboratory Robotic Soccer: A Machine Learning Testbed Monday, September 18, 2000 William H. Hsu Department of Computing and Information Sciences, KSU http://www.cis.ksu.edu/~bhsu Readings: Chapter 13.1-13.4, Mitchell Sections 20.1-20.2, Russell and Norvig

Lecture Outline • References: Chapter 13, Mitchell; Sections 20.1-20.2, Russell and Norvig • Today: Sections 13.1-13.4, Mitchell • Review: “Learning to Predict by the Method of Temporal Differences”, Sutton • Introduction to Reinforcement Learning • Control Learning • Control policies that choose optimal actions • MDP framework, continued • Issues • Delayed reward • Active learning opportunities • Partial observability • Reuse requirement • Q Learning • Dynamic programming algorithm • Deterministic and nondeterministic cases; convergence properties

Control Learning • Learning to Choose Actions • Performance element • Applying policy in uncertain environment (last time) • Control, optimization objectives: belong to intelligent agent • Applications: automation (including mobile robotics), information retrieval • Examples • Robot learning to dock on battery charger • Learning to choose actions to optimize factory output • Learning to play Backgammon • Problem Characteristics • Delayed reward: loss signal may be episodic (e.g., win-loss at end of game) • Opportunity for active exploration: situated learning • Possible partially observability of states • Possible need to learn multiple tasks with same sensors, effectors (e.g., actuators)

Example:TD-Gammon • Learns to Play Backgammon [Tesauro, 1995] • Predecessor: NeuroGammon [Tesauro and Sejnowski, 1989] • Learned from examples of labelled moves (very tedious for human expert) • Result: strong computer player, but not grandmaster-level • TD-Gammon: first version, 1992 - used reinforcement learning • Immediate Reward • +100 if win • -100 if loss • 0 for all other states • Learning in TD-Gammon • Algorithm: temporal differences [Sutton, 1988] - next time • Training: playing 200000 - 1.5 million games against itself (several weeks) • Learning curve: improves until ~1.5 million games • Result: now approximately equal to best human player (won World Cup of Backgammon in 1992; among top 3 since 1995)

Interactive Model • State (may be partially observable), incremental reward presented to agent • Agent selects actions based upon (current) policy • Taking action puts agent into new state in environment • New reward: reinforcement (feedback) • Agent uses decision cycle to estimate new state, compute outcome distributions, select new actions • Reinforcement Learning Problem • Given • Observation sequence • Discount factor  [0, 1) • Learn to: choose actions that maximize r(t) + r(t + 1)+ 2r(t + 2)+ … State Reward Action Reinforcement Learning:Problem Definition Agent Policy Environment

Quick Review:Markov Decision Processes • Markov Decision Processes (MDPs) • Components • Finite set of states S • Set of actions A • At each time, agent • observes state s(t)  S and chooses action a(t)  A; • then receives reward r(t), • and state changes to s(t + 1) • Markov property, akaMarkov assumption: s(t + 1) = (t + 1) and r(t) = r(s(t), a(t)) • i.e., r(t) and s(t + 1) depend only on current state and action • Previous history s(0), s(1), …, s(t - 1): irrelevant • i.e., s(t + 1) conditionally independent of s(0), s(1), …, s(t - 1) given s(t) • , r: may be nondeterministic; not necessarily known to agent • Variants: totally observable (accessible), partially observable (inaccessible) • Criterion for a(t): Total Reward – Maximum Expected Utility (MEU)

Agent’s Learning Task • Performance Element • Execute actions in environment, observe results • Learn action policy  : state action that maximizes expected discounted rewardE[r(t) + r(t + 1)+ 2r(t + 2)+ …] from any starting state in S •   [0, 1) • Discount factor on future rewards • Expresses preference for rewards sooner rather than later • Note: Something New! • Target function is  : state action • However… • We have no training examples of form <state, action> • Training examples are of form <<state, action>, reward>

First Learning Scenario: Deterministic Worlds • Agent considers adopting policy  from policy space • For each possible policy   , can define an evaluation function over states: where r(t), r(t + 1),r(t + 2), … are generated by following policy  starting at state s • Restated, task is to learn optimal policy * • Finding Optimal Policy 0 100 90 100 0 0 90 100 0 G 0 G 72 81 G G 81 0 0 0 0 0 81 90 0 81 90 100 100 81 90 100 0 0 72 81 Value Function r(state, action) immediate reward values Q(state, action) values V*(state) values One optimal policy

Idea • Might have agent try to learn evaluation function V* (abbreviated V*) • Could then perform lookahead search to choose best action from any state s, because: • Problem with Idea • Works well if agent knows •  : state  action  state • r : state  action  R • When agent doesn’t know  and r, cannot choose actions this way What to Learn

Solution Approach • Define new function very similar to V* • If agent learns Q, it can choose optimal action even without knowing ! • Using Learned Q • Q: evaluation function to be learned by agent • Apply Q to select action • Idealized, computed policy (this is your brain without Q-learning): • Approximated policy (this is your brain with Q-learning): Q Function

Developing Recurrence Equation for Q • Note: Q and V* closely related • Allows us to write Q recursively as • Nice! Let denote learner’s current approximation to Q • Training Rule • s’: state resulting from applying action a in state s • (Deterministic) transition function  made implicit • Dynamic programming: iterate over table of possible a’ values Training Rule to Learn Q

(Nonterminating) Procedure for Situated Agent • Procedure Q-Learning-Deterministic (Reinforcement-Stream) • Reinforcement-Stream: consists of <<state, action>, reward> tuples • FOR each <s, a> DO • Initialize table entry • Observe current state s • WHILE (true) DO • Select action a and execute it • Receive immediate reward r • Observe new state s’ • Update table entry for as follows • Move: record transition from s to s’ Q Learning forDeterministic Worlds

Example: Propagating Credit (Q) for Candidate Action • glyph: denotes mobile robot • Initial state: s1(upper left) • Let discount factor  be 0.9 • Q estimate • Property • If rewards nonnegative, then increases monotonically between 0 and true Q 90 72 100 100 aright 81 81 G G 63 63 Updating the Q Estimate s1 s2 s1 s2 Initial state: s1 Next state: s2

Claim • converges to Q • Scenario: deterministic world where <s, a> are observed (visited) infinitely often • Proof • Define full interval: interval during which each <s, a> is visited • During each full interval, largest error in table is reduced by factor of  • Let be table after n updates and n be the maximum error in ; that is, • For any table entry , updated error in revised estimate is • Note: used general fact, Convergence

Second Learning Scenario: Nondeterministic World (Nondeterministic MDP) • What if reward and next state are nondeterministically selected? • i.e., reward function and transition function are nondeterministic • Nondeterminism may express many kinds of uncertainty • Inherent uncertainty in dynamics of world • Effector exceptions (qualifications), side effects (ramifications) • Solution Approach • Redefine V, Q in terms of expected values • Introduce decay factor; retain some of previous Q value • Compare: momentum term in ANN learning Nondeterministic Case

Q Learning Generalizes to Nondeterministic Worlds • Alter training rule to • Decaying weighted average • visitsn(s, a): total number of times <s, a> has been visited by iteration n, inclusive • r: observed reward (may also be stochastically determined) • Can Still Prove Convergence of to Q [Watkins and Dayan, 1992] • Intuitive Idea •  [0, 1]: discounts estimate by number of visits, prevents oscillation • Tradeoff • More gradual revisions to • Able to deal with stochastic environment: P(s’ | s, a), P(r | s, s’, a) Nondeterministic Case:Generalizing Q Learning

Terminology • Reinforcement Learning (RL) • RL: learning to choose optimal actions from <state, reward, action> observations • Scenarios • Delayed reward: reinforcement is deferred until end of episode • Active learning: agent can control collection of experience • Partial observability: may only be able to observe rewards (must infer state) • Reuse requirement: sensors, effectors may be required for multiple tasks • Markov Decision Processes (MDPs) • Markovity (aka Markov property, Markov assumption): CI assumption on states over time • Maximum expected utility (MEU): maximum expected total reward (under additive decomposition assumption) • Q Learning • Action-value functionQ : stateaction value • Q learning: training rule and dynamic programming algorithm for RL

CIS 540 Robotics Laboratory More Reinforcement Learning: Temporal Differences Monday, September 18, 2000 William H. Hsu Department of Computing and Information Sciences, KSU http://www.cis.ksu.edu/~bhsu Readings: Sections 13.5-13.8, Mitchell Sections 20.2-20.7, Russell and Norvig

Lecture Outline • Readings: 13.1-13.4, Mitchell; 20.2-20.7, Russell and Norvig • This Week’s Paper Review: “Connectionist Learning Procedures”, Hinton • Suggested Exercises: 13.4, Mitchell; 20.11, Russell and Norvig • Reinforcement Learning (RL) Concluded • Control policies that choose optimal actions • MDP framework, continued • Continuing research topics • Active learning: experimentation (exploration) strategies • Generalization in RL • Next: ANNs and GAs for RL • Temporal Diffference (TD) Learning • Family of dynamic programming algorithms for RL • Generalization of Q learning • More than one step of lookahead • More on TD learning in action

Interactive Model • State s (may be partially observable) • Agent selects action a based upon (current) policy • Incremental reward (akareinforcement) r(s, a) presented to agent • Taking action puts agent into new state s’ = (s, a) in environment • Agent uses decision cycle to estimate s’, compute outcome distributions, select new actions • Reinforcement Learning Problem • Given • Observation sequence • Discount factor  [0, 1) • Learn to: choose actions that maximize r(t) + r(t + 1)+ 2r(t + 2)+ … State Reward Action Quick Review:Policy Learning Framework Agent Policy Environment

Deterministic World Scenario • “Knowledge-free” (here, model-free) search for policy from policy space • For each possible policy   , can define an evaluation function over states: where r(t), r(t + 1),r(t + 2), … are generated by following policy  starting at state s • Restated, task is to learn optimal policy * • Finding Optimal Policy • Q-Learning Training Rule 0 100 90 100 0 0 90 100 0 G 0 G 72 81 G G 81 0 0 0 0 0 81 90 0 81 90 100 100 81 90 100 0 0 72 81 Quick Review:Q Learning r(state, action) immediate reward values Q(state, action) values V*(state) values One optimal policy

Learning Scenarios • First Learning Scenario • Passive learning in known environment (Section 20.2, Russell and Norvig) • Intuition (passive learning in known and unknown environments) • Training sequences (s1, s2, …, sn, r = U(sn)) • Learner has fixed policy ; determine benefits (expected total reward) • Important note: known  accessible  deterministic (even if transition model known, state may not be directly observable and may be stochastic) • Solutions: naïve updating (LMS), dynamic programming, temporal differences • Second Learning Scenario • Passive learning in unknown environment (Section 20.3, Russell and Norvig) • Solutions: LMS, temporal differences; adaptation of dynamic programming • Third Learning Scenario • Active learning in unknown environment (Sections 20.4-20.6, Russell and Norvig) • Policy must be learned (e.g., through application and exploration) • Solutions: dynamic programming (Q-learning), temporal differences

Solution Approaches • Naïve updating: least-mean-square (LMS) utility update • Dynamic programming (DP): solving constraint equations • Adaptive DP (ADP): includes value iteration, policy iteration, exact Q-learning • Passive case: teacher selects sequences (trajectories through environment) • Active case: exactQ-learning (recursive exploration) • Method of temporal differences (TD): approximating constraint equations • Intuitive idea: use observed transitions to adjust U(s) or Q(s, a) • Active case: approximateQ-learning (TD Q-learning) • Passive: Examples • Temporal differences: U(s)  U(s) + (R(s) + U(s’) - U(s)) • No exploration function • Active: Examples • ADP (value iteration): U(s)  R(s) +  maxa (s’ (Ms,s’(a) · U(s’))) • Exploration (exact Q-learning): Reinforcement Learning Methods

Active Learning and Exploration • Active Learning Framework • So far: optimal behavior is to choose action with maximum expected utility (MEU), given current estimates • Proposed revision: action has two outcomes • Gains rewards on current sequence (agent preference: greed) • Affects percepts  ability of agent to learn ability of agent to receive future rewards (agent preference: “investment in education”, akanovelty, curiosity) • Tradeoff: comfort (lower risk) reduced payoff versus higher risk, high potential • Problem: how to quantify tradeoff, reward latter case? • Exploration • Define: exploration function - e.g., f(u, n) = (n < N) ? R+ : u • u: expected utility under optimistic estimate; f increasing in u (greed) • n  N(s, a): number of trials of action-value pair; f decreasing in n (curiosity) • Optimistic utility estimator: U+(s)  R(s) +  maxaf (s’ (Ms,s’(a) · U+(s’)), N(s, a)) • Key Issues: Generalization (Today); Allocation (CIS 830)

Q-Learning • Reduce discrepancy between successive estimates • Q estimates • One step time difference • Method of Temporal Differences (TD()), aka Temporal Differencing • Why not two steps? • Or n steps? • TD() formula • Blends all of these • Intuitive idea: use constant 0    1 to combine estimates from various lookahead distances (note normalization factor 1 - ) Temporal Difference Learning:Rationale and Formula

Training Rule: Derivation from Formula • Formula: • Recurrence equation for Q()(s(t), a(t)) (recursive definition) defines update rule • Select a(t + i) based on current policy • Algorithm • Use above training rule • Properties • Sometimes converges faster than Q learning • Converges for learning V* for any 0    1 [Dayan, 1992] • Other results [Sutton, 1988; Peng and Williams, 1994] • Application: Tesauro’s TD-Gammon uses this algorithm [Tesauro, 1995] • Recommended book • Reinforcement Learning [Sutton and Barto, 1998] • http://www.cs.umass.edu/~rich/book/the-book.html Temporal Difference Learning: TD() Training Rule and Algorithm

Applying Results of RL:Models versus Action-Value Functions • Distinction: Learning Policies with and without Models • Model-theoretic approach • Learning: transition function , utility function U • ADP component: value/policy iteration to reconstruct U from R • Putting learning and ADP components together: decision cycle (Lecture 17) • Function Active-ADP-Agent: Figure 20.9, Russell and Norvig • Contrast: Q-learning • Produces estimated action-value function • No environment model (i.e., no explicit representation of state transitions) • NB: this includes both exact and approximate (e.g., TD) Q-learning • Function Q-Learning-Agent: Figure 20.12, Russell and Norvig • Ramifications: A Debate • Knowledge in model-theoretic approach corresponds to “pseudo-experience” in TD (see: 20.3, Russell and Norvig; distal supervised learning; phantom induction) • Dissenting conjecture: model-free methods “reduce need for knowledge” • At issue: when is it worth while to combine analytical, inductive learning?

Applying Results of RL:MDP Decision Cycle Revisited • Function Decision-Theoretic-Agent (Percept) • Percept: agent’s input; collected evidence about world (from sensors) • COMPUTE updated probabilities for current state based on available evidence, including current percept and previous action (prediction, estimation) • COMPUTE outcome probabilities for actions, given action descriptions and probabilities of current state (decision model) • SELECT action with highest expected utility, given probabilities of outcomes and utility functions • RETURN action • Situated Decision Cycle • Update percepts, collect rewards • Update active model (prediction and estimation; decision model) • Update utility function: value iteration • Selecting action to maximize expected utility: performance element • Role of Learning: Acquire State Transition Model, Utility Function

Explicit Representation • One output value for each input tuple • Assumption: functions represented in tabular form for DP • Utility U: state value, Uh: statevector  value • Transition M: state  state  action  probability • Reward R: state value, r: state  action  value • Action-value Q: state  action  value • Reasonable for small state spaces, breaks down rapidly with more states • ADP convergence, time per iteration becomes unmanageable • “Real-world” problems and games: still intractable even for approximate ADP • Solution Approach: Implicit Representation • Compact representation: allows calculation of U, M, R, Q • e.g., checkers: • Input Generalization • Key benefit of compact representation: inductive generalization over states • Implicit representation : RL :: representation bias : supervised learning Generalization in RL

Q-Learning • Exact version closely related to DP-based MDP solvers • Typical assumption: perfect knowledge of (s, a) and r(s, a) • NB: remember, does not mean • Accessibility (total observability of s) • Determinism of , r • Situated Learning • akain vivo, online, lifelong learning • Achieved by moving about, interacting with real environment • Opposite: simulated, in vitro learning • Bellman’s Equation [Bellman, 1957] • Note very close relationship to definition of optimal policy: • Result:  satisfies above equation iff  =* Relationship to Dynamic Programming

Subtle Issues and Continuing Research • Current Research Topics • Replace table of Q estimates with ANN or other generalizer • Neural reinforcement learning (next time) • Genetic reinforcement learning (next week) • Handle case where state only partially observable • Estimation problem clear for ADPs (many approaches, e.g., Kalman filtering) • How to learn Q in MDPs? • Optimal exploration strategies • Extend to continuous action, state • Knowledge: incorporate or attempt to discover? • Role of Knowledge in Control Learning • Method of incorporating domain knowledge: simulated experiences • Distal supervised learning [Jordan and Rumelhart, 1992] • Pseudo-experience [Russell and Norvig, 1995] • Phantom induction [Brodie and Dejong, 1998]) • TD Q-learning: knowledge discovery or brute force (or both)?

RL Applications:Game Playing • Board Games • Checkers • Samuel’s player [Samuel, 1959]: precursor to temporal difference methods • Early case of multi-agent learning and co-evolution • Backgammon • Predecessor: Neurogammon (backprop-based) [Tesauro and Sejnowski, 1989] • TD-Gammon: based on TD() [Tesauro, 1992] • Robot Games • Soccer • RoboCup web site: http://www.robocup.org • Soccer server manual: http://www.dsv.su.se/~johank/RoboCup/manual/ • Air hockey: http://cyclops.csl.uiuc.edu • Discussions Online (Other Games and Applications) • Sutton and Barto book: http://www.cs.umass.edu/~rich/book/11/node1.html • Sheppard’s thesis: http://www.cs.jhu.edu/~sheppard/thesis/node32.html

RL Applications:Control and Optimization • Mobile Robot Control: Autonomous Exploration and Navigation • USC Information Sciences Institute (Shen et al): http://www.isi.edu/~shen • Fribourg (Perez): http://lslwww.epfl.ch/~aperez/robotreinfo.html • Edinburgh (Adams et al): http://www.dai.ed.ac.uk/groups/mrg/MRG.html • CMU (Mitchell et al): http://www.cs.cmu.edu/~rll • General Robotics: Smart Sensors and Actuators • CMU robotics FAQ: http://www.frc.ri.cmu.edu/robotics-faq/TOC.html • Colorado State (Anderson et al): http://www.cs.colostate.edu/~anderson/res/rl/ • Optimization: General Automation • Planning • UM Amherst: http://eksl-www.cs.umass.edu/planning-resources.html • USC ISI (Knoblock et al) http://www.isi.edu/~knoblock • Scheduling: http://www.cs.umass.edu/~rich/book/11/node7.html

Terminology • Reinforcement Learning (RL) • Definition: learning policies : stateaction from <<state, action>, reward> • Markov decision problems (MDPs): finding control policies to choose optimal actions • Q-learning: produces action-value function Q : stateaction value (expected utility) • Active learning: experimentation (exploration) strategies • Exploration function: f(u, n) • Tradeoff: greed (u) preference versus novelty (1 / n) preference, akacuriosity • Temporal Diffference (TD) Learning • : constant for blending alternative training estimates from multi-step lookahead • TD(): algorithm that uses recursive training rule with -estimates • Generalization in RL • Explicit representation: tabular representation of U, M, R, Q • Implicit representation: compact (akacompressed) representation

Summary Points • Reinforcement Learning (RL) Concluded • Review: RL framework (learning from <<state, action>, reward> • Continuing research topics • Active learning: experimentation (exploration) strategies • Generalization in RL: made possible by implicit representations • Temporal Diffference (TD) Learning • Family of algorithms for RL: generalizes Q-learning • More than one step of lookahead • Many more TD learning results, applications: [Sutton and Barto, 1998] • More Discussions Online • Harmon’s tutorial: http://www-anw.cs.umass.edu/~mharmon/rltutorial/ • CMU RL Group: http://www.cs.cmu.edu/Groups/reinforcement/www/ • Michigan State RL Repository: http://www.cse.msu.edu/rlr/ • For More Info • Post to http://www.kddresearch.org web board • Send e-mail to bhsu@cis.ksu.edu, dag@cis.ksu.edu

Summary Points • Control Learning • Learning policies from <state, reward, action> observations • Objective: choose optimal actions given new percepts and incremental rewards • Issues • Delayed reward • Active learning opportunities • Partial observability • Reuse of sensors, effectors • Q Learning • Action-value function Q : stateaction value (expected utility) • Training rule • Dynamic programming algorithm • Q learning for deterministic worlds • Convergence to true Q • Generalizing Q learning to nondeterministic worlds • Next Week: More Reinforcement Learning (Temporal Differences)

Monday, September 18, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Monday, September 18, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Presentation Transcript

Wednesday, January 19, 2000 William H. Hsu Department of Computing and Information Sciences, KSU http://www.cis.ksu.edu/

Thursday, September 16, 1999 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, February 25, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, January 21, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Monday, 26 January 2004 William H. Hsu Department of Computing and Information Sciences, KSU

Wednesday, February 2, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, 21 November 2003 William H. Hsu Department of Computing and Information Sciences, KSU

Monday, January 22, 2001 William H. Hsu Department of Computing and Information Sciences, KSU

Thursday, September 30, 1999 William H. Hsu Department of Computing and Information Sciences, KSU

Wednesday, March 1, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Wednesday, 21 May 2003 William H. Hsu Department of Computing and Information Sciences, KSU

Tuesday, September 7, 1999 William H. Hsu Department of Computing and Information Sciences, KSU

Thursday, 15 March 2007 William H. Hsu Department of Computing and Information Sciences, KSU

William H. Hsu Department of Computing and Information Sciences, KSU

Friday, February 18, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Monday, March 27, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, January 14, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Tuesday, September 28, 1999 William H. Hsu Department of Computing and Information Sciences, KSU

Monday, March 27, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Thursday 31 October 2002 William H. Hsu Department of Computing and Information Sciences, KSU

Monday, February 12, 2001 William H. Hsu Department of Computing and Information Sciences, KSU