Mastering Q-Learning for Optimal Policies: A Comprehensive Guide

Introduction to Reinforcement Learning and Q-Learning Andrew L. Nelson Visiting Research Faculty University of South Florida Q-Learning

Overview • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Outline to the left in green • Current topic in yellow • References • Introduction • Learning an optimal policy in a known environment • Learning an approximate optimal policy in an unknown environment • Example • Generalization and representation • Knowledge based vs general function approximation methods Q-Learning

References • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • C. Watkins, P. Dayan, “Q-Learning,” Machine Learning, vol. 8, pp. 279-292, 1989. • T.M. Mitchell, Machine Learning, WCB/McGraw-Hill, 1997. Q-Learning

Introduction • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Situated Learning Agents • The Goal of a leaning agent is to learn to choose actions (a) so that the net reward over a sequence of actions is maximized • Supervised learning methods make use of knowledge of the world and of known reward functions • Reinforcement learning methods use rewards to learn an optimal policy in a given (unknown) environment Q-Learning

Agent and Environment • Overview • References • Introduction • Agent andenvironment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • An agent produces an action (a), and receives a reward (and changes the state, s) from a given environment Q-Learning

Nomenclature • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Action: a  A. • State: s  S. • Reward: r = R(s) • Policy: π: A → S • Optimal Policy: π * • World Model: s' = T(s, a) • Utility: U(s) • Value: Q(a, s) Q-Learning

Cell World • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Agent • States • Transitions • Reward Q-Learning

Learning π* in Known Environments • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • The supervised method: • Find the maximum possible utility for each state (Iterative search) • learn the optimal policy π*: A → S by learning the action associated with each state s that leads to the next state s' with maximum possible utility, U* • Requirements: • Known world model, T(s, a) • Known reward function, R(s) Q-Learning

Known Rewards and Transitions • References • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • R(s) and s' = T(s, a) known for all s  S and a  A Q-Learning

Calculate U* for Each State (Using an iterative search algorithm, for example) • References • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary Q-Learning

Calculate π* using the known U* values • References • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary π*: U*(s), for all s Q-Learning

Notes • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Supervised learning methods work well when a complete model of the environment and the reward function are known • Since R(s) and T(s, a) are known, we can reduce learning to a standard iterative learning process. Q-Learning

Unknown Environments • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • What if the environment is unknown? Q-Learning

Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary Q-Learning

The Q-Function • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Instead of learning utilities, action-state values (Q) will be learned • U(s) = maxaQ(s, a) • Local action and exploration can be used to discover and learn Q(s, a) values in an unknown environment • We will use the following equation: Q(s, a) ← r + maxa' Q(s', a') Q-Learning

The Q-Learning Algorithm • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Build up a table of Q(s, a) values as follows: • Do forever: From the current state s • Set each un-initialized state-action Q(s, a) value to 0 and add it to table of Q values • With probability p, Select action a with maximum Q value (otherwise select a at random) • Execute a and receive immediate reward r. • Update the table entry for Q(s, a) as Q(s, a) ← r + maxa' Q(s', a') • s ← s' Q-Learning

Q-Learning Example • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Initialize table and first position Q-Learning

Q-Learning Example • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Move to s'... iterate Q-Learning

Q-Learning Example • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Continue Q-Learning

Q-Learning Example • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Terminal state, start over Q-Learning

Q-Learning Example • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Starting new iteration Q-Learning

Q-Learning Example • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • After a few more iterations... Q-Learning

Representation and Generalization • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Policies learned using state transition representations do not generalize to un-visited stated. • Functional representations allow for generalization to states not explored f(s) = p1a + p2a2 + p3a3 ... • Functional representations might cover search spaces that do not contain the target policy. Q-Learning

Summary • Overview • References • Introduction • Agent and environment • Nomenclature • Cell World • Policy Learning in known space • Example • Reinforcement Policy Learning • Q-Function • Q-Algorithm • Example • Generalization • Summary • Reinforcement learning (RL) is useful for learning policies in un-characterized environments • RL uses reward from actions taken during exploration • RL is useful on small state transition spaces • Functional representations increase the power of RL both in terms of generalization and representation Q-Learning

Mastering Q-Learning for Optimal Policies: A Comprehensive Guide