Introduction to Reinforcement Learning: Maximizing Rewards in Dynamic Environments

Introduction to Reinforcement Learning Shijiang Lu

What Is Reinforcement Learning • Reinforcement learning (RL) is the problem facing an agent that must learn how to interact on a trial and error basis with a dynamic environment so that to maximize a scalar reward.

Agent s r a Fs(sT) Fr(sT) sT Environment Agent And Its Environment a: the agent’s action sT: the true state of the environment s: state of the environment perceived by the agent r: immediate reward perceived by the agent Fs(sT) an Fr(sT): functions that map sT to s and r

Characteristics of RL • The agent has a goal (or goals) to achieve • The agent can take actions and the agent’s action will affect its environment • The agent learns in a trial and error fashion, i.e., the agent has no teacher and must learn by itself

Characteristics of RL (Cont.) • The agent’s action should be chosen based on its perception of its environment and its evaluation of how well its need has been fulfilled already. • The agent may or may not have knowledge about its environment initially. Nevertheless, it must interact with its environment.

Characteristics of RL (Cont.) • The agent may not know everything about the environment, i.e., there can be hidden states that the agent has no knowledge about. • The environment may change independent of the agent’s action

Characteristics of RL (Cont.) • The environment may be non-deterministic, i.e., when the agent takes the same action under the same state, the environment may response differently. • The reward of an action may come instantaneously, or it may be delayed, i.e., not immediately after the agent’s action.

Tradeoff Between Exploration and Exploitation • Exploration: Finding new knowledge by trying new actions, etc. • Exploitation: Using learned knowledge to find the best action. • Tradeoff: Neither exploration nor exploitation alone will yield satisfactory results

Four Components of A RL Agent • Policy . At each time step, a policy  takes s and r as input and outputs an action a • Reward function R(s, a). Reward function takes s and a as input and returns a scalar value (the expected immediate reward) for taking action a at state s

Four Components of A RL Agent (Cont.) • Value function V. The expected total return from s given that the agent uses policy  • Model. The model predicts the behavior of the environment, i.e., for given s and a, what the immediate reward will be and how the states will change

RL for Adaptive Clustering • Actions: changing clustering algorithms, parameters, attributes/features, etc. • Immediate reward: how good the clustering result is • By using a trial and error approach, we can learn what is the best clustering algorithm, what attributes/features to choose, etc.

References • [Sutton98] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. http://citeseer.nj.nec.com/sutton98reinforcement.html • [Kaelbling96] Leslie P. Kaelbling, Michael L. Littman, and Andrew W. Moore Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4:237285, 1996 http://citeseer.ist.psu.edu/kaelbling96reinforcement.html

Introduction to Reinforcement Learning: Maximizing Rewards in Dynamic Environments