Reinforcement Learning

Introduction to Reinforcement Learning:Q-Learning and evolutionsAspect numérique d'Intelligence Artificielle

Reinforcement Learning AI ML

Reinforcement Learning AI Supervised Unsupervised ML

Reinforcement Learning AI Supervised Unsupervised ANNs data mining ML

Reinforcement Learning AI Active Learning Supervised RL ML

Learning Autonomous Agent Perceptions Actions Environment

Learning Autonomous Agent Delayed Reward Perceptions Actions Environment

Learning Autonomous Agent Delayed Reward New behavior Perceptions Actions Environment

2-Armed Bandit Problem

2-Armed Bandit Problem $ 1º. Insert a coin,

2-Armed Bandit Problem 1º. Insert a coin, 2º. Pull an arm,

2-Armed Bandit Problem 1º. Insert a coin, 2º. Pull an arm, 3º. Receive a random reward. 0 or$

? ? Exploration vs. Exploitation $ $

Q-learning and evolutions • Method of RL introduced by Watkins (89) • The world is modeled as a Markov Decision Process (MDP) • Evolutions : Q(λ)-learning, HQ-learning, Bayesian Q-learning, W-learning, fuzzy learning.

Markov Decision Processes

Markov Decision Processes state

Markov Decision Processes state action

Markov Decision Processes state new state action

Markov Decision Processes reward state new state action

Markov Decision Processes • Transitions are probabilistic : y and r are drawn from stationary probability distribution Pxa(y) and Pxa(r) • We have : ΣrPxa(r) = 1 et ΣyPxa(y) = 1 • The Markov stationary property is expressed as: P(xt+1 = y | xt = x, at = a) = Pxa(y) for all t.

Markov Decision Processes • Special case : Deterministic words • Pxa(y) =  • Pxa(r) =  • Special case : Different action sets • Pxa(x) = 1 for all actions a in the ‘unavailable set for x 1 if y = yxa 0 otherwise 1 if r = rxa 0 otherwise

Markov Decision Processes • Expected reward : • When we take action a in state x, we expect to receive is : E(r) =Σr r Pxa(r) • Typically, r is a function of the transition x to y. Writing r = r(x,y), the probability of a particular reward is: Pxa(r) = ΣPxa(y) and the expected reward becomes : E(r)=Σy r(x,y) Pxa(y) • Reward r are bounded by rmin and rmax. Hence, for a given x, a, rmin ≤ E(r) ≤ rmax. {y|r(x,y)=r}

Markov Decision Processes • The task : The agent acts according to a policy π. • Deterministic policy : unique action a = π(x) • Stochasticpolicy : distribution Pxwith probability Px(a) • Stationary or memory-less policy: no concept of time • Non-stationary policy: the agent must possess memory • Following a stationary deterministic policy π, at time t, the agent observes state xt, takes action at = π(xt), observes new state xt+1, and receives reward rt with expected value: E(rt) = Σr r Pxtat(r) { π π {

Markov Decision Processes S Delayed Reward

Markov Decision Processes t-1 S g t+1 g t g Delayed Reward

Markov Decision Processes • The agents is interested in the total discounted reward: R = rt+  rt+1 + ² rt+2+ … where 0 ≤ < 1. • Special case  = 0, where we only try to maximize immediat reward. • Low/high means pay little/great attention to the future. • The expected total discounted reward if we follow policy π, starting from xt, is : V (xt) = E(r) = E(rt)+  E(rt+1)+ ² E(rt+2)+ … = Σr r Pxtat(r) +  Σy V (y) Pxtat (y) π π

Markov Decision Processes π • V (x) : value of state x under policy π. • The agent must find an optimal policy that maximize the total discounted expected reward. • DP theory assures us of the existence of a stationary and deterministic optimal policy πfor an MDP which satisfie : V (x)= max [Σr r Pxb(r) +  Σy V (y) Pxb(y)]for all x. • may be non-unique, but V (x) is unique and the best that an agent can do from x. • All optimal policies will have : V (x) = V (x). * * * π π b ε A * π * * * * π π

Markov Decision Processes • The strategy: • Build up Q-value Q(x,a) for each pair (x,a). • The Q-learning agent must find an optimal policy when Pxa(y) and Pxa(r) are initially unknown and interact with the world to learn this probabilities. • In 1-step Q-learning, after each experience, we observe state y, receive reward r and update: • Q(x,a) := (r +  max Q(y,b)) b ε A

Markov Decision Processes • In the discrete case, where we store each Q(x,a) explicitly in lookup table, we update: • Q(x,a) := (1- α) Q(x,a) + α (r +  max Q(y,b)) • 0≤α≤1: indicate the weight of the new experience. • Start with α = 1 which favorize exporation. • Finish with α→0 which favorize exploitation. b ε A

if a is estimated optimal in s, otherwise. Undirected Exploration • Semi-uniform distribution: • Boltzmann law:

Hierarchical Q-learning • As the complexity of problems scales up, both the size of statespace end the complexity of the reward function increase. • Lin (93) suggest breaking complex problem into sub-problems, having a collection of Q-L agents A1, …, An learn the sub-problems. • A single controlling Q-L agent learns Q(x,i), where i is which agent to choose in state x. • When the creature observe state x, each agent Ai suggest an action ai. The switch chooses winner k and executes ak.

Q-learning and evolutions • Method of RL introduced by Watkins (89) • The world is modeled as a Markov Decision Process (MDP) • Q(λ)-learning (Peng and William 96) : It combines Q-learning (Watkins 89) and TD(λ) (Sutton 88 – Tesauro 92). • HQ-learning (Wiering and Schmidhuber 97): a hierarchical extension of Q-learning. • Bayesian Q-learning : use of belief and observation • W-learning : Q-learning with independents multiples agents.

Speeding-up Q(λ)-learning • In the discrete Q-learning, we update : • Q(xt,at) := (1- αt) Q(xt,at) + αk (rt +  max Q(xt+1,b)) • Rewrite as : • Q(xt,at) := Q(xt,at) + αket with • et = rt +  max Q(xt+1,b))-Q(xt,at) • In the Q(λ)-learning, we update : • Q(xt,at) := Q(xt,at) + αk [etηt(s,a) + etιt(s,a)]with : • et= rt +  max Q(xt+1,b) - Q(xt,at) • et= rt +  max Q(xt+1,b) - max Q(xt,b) • ηt(s,a) returns 1 if (s,a) occurred at time t. • ιt (s,a) = γλ(ιt-1(s,a) + ηt-1(s,a) (eligibility trace) b ε A ‘ ‘ ‘ b ε A ‘ ‘ b ε A b ε A b ε A

Speeding-up Q(λ)-learning • Peng and Williams’ algorithm for online Q(λ):

Speeding-up Q(λ)-learning • Notes : • There are other possible variants, e.g. Rummery and Niranjan 94). • There are also a Fast Q(λ)-learning algorithm based on the fact that the only Q-values needed at any given time are those for the possible action given the current state. The algorithm relies on two procedure : • the local Update calculates exact Q-Values once tehy are required. • The Global Update procedure update the global variables and the current Q-values. • Use of “lazy learning” are possible.

Reinforcement Learning