320 likes | 384 Views
Optimal Tuning of Continual Online Exploration in Reinforcement Learning. Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information Systems Research Unit (ISYS) Université de Louvain Belgium. Outline. Introduction Mathematical concepts
E N D
Optimal Tuning of Continual Online Exploration inReinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information Systems Research Unit (ISYS) Université de Louvain Belgium
Outline • Introduction • Mathematical concepts • Modelling exploration by entropy • Optimal policy • Preliminary experiments • Conclusion and further work Achbany Youssef - UCL
Introduction • One of the challenges of reinforcement learning is to manage: • The tradeoff between exploration and exploitation. • Exploitation • aims to capitalize on already well-established solutions. • Exploration: • aims to continually try new ways of solving the problem. • is relevant when the environment is changing. Achbany Youssef - UCL
Introduction • Simple routing problem • The goal is • to reach a destination node (13) • From an initial node (1) • To minimize costs • For each node • Set of admissible actions • Weight (cost) associated • We define a probability distribution on the set of admissible actions Achbany Youssef - UCL
Mathematical concepts • We have a set of states, S = {1, 2, …,n} • st = k means that the system is in state k at time t • In each state s = k, we have a set of admissible control actions, U(k) • So that u(k) ÎU(k) is a control action available at state k Achbany Youssef - UCL
Mathematical concepts • When we choose actionu(st) at state st, • A bounded costC(u(st)| st) < ∞ is incurred • The system jumps to state st+1 = f(u(st)| st) • Where f is a function • We suppose the network of states does not contain any negative cycle Achbany Youssef - UCL
Mathematical concepts • For each state s, we define a probability distribution on the set of admissible actions, P(u(s)| s) • Meaning that the choice is randomized • This introduces exploration – not only exploitation • This is the main contribution of our work Achbany Youssef - UCL
uk1 P(uk1|k) uk2 k P(uk2|k) P(uk3|k) uk3 Mathematical concepts • For instance if, in state s = k, there are three admissible actions, • The probability distribution P(u(k)| s=k) involves three values Achbany Youssef - UCL
Mathematical concepts • The policyp is defined as the set of all probability distributions for all states Achbany Youssef - UCL
Mathematical concepts • The goal is to reach a destination state, s = d • From an initial state, s0 = k0 • While minimizing the total expected cost • The expectation is taken on the policy, that is, on all the random variables u(k) associated to the states Achbany Youssef - UCL
Mathematical concepts • In other words, we have to determine the best policyp that minimizes Vp(k0) • That is, the best probability distributions • This is standard, except the fact that we introduce choice randomisation Achbany Youssef - UCL
Mathematical concepts • We now introduce a way to control exploration • We introduce the degree of exploration, Ek, defined on each state k • Which is the entropy of the probability distribution of actions in this state k Achbany Youssef - UCL
Modelling exploration by entropy • The degree of exploration, Ek, is defined as the entropy at state k • The minimum is 0 (no exploration) • The maximum is log(nk) where nk is the number of admissible actions in state k (full exploration) Achbany Youssef - UCL
Modelling exploration by entropy • While the exploration rate is defined as • and takes its value between 0 (no exploration) • and 1 (full exploration). Achbany Youssef - UCL
Modelling exploration by entropy • The goal now is to determine the optimal policy under exploration constraints • That is, seek the policy, p*, among • for which the expected cost, Vp(k0), is minimal • while guarantying a given degree of exploration (entropy) in each state k Achbany Youssef - UCL
Modelling exploration by entropy • In other words, • where the Ek are provided/fixed by the user/designer • They control the degree of exploration at each node k Achbany Youssef - UCL
Modelling exploration by entropy • Thus, we route the agents as fast as possible, while exploring the network Achbany Youssef - UCL
Optimal policy • Here are the necessary optimality conditions (for a local minimum), very similar to Bellman’s equations • V*(k) is the optimal expected cost from state k • P(i|k) is the probability of chosing action i satisfying the entropy constraint through qk Achbany Youssef - UCL
Optimal policy • Which lead to the following updating rules • Convergence has been proved in a stationary environment Achbany Youssef - UCL
Optimal policy • This updating rule has a nice interpretation: • Route the agents preferably (with probability P(i|k)) to the state from which the expected cost is minimal • Including the direct cost for reaching this state Achbany Youssef - UCL
Optimal policy • If qk is large (zero entropy: no exploration), we obtain • which is the common value iteration algorithm or Bellman’s equation • for finding the shortest path Achbany Youssef - UCL
Optimal policy • If qk is zero (maximum entropy: full exploration), • We perform a blind exploration • We estimate the « average first passage time » • Without taking the costs into consideration: where nk is the number of admissible actions in state k Achbany Youssef - UCL
Advantages of our algorithm • Our strategy could be interesting if the environment is changing • And there is a need for continuous exploration • Indeed, if no exploration is performed, • The agent will not notice the changes unless they occur on the shortest path • So that the policy will not be adjusted • In other words, we propose an optimal exploration/exploitation trade-off Achbany Youssef - UCL
Simple Network routing Dynamic Uncertain Preliminary experiments Achbany Youssef - UCL
Preliminary experiments • Exploration rate of 0% for all nodes (no exploration) Achbany Youssef - UCL
Preliminary experiments • Entropy rate of 30% for all nodes Achbany Youssef - UCL
Preliminary experiments • Entropy rate of 60% for all nodes Achbany Youssef - UCL
Preliminary experiments • Entropy rate of 90% for all nodes Achbany Youssef - UCL
Preliminary experiments • Other experimental simulations are provided in: • Tuning continual exploration in reinforcement learning (Technical report submitted for publication). • http://www.isys.ucl.ac.be/staff/francois/Articles/Achbany2005a.pdf Achbany Youssef - UCL
Conclusion • In this work, • we presented a model integrating both exploration and exploitation in a common framework. • The exploration rate is controlled by the entropy of the choice probability distribution defined on the states of the system. • When no exploration is performed (zero entropy on each node), the model reduces to the common value iteration algorithm computing the minimum cost policy. • On the other hand, when full exploration is performed (maximum entropy on each node), the model reduces to a "blind" exploration, without considering the costs. Achbany Youssef - UCL
Further work • This model has been extended to • Stochastic shortest paths problems • Discounted problems • Acyclic graphs • Edit-distances between string • Developing links with Q-learning Achbany Youssef - UCL
Thank you !!! Achbany Youssef - UCL