Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005

Outline • Reinforcement Learning • RL Agent • Policy • Hierarchical Reinforcement Learning • The Need • Sub-Goal Detection • State Clusters • Border States • Continuous State and/or Action Spaces • Options • Macro Q-Learning with Parallel Option Discovery • Experimental Results

Reinforcement Learning • Agent observes the state, and takes the action according to the policy • Policy is a function from the state space onto the action space • Policy can be deterministic or non-deterministic • State and action spaces can be discrete, continuous or hybrid

RL Agent • No model of the environment • Agent observes state s, takes action a and goes into state s’ observing reward r • Agent tries to maximize total expected reward (return) • Finite state machine model a, r S S’

Policy • In a flat RL model, policy is a map from each state to a primitive action • In the optimal policy, the action taken by the agent return highest return at each each step • Can be kept in tabular format for small state and action spaces • Function approximators can be used for large state or action spaces (or continuous ones)

The Need For Hierarchical RL • Increase the performance • Applying RL to the problems with large action and/or state space become feasible • Detection of sub-goals can help the agent to have the abstract actions defined over the primitive actions • Sub-goals and abstract actions can be used in different tasks on the same domain. The knowledge is transferred between tasks • The policy of the agent can be translated into a natural language

Sub-goal Detection • A sub-goal can be a single state, a subset of the state space, or a constraint in the state space • Reaching a sub-goal should help the agent reaching the main goal (to get the highest return) • Sub-goals must be discovered by the agent autonomously

State Clusters • The states in a cluster are strongly connected to each other • The number of state transitions among clusters are small • The states at two ends of a state transition between two different clusters are sub-goal candidates • Clusters can be hierarchical • Different clusters can be in the same cluster at a higher level

Border States • Some actions cannot be applied in some states. These states are defined as border states • Border states are assumed to have a transition sequence. We can travel through the border states by taking some actions • Each end in this transition sequence is a candidate sub-goal assuming the agent sufficiently explored the environment

Border State Detection • For discrete action and state space • F(s): set of states which can be reached from state s in one time unit • G(s): if an action in G(s) is applied at state s, no state transition occurs • H(s): if an action in H(s) is applied at state s, the agent moves to a different state

Border State Detection • Detect the longest state sequence s0,s1,s2,…,sk-1,sk which satisfies the following constraints • siF(si+1) or si+1F(si) for 0i<k • G(si)G(si+1)   for 0<i<k-1 • H(s0) G(s1)   • H(sk) G(sk-1)   • s0 and sk are candidate sub-goals

Border States on Continuous State and Action Spaces • Environment is assumed to be bounded • State and action vectors can include both continuous and discrete dimensions • The derivative of state vector with respect to the action vector can be used • The border state regions must have small derivatives for some action vectors • The large change in these derivatives is the indication of border state regions

Options • An option is a policy • It can be local (defined on a subset of state space) or can be global • The option policy can use primitive actions or other options • It is hierarchical • Used to reach sub-goals

Macro Q-Learning with Parallel Option Discovery • Agent starts with no sub-goal and option • It detects the sub-goals and learns the option policies and the main policy simultaneously • Options are formed and removed from the model according the sub-goal detection algorithm • When a possible sub-goal is detected, a new option is added to the model to have the policy to reach this sub-goal • All options policies are updated in parallel • The agent generates an internal reward if a sub-goal is reached

Macro Q-Learning with Parallel Option Discovery • An Option is defined by the following: O = (o, o, Io, Qo, ro)where Qo is Q values for the option and ro is the internal reward signal associated with the option • Intra-option learning method is used

Flat RL Hierarchical RL Experiments

Options in HRL

Questions and Suggestions!!!

Hierarchical Reinforcement Learning