Hierarchical Reinforcement Learning Using Graphical Models

Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and Sridhar Mahadevan Rich Representations for Reinforcement Learning ICML’05 Workshop August 7, 2005

Introduction • Abstraction necessary to scale RLhierarchical RL • Want to learn abstractions automatically • Other approaches • Find subgoals: McGovern & Barto’01, Simsek & Barto’04, Simsek, Wolfe, & Barto’05, Mannor et al ’04 … • Build policy hierarchy: Hengst’02 • Potentially proto-value functions: Mahadevan’05 • Our approach • Learn initial policy hierarchy using graphical model framework, then learn how to use policies using reinforcement learning and reward • Related to imitation • Price & Boutilier’03, Abbeel & Ng’04

Outline • Dynamic Abstraction Networks • Approach • Experiments • Results • Summary • Future Work

Dynamic Abstraction Network Attend ICML’05 P1 P1 F F Register P0 P0 Policy Hierarchy Obs Obs F1 F1 Bonn S1 S1 F0 F0 Conference Center S0 S0 State Hierarchy Obs Obs t=2 t=1 HHMM Fine, Singer, & Tishby’98 AHMM Bui, Venkatesh, & West’02 DAN Manfredi & Mahadevan’05 Just one realization of a DAN; others are possible

Approach Expert Phase 1 Discrete variables? Continuous? How many state values? Levels? Observe Trajectories Learn DAN using EM Phase 2 e.g., SMDP Q-Learning Extract Abstractions Policy Improvement Hand-code Skills

DANs vs MAXQ/HAMs DANs infer from training sequences • DANs • # of levels in state/policy hierarchies • # of values for each (abstract) state/policy node • Training sequences: (flat state,action) pairs • MAXQ [Dietterich’00] • # of levels, # of tasks at each level • Connections between levels • Initiation set for each task • Termination set for each task • HAMs [Parr & Russell’98] • # of levels • Hierarchy of stochastic finite state machines • Explicit action, call, choice, stop states

Why Graphical Models? • Advantages of Graphical Models • Joint learning of multiple policy/state abstractions • Continuous/hidden domains • Full machinery of inference can be used • Disadvantages • Parameter learning with hidden variables is expensive • Expectation-Maximization can get stuck in local maxima

Domain • Dietterich’s Taxi (2000) • States • Taxi Location (TL): 25 • Passenger Location (PL): 5 • Passenger Destination (PD): 5 • Actions • North, South, East, West • Pickup, Putdown • Hand-coded policies • GotoRed • GotoGreen • GotoYellow • GotoBlue • Pickup, Putdown

Experiments TL TL PL PL PD PD Phase 1 • |S1| = 5, |S0| = 25, |1| = 6, |0| = 6 • 1000 sequences from SMDP Q-learner {TL, PL, PD, A}1 , … , {TL, PL, PD, A}n • Bayes Net Toolbox (Murphy’01) Phase 2 • SMDP Q-learning • Choose policy 1using -greedy • Compute most likely abstract state s0 given TL, PL, PD • Select action 0using Pr ( 0  1 = 1 , S0 = s0 ) Taxi DAN Policy Policy Policy Policy F F Action Action S1 S1 F1 F1 S0 S0 F0 F0

Policy Improvement • Policy learned over DAN policies performs well • Each plot is average over 10 RL runs and 1 EM run

Policy Recognition PD PU DAN Initial Passenger Loc Passenger Dest Policy 1 Policy 6 • Can (sometimes!) recognize a specific sequence of actions as composing a single policy

Summary • Two-phased method for automating hierarchical RL using graphical models • Advantages • Limited info needed (# of levels, # of values) • Permits continuous and partially observable state/actions • Disadvantages • EM is expensive • Need mentor • Abstractions learned can be hard to decipher (local maxima?)

Future Work • Approximate inference in DANs • Saria & Mahadevan’04: Rao-Blackwellized particle filtering for multi-agent AHMMs • Johns & Mahadevan’05: variational inference for AHMMs • Take advantage of ability to do inference in hierarchical RL phase • Incorporate reward in DAN

Thank You Questions?

Abstract State Transitions: S0 • Regardless of abstract P0 policy being executed, abstract S0 states self-transition with high probability • Depending on abstract P0 policy, may alternatively transition to one of a few abstract S0 states • Similarly for abstract S1 states and abstract P1 policies

State Abstractions Abstract state to which agent is most likely to transition is a consequence, in part, of the learned state abstractions

Semi-MDP Q-learning • Q(s,o)  Q(s,o) + •  [r +  maxoO – Q(s, o) – Q(s,o)] s • Q(s,o): activity-value for state s and activity o • : learning rate • : discount rate raised to the number of time steps o took • r: accumulated discounted reward since o began

Abstract State S1 Transitions • Abstract state S1 transitions under abstract policy P1

Expectation-Maximization (EM) • Hidden variables and unknown parameters • E(xpectation)-step • Assume parameters known and compute the conditional expected values for variables • M(aximization)-step • Assume variables observed and compute the argmax parameters

Abstract State S0 Transitions • Abstract state S0 transitions under abstract policy P0

Hierarchical Reinforcement Learning Using Graphical Models