200 likes | 243 Views
Hierarchical Reinforcement Learning Using Graphical Models. Victoria Manfredi and Sridhar Mahadevan Rich Representations for Reinforcement Learning ICML’05 Workshop August 7, 2005. Introduction. Abstraction necessary to scale RL hierarchical RL Want to learn abstractions automatically
E N D
Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and Sridhar Mahadevan Rich Representations for Reinforcement Learning ICML’05 Workshop August 7, 2005
Introduction • Abstraction necessary to scale RLhierarchical RL • Want to learn abstractions automatically • Other approaches • Find subgoals: McGovern & Barto’01, Simsek & Barto’04, Simsek, Wolfe, & Barto’05, Mannor et al ’04 … • Build policy hierarchy: Hengst’02 • Potentially proto-value functions: Mahadevan’05 • Our approach • Learn initial policy hierarchy using graphical model framework, then learn how to use policies using reinforcement learning and reward • Related to imitation • Price & Boutilier’03, Abbeel & Ng’04
Outline • Dynamic Abstraction Networks • Approach • Experiments • Results • Summary • Future Work
Dynamic Abstraction Network Attend ICML’05 P1 P1 F F Register P0 P0 Policy Hierarchy Obs Obs F1 F1 Bonn S1 S1 F0 F0 Conference Center S0 S0 State Hierarchy Obs Obs t=2 t=1 HHMM Fine, Singer, & Tishby’98 AHMM Bui, Venkatesh, & West’02 DAN Manfredi & Mahadevan’05 Just one realization of a DAN; others are possible
Approach Expert Phase 1 Discrete variables? Continuous? How many state values? Levels? Observe Trajectories Learn DAN using EM Phase 2 e.g., SMDP Q-Learning Extract Abstractions Policy Improvement Hand-code Skills
DANs vs MAXQ/HAMs DANs infer from training sequences • DANs • # of levels in state/policy hierarchies • # of values for each (abstract) state/policy node • Training sequences: (flat state,action) pairs • MAXQ [Dietterich’00] • # of levels, # of tasks at each level • Connections between levels • Initiation set for each task • Termination set for each task • HAMs [Parr & Russell’98] • # of levels • Hierarchy of stochastic finite state machines • Explicit action, call, choice, stop states
Why Graphical Models? • Advantages of Graphical Models • Joint learning of multiple policy/state abstractions • Continuous/hidden domains • Full machinery of inference can be used • Disadvantages • Parameter learning with hidden variables is expensive • Expectation-Maximization can get stuck in local maxima
Domain • Dietterich’s Taxi (2000) • States • Taxi Location (TL): 25 • Passenger Location (PL): 5 • Passenger Destination (PD): 5 • Actions • North, South, East, West • Pickup, Putdown • Hand-coded policies • GotoRed • GotoGreen • GotoYellow • GotoBlue • Pickup, Putdown
Experiments TL TL PL PL PD PD Phase 1 • |S1| = 5, |S0| = 25, |1| = 6, |0| = 6 • 1000 sequences from SMDP Q-learner {TL, PL, PD, A}1 , … , {TL, PL, PD, A}n • Bayes Net Toolbox (Murphy’01) Phase 2 • SMDP Q-learning • Choose policy 1using -greedy • Compute most likely abstract state s0 given TL, PL, PD • Select action 0using Pr ( 0 1 = 1 , S0 = s0 ) Taxi DAN Policy Policy Policy Policy F F Action Action S1 S1 F1 F1 S0 S0 F0 F0
Policy Improvement • Policy learned over DAN policies performs well • Each plot is average over 10 RL runs and 1 EM run
Policy Recognition PD PU DAN Initial Passenger Loc Passenger Dest Policy 1 Policy 6 • Can (sometimes!) recognize a specific sequence of actions as composing a single policy
Summary • Two-phased method for automating hierarchical RL using graphical models • Advantages • Limited info needed (# of levels, # of values) • Permits continuous and partially observable state/actions • Disadvantages • EM is expensive • Need mentor • Abstractions learned can be hard to decipher (local maxima?)
Future Work • Approximate inference in DANs • Saria & Mahadevan’04: Rao-Blackwellized particle filtering for multi-agent AHMMs • Johns & Mahadevan’05: variational inference for AHMMs • Take advantage of ability to do inference in hierarchical RL phase • Incorporate reward in DAN
Thank You Questions?
Abstract State Transitions: S0 • Regardless of abstract P0 policy being executed, abstract S0 states self-transition with high probability • Depending on abstract P0 policy, may alternatively transition to one of a few abstract S0 states • Similarly for abstract S1 states and abstract P1 policies
State Abstractions Abstract state to which agent is most likely to transition is a consequence, in part, of the learned state abstractions
Semi-MDP Q-learning • Q(s,o) Q(s,o) + • [r + maxoO – Q(s, o) – Q(s,o)] s • Q(s,o): activity-value for state s and activity o • : learning rate • : discount rate raised to the number of time steps o took • r: accumulated discounted reward since o began
Abstract State S1 Transitions • Abstract state S1 transitions under abstract policy P1
Expectation-Maximization (EM) • Hidden variables and unknown parameters • E(xpectation)-step • Assume parameters known and compute the conditional expected values for variables • M(aximization)-step • Assume variables observed and compute the argmax parameters
Abstract State S0 Transitions • Abstract state S0 transitions under abstract policy P0