Hierarchical Learning and Planning

Hierarchical Learning and Planning

Outline • Hierarchical RL • Motivation • Learning with partial programs • Concurrency • Hierarchical lookahead • Semantics for high-level actions • Offline and online algorithms

Scaling up • Human life: 20 trillion actions • World: gazillions of state variables

Structured behavior Behavior is usually very structured and deeply nested: • Moving my tongue • Shaping this syllable • Saying this word • Saying this sentence • Making a point about nesting • Explaining structured behavior • Giving this talk

Structured behavior Behavior is usually very structured and deeply nested: • Moving my tongue • Shaping this syllable • Saying this word • Saying this sentence • Making a point about nesting • Explaining structured behavior • Giving this talk Modularity: Choice of tongue motions is independent of almost all state variables, given thechoice of word.

Running example • Peasants can move, pickup and dropoff • Penalty for collision • Cost-of-living each step • Reward for dropping off resources • Goal : gather 10 gold + 10 wood • (3L)n++ states s • 7n primitive actions a

Reinforcement Learning a Learning algorithm s,r Policy

Q-functions • Can represent policies using a Q-function • Qπ(s,a) = “Expected total reward if I do action a in environment state s and follow policy π thereafter” • Q-learning provides a model-free solution method Fragment of example Q-function

Temporal abstraction in RL • Define temporally extended actions, e.g., “get gold”, “get wood”, “attack unit x” etc. • Set up a decision process with choice states and extended choice-free actions • Resulting decision problem is semi-Markov (Forestier & Varaiya, 1978) • Partial programs in RL (HAMs (Parr & Russell); Options (Sutton & Precup); MAXQ (Dietterich); Alisp (Andre & Russell), Concurrent Alisp (Marthi, Russell, Latham, & Guestrin) • Key advantage:Factored representation of Q-function => faster learning (Dietterich, 2000)

RL and partial programs Partial program a Learning algorithm s,r Completion

Program state θ includes Program counter Call stack Global variables Single-threaded Alisp program (defun top () (loop do (choose ‘top-choice (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

Q-functions • Represent completions using Q-function • Joint stateω = [s,θ] env state + program state • MDP + partial program = SMDP over {ω} • Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter” • Modified Q-learning [AR 02] finds optimal completion Example Q-function

Internal state • Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies • E.g., while navigating to location (x,y), moving towards (x,y) is a good idea • Natural local shaping potential (distance from destination) impossible to express in external terms

Temporal Q-decomposition Top Top Top GatherGold GatherWood Nav(Mine2) Nav(Forest1) Qr Qc Qe Q • Temporal decompositionQ = Qr+Qc+Qewhere • Qr(ω,u) = reward while doing u (may be many steps) • Qc(ω,u) = reward in current subroutine after doing u • Qe(ω,u) = reward after current subroutine

State abstraction • Temporal decomposition => state abstraction E.g., while navigating, Qc independent of gold reserves • In general, local Q-components can depend on few variables => fast learning

Handling multiple effectors Get-gold Get-wood Multithreadedagent programs • Threads = tasks • Each effector assigned to a thread • Threads can be created/destroyed • Effectors can be reassigned • Effectors can be created/destroyed (Defend-Base) Get-wood

An example single-threaded ALisp program An example Concurrent ALisp program (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

Concurrent Alisp semantics Peasant 1 (defun nav (Wood1) (loop until (at-pos Wood1) (setf d (choose ‘(N S E W R))) do (action d) )) • defun gather-wood () • (setf dest (choose *forests*)) • (nav dest) • (action ‘get-wood) • (nav *home-base-loc*) • (action ‘put-wood)) Thread 1 Running Paused Waiting for joint action Making joint choice (defun nav (Gold2) (loop until (at-pos Gold2) (setf d (choose ‘(N S E W R))) do (action d) )) (defun gather-gold () (setf dest (choose *goldmines*)) (nav dest) (action ‘get-gold) (nav *home-base-loc*) (action ‘put-gold)) Peasant 2 Paused Thread 2 Running Making joint choice Waiting for joint action Environment timestep 27 Environment timestep 26

Concurrent Alisp semantics • Threads execute independently until they hit a choice or action • Wait until all threads are at a choice or action • If all effectors have been assigned an action, do that joint action in environment • Otherwise, make joint choice

Q-functions • To complete partial program, at each choice state ω, need to specify choices for all choosing threads • So Q(ω,u) as before, except u is a joint choice • Suitable SMDP Q-learning gives optimal completion Example Q-function

Problems with concurrent activities • Temporal decomposition of Q-function lost • No credit assignment among threads • Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly • Peasant 2 thinks he’s done very well!! • Significantly slows learning as number of peasants increases

Threadwise decomposition • Idea : decompose reward among threads (Russell+Zimdars, 2003) • E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants • Qjπ(ω,u) = “Expected total reward received by thread j if we make joint choice u and then do π” • Threadwise Q-decompositionQ = Q1+…Qn • Recursively distributed SARSA => global optimality

Learning threadwise decomposition Peasant 3 thread Peasant 2 thread Peasant 1 thread r2 r1 r3 Top thread r decomp a Action state

Learning threadwise decomposition Peasant 3 thread Q-update Peasant 2 thread Q-update Peasant 1 thread Q2(ω,·) Q-update Q1(ω,·) Q3(ω,·) Top thread + argmax = u Q(ω,·) Choice state ω

Threadwise and temporal decomposition Peasant 3 thread Top thread Peasant 2 thread • Qj = Qj,r + Qj,c + Qj,e where • Qj,r (ω,u) = Expected reward gained by thread jwhile doing u • Qj,c (ω,u) = Expected reward gained by thread jafter u but before leaving current subroutine • Qj,e (ω,u) = Expected reward gained by thread jafter current subroutine Peasant 1 thread

Resource gathering with 15 peasants Threadwise + Temporal Threadwise Undecomposed Reward of learnt policy Flat Num steps learning (x 1000)

Summary of Part I • Structure in behavior seems essential for scaling up • Partial programs • Provide natural structural constraints on policies • Decompose value functions into simple components • Include internal state (e.g., “goals”) that further simplifies value functions, shaping rewards • Concurrency • Simplifies description of multieffector behavior • Messes up temporal decomposition and credit assignment (but threadwise reward decomposition restores it)

Current directions • http://www.cs.berkeley.edu/~bhaskara/alisp/ • Partial observability ([s,θ] is just [θ] ) • Complex motor control tasks • Metalevel RL: choice of computation steps • Transfer of learned subroutines to new tasks • Eliminating Qe by recursive construction [UAI06] • Learning new hierarchical structure • Model-based hierarchical lookahead

Rxa1 c2 c2 Kxa1 Qa4 ‘a’ ‘b’ a b ‘b’ ‘a’ ‘b’ ‘a’ aa ab ba bb Abstract Lookahead • k-step lookahead >> 1-step lookahead • e.g., chess • k-step lookahead no use if steps are too small • e.g., first k characters of a NIPS paper • this is one small part of a human life, = ~20,000,000,000,000 primitive actions Kc3 • Abstract plans (high-level executions of partial programs) are shorter • Much shorter plans for given goals => exponential savings • Can look ahead much further into future

High-level actions • Start with restricted classical HTN form of partial program • A high-level action (HLA) • Has a set of possible immediate refinements into sequences of actions, primitive or high-level • Each refinement may have a precondition on its use • Executable plan space = all primitive refinements of Act • To do offline or online planning with HLAs, need a model • See classical HTN planning literature

Not • HTN literature describes two desirable properties for hierarchies: • Downward refinement property: Every successful high-level plan has a successful primitive refinement • Upward refinement property: Every failing high-level plan has no successful primitive refinements • HLA descriptions are typically heuristic guidance or aspirational prescriptions; “It is naïve to expect DRP and URP to hold in general.” • Observation: If assertions about HLAs are true, then DRP and URP always hold

S S S s0 s0 s0 a4 a1 a3 a2 t t t Angelic semantics for HLAs • Start with state-space view, ignore rewards for now • Central idea is reachable set of an HLA from some state • When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal • May seem related to nondeterminism • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature State space h1h2is a solution h2 h1 h1h2 h2 Goal

Technical development • Exact description Eh: S 2S specifies reachable set by any primitive refinement of h; generally not concise • Upper and lower descriptions bound the exact reachable set above and below • Still support proofs of plan success/failure • Possibly-successful plans must be refined • Developed NCSTRIPS language for concise descriptions • Developed sound and complete hierarchical planning algorithms using abstract lookahead trees (ALTs)

Experiment • Instance 3 • 5x8 world • 90-step plan • Flat/hierarchical without descriptions did not terminate within 10,000 seconds

Example – Warehouse World • Has similarities to blocks and taxi domains, but more choices and constraints • Gripper must stay in bounds • Can’t pass through blocks • Can only turn around at top row • Goal: have C on T4 • Can’t just move directly • Final plan has 22 steps Left, Down, Pickup, Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown

~ s t s ~ t x s t Representing descriptions: NCSTRIPS • Assume S = set of truth assignments to propositions • Descriptions specify propositions (possibly) added/deleted by HLA • An efficient algorithm exists to progress state sets (represented as DNF formulae) through descriptions Navigate(xt,yt) (Pre: At(xs,ys)) Upper: -At(xs,ys), +At(xt,yt), FacingRight Lower: IF (Free(xt,yt) x Free(x,ymax)): -At(xs,ys), +At(xt,yt), FacingRight, ELSE: nil

Technical development contd. • Extended angelic semantics to account for rewards • Developed novel algorithms that do lookahead • Hierarchical Forward Search • An optimal, offline algorithm • Can prove optimality of plans entirely at high level • Hierarchical Real-Time Search • An online algorithm • Requires only bounded computation per time step • Guaranteed to reach goal in safely explorable spaces • Converges to optimal policy

Hierarchical Forward Search • Construct the initial ALT • Loop • Select the plan with highest upper reward • If this plan is primitive, return it • Otherwise, refine one of its HLAs • Related to A*

Intuitive Picture t s0 Act highest-level primitive

Intuitive Picture Act t s0 highest-level primitive

Intuitive Picture t s0 highest-level primitive

s0 t An online version? • Hierarchical Forward Search plans all the way to t before starting acting • Can’t act in real time • Not really useful for RL • In stochastic case, must plan for all contingencies • How can we choose a good first action with bounded computation?

Hierarchical Real-Time Search • Loop until at t • Run Hierarchical Forward Search for (at most) k steps • Choose a plan that • begins with a primitive action • has total upper reward ≥ that of the best unrefined plan • has minimal upper reward up to but not including Act • Do the first action of this plan in the world • Do a Bellman backup for the current state (prevents looping) • Related to LRTA*

Preliminary Results

Discussion • High-level lookahead is effective, perhaps essential • Directions for future work: • Extensions to probabilitic/partially obs. environments • Better meta-level control • Integration into DP and RL algorithms

Hierarchical Learning and Planning

Hierarchical Learning and Planning

Presentation Transcript

Practical Planning: Scheduling and Hierarchical Task Networks

Hierarchical Planning

Hierarchical Reinforcement Learning

Modular and hierarchical learning systems

Introduction to Hierarchical Reinforcement Learning

Thinking, learning, and planning

Chapter 11 Hierarchical Task Network Planning

Hierarchical Task Network (HTN) Planning

Hierarchical Methods for Planning under Uncertainty

Concurrent Hierarchical Reinforcement Learning

Intrinsically Motivated Hierarchical Reinforcement Learning

Distributed Planning in Hierarchical Factored MDPs

Hierarchical Reinforcement Learning

Hierarchical Planning Framework (review)

Chapter 11 Hierarchical Task Network Planning

Chapter 11 Hierarchical Task Network Planning

Learning Hierarchical Models of Scenes, Objects and Parts

Top-down hierarchical planning

Hierarchical Task Network (HTN) Planning

Hierarchical Reinforcement Learning

Introduction to Hierarchical Production Planning and (Demand) Forecasting

Hierarchical learning and high dimensionality problems in bioinformatics