500 likes | 510 Views
Hierarchical Learning and Planning. Outline. Hierarchical RL Motivation Learning with partial programs Concurrency Hierarchical lookahead Semantics for high-level actions Offline and online algorithms. Scaling up. Human life: 20 trillion actions World: gazillions of state variables.
E N D
Outline • Hierarchical RL • Motivation • Learning with partial programs • Concurrency • Hierarchical lookahead • Semantics for high-level actions • Offline and online algorithms
Scaling up • Human life: 20 trillion actions • World: gazillions of state variables
Structured behavior Behavior is usually very structured and deeply nested: • Moving my tongue • Shaping this syllable • Saying this word • Saying this sentence • Making a point about nesting • Explaining structured behavior • Giving this talk
Structured behavior Behavior is usually very structured and deeply nested: • Moving my tongue • Shaping this syllable • Saying this word • Saying this sentence • Making a point about nesting • Explaining structured behavior • Giving this talk Modularity: Choice of tongue motions is independent of almost all state variables, given thechoice of word.
Running example • Peasants can move, pickup and dropoff • Penalty for collision • Cost-of-living each step • Reward for dropping off resources • Goal : gather 10 gold + 10 wood • (3L)n++ states s • 7n primitive actions a
Reinforcement Learning a Learning algorithm s,r Policy
Q-functions • Can represent policies using a Q-function • Qπ(s,a) = “Expected total reward if I do action a in environment state s and follow policy π thereafter” • Q-learning provides a model-free solution method Fragment of example Q-function
Temporal abstraction in RL • Define temporally extended actions, e.g., “get gold”, “get wood”, “attack unit x” etc. • Set up a decision process with choice states and extended choice-free actions • Resulting decision problem is semi-Markov (Forestier & Varaiya, 1978) • Partial programs in RL (HAMs (Parr & Russell); Options (Sutton & Precup); MAXQ (Dietterich); Alisp (Andre & Russell), Concurrent Alisp (Marthi, Russell, Latham, & Guestrin) • Key advantage:Factored representation of Q-function => faster learning (Dietterich, 2000)
RL and partial programs Partial program a Learning algorithm s,r Completion
Program state θ includes Program counter Call stack Global variables Single-threaded Alisp program (defun top () (loop do (choose ‘top-choice (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))
Q-functions • Represent completions using Q-function • Joint stateω = [s,θ] env state + program state • MDP + partial program = SMDP over {ω} • Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter” • Modified Q-learning [AR 02] finds optimal completion Example Q-function
Internal state • Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies • E.g., while navigating to location (x,y), moving towards (x,y) is a good idea • Natural local shaping potential (distance from destination) impossible to express in external terms
Temporal Q-decomposition Top Top Top GatherGold GatherWood Nav(Mine2) Nav(Forest1) Qr Qc Qe Q • Temporal decompositionQ = Qr+Qc+Qewhere • Qr(ω,u) = reward while doing u (may be many steps) • Qc(ω,u) = reward in current subroutine after doing u • Qe(ω,u) = reward after current subroutine
State abstraction • Temporal decomposition => state abstraction E.g., while navigating, Qc independent of gold reserves • In general, local Q-components can depend on few variables => fast learning
Handling multiple effectors Get-gold Get-wood Multithreadedagent programs • Threads = tasks • Each effector assigned to a thread • Threads can be created/destroyed • Effectors can be reassigned • Effectors can be created/destroyed (Defend-Base) Get-wood
An example single-threaded ALisp program An example Concurrent ALisp program (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))
Concurrent Alisp semantics Peasant 1 (defun nav (Wood1) (loop until (at-pos Wood1) (setf d (choose ‘(N S E W R))) do (action d) )) • defun gather-wood () • (setf dest (choose *forests*)) • (nav dest) • (action ‘get-wood) • (nav *home-base-loc*) • (action ‘put-wood)) Thread 1 Running Paused Waiting for joint action Making joint choice (defun nav (Gold2) (loop until (at-pos Gold2) (setf d (choose ‘(N S E W R))) do (action d) )) (defun gather-gold () (setf dest (choose *goldmines*)) (nav dest) (action ‘get-gold) (nav *home-base-loc*) (action ‘put-gold)) Peasant 2 Paused Thread 2 Running Making joint choice Waiting for joint action Environment timestep 27 Environment timestep 26
Concurrent Alisp semantics • Threads execute independently until they hit a choice or action • Wait until all threads are at a choice or action • If all effectors have been assigned an action, do that joint action in environment • Otherwise, make joint choice
Q-functions • To complete partial program, at each choice state ω, need to specify choices for all choosing threads • So Q(ω,u) as before, except u is a joint choice • Suitable SMDP Q-learning gives optimal completion Example Q-function
Problems with concurrent activities • Temporal decomposition of Q-function lost • No credit assignment among threads • Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly • Peasant 2 thinks he’s done very well!! • Significantly slows learning as number of peasants increases
Threadwise decomposition • Idea : decompose reward among threads (Russell+Zimdars, 2003) • E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants • Qjπ(ω,u) = “Expected total reward received by thread j if we make joint choice u and then do π” • Threadwise Q-decompositionQ = Q1+…Qn • Recursively distributed SARSA => global optimality
Learning threadwise decomposition Peasant 3 thread Peasant 2 thread Peasant 1 thread r2 r1 r3 Top thread r decomp a Action state
Learning threadwise decomposition Peasant 3 thread Q-update Peasant 2 thread Q-update Peasant 1 thread Q2(ω,·) Q-update Q1(ω,·) Q3(ω,·) Top thread + argmax = u Q(ω,·) Choice state ω
Threadwise and temporal decomposition Peasant 3 thread Top thread Peasant 2 thread • Qj = Qj,r + Qj,c + Qj,e where • Qj,r (ω,u) = Expected reward gained by thread jwhile doing u • Qj,c (ω,u) = Expected reward gained by thread jafter u but before leaving current subroutine • Qj,e (ω,u) = Expected reward gained by thread jafter current subroutine Peasant 1 thread
Resource gathering with 15 peasants Threadwise + Temporal Threadwise Undecomposed Reward of learnt policy Flat Num steps learning (x 1000)
Summary of Part I • Structure in behavior seems essential for scaling up • Partial programs • Provide natural structural constraints on policies • Decompose value functions into simple components • Include internal state (e.g., “goals”) that further simplifies value functions, shaping rewards • Concurrency • Simplifies description of multieffector behavior • Messes up temporal decomposition and credit assignment (but threadwise reward decomposition restores it)
Current directions • http://www.cs.berkeley.edu/~bhaskara/alisp/ • Partial observability ([s,θ] is just [θ] ) • Complex motor control tasks • Metalevel RL: choice of computation steps • Transfer of learned subroutines to new tasks • Eliminating Qe by recursive construction [UAI06] • Learning new hierarchical structure • Model-based hierarchical lookahead
Rxa1 c2 c2 Kxa1 Qa4 ‘a’ ‘b’ a b ‘b’ ‘a’ ‘b’ ‘a’ aa ab ba bb Abstract Lookahead • k-step lookahead >> 1-step lookahead • e.g., chess • k-step lookahead no use if steps are too small • e.g., first k characters of a NIPS paper • this is one small part of a human life, = ~20,000,000,000,000 primitive actions Kc3 • Abstract plans (high-level executions of partial programs) are shorter • Much shorter plans for given goals => exponential savings • Can look ahead much further into future
High-level actions • Start with restricted classical HTN form of partial program • A high-level action (HLA) • Has a set of possible immediate refinements into sequences of actions, primitive or high-level • Each refinement may have a precondition on its use • Executable plan space = all primitive refinements of Act • To do offline or online planning with HLAs, need a model • See classical HTN planning literature
Not • HTN literature describes two desirable properties for hierarchies: • Downward refinement property: Every successful high-level plan has a successful primitive refinement • Upward refinement property: Every failing high-level plan has no successful primitive refinements • HLA descriptions are typically heuristic guidance or aspirational prescriptions; “It is naïve to expect DRP and URP to hold in general.” • Observation: If assertions about HLAs are true, then DRP and URP always hold
S S S s0 s0 s0 a4 a1 a3 a2 t t t Angelic semantics for HLAs • Start with state-space view, ignore rewards for now • Central idea is reachable set of an HLA from some state • When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal • May seem related to nondeterminism • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature State space h1h2is a solution h2 h1 h1h2 h2 Goal
Technical development • Exact description Eh: S 2S specifies reachable set by any primitive refinement of h; generally not concise • Upper and lower descriptions bound the exact reachable set above and below • Still support proofs of plan success/failure • Possibly-successful plans must be refined • Developed NCSTRIPS language for concise descriptions • Developed sound and complete hierarchical planning algorithms using abstract lookahead trees (ALTs)
Experiment • Instance 3 • 5x8 world • 90-step plan • Flat/hierarchical without descriptions did not terminate within 10,000 seconds
Example – Warehouse World • Has similarities to blocks and taxi domains, but more choices and constraints • Gripper must stay in bounds • Can’t pass through blocks • Can only turn around at top row • Goal: have C on T4 • Can’t just move directly • Final plan has 22 steps Left, Down, Pickup, Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown
~ s t s ~ t x s t Representing descriptions: NCSTRIPS • Assume S = set of truth assignments to propositions • Descriptions specify propositions (possibly) added/deleted by HLA • An efficient algorithm exists to progress state sets (represented as DNF formulae) through descriptions Navigate(xt,yt) (Pre: At(xs,ys)) Upper: -At(xs,ys), +At(xt,yt), FacingRight Lower: IF (Free(xt,yt) x Free(x,ymax)): -At(xs,ys), +At(xt,yt), FacingRight, ELSE: nil
Technical development contd. • Extended angelic semantics to account for rewards • Developed novel algorithms that do lookahead • Hierarchical Forward Search • An optimal, offline algorithm • Can prove optimality of plans entirely at high level • Hierarchical Real-Time Search • An online algorithm • Requires only bounded computation per time step • Guaranteed to reach goal in safely explorable spaces • Converges to optimal policy
Hierarchical Forward Search • Construct the initial ALT • Loop • Select the plan with highest upper reward • If this plan is primitive, return it • Otherwise, refine one of its HLAs • Related to A*
Intuitive Picture t s0 Act highest-level primitive
Intuitive Picture t s0 Act highest-level primitive
Intuitive Picture t s0 Act highest-level primitive
Intuitive Picture Act t s0 highest-level primitive
Intuitive Picture Act t s0 highest-level primitive
Intuitive Picture Act t s0 highest-level primitive
Intuitive Picture t s0 highest-level primitive
Intuitive Picture t s0 highest-level primitive
s0 t An online version? • Hierarchical Forward Search plans all the way to t before starting acting • Can’t act in real time • Not really useful for RL • In stochastic case, must plan for all contingencies • How can we choose a good first action with bounded computation?
Hierarchical Real-Time Search • Loop until at t • Run Hierarchical Forward Search for (at most) k steps • Choose a plan that • begins with a primitive action • has total upper reward ≥ that of the best unrefined plan • has minimal upper reward up to but not including Act • Do the first action of this plan in the world • Do a Bellman backup for the current state (prevents looping) • Related to LRTA*
Discussion • High-level lookahead is effective, perhaps essential • Directions for future work: • Extensions to probabilitic/partially obs. environments • Better meta-level control • Integration into DP and RL algorithms