1 / 50

Hierarchical Learning and Planning

Hierarchical Learning and Planning. Outline. Hierarchical RL Motivation Learning with partial programs Concurrency Hierarchical lookahead Semantics for high-level actions Offline and online algorithms. Scaling up. Human life: 20 trillion actions World: gazillions of state variables.

ednaa
Download Presentation

Hierarchical Learning and Planning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Learning and Planning

  2. Outline • Hierarchical RL • Motivation • Learning with partial programs • Concurrency • Hierarchical lookahead • Semantics for high-level actions • Offline and online algorithms

  3. Scaling up • Human life: 20 trillion actions • World: gazillions of state variables

  4. Structured behavior Behavior is usually very structured and deeply nested: • Moving my tongue • Shaping this syllable • Saying this word • Saying this sentence • Making a point about nesting • Explaining structured behavior • Giving this talk

  5. Structured behavior Behavior is usually very structured and deeply nested: • Moving my tongue • Shaping this syllable • Saying this word • Saying this sentence • Making a point about nesting • Explaining structured behavior • Giving this talk Modularity: Choice of tongue motions is independent of almost all state variables, given thechoice of word.

  6. Running example • Peasants can move, pickup and dropoff • Penalty for collision • Cost-of-living each step • Reward for dropping off resources • Goal : gather 10 gold + 10 wood • (3L)n++ states s • 7n primitive actions a

  7. Reinforcement Learning a Learning algorithm s,r Policy

  8. Q-functions • Can represent policies using a Q-function • Qπ(s,a) = “Expected total reward if I do action a in environment state s and follow policy π thereafter” • Q-learning provides a model-free solution method Fragment of example Q-function

  9. Temporal abstraction in RL • Define temporally extended actions, e.g., “get gold”, “get wood”, “attack unit x” etc. • Set up a decision process with choice states and extended choice-free actions • Resulting decision problem is semi-Markov (Forestier & Varaiya, 1978) • Partial programs in RL (HAMs (Parr & Russell); Options (Sutton & Precup); MAXQ (Dietterich); Alisp (Andre & Russell), Concurrent Alisp (Marthi, Russell, Latham, & Guestrin) • Key advantage:Factored representation of Q-function => faster learning (Dietterich, 2000)

  10. RL and partial programs Partial program a Learning algorithm s,r Completion

  11. Program state θ includes Program counter Call stack Global variables Single-threaded Alisp program (defun top () (loop do (choose ‘top-choice (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

  12. Q-functions • Represent completions using Q-function • Joint stateω = [s,θ] env state + program state • MDP + partial program = SMDP over {ω} • Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter” • Modified Q-learning [AR 02] finds optimal completion Example Q-function

  13. Internal state • Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies • E.g., while navigating to location (x,y), moving towards (x,y) is a good idea • Natural local shaping potential (distance from destination) impossible to express in external terms

  14. Temporal Q-decomposition Top Top Top GatherGold GatherWood Nav(Mine2) Nav(Forest1) Qr Qc Qe Q • Temporal decompositionQ = Qr+Qc+Qewhere • Qr(ω,u) = reward while doing u (may be many steps) • Qc(ω,u) = reward in current subroutine after doing u • Qe(ω,u) = reward after current subroutine

  15. State abstraction • Temporal decomposition => state abstraction E.g., while navigating, Qc independent of gold reserves • In general, local Q-components can depend on few variables => fast learning

  16. Handling multiple effectors Get-gold Get-wood Multithreadedagent programs • Threads = tasks • Each effector assigned to a thread • Threads can be created/destroyed • Effectors can be reassigned • Effectors can be created/destroyed (Defend-Base) Get-wood

  17. An example single-threaded ALisp program An example Concurrent ALisp program (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

  18. Concurrent Alisp semantics Peasant 1 (defun nav (Wood1) (loop until (at-pos Wood1) (setf d (choose ‘(N S E W R))) do (action d) )) • defun gather-wood () • (setf dest (choose *forests*)) • (nav dest) • (action ‘get-wood) • (nav *home-base-loc*) • (action ‘put-wood)) Thread 1 Running Paused Waiting for joint action Making joint choice (defun nav (Gold2) (loop until (at-pos Gold2) (setf d (choose ‘(N S E W R))) do (action d) )) (defun gather-gold () (setf dest (choose *goldmines*)) (nav dest) (action ‘get-gold) (nav *home-base-loc*) (action ‘put-gold)) Peasant 2 Paused Thread 2 Running Making joint choice Waiting for joint action Environment timestep 27 Environment timestep 26

  19. Concurrent Alisp semantics • Threads execute independently until they hit a choice or action • Wait until all threads are at a choice or action • If all effectors have been assigned an action, do that joint action in environment • Otherwise, make joint choice

  20. Q-functions • To complete partial program, at each choice state ω, need to specify choices for all choosing threads • So Q(ω,u) as before, except u is a joint choice • Suitable SMDP Q-learning gives optimal completion Example Q-function

  21. Problems with concurrent activities • Temporal decomposition of Q-function lost • No credit assignment among threads • Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly • Peasant 2 thinks he’s done very well!! • Significantly slows learning as number of peasants increases

  22. Threadwise decomposition • Idea : decompose reward among threads (Russell+Zimdars, 2003) • E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants • Qjπ(ω,u) = “Expected total reward received by thread j if we make joint choice u and then do π” • Threadwise Q-decompositionQ = Q1+…Qn • Recursively distributed SARSA => global optimality

  23. Learning threadwise decomposition Peasant 3 thread Peasant 2 thread Peasant 1 thread r2 r1 r3 Top thread r decomp a Action state

  24. Learning threadwise decomposition Peasant 3 thread Q-update Peasant 2 thread Q-update Peasant 1 thread Q2(ω,·) Q-update Q1(ω,·) Q3(ω,·) Top thread + argmax = u Q(ω,·) Choice state ω

  25. Threadwise and temporal decomposition Peasant 3 thread Top thread Peasant 2 thread • Qj = Qj,r + Qj,c + Qj,e where • Qj,r (ω,u) = Expected reward gained by thread jwhile doing u • Qj,c (ω,u) = Expected reward gained by thread jafter u but before leaving current subroutine • Qj,e (ω,u) = Expected reward gained by thread jafter current subroutine Peasant 1 thread

  26. Resource gathering with 15 peasants Threadwise + Temporal Threadwise Undecomposed Reward of learnt policy Flat Num steps learning (x 1000)

  27. Summary of Part I • Structure in behavior seems essential for scaling up • Partial programs • Provide natural structural constraints on policies • Decompose value functions into simple components • Include internal state (e.g., “goals”) that further simplifies value functions, shaping rewards • Concurrency • Simplifies description of multieffector behavior • Messes up temporal decomposition and credit assignment (but threadwise reward decomposition restores it)

  28. Current directions • http://www.cs.berkeley.edu/~bhaskara/alisp/ • Partial observability ([s,θ] is just [θ] ) • Complex motor control tasks • Metalevel RL: choice of computation steps • Transfer of learned subroutines to new tasks • Eliminating Qe by recursive construction [UAI06] • Learning new hierarchical structure • Model-based hierarchical lookahead

  29. Rxa1 c2 c2 Kxa1 Qa4 ‘a’ ‘b’ a b ‘b’ ‘a’ ‘b’ ‘a’ aa ab ba bb Abstract Lookahead • k-step lookahead >> 1-step lookahead • e.g., chess • k-step lookahead no use if steps are too small • e.g., first k characters of a NIPS paper • this is one small part of a human life, = ~20,000,000,000,000 primitive actions Kc3 • Abstract plans (high-level executions of partial programs) are shorter • Much shorter plans for given goals => exponential savings • Can look ahead much further into future

  30. High-level actions • Start with restricted classical HTN form of partial program • A high-level action (HLA) • Has a set of possible immediate refinements into sequences of actions, primitive or high-level • Each refinement may have a precondition on its use • Executable plan space = all primitive refinements of Act • To do offline or online planning with HLAs, need a model • See classical HTN planning literature

  31. Not • HTN literature describes two desirable properties for hierarchies: • Downward refinement property: Every successful high-level plan has a successful primitive refinement • Upward refinement property: Every failing high-level plan has no successful primitive refinements • HLA descriptions are typically heuristic guidance or aspirational prescriptions; “It is naïve to expect DRP and URP to hold in general.” • Observation: If assertions about HLAs are true, then DRP and URP always hold

  32. S S S s0 s0 s0 a4 a1 a3 a2 t t t Angelic semantics for HLAs • Start with state-space view, ignore rewards for now • Central idea is reachable set of an HLA from some state • When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal • May seem related to nondeterminism • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature State space h1h2is a solution h2 h1 h1h2 h2 Goal

  33. Technical development • Exact description Eh: S 2S specifies reachable set by any primitive refinement of h; generally not concise • Upper and lower descriptions bound the exact reachable set above and below • Still support proofs of plan success/failure • Possibly-successful plans must be refined • Developed NCSTRIPS language for concise descriptions • Developed sound and complete hierarchical planning algorithms using abstract lookahead trees (ALTs)

  34. Experiment • Instance 3 • 5x8 world • 90-step plan • Flat/hierarchical without descriptions did not terminate within 10,000 seconds

  35. Example – Warehouse World • Has similarities to blocks and taxi domains, but more choices and constraints • Gripper must stay in bounds • Can’t pass through blocks • Can only turn around at top row • Goal: have C on T4 • Can’t just move directly • Final plan has 22 steps Left, Down, Pickup, Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown

  36. ~ s t s ~ t x s t Representing descriptions: NCSTRIPS • Assume S = set of truth assignments to propositions • Descriptions specify propositions (possibly) added/deleted by HLA • An efficient algorithm exists to progress state sets (represented as DNF formulae) through descriptions Navigate(xt,yt) (Pre: At(xs,ys)) Upper: -At(xs,ys), +At(xt,yt), FacingRight Lower: IF (Free(xt,yt) x Free(x,ymax)): -At(xs,ys), +At(xt,yt), FacingRight, ELSE: nil

  37. Technical development contd. • Extended angelic semantics to account for rewards • Developed novel algorithms that do lookahead • Hierarchical Forward Search • An optimal, offline algorithm • Can prove optimality of plans entirely at high level • Hierarchical Real-Time Search • An online algorithm • Requires only bounded computation per time step • Guaranteed to reach goal in safely explorable spaces • Converges to optimal policy

  38. Hierarchical Forward Search • Construct the initial ALT • Loop • Select the plan with highest upper reward • If this plan is primitive, return it • Otherwise, refine one of its HLAs • Related to A*

  39. Intuitive Picture t s0 Act highest-level primitive

  40. Intuitive Picture t s0 Act highest-level primitive

  41. Intuitive Picture t s0 Act highest-level primitive

  42. Intuitive Picture Act t s0 highest-level primitive

  43. Intuitive Picture Act t s0 highest-level primitive

  44. Intuitive Picture Act t s0 highest-level primitive

  45. Intuitive Picture t s0 highest-level primitive

  46. Intuitive Picture t s0 highest-level primitive

  47. s0 t An online version? • Hierarchical Forward Search plans all the way to t before starting acting • Can’t act in real time • Not really useful for RL • In stochastic case, must plan for all contingencies • How can we choose a good first action with bounded computation?

  48. Hierarchical Real-Time Search • Loop until at t • Run Hierarchical Forward Search for (at most) k steps • Choose a plan that • begins with a primitive action • has total upper reward ≥ that of the best unrefined plan • has minimal upper reward up to but not including Act • Do the first action of this plan in the world • Do a Bellman backup for the current state (prevents looping) • Related to LRTA*

  49. Preliminary Results

  50. Discussion • High-level lookahead is effective, perhaps essential • Directions for future work: • Extensions to probabilitic/partially obs. environments • Better meta-level control • Integration into DP and RL algorithms

More Related