560 likes | 577 Views
Explore the complexities of decision-making in a vast world of possibilities. Understand Markov Decision Processes and Reinforcement Learning to create optimal policies for intelligent agents. Discover the power of hierarchical structures and abstract actions in enhancing the learning process. Dive into the world of Partial Programs to harness the potential for building intelligent systems efficiently.
E N D
Life • 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions • (Not to mention choosing brain activities!) • The world has a very large, partially observable state space and an unknown model • So, being intelligent is “provably” hard • How on earth do we manage?
Hierarchical structure!! (among other things) • Deeply nested behaviors give modularity • E.g., tongue controls for [t] independent of everything else given decision to say “tongue” • High-level decisions give scale • E.g., “go to IJCAI 13” 3,000,000,000 actions • Look further ahead, reduce computation exponentially [NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05, ICAPS 07, ICAPS 08, ICAPS 10, IJCAI 11]
Reminder: MDPs • Markov decision process (MDP) defined by • Set of states S • Set of actions A • Transition model P(s’ | s,a) • Markovian, i.e., depends only on st and not on history • Reward function R(s,a,s’) • Discount factor γ • Optimal policyπ* maximizes the value function Vπ= Σt γt R(st,π(st),st+1) • Q(s,a) = Σs’ P(s’ | s,a)[R(s,a,s’) + γV(s’)]
Reminder: policy invariance • MDP M with reward function R(s,a,s’) • Any potential function Φ(s) then • Any policy that is optimal for M is optimal for any version M’ with reward function R’(s,a,s’) = R(s,a,s’) + γΦ(s’) - Φ(s) • So we can provide hints to speed up RL by shaping the reward function
Reminder: Bellman equation • Bellman equation for a fixed policy π : Vπ(s) = Σs’P(s’|s,π(s)) [R(s,π(s),s’) + γVπ (s’) ] • Bellman equation for optimal value function: V(s) = maxaΣs’P(s’|s,a) [R(s,a,s’) + γV(s’) ]
Reminder: Reinforcement learning • Agent observes transitions and rewards: • Doing a in s leads to s’with reward r • Temporal-difference (TD) learning: • V(s) <- (1-α)V(s) + α[r+γV(s’) ] • Q-learning: • Q(s,a) <- (1-α)Q(s,a) + α[r+γ maxa’Q(s’,a’) ] • SARSA (after observing the next action a’ ): • Q(s,a) <- (1-α)Q(s,a) + α[r+γQ(s’,a’) ]
Reminder: RL issues • Credit assignment (problem: • Generalization problem:
Abstract states • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.
Abstract states • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.
Abstract states • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc. Oops – lost the Markov property! Action outcome depends on the low-level state, which depends on history
Abstract actions • Forestier & Varaiya, 1978: supervisory control: • Abstract action is a fixed policy for a region • “Choice states” where policy exits region Decision process on choice states is semi-Markov
Semi-Markov decision processes • Add random duration N for action executions: P(s’,N|s,a) • Modified Bellman equation: Vπ = Σs’,N P(s’,N|s,π(s)) [R(s,π(s),s’,N) + γNVπ (s’) ] • Modified Q-learning with abstract actions: • For transitions within an abstract action πa: rc = rc + γcR(s,πa(s),s’) ; γc = γcγ • For transitions that terminate action begun at s0 Q(s0,πa) = (1-α)Q(s0,πa) + α(rc + γc maxa’Q(s’,πa’) ]
Hierarchical abstract machines • Hierarchy of NDFAs; states come in 4 types: • Action states execute a physical action • Call states invoke a lower-level NDFA • Choice points nondeterministically select next state • Stop states return control to invoking NDFA • NDFA internal transitions dictated by percept
Running example • Peasants can move, pickup and dropoff • Penalty for collision • Cost-of-living each step • Reward for dropping off resources • Goal : gather 10 gold + 10 wood • (3L)n++ states s • 7n primitive actions a
Joint SMDP • Physical states S, machine states Θ, joint state space Ω = S x Θ • Joint MDP: • Transition model for physical state (from physical MDP) and machine state (as defined by HAM) • A(ω) is fixed except when θ is at a choice point • Reduction to SMDP: • States are just those where θ is at a choice point
Hierarchical optimality • Theorem: the optimal solution to the SMDP is hierarchically optimal for the original MDP, i.e., maximizes value among all behaviors consistent with the HAM
Partial programs • HAMs (and Dietterich’s MAXQ hierarchies and Sutton’s Options) are all partially constrained descriptions of behaviors with internal state • A partial program is like an ordinary program in any programming language, with a nondeterministic choice operator added to allow for ignorance about what to do • HAMs, MAXQs, Options are trivially expressible as partial programs • Standard MDP: loop forever choose an action
RL and partial programs Partial program a Learning algorithm s,r Completion
RL and partial programs Partial program a Learning algorithm s,r Completion Hierarchically optimal for all terminating programs
Single-threaded Alisp program Program state θ includes Program counter Call stack Global variables (defun top () (loop do (choose ‘top-choice (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action‘get-wood) (nav *base-loc*) (action‘dropoff))) (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest)) (action‘get-gold) (nav *base-loc*)) (action‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice‘nav-choice (move ‘(N S E W NOOP)) (action move))))
Q-functions • Represent completions using Q-function • Joint stateω = [s,θ] env state + program state • MDP + partial program = SMDP over {ω} • Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter” • Modified Q-learning [AR 02] finds optimal completion
Internal state • Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies • E.g., while navigating to location (x,y), moving towards (x,y) is a good idea • Natural local shaping potential (distance from destination) impossible to express in external terms
Temporal Q-decomposition Top Top Top GatherGold GatherWood Nav(Mine2) Nav(Forest1) Qr Qc Qe Q • Temporal decompositionQ = Qr+Qc+Qewhere • Qr(ω,u) = reward while doing u (may be many steps) • Qc(ω,u) = reward in current subroutine after doing u • Qe(ω,u) = reward after current subroutine
State abstraction • Temporal decomposition => state abstraction E.g., while navigating, Qc independent of gold reserves • In general, local Q-components can depend on few variables => fast learning
Handling multiple effectors Get-gold Get-wood Multithreadedagent programs • Threads = tasks • Each effector assigned to a thread • Threads can be created/destroyed • Effectors can be reassigned • Effectors can be created/destroyed (Defend-Base) Get-wood
An example Concurrent ALisp program An example single-threaded ALisp program (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))
Concurrent Alisp semantics Peasant 1 (defun nav (Wood1) (loop until (at-pos Wood1) (setf d (choose ‘(N S E W R))) do (action d) )) • defun gather-wood () • (setf dest (choose *forests*)) • (nav dest) • (action ‘get-wood) • (nav *home-base-loc*) • (action ‘put-wood)) Thread 1 Running Paused Waiting for joint action Making joint choice (defun nav (Gold2) (loop until (at-pos Gold2) (setf d (choose ‘(N S E W R))) do (action d) )) (defun gather-gold () (setf dest (choose *goldmines*)) (nav dest) (action ‘get-gold) (nav *home-base-loc*) (action ‘put-gold)) Peasant 2 Paused Thread 2 Running Making joint choice Waiting for joint action Environment timestep 27 Environment timestep 26
Concurrent Alisp semantics • Threads execute independently until they hit a choice or action • Wait until all threads are at a choice or action • If all effectors have been assigned an action, do that joint action in environment • Otherwise, make joint choice
Q-functions • To complete partial program, at each choice state ω, need to specify choices for all choosing threads • So Q(ω,u) as before, except u is a joint choice • Suitable SMDP Q-learning gives optimal completion Example Q-function
Problems with concurrent activities • Temporal decomposition of Q-function lost • No credit assignment among threads • Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly • Peasant 2 thinks he’s done very well!! • Significantly slows learning as number of peasants increases
Threadwise decomposition • Idea : decompose reward among threads (Russell+Zimdars, 2003) • E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants • Qjπ(ω,u) = “Expected total reward received by thread j if we make joint choice u and then do π” • Threadwise Q-decompositionQ = Q1+…Qn • Recursively distributed SARSA => global optimality
Learning threadwise decomposition Peasant 3 thread Peasant 2 thread Peasant 1 thread r2 r1 r3 Top thread r decomp a Action state
Learning threadwise decomposition Peasant 3 thread Q-update Peasant 2 thread Q-update Peasant 1 thread Q2(ω,·) Q-update Q1(ω,·) Q3(ω,·) Top thread + argmax = u Q(ω,·) Choice state ω
Threadwise and temporal decomposition Peasant 3 thread • Qj = Qj,r + Qj,c + Qj,e where • Qj,r (ω,u) = Expected reward gained by thread jwhile doing u • Qj,c (ω,u) = Expected reward gained by thread jafter u but before leaving current subroutine • Qj,e (ω,u) = Expected reward gained by thread jafter current subroutine Top thread Peasant 2 thread Peasant 1 thread
Resource gathering with 15 peasants Threadwise + Temporal Threadwise Undecomposed Reward of learnt policy Flat Num steps learning (x 1000)
Summary of Part I • Structure in behavior seems essential for scaling up • Partial programs • Provide natural structural constraints on policies • Decompose value functions into simple components • Include internal state (e.g., “goals”) that further simplifies value functions, shaping rewards • Concurrency • Simplifies description of multieffector behavior • Messes up temporal decomposition and credit assignment (but threadwise reward decomposition restores it)
Outline • Hierarchical RL with abstract actions • Hierarchical RL with partial programs • Efficient hierarchical planning
Abstract Lookahead Rxa1 c2 c2 Kxa1 Qa4 ‘a’ ‘b’ a b ‘b’ ‘a’ ‘b’ ‘a’ aa ab ba bb • k-step lookahead >> 1-step lookahead • e.g., chess • k-step lookahead no use if steps are too small • e.g., first k characters of a NIPS paper Kc3 • Abstract plans (high-level executions of partial programs) are shorter • Much shorter plans for given goals => exponential savings • Can look ahead much further into future
High-level actions [Act] [GoSFO, Act] [GoOAK, Act] • Start with restricted classical HTN form of partial program • A high-level action (HLA) • Has a set of possible immediate refinements into sequences of actions, primitive or high-level • Each refinement may have a precondition on its use • Executable plan space = all primitive refinements of Act [WalkToBART, BARTtoSFO, Act]
Lookahead with HLAs • To do offline or online planning with HLAs, need a model • See classical HTN planning literature
Lookahead with HLAs • To do offline or online planning with HLAs, need a model • See classical HTN planning literature – not! • Drew McDermott (AI Magazine, 2000): • The semantics of hierarchical planning have never been clarified … no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions
Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking
Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general
Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds
Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds • Problem: how to say true things about effects of HLAs?
Angelic semantics for HLAs S S S s0 s0 s0 a4 a1 a3 a2 g g g • Begin with atomic state-space view, start in s0, goal state g • Central idea is the reachable set of an HLA from each state • When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal • May seem related to nondeterminism • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature State space h1h2is a solution h2 h1 h1h2 h2