280 likes | 363 Views
Hierarchical Lookahead Agents: A Preliminary Report. Bhaskara Marthi Stuart Russell Jason Wolfe. High-level actions. For our purposes, a high-level action (HLA) Has a set of possible immediate refinements into sequences of actions, primitive or high-level
E N D
Hierarchical Lookahead Agents:A Preliminary Report Bhaskara Marthi Stuart Russell Jason Wolfe
High-level actions • For our purposes, a high-level action (HLA) • Has a set of possible immediate refinements into sequences of actions, primitive or high-level • Each refinement may have a precondition on its use • Many of the actions we do are high-level • Go to work • Write a NIPS paper • This definition generalizes others we’ve seen • Options don’t provide a choice of refinements • MAXQ tasks can’t represent sequencing constraints
Rxa1 c2 c2 Kxa1 Qa4 ‘a’ ‘b’ a b ‘b’ ‘a’ ‘b’ ‘a’ aa ab ba bb Abstract Lookahead • k-step lookahead >> 1-step lookahead • e.g., chess • k-step lookahead no use if steps are too small • e.g., first k characters of a NIPS paper • this is one small part of a human life, = ~20,000,000,000,000 primitive actions Kc3 • Abstract plans with HLAs are shorter • Much shorter plans for given goals => exponential savings • Can look ahead much further into future • Requires a model (i.e., transition and reward fns) for HLAs • No suitable models to be found in HRL or HTN literature • We proposed angelic semantics [MRW 07], which we extend here
S S S s0 s0 s0 a4 a1 a3 a2 t t t Angelic semantics for HLAs • Models HLAs in deterministic domains without rewards • Central idea is reachable set of an HLA from some state • When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal • May seem related to nondeterminism • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature State space h1h2is a solution h2 h1 h1h2 h2 Goal
Contributions • Extended angelic semantics to account for rewards • Developed novel algorithms that do lookahead • Hierarchical Forward Search • An optimal, offline algorithm • Can prove optimality of plans entirely at high level • Hierarchical Real-Time Search • An online algorithm • Requires only bounded computation per time step • Significantly outperforms previous ``flat’’ lookahead algorithms • Both algorithms require three inputs: • a deterministic MDP model • an action hierarchy (HLAs) • models for HLAs (based on angelic semantics)
S S -5 s0 s0 3 -8 -1 -4 t t Transition & reward functions for action a1 Deterministic MDPs • A deterministic MDP consists of: • State space S • Initial state s02 S, (w.l.o.g.) single terminal state t S • Primitive action set A • Transition function f : S x A ! S • Reward function r : S x A !R • Undiscounted, assume no non-negative-reward cycles
Example – Warehouse World • Has similarities to blocks and taxi domains, but more choices and constraints • Gripper must stay in bounds • Can’t pass through blocks • Can only turn around at top row • All actions have reward -1 • Goal: have C on T4 • Can’t just move directly • Final plan has 22 steps Left, Down, Pickup, Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown
Some HLAs for warehouse world • Each HLA has a set of immediate refinements into sequences of actions • All plans of interest are refinements of the special HLA Act • HLAs define a context free grammar over primitive plans (Act is the start symbol) [Act] … [Move(B,C) Act] … [Get(B) PutOn(C)] … [GetL(B)] [GetR(B)] … …
Act Move(C,B) PutOn(A) 3 4 2 1 Act 2 1 Move(B,C) Get(C) Act 1 Act Move(C,A) 3 Move(C,B) 3 0 2 0 Act Move(A,C) 2 Act 5 3 2 1 Abstract Lookahead Trees • ALTs are primary data structures • Represent a set of potential plans • Edges are (possibly abstract) actions • Basic operation: refine a plan (replace with all refs at some HLA) • Nodes represent potential situations(pairs of upper & lower valuations)
S S 3 -1 s0 s0 - - - - - - 3 3 3 4 h 2 -9 -9 7 7 7 - - - 5 - - - - - - - - - 7 1 2 5 - - - - - - - 2 t t 4 2 - - - - - - - Modeling HLAs • An HLA is fully characterized by primitive MDP + hierarchy • But without abstraction, we lose benefits of hierarchy • Extension of idea from “Angelic Semantics for HLAs” [MRW ‘07]: • Valuation of HLA h from state s: • For each s’, best reward of any prim. ref. of h that takes s to s’ • Exact description of h= valuation of h from each s • But this description has no compact, efficient rep. in general
Lower S S 1 s0 s0 8 Upper t t Upper and lower valuations • Instead, use approximatevaluations • Upper valuations never underestimate best achievable reward • Lower valuations never overestimate best achievable reward • We choose a simple form: reachable set + reward bound on set - - - 7 5 Exact - 2 4 6 -3 - - - - -
~ s t ~ s ~ t ~ x s t Representing descriptions: NCSTRIPS • Assume S = set of truth assignments to propositions • Descriptions specify propositions (possibly) added/deleted by HLA • Also include a bound on rewards • An efficient algorithm exists to progress valuations (represented as DNF formulae + numeric bound) through descriptions Navigate(xt,yt) (Pre: At(xs,ys)) Upper: -At(xs,ys), +At(xt,yt), -FacingRight, +FacingRight, +R(-|xs-xt| - |ys-yt|) Lower: IF (Free(xt,yt) x Free(x,ymax)): -At(xs,ys), +At(xt,yt), -FacingRight, +FacingRight, +R(-|xs-xt| -ys-yt+1) ELSE: nil
S S S s0 s0 s0 0 -5 0 Move(A,C) Act -3 -9 t t t Back to ALT nodes • Initially, we know we are at s0with reward 0 • Each ALT node is created from parent & descriptions of action from parent • Valuations give provable bounds on reachable states & rewards • Lower valuation after Act usually vacuous, upper valuation gives admissible heuristic to t
S S S s0 s0 s0 -5 -7 t t t Pruning Prune whenever a node’s lower valuation dominates another node’s upper valuation 0 -9 -5 -7
Hierarchical Forward Search • Construct the initial ALT • Loop • Select the plan with highest upper reward • If this plan is primitive, return it • Otherwise, refine one of its HLAs • Related to A*
Intuitive Picture t s0 Act highest-level primitive
Intuitive Picture t s0 Act highest-level primitive
Intuitive Picture t s0 Act highest-level primitive
Intuitive Picture Act t s0 highest-level primitive
Intuitive Picture Act t s0 highest-level primitive
Intuitive Picture Act t s0 highest-level primitive
Intuitive Picture t s0 highest-level primitive
Intuitive Picture t s0 highest-level primitive
Analysis • Optimality follows from guarantees offered by upper descriptions • Accurate upper descriptions --> more directed search • Accurate lower descriptions --> more pruning
s0 t An online version? • Hierarchical Forward Search plans all the way to t before starting acting • Can’t act in real time • Not really useful for RL • In stochastic case, must plan for all contingencies • How can we choose a good first action with bounded computation?
Hierarchical Real-Time Search • Loop until at t • Run Hierarchical Forward Search for (at most) k steps • Choose a plan that • begins with a primitive action • has total upper reward ≥ that of the best unrefined plan • has minimal upper reward up to but not including Act • Do the first action of this plan in the world • Do a Bellman backup for the current state (prevents looping) • Related to LRTA*
Preliminary Results Instance 1 (small) Instance 2 (larger)
Discussion • We see a definite benefit to high-level lookahead • Still some subtleties still to be worked out • Each HLA description has different biases, making direct comparisons between plans difficult • Current scheme is overly optimistic • Other directions for future work: • Extensions to probabilitic/partially obs. environments • Better meta-level control • Inferring/learning HLA models • Integration into RL algorithms • Abstract-level DP • … Questions?