300 likes | 439 Views
Automatic Induction of MAXQ Hierarchies. Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University. Funded by DARPA Transfer Learning Program. Hierarchical Reinforcement Learning. Exploits domain structure to facilitate learning
E N D
Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University Funded by DARPA Transfer Learning Program
Hierarchical Reinforcement Learning • Exploits domain structure to facilitate learning • Policy constraints • State abstraction • Paradigms: Options, HAMs, MaxQ • MaxQ task hierarchy • Directed acyclic graph of subtasks • Leaves are the primitive MDP actions • Traditionally, task structure is provided as prior knowledge to the learning agent
Model Representation • Dynamic Bayesian Networks for the transition and reward models • Symbolic representation of the conditional probabilities/reward values as decision trees
Goal: Learn Task Hierarchies • Avoid the significant manual engineering of task decomposition • Requiring deep understanding of the purpose and function of subroutines in computer science • Frameworks for learning exit-option hierarchies: • HexQ: Determine exit states through random exploration • VISA: Determine exit states by analyzing DBN action models
Focused Creation of Subtasks • HEXQ & VISA: Create a separate subtask for each possible exit state. • This can generate a large number of subtasks • Claim: Defining good subtasks requires maximizing state abstraction while identifying “useful” subgoals. • Our approach: selectively define subtasks with single abstract exit states
Transfer Learning Scenario • Working hypothesis: • MaxQ value-function learning is much quicker than non-hierarchical (flat) Q-learning • Hierarchical structure is more amenable to transfer from source tasks to the target than value functions • Transfer scenario: • Solve a “source problem” (no CPU time limit) • Learn DBN models • Learn MAXQ hierarchy • Solve a “target problem” under the assumption that the same hierarchical structure applies • Will relax this constraint in future work
Rt+1 At Xt Xt+1 Yt Yt+1 MaxNode State Abstraction • Y is irrelevant within this action • It affects the dynamics but not the reward function • In HEXQ, VISA, and our work, we assume there is only one terminal abstract state, hence no pseudo-reward is needed • As a side-effect, this enables “funnel” abstractions in parent tasks
Our Approach: AI-MAXQ Learn DBN action models via random exploration (Other work) Apply Q learning to solve the source problem Generate a good trajectory from the learned Q function Analyze trajectory to produce CAT (This Talk) Analyze CAT to define MAXQ Hierarchy (This Talk)
Causally Annotated Trajectory (CAT) req.wood a.* reg.* req.wood a.r a.r a.r a.r a.l Start Goto MG Goto Dep Goto CW Goto Dep End reg.* req.gold req.gold A variable v is relevant to an action if the DBN for that action tests or changes that variable (this includes both the variable nodes and the reward nodes) Create an arc from A to B labeled with variable v iff v is relevant to A and B but not to any intermediate actions.
CAT Scan Start Goto MG Goto Dep Goto CW Goto Dep End • An action is absorbed regressively as long as • It does not have an effect beyond the trajectory segment, preventing exogenous effects • It does not increase the state abstraction
Start Goto MG Goto Dep Goto CW Goto Dep End CAT Scan
Start Goto MG Goto Dep Goto CW Goto Dep End CAT Scan Root
CAT Scan Start Goto MG Goto Dep Goto CW Goto Dep End Root Harvest Gold Harvest Wood
Root Harvest Gold Harvest Wood Get Gold Put Gold Get Wood Put Wood Mine Gold GDeposit Chop Wood WDeposit GGoto(goldmine) GGoto(townhall) WGoto(forest) WGoto(townhall) Goto(loc) Induced Wargus Hierarchy
Induced Abstraction & Termination Note that because each subtask has a unique terminal state, Result Distribution Irrelevance applies
Claims • The resulting hierarchy is unique • Does not depend on the order in which goals and trajectory sequences are analyzed • All state abstractions are safe • There exists a hierarchical policy within the induced hierarchy that will reproduce the observed trajectory • Extend MaxQ Node Irrelevance to the induced structure • Learned hierarchical structure is “locally optimal” • No local change in the trajectory segmentation can improve the state abstractions (very weak)
Experimental Setup • Randomly generate pairs of source-target resource-gathering maps in Wargus • Learn the optimal policy in source • Induce task hierarchy from a single (near) optimal trajectory • Transfer this hierarchical structure to the MaxQ value-function learner for target • Compare to direct Q learning, and MaxQ learning on a manually engineered hierarchy within target
Root Get Gold Get Wood GWDeposit Mine Gold Chop Wood Deposit Goto(loc) Hand-Built Wargus Hierarchy
Need For Demonstrations • VISA only uses DBNs for causal information • Globally applicable across state space without focusing on the pertinent subspace • Problems • Global variable coupling might prevent concise abstraction • Exit states can grow exponentially: one for each path in the decision tree encoding • Modified bitflip domain exposes these shortcomings
Modified Bitflip Domain • State space: b0,…,bn-1 • Action space: • Flip(i), 0 < i < n-1 • If b0 … bi-1 = 1 then bi← ~bi • Else b0 ← 0, …, bi ← 0 • Flip(n-1) • If parity(b0, …,bn-2) bn-2 = 1, bn-1← ~bn-1 • Else b0 ← 0, …, bn-1 ← 0 • parity(…) = even if n-1 is even, odd otherwise • Reward: -1 for all actions • Terminal/goal state: b0 … bn-1 = 1
Modified Bitflip Domain 1 1 1 0 0 0 0 Flip(3) 1 1 1 1 0 0 0 Flip(1) 1 0 1 1 0 0 0 Flip(4) 0 0 0 0 0 0 0
VISA’s Causal Graph Flip(n-2) • Variables grouped into two strongly connected components (dashed ellipses) • Both components affect the reward node Flip(2) Flip(n-2) Flip(n-2) Flip(1) Flip(n-1) Flip(2) b0 b1 b2 R bn-2 bn-1 Flip(2) Flip(n-1) Flip(3) Flip(n-1) Flip(3) Flip(n-1)
Root 2n-3 exit options Flip(n-1) Parity(b0,…,bn-2) bn-2 = 1 Flip(n-1) Flip(0) Flip(1) VISA task hierarchy
Bitflip CAT bn-1 bn-2 b1 b0,…,bn-1 b0,…,bn-2 b0 b0 Flip(1) Start Flip(0) Flip(n-2) Flip(n-1) End
Induced MAXQ task hierarchy Root Flip(n-1) b0…bn-2= 1 Flip(n-2) b0…bn-3 = 1 Flip(n-3) b0 b1 = 1 Flip(0) Flip(1)
Conclusion • Causality analysis is the key to our approach • Enables us to find concise subtask definitions from a demonstration • CAT scan is easy to perform • Need to extend to learn from multiple demonstrations • Disjunctive goals