Automatic Induction of MAXQ Hierarchies

Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University Funded by DARPA Transfer Learning Program

Hierarchical Reinforcement Learning • Exploits domain structure to facilitate learning • Policy constraints • State abstraction • Paradigms: Options, HAMs, MaxQ • MaxQ task hierarchy • Directed acyclic graph of subtasks • Leaves are the primitive MDP actions • Traditionally, task structure is provided as prior knowledge to the learning agent

Model Representation • Dynamic Bayesian Networks for the transition and reward models • Symbolic representation of the conditional probabilities/reward values as decision trees

Goal: Learn Task Hierarchies • Avoid the significant manual engineering of task decomposition • Requiring deep understanding of the purpose and function of subroutines in computer science • Frameworks for learning exit-option hierarchies: • HexQ: Determine exit states through random exploration • VISA: Determine exit states by analyzing DBN action models

Focused Creation of Subtasks • HEXQ & VISA: Create a separate subtask for each possible exit state. • This can generate a large number of subtasks • Claim: Defining good subtasks requires maximizing state abstraction while identifying “useful” subgoals. • Our approach: selectively define subtasks with single abstract exit states

Transfer Learning Scenario • Working hypothesis: • MaxQ value-function learning is much quicker than non-hierarchical (flat) Q-learning • Hierarchical structure is more amenable to transfer from source tasks to the target than value functions • Transfer scenario: • Solve a “source problem” (no CPU time limit) • Learn DBN models • Learn MAXQ hierarchy • Solve a “target problem” under the assumption that the same hierarchical structure applies • Will relax this constraint in future work

Rt+1 At Xt Xt+1 Yt Yt+1 MaxNode State Abstraction • Y is irrelevant within this action • It affects the dynamics but not the reward function • In HEXQ, VISA, and our work, we assume there is only one terminal abstract state, hence no pseudo-reward is needed • As a side-effect, this enables “funnel” abstractions in parent tasks

Our Approach: AI-MAXQ Learn DBN action models via random exploration (Other work) Apply Q learning to solve the source problem Generate a good trajectory from the learned Q function Analyze trajectory to produce CAT (This Talk) Analyze CAT to define MAXQ Hierarchy (This Talk)

Wargus Resource-Gathering Domain

Causally Annotated Trajectory (CAT) req.wood a.* reg.* req.wood a.r a.r a.r a.r a.l Start Goto MG Goto Dep Goto CW Goto Dep End reg.* req.gold req.gold A variable v is relevant to an action if the DBN for that action tests or changes that variable (this includes both the variable nodes and the reward nodes) Create an arc from A to B labeled with variable v iff v is relevant to A and B but not to any intermediate actions.

CAT Scan Start Goto MG Goto Dep Goto CW Goto Dep End • An action is absorbed regressively as long as • It does not have an effect beyond the trajectory segment, preventing exogenous effects • It does not increase the state abstraction

Start Goto MG Goto Dep Goto CW Goto Dep End CAT Scan

Start Goto MG Goto Dep Goto CW Goto Dep End CAT Scan Root

CAT Scan Start Goto MG Goto Dep Goto CW Goto Dep End Root Harvest Gold Harvest Wood

Root Harvest Gold Harvest Wood Get Gold Put Gold Get Wood Put Wood Mine Gold GDeposit Chop Wood WDeposit GGoto(goldmine) GGoto(townhall) WGoto(forest) WGoto(townhall) Goto(loc) Induced Wargus Hierarchy

Induced Abstraction & Termination Note that because each subtask has a unique terminal state, Result Distribution Irrelevance applies

Claims • The resulting hierarchy is unique • Does not depend on the order in which goals and trajectory sequences are analyzed • All state abstractions are safe • There exists a hierarchical policy within the induced hierarchy that will reproduce the observed trajectory • Extend MaxQ Node Irrelevance to the induced structure • Learned hierarchical structure is “locally optimal” • No local change in the trajectory segmentation can improve the state abstractions (very weak)

Experimental Setup • Randomly generate pairs of source-target resource-gathering maps in Wargus • Learn the optimal policy in source • Induce task hierarchy from a single (near) optimal trajectory • Transfer this hierarchical structure to the MaxQ value-function learner for target • Compare to direct Q learning, and MaxQ learning on a manually engineered hierarchy within target

Root Get Gold Get Wood GWDeposit Mine Gold Chop Wood Deposit Goto(loc) Hand-Built Wargus Hierarchy

Hand-Built Abstractions & Terminations

Results: Wargus

Need For Demonstrations • VISA only uses DBNs for causal information • Globally applicable across state space without focusing on the pertinent subspace • Problems • Global variable coupling might prevent concise abstraction • Exit states can grow exponentially: one for each path in the decision tree encoding • Modified bitflip domain exposes these shortcomings

Modified Bitflip Domain • State space: b0,…,bn-1 • Action space: • Flip(i), 0 < i < n-1 • If b0 …  bi-1 = 1 then bi← ~bi • Else b0 ← 0, …, bi ← 0 • Flip(n-1) • If parity(b0, …,bn-2)  bn-2 = 1, bn-1← ~bn-1 • Else b0 ← 0, …, bn-1 ← 0 • parity(…) = even if n-1 is even, odd otherwise • Reward: -1 for all actions • Terminal/goal state: b0 …  bn-1 = 1

Modified Bitflip Domain 1 1 1 0 0 0 0 Flip(3) 1 1 1 1 0 0 0 Flip(1) 1 0 1 1 0 0 0 Flip(4) 0 0 0 0 0 0 0

VISA’s Causal Graph Flip(n-2) • Variables grouped into two strongly connected components (dashed ellipses) • Both components affect the reward node Flip(2) Flip(n-2) Flip(n-2) Flip(1) Flip(n-1) Flip(2) b0 b1 b2 R bn-2 bn-1 Flip(2) Flip(n-1) Flip(3) Flip(n-1) Flip(3) Flip(n-1)

Root 2n-3 exit options Flip(n-1) Parity(b0,…,bn-2)  bn-2 = 1 Flip(n-1) Flip(0) Flip(1) VISA task hierarchy

Bitflip CAT bn-1 bn-2 b1 b0,…,bn-1 b0,…,bn-2 b0 b0 Flip(1) Start Flip(0) Flip(n-2) Flip(n-1) End

Induced MAXQ task hierarchy Root Flip(n-1) b0…bn-2= 1 Flip(n-2) b0…bn-3 = 1 Flip(n-3) b0 b1 = 1 Flip(0) Flip(1)

Results: Bitflip

Conclusion • Causality analysis is the key to our approach • Enables us to find concise subtask definitions from a demonstration • CAT scan is easy to perform • Need to extend to learn from multiple demonstrations • Disjunctive goals

Automatic Induction of MAXQ Hierarchies

Automatic Induction of MAXQ Hierarchies

Presentation Transcript

A Metric-based Framework for Automatic Taxonomy Induction

Hierarchies (Trees)

Structural Induction: towards Automatic Ontology Elicitation

Structural Induction, towards Automatic Ontology Elicitation

Hierarchies of Creative Domains:

Memory hierarchies

Memory Hierarchies

Memory Hierarchies

Feature Selection for Automatic Taxonomy Induction

Memory Hierarchies

Automatic Construction of Semantic Hierarchies Rion L. Snow rionsnow@fairisaac

An Overview of MAXQ Hierarchical Reinforcement Learning

Principle of Locality: Memory Hierarchies

Type Hierarchies

Hierarchies of Matter

Hierarchies

Protocol Hierarchies

Hierarchies of Matter

Memory hierarchies

Principle of locality: Memory Hierarchies

A text mining approach on automatic generation of web directories and hierarchies

The Hierarchies of God