Learning Procedural Planning Knowledge in Complex Environments

Learning Procedural Planning Knowledge in Complex Environments Douglas Pearson douglas.pearson@threepenny.net March 2004

Characterizing the Learner Method Implicit Deliberate KR Complex Environments Actions: Duration & Conditional Sensing: Limited, noisy, delayed Task : Timely response Domain: Change over time large state space Reinforcement Learning IMPROV Procedural Symbolic Learners Simple Environments Declarative Simpler Agents Weak, slower learning Complex Agents Strong, faster learning

Why Limit Knowledge Access? • Procedural – Only access by executing • Declarative – Can answer when will execute/what it will do. Declarative Problems • Availability • If (x^5 + 3x^3 – 5x^2 +2) > 7 then Action • Chains of rules A->B->C->Action • Efficiency • O(size of knowledge base) or worse • Agent slows down as learns more IMPROV Representation • Sets of production rules for operator preconditions and actions • Assume learner can only execute rules • But allow ability to add declarative knowledge when it’s efficient to do so.

Focusing on Part of the Problem 100% Task Performance 0% Knowledge Learn this Domain Knowledge Initial Rule Base Representation

The Problem • Cast learning problem as • Error detection (incomplete/incorrect K) • Error correction (fixing or adding K) • But with just limited, procedural access • Aim is to support learning in complex, scalable agents/environments.

Speed-30 Speed-10 Speed-0 Speed-30 S1 S2 S3 S4 Error Detection Problem PLAN Existing (Possibly Incorrect) Knowledge How to monitor the plan during execution without direct knowledge access?

Engine stalls No proposal S4 Speed-30 Speed-10 Speed-0 S1 S2 S3 Error Detection Solution • Direct monitoring – not possible • Instead detect lack of progress to the goal • No rules matching or conflicting rules • Not predicting behavior of the world (useful in stochastic environments) • But no implicit notion of quality of solution • Can add domain specific error conditions – but not required.

Repeat until find goal Fail Reached Goal Identify Incorrect Operator(s) Train Inductive Learner Change Domain Knowledge IMPROV’s Recovery Method Replan Search Execute Record [State,Op -> Result] Learning

Speed-30 Speed-10 Change-Gear Speed-0 Speed-30 Finding the Incorrect Operator(s) Speed-30 Speed-10 Speed-0 Speed-30 Change-Gear is over-specific Speed-0 is over-general By waiting can do better credit assignment

Learning to Correct the Operator • Collected a set of training instances • [State, Operator -> Result] • Can identify differences between states Speed = 40 Light = green Self = car Other = ambulance Speed = 40 Light = green Self = car Other = car • Used as a default bias in training inductive learner • Learn preconditions as classification problem (predict operator from state)

K-Incremental Learning • Collect a set of k instances • Then train inductive learner Reinforcement Learners Till Correction (IMPROV) Till Unique Cause (EXPO) Non-Incremental Learners n 1 k1 k2 Instance set size K-Incremental Learner • k does not grow over time => incremental behavior • Better decisions about what to discard when generalizing • When doing “active learning” bad early learning can really hurt

Speed 30 Speed 0 Speed 20 Extending to Operator Actions Decompose into operator hierarchy Speed 30 Speed 0 Speed 20 Brake Release Slow -5 Slow -10 Slow -10 Slow 0 Terminates with operators that modify a single symbol

Slow -2 Slow -4 Slow -6 => Failure Observed effects of braking on ice Correcting Actions Slow -5 Slow -10 Slow -10 Expected effects of braking Use the correction method to change the pre-conditions of these sub-operators

Change Procedural Actions Brake Braking & slow=0 & ice => reject slow -5 Specialize Slow -5 Changing effects of brake Braking & slow=0 & ice => propose slow -2 Generalize Slow -2 Supports Complex Actions Actions with durations (sequence of operators) Conditional actions (branches in sequence of operators) Multiple simultaneous effects

IMPROV Summary • IMPROV support for: • Powerful agents • -- Multiple goals • -- Faster, deliberate learning • Complex environments • -- Noise • -- Complex actions • -- Dynamic environments Method Implicit Deliberate KR Reinforcement Learning IMPROV Procedural Incremental Symbolic Learners Declarative Non-Incremental k-Incremental Learning -- Improved credit assignment -- Which operator -- Which feature General weak deliberate learner with only procedural access assumed -- General purpose error detection -- General correction method applied to preconditions and actions -- Nice re-use of precondition learner to learn actions -- Easy to add domain specific knowledge to make method stronger

Redux: Diagram-based Example-driven Knowledge Acquisition Douglas Pearson douglas.pearson@threepenny.net March 2004

1. User specifies desired behavior

2. User selects features – define rules Later we’ll use ML to guess this initial feature set

3. Compare desired with rules Desired Turn-to-face(threat1) Shoot(threat1) Move-through(door1) Actual Turn-to-face(neutral1) Shoot(neutral1) Move-through(door1)

4. Identify and correct problems • Detect differences between desired behavior and rules • Detect overgeneral preconditions • Detect conflicts within the scenario • Detect conflicts between scenarios • Detect choice points where there’s no guidance • etc. etc. • All of these errors are detected automatically when rule is created

Library of validated behavior examples Analysis & generation tools Executable Code A -> B C -> D E, J -> F G, A, C -> H E, G -> I J, K -> L Detect inconsistency Generalize Generate rules Simulate execution Simulation Environment 5. Fast rule creation by expert Define behavior with diagram-based examples Expert Engineer

Learning Procedural Planning Knowledge in Complex Environments