Advice Taking and Transfer Learning : Naturally-Inspired Extensions to Reinforcement Learning

Advice Taking and Transfer Learning:Naturally-Inspired Extensionsto Reinforcement Learning Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik University of Wisconsin - Madison University of Minnesota - Duluth*

Reinforcement Learning reward action state Environment May be delayed Agent

Q-Learning policy(state) = argmaxaction state • Update Q-function incrementally • Follow current Q-function to choose actions • Converges to accurate Q-function Q-function value action

Limitations • Agents begin without any information • Random exploration required in early stages of learning • Long training times can result

Naturally-Inspired Extensions • Advice Taking • Transfer Learning Human Teacher RL Agent Knowledge Target-task Agent Source-task Agent Knowledge

Potential Benefits higher slope higher asymptote higher start performance withknowledge without knowledge training

Outline • RL in a complex domain • Extension #1: Advice Taking • Extension #2: Transfer Learning • Skill Transfer • Macro Transfer • MLN Transfer

The RoboCup Domain MoveDownfield KeepAway BreakAway +1 upon goal +1 per time step +1 per meter

The RoboCup Domain distBetween(a0, Player) distBetween(a0, GoalPart) distBetween(Attacker, goalCenter) distBetween(Attacker, ClosestDefender) distBetween(Attacker, goalie) angleDefinedBy(topRight, goalCenter, a0) angleDefinedBy(GoalPart, a0, goalie) angleDefinedBy(Attacker, a0, ClosestDefender) angleDefinedBy(Attacker, a0, goalie) timeLeft state move(ahead) shoot(GoalPart) pass(Teammate) move(away) move(right) move(left) actions

Q-Learning policy(state) = argmaxaction state Q-function value action Function approximation

Approximating the Q-function distBetween(a0, a1) distBetween(a0, a2) distBetween(a0, goalie) … 0.2 -0.1 0.9 … Linear support-vector regression: Q-value = Feature vector Weight vector ● T ● Set weights to minimize: ModelSize + C × DataMisfit

RL in 3-on-2 BreakAway

Extension #1: Advice Taking IF an opponent is near AND a teammate is open THEN pass is the best action

Advice in RL • Advice sets constraints on Q-values under specified conditions IF an opponent is near me AND a teammate is open THEN pass has a high Q-value • Apply as soft constraints in optimization ModelSize + C × DataMisfit + μ× AdviceMisfit

Advice Performance

Extension #2: Transfer 3-on-2 KeepAway 3-on-2 BreakAway 3-on-2 MoveDownfield

Relational Transfer • First-order logic describes relationships between objects distBetween(a0, Teammate) > 10 distBetween(Teammate, goalCenter) < 15 • We want to transfer relational knowledge • Human-level reasoning • General representation

Skill Transfer Example 1: distBetween(a0, a1) = 15 distBetween(a0, a2) = 5 distBetween(a0, goalie) = 20 ... action = pass(a1) outcome = caught(a1) good_action(pass(Teammate)):- distBetween(a0, Teammate) > 10, distBetween(Teammate, goalCenter) <15. • Learn advice about good actions from the source task • Select positive and negative examples of good actions and apply inductive logic programming to learn rules

User Advice in Skill Transfer • There may be new skills in the target that cannot be learned from the source • E.g., shooting in BreakAway • We allow users to add their own advice about these new skills • User advice simply adds to transfer advice

Skill Transfer to 3-on-2 BreakAway

Macro Transfer move(ahead) pass(Teammate) shoot(GoalPart) • Learn a strategy from the source task • Find an action sequence that separates good games from bad games • Learn first-order rules to control transitions along the sequence

Transfer via Demonstration Games played in target task 0 100 … Execute macro strategy Perform standard RL Agent learns an initial Q-function Agent adapts to the target task

Macro Transfer to 3-on-2 BreakAway

MLN Transfer • Learn a Markov Logic Network to represent the source-task policy relationally • Apply the policy via demonstration in the target task state MLN Q-function value action

Markov Logic Networks Y X Z B A • A Markov network models a joint distribution • A Markov Logic Network combines probability with logic • Template: a set of first-order formulas with weights • Each grounded predicate in a formula becomes a node • Predicates in grounded formula are connected by arcs • Probability of a world: (1/Z) exp( Σ WiNi )

MLN Q-function IF distance(me, Teammate) < 15 AND angle(me, goalie, Teammate) > 45 THEN Q є (0.8, 1.0) Formula 1 W1 = 0.75 N1 = 1 teammate IF distance(me, GoalPart) < 10 AND angle(me, goalie, GoalPart) > 45 THEN Q є (0.8, 1.0) Formula 2 W2 = 1.33 N2 = 3 goal parts Probability that Q є (0.8, 1.0): __exp(W1N1 + W2N2)__ 1 + exp(W1N1 + W2N2)

Using an MLN Q-function Q є (0.8, 1.0) P1 = 0.75 Q = P1 ● E [Q | bin1] + P2 ● E [Q | bin2] + P3 ● E [Q | bin3] Q є (0.5, 0.8) P2 = 0.15 Q є (0, 0.5) P2 = 0.10 Q-value of most similar training example in bin

MLN Transfer to 3-on-2 BreakAway

Conclusions • Advice and transfer can provide RL agents with knowledge that improves early performance • Relational knowledge is desirable because it is general and involves human-level reasoning • More detailed knowledge produces larger initial benefits, but is less widely transferrable

Acknowledgements • DARPA grant HR0011-04-1-0007 • DARPA grant HR0011-07-C-0060 • DARPA grant FA8650-06-C-7606 • NRL grant N00173-06-1-G002 Thank You

Advice Taking and Transfer Learning : Naturally-Inspired Extensions to Reinforcement Learning