350 likes | 514 Views
Advice Taking and Transfer Learning : Naturally-Inspired Extensions to Reinforcement Learning. Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik University of Wisconsin - Madison University of Minnesota - Duluth*. Reinforcement Learning. reward. action. state. Environment.
E N D
Advice Taking and Transfer Learning:Naturally-Inspired Extensionsto Reinforcement Learning Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik University of Wisconsin - Madison University of Minnesota - Duluth*
Reinforcement Learning reward action state Environment May be delayed Agent
Q-Learning policy(state) = argmaxaction state • Update Q-function incrementally • Follow current Q-function to choose actions • Converges to accurate Q-function Q-function value action
Limitations • Agents begin without any information • Random exploration required in early stages of learning • Long training times can result
Naturally-Inspired Extensions • Advice Taking • Transfer Learning Human Teacher RL Agent Knowledge Target-task Agent Source-task Agent Knowledge
Potential Benefits higher slope higher asymptote higher start performance withknowledge without knowledge training
Outline • RL in a complex domain • Extension #1: Advice Taking • Extension #2: Transfer Learning • Skill Transfer • Macro Transfer • MLN Transfer
The RoboCup Domain MoveDownfield KeepAway BreakAway +1 upon goal +1 per time step +1 per meter
The RoboCup Domain distBetween(a0, Player) distBetween(a0, GoalPart) distBetween(Attacker, goalCenter) distBetween(Attacker, ClosestDefender) distBetween(Attacker, goalie) angleDefinedBy(topRight, goalCenter, a0) angleDefinedBy(GoalPart, a0, goalie) angleDefinedBy(Attacker, a0, ClosestDefender) angleDefinedBy(Attacker, a0, goalie) timeLeft state move(ahead) shoot(GoalPart) pass(Teammate) move(away) move(right) move(left) actions
Q-Learning policy(state) = argmaxaction state Q-function value action Function approximation
Approximating the Q-function distBetween(a0, a1) distBetween(a0, a2) distBetween(a0, goalie) … 0.2 -0.1 0.9 … Linear support-vector regression: Q-value = Feature vector Weight vector ● T ● Set weights to minimize: ModelSize + C × DataMisfit
Outline • RL in a complex domain • Extension #1: Advice Taking • Extension #2: Transfer Learning • Skill Transfer • Macro Transfer • MLN Transfer
Extension #1: Advice Taking IF an opponent is near AND a teammate is open THEN pass is the best action
Advice in RL • Advice sets constraints on Q-values under specified conditions IF an opponent is near me AND a teammate is open THEN pass has a high Q-value • Apply as soft constraints in optimization ModelSize + C × DataMisfit + μ× AdviceMisfit
Outline • RL in a complex domain • Extension #1: Advice Taking • Extension #2: Transfer Learning • Skill Transfer • Macro Transfer • MLN Transfer
Extension #2: Transfer 3-on-2 KeepAway 3-on-2 BreakAway 3-on-2 MoveDownfield
Relational Transfer • First-order logic describes relationships between objects distBetween(a0, Teammate) > 10 distBetween(Teammate, goalCenter) < 15 • We want to transfer relational knowledge • Human-level reasoning • General representation
Outline • RL in a complex domain • Extension #1: Advice Taking • Extension #2: Transfer Learning • Skill Transfer • Macro Transfer • MLN Transfer
Skill Transfer Example 1: distBetween(a0, a1) = 15 distBetween(a0, a2) = 5 distBetween(a0, goalie) = 20 ... action = pass(a1) outcome = caught(a1) good_action(pass(Teammate)):- distBetween(a0, Teammate) > 10, distBetween(Teammate, goalCenter) <15. • Learn advice about good actions from the source task • Select positive and negative examples of good actions and apply inductive logic programming to learn rules
User Advice in Skill Transfer • There may be new skills in the target that cannot be learned from the source • E.g., shooting in BreakAway • We allow users to add their own advice about these new skills • User advice simply adds to transfer advice
Outline • RL in a complex domain • Extension #1: Advice Taking • Extension #2: Transfer Learning • Skill Transfer • Macro Transfer • MLN Transfer
Macro Transfer move(ahead) pass(Teammate) shoot(GoalPart) • Learn a strategy from the source task • Find an action sequence that separates good games from bad games • Learn first-order rules to control transitions along the sequence
Transfer via Demonstration Games played in target task 0 100 … Execute macro strategy Perform standard RL Agent learns an initial Q-function Agent adapts to the target task
Outline • RL in a complex domain • Extension #1: Advice Taking • Extension #2: Transfer Learning • Skill Transfer • Macro Transfer • MLN Transfer
MLN Transfer • Learn a Markov Logic Network to represent the source-task policy relationally • Apply the policy via demonstration in the target task state MLN Q-function value action
Markov Logic Networks Y X Z B A • A Markov network models a joint distribution • A Markov Logic Network combines probability with logic • Template: a set of first-order formulas with weights • Each grounded predicate in a formula becomes a node • Predicates in grounded formula are connected by arcs • Probability of a world: (1/Z) exp( Σ WiNi )
MLN Q-function IF distance(me, Teammate) < 15 AND angle(me, goalie, Teammate) > 45 THEN Q є (0.8, 1.0) Formula 1 W1 = 0.75 N1 = 1 teammate IF distance(me, GoalPart) < 10 AND angle(me, goalie, GoalPart) > 45 THEN Q є (0.8, 1.0) Formula 2 W2 = 1.33 N2 = 3 goal parts Probability that Q є (0.8, 1.0): __exp(W1N1 + W2N2)__ 1 + exp(W1N1 + W2N2)
Using an MLN Q-function Q є (0.8, 1.0) P1 = 0.75 Q = P1 ● E [Q | bin1] + P2 ● E [Q | bin2] + P3 ● E [Q | bin3] Q є (0.5, 0.8) P2 = 0.15 Q є (0, 0.5) P2 = 0.10 Q-value of most similar training example in bin
Conclusions • Advice and transfer can provide RL agents with knowledge that improves early performance • Relational knowledge is desirable because it is general and involves human-level reasoning • More detailed knowledge produces larger initial benefits, but is less widely transferrable
Acknowledgements • DARPA grant HR0011-04-1-0007 • DARPA grant HR0011-07-C-0060 • DARPA grant FA8650-06-C-7606 • NRL grant N00173-06-1-G002 Thank You