Transfer in Reinforcement Learning via Markov Logic Networks

Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison, USA

Possible Benefits of Transfer in RL Learning curves in the target task: performance with transfer without transfer training

The RoboCup Domain 2-on-1 BreakAway 3-on-2 BreakAway

Reinforcement Learning States are described by features: Actions are: Rewards are: distance(me,teammate1) = 15 distance(me,opponent1) = 5 angle(opponent1, me, teammate1) = 30 … Move Pass Shoot +1 for scoring 0 otherwise reward action state Environment Agent

Our Previous Methods • Skill transfer • Learn a rule for when to take each action • Use rules as advice • Macro transfer • Learn a relational multi-step action plan • Use macro to demonstrate

Transfer via Markov Logic Networks MLN Q-function Source-task learner Source-task Q-function and data Analyze Learn MLN Q-function Target-task learner Demonstrate

Markov Logic Networks Y X Z B A • A Markov network models a joint distribution • A Markov Logic Network combines probability with logic • Template: a set of first-order formulas with weights • Each grounded predicate in a formula becomes a node • Predicates in grounded formula are connected by arcs • Probability of a world: (1/Z) exp( Σ WiNi ) Richardson and Domingos, ML 2006

MLN Q-function IF distance(me, Teammate) < 15 AND angle(me, goalie, Teammate) > 45 THEN Q є (0.8, 1.0) Formula 1 W1 = 0.75 N1 = 1 teammate IF distance(me, GoalPart) < 10 AND angle(me, goalie, GoalPart) > 45 THEN Q є (0.8, 1.0) Formula 2 W1 = 1.33 N1 = 3 goal parts Probability that Q є (0.8, 1.0): __exp(W1N1 + W1N1)__ 1 + exp(W1N1 + W1N1)

Grounded Markov Network distance(me, teammate1) < 15 angle(me, goalie, teammate1) > 45 angle(me, goalie, goalLeft) > 45 distance(me, goalLeft) < 10 Q є (0.8, 1.0) distance(me, goalRight) < 10 angle(me, goalie, goalRight) > 45

Learning an MLN • Find good Q-value bins using hierarchical clustering • Learn rules that classify examples into bins using inductive logic programming • Learn weights for these formulas to produce the final MLN

Binning via Hierarchical Clustering Frequency Q-value Frequency Q-value Frequency Q-value

Classifying Into Bins via ILP • Given examples • Positive: inside this Q-value bin • Negative: outside this Q-value bin • The Aleph* ILP learning system finds rules that separate positive from negative • Builds rules one predicate at a time • Top-down search through the feature space * Srinivasan, 2001

Learning Formula Weights • Given formulas and examples • Same examples as for ILP • ILP rules as network structure • Alchemy* finds weights that make the probability estimates accurate • Scaled conjugate-gradient algorithm * Kok, Singla, Richardson, Domingos, Sumner, Poon and Lowd, 2004-2007

Using an MLN Q-function Q є (0.8, 1.0) P1 = 0.75 Q = P1 ● E [Q | bin1] + P2 ● E [Q | bin2] + P3 ● E [Q | bin3] Q є (0.5, 0.8) P2 = 0.15 Q є (0, 0.5) P2 = 0.10 Q-value of most similar training example in bin

Example Similarity • E [Q | bin] = Q-value of most similar training example in bin • Similarity = dot product of example vectors • Example vector shows which bin rules the example satisfies Rule 1 Rule 2 Rule 3 … 1 -1 1 1 1 -1

Experiments • Source task: 2-on-1 BreakAway • 3000 existing games from the learning curve • Learn MLNs from 5 separate runs • Target task: 3-on-2 BreakAway • Demonstration period of 100 games • Continue training up to 3000 games • Perform 5 target runs for each source run

Discoveries • Results can vary widely with the source-task chunk from which we transfer • Most methods use the “final” Q-function from the last chunk • MLN transfer performs better from chunks halfway through the learning curve

Results in 3-on-2 BreakAway

Conclusions • MLN transfer can significantly improve initial target-task performance • Like macro transfer, it is an aggressive approach for tasks with similar strategies • It “lifts” transferred information to first-order logic, making it more general for transfer • Theory refinement in the target task may be viable through MLN revision

Potential Future Work • Model screening for transfer learning • Theory refinement in the target task • Fully relational RL in RoboCup using MLNs as Q-function approximators

Acknowledgements • DARPA Grant HR0011-07-C-0060 • DARPA Grant FA 8650-06-C-7606 Thank You

Transfer in Reinforcement Learning via Markov Logic Networks

Transfer in Reinforcement Learning via Markov Logic Networks

Presentation Transcript

Markov Logic and Deep Networks

10-803 Markov Logic Networks

Online Structure Learning for Markov Logic Networks

Markov Logic Networks

Learning Markov Logic Network Structure Via Hypergraph Lifting

Policy Transfer via Markov Logic Networks

Learning Markov Logic Networks with Many Descriptive Attributes

Markov Logic Networks

Learning Markov Logic Networks Using Structural Motifs

Boosting Markov Logic Networks

Relational Transfer in Reinforcement Learning

Learning the Structure of Markov Logic Networks

Max-Margin Weight Learning for Markov Logic Networks

Learning the Structure of Markov Logic Networks

Reinforcement Learning on Markov Games

Online Max-Margin Weight Learning with Markov Logic Networks

Exploring Markov Decision Process Violations in Reinforcement Learning

Discriminative Structure and Parameter Learning for Markov Logic Networks

Learning the Structure of Markov Logic Networks

Apprenticeship Learning via Inverse Reinforcement Learning