220 likes | 457 Views
Transfer in Reinforcement Learning via Markov Logic Networks. Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison, USA. Possible Benefits of Transfer in RL. Learning curves in the target task:. performance. with transfer.
E N D
Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison, USA
Possible Benefits of Transfer in RL Learning curves in the target task: performance with transfer without transfer training
The RoboCup Domain 2-on-1 BreakAway 3-on-2 BreakAway
Reinforcement Learning States are described by features: Actions are: Rewards are: distance(me,teammate1) = 15 distance(me,opponent1) = 5 angle(opponent1, me, teammate1) = 30 … Move Pass Shoot +1 for scoring 0 otherwise reward action state Environment Agent
Our Previous Methods • Skill transfer • Learn a rule for when to take each action • Use rules as advice • Macro transfer • Learn a relational multi-step action plan • Use macro to demonstrate
Transfer via Markov Logic Networks MLN Q-function Source-task learner Source-task Q-function and data Analyze Learn MLN Q-function Target-task learner Demonstrate
Markov Logic Networks Y X Z B A • A Markov network models a joint distribution • A Markov Logic Network combines probability with logic • Template: a set of first-order formulas with weights • Each grounded predicate in a formula becomes a node • Predicates in grounded formula are connected by arcs • Probability of a world: (1/Z) exp( Σ WiNi ) Richardson and Domingos, ML 2006
MLN Q-function IF distance(me, Teammate) < 15 AND angle(me, goalie, Teammate) > 45 THEN Q є (0.8, 1.0) Formula 1 W1 = 0.75 N1 = 1 teammate IF distance(me, GoalPart) < 10 AND angle(me, goalie, GoalPart) > 45 THEN Q є (0.8, 1.0) Formula 2 W1 = 1.33 N1 = 3 goal parts Probability that Q є (0.8, 1.0): __exp(W1N1 + W1N1)__ 1 + exp(W1N1 + W1N1)
Grounded Markov Network distance(me, teammate1) < 15 angle(me, goalie, teammate1) > 45 angle(me, goalie, goalLeft) > 45 distance(me, goalLeft) < 10 Q є (0.8, 1.0) distance(me, goalRight) < 10 angle(me, goalie, goalRight) > 45
Learning an MLN • Find good Q-value bins using hierarchical clustering • Learn rules that classify examples into bins using inductive logic programming • Learn weights for these formulas to produce the final MLN
Binning via Hierarchical Clustering Frequency Q-value Frequency Q-value Frequency Q-value
Classifying Into Bins via ILP • Given examples • Positive: inside this Q-value bin • Negative: outside this Q-value bin • The Aleph* ILP learning system finds rules that separate positive from negative • Builds rules one predicate at a time • Top-down search through the feature space * Srinivasan, 2001
Learning Formula Weights • Given formulas and examples • Same examples as for ILP • ILP rules as network structure • Alchemy* finds weights that make the probability estimates accurate • Scaled conjugate-gradient algorithm * Kok, Singla, Richardson, Domingos, Sumner, Poon and Lowd, 2004-2007
Using an MLN Q-function Q є (0.8, 1.0) P1 = 0.75 Q = P1 ● E [Q | bin1] + P2 ● E [Q | bin2] + P3 ● E [Q | bin3] Q є (0.5, 0.8) P2 = 0.15 Q є (0, 0.5) P2 = 0.10 Q-value of most similar training example in bin
Example Similarity • E [Q | bin] = Q-value of most similar training example in bin • Similarity = dot product of example vectors • Example vector shows which bin rules the example satisfies Rule 1 Rule 2 Rule 3 … 1 -1 1 1 1 -1
Experiments • Source task: 2-on-1 BreakAway • 3000 existing games from the learning curve • Learn MLNs from 5 separate runs • Target task: 3-on-2 BreakAway • Demonstration period of 100 games • Continue training up to 3000 games • Perform 5 target runs for each source run
Discoveries • Results can vary widely with the source-task chunk from which we transfer • Most methods use the “final” Q-function from the last chunk • MLN transfer performs better from chunks halfway through the learning curve
Conclusions • MLN transfer can significantly improve initial target-task performance • Like macro transfer, it is an aggressive approach for tasks with similar strategies • It “lifts” transferred information to first-order logic, making it more general for transfer • Theory refinement in the target task may be viable through MLN revision
Potential Future Work • Model screening for transfer learning • Theory refinement in the target task • Fully relational RL in RoboCup using MLNs as Q-function approximators
Acknowledgements • DARPA Grant HR0011-07-C-0060 • DARPA Grant FA 8650-06-C-7606 Thank You