550 likes | 577 Views
Explore efficient knowledge transfer in reinforcement learning with human-like adaptation across varying soccer scenarios.
E N D
Autonomous Inter-Task Transfer in Reinforcement Learning Domains Matthew E. Taylor Learning Agents Research Group Department of Computer Sciences University of Texas at Austin 6/24/2008 1
Inter-Task Transfer • Learning tabula rasa can be unnecessarily slow • Humans can use past information • Soccer with different numbers of players • Agents leverage learned knowledge in novel tasks 2
Primary Questions Source Ssource, Asource Target Starget, ATarget • Is it possible to transfer learned knowledge? • Possible to transfer without a providing a task mapping? • Only consider reinforcement learning tasks 3
Reinforcement Learning (RL): Key Ideas • Markov Decision Process (MDP): ⟨SATR⟩ • Policy: π(s) = a • Action-Value function: Q(s, a) = ℜ • State variables: s = ⟨x1, x2, … xn⟩ Environment MDP S: States in task A: Actions agent can take T: T(S, A) → S’ R: R(S) → ℜ Action State Reward Agent 4
Outline • Reinforcement Learning Background • Inter-Task Mappings • Value Function Transfer • MASTER: Learning Inter-Task Mappings • Related Work • Future Work and Conclusion 5
Enabling Transfer Source Task QS: SS×AS→ℜ Target Task Environment Environment QT: ST×AT→ℜ ActionT ActionS StateT StateS RewardT RewardS Agent Agent 6
Inter-Task Mappings Source Target Source Target 7
Inter-Task Mappings • χx: starget→ssource • Given state variable in target task (some x from s=x1, x2, … xn) • Return corresponding state variable in source task • χA: atarget→asource • Similar, but for actions • Intuitive mappings exist in some domains (Oracle) • Used to construct transfer functional Source Target χx STarget SSOURCE ⟨x1…xn⟩ ⟨x1…xk⟩ χA ATarget ASOURCE {a1…am} {a1…aj} 8
Keepaway [Stone, Sutton, and Kuhlmann 2005] Goal: Maintain possession of ball 5 agents 3 (stochastic) actions 13 (noisy & continuous) state variables K2 K1 T1 K3 T2 Keeper with ball may hold ball or pass to either teammate 4 vs. 3: 7 agents 4 actions 19 state variables Both takers move towards player with ball
Keepaway Hand-coded χA Actions in 4 vs. 3 have “similar” actions in 3 vs. 2 • Hold4v3 Hold3v2 • Pass14v3 Pass13v2 • Pass24v3 Pass23v2 • Pass34v3 Pass23v2 Pass14v3 K2 K2 K1 K1 Pass24v3 T1 T1 K3 K3 T3 T2 T2 Pass34v3 K4 10
Keepaway Hand-coded χX Define similar state variables in two tasks Example: distances from player with ball to teammates K2 K2 K1 K1 T1 T1 K3 K3 T3 T2 T2 K4 11
Outline • Reinforcement Learning Background • Inter-Task Mappings • Value Function Transfer • MASTER: Learning Inter-Task Mappings • Related Work • Future Work and Conclusion 12
Value Function Transfer Source Ssource, Asource Target Starget, ATarget 13
ρ Value Function Transfer Source Task Q not defined on ST and AT ρ(QS (SS, AS)) = QT (ST, AT) Action-Value function transferred ρ is task-dependant: relies on inter-task mappings QS: SS×AS→ℜ Target Task Environment Environment QT: ST×AT→ℜ ActionS ActionT StateT StateS RewardS RewardT Agent Agent 14
Learning Keepaway • Sarsa update • CMAC, RBF, and neural network approximation successful • Qπ(s,a): Predicted number of steps episode will last • Reward = +1 for every timestep 15
’s Effect on CMACs • For each weight in 4 vs. 3 function approximator: • Use inter-task mapping to find corresponding 3 vs. 2 weight 3 vs. 2 4 vs. 3 16
Threshold: 8.5 Performance Target: no Transfer Target: with Transfer Target + Source: with Transfer Target: with Transfer Target: no transfer Transfer Evaluation Metrics “Sunk Cost” is ignored Source task(s) independently useful AI Goal Effectively utilize past knowledge Only care about Target Source Task(s) not useful Engineering Goal Minimize total training Set a threshold performance Majority of agents can achieve with learning Two distinct scenarios: 1. Target Time Metric: Successful if target task learning time reduced • 2.Total Time Metric: Successful if total (source + target) time reduced 17
Value Function Transfer: Time to threshold in 4 vs. 3 No Transfer Target Task Time Total Time } 18
Value Function Transfer Flexibility • Different Function Approximators • Radial Basis Function & Neural Network • Different Actuators • Pass accuracy • “Accurate” passers have normal actuators • “Inaccurate” passers have less capable kick actuators • Value Function Transfer also reduces target task time and total time: • Inaccurate 3 vs. 2 → Inaccurate 4 vs. 3 • Accurate 3 vs. 2 → Inaccurate 4 vs. 3 • Inaccurate 3 vs. 2 → Accurate 4 vs. 3 19
Value Function Transfer Flexibility • Different Function Approximators • Different Actuators • Different Keepaway Tasks • 5 vs. 4, 6 vs. 5, 7 vs. 6 20
Value Function Transfer Flexibility • Different Function Approximators • Different Actuators • Different Keepaway Tasks • Partial Mappings K2 K2 K1 K1 T1 T1 K3 K3 T3 T2 T2 K4 21
Value Function Transfer Flexibility • Different Function Approximators • Different Actuators • Different Keepaway Tasks • Partial Mappings • Different Domains • Knight Joust to 4 vs. 3 Keepaway Goal: Travel from start to goal line 2 agents 3 actions 3 state variables Fully Observable Discrete State Space (Q-table with ~600 s,a pairs) Deterministic Actions Opponent moves directly towards player Player may move North, or take a knight jump to either side 22
Value Function Transfer Flexibility • Different Function Approximators • Different Actuators • Different Keepaway Tasks • Partial Mappings • Different Domains • Knight Joust to 4 vs. 3 Keepaway • 3 vs. 2 Flat Reward, 3 vs. 2 Giveaway 23
Empirical Evaluation • Keepaway: 3 vs. 2, 4 vs. 3, 5 vs. 4, 6 vs. 5, 7 vs. 6 • Server Job Scheduling • Autonomic Computing Task • Server processes jobs in a queue while new jobs arrive • Policy selects between jobs with different utility functions Source Job Types 1,2 Target Job Types 1-4 25
Empirical Evaluation • Keepaway: 3 vs. 2, 4 vs. 3, 5 vs. 4, 6 vs. 5, 7 vs. 6 • Server Job Scheduling • Autonomic Computing Task • Server processes jobs in a queue while new jobs arrive • Policy selects between jobs with different utility functions • Mountain Car • 2D • 3D • Cross-Domain Transfer • Ringworld to Keepaway • Knight’s Joust to Keepaway K2 K1 T1 # Actions # State Variables Discrete vs. Continuous Deterministic vs. Stochastic Fully vs. Partially Observable Single Agent vs. Multi-Agent K3 T2
Outline • Reinforcement Learning Background • Inter-Task Mappings • Value Function Transfer • MASTER: Learning Inter-Task Mappings • Related Work • Future Work and Conclusion 27
Learning Task Relationships • Sometimes task relationships are unknown • Necessary for Autonomous Transfer • But finding similarities (analogies) can be very hard! • Key idea: • Agents may generate data (experience) in both tasks • Leverage existing machine learning techniques • 2 Techniques, differ in amount of background knowledge 28
Context ? • Steps to enable Autonomous transfer: • Select a relevant source task, given a target task • Learn how the source and target tasks are related • Effectively transfer knowledge between tasks • Transfer is Feasible (step 3) • Steps toward Finding Mappings between Tasks (step 2) • Leverage full QDBNs to search for mappings [Liu and Stone, 2006] • Test possible mappings on-line [Soni and Singh, 2006] • Mapping Learning via Classification 29
Context Source Target • Steps to enable Autonomous transfer: • Select a relevant source task, given a target task • Learn how the source and target tasks are related • Effectively transfer knowledge between tasks • Transfer is Feasible (step 3) • Steps toward Finding Mappings between Tasks (step 2) • Leverage full QDBNs to search for mappings [Liu and Stone, 2006] • Test possible mappings on-line [Soni and Singh, 2006] • Mapping Learning via Classification S, A, r, S’ S, A, r, S’ A A→A S, r, S’ S, r, S’ Action Classifier 30
MASTER OverviewModeling Approximate State Transitions by Exploiting Regression • Goals: • Learn inter-task mapping between tasks • Minimize data complexity • No background knowledge needed • Algorithm Overview: • Record data in source task • Record small amount of data in target task • Analyze data off-line to determine best mapping • Use mapping in target task Environment Source Task Target Task StateT RewardT ActionT Environment ActionS StateS RewardS Agent Agent MASTER
MASTER Algorithm Record observed (ssource, asource, s’source) tuples in source task Record small number of (starget, atarget, s’target) tuples in target task Learn one-step transition model, T(ST, AT), for the target task: M(starget, atarget) →s’target for every possible action mapping χA for every possible state variable mapping χX Transform recorded source task tuples Calculate the error of the transformed source task tuples on the target task model: ∑(M(stransformed, atransformed) – s’transformed)2 returnχA,χX with lowest error Environment Source Task Target Task StateT RewardT ActionT Environment ActionS StateS RewardS Agent Agent MASTER
Observations • Pros: • Very little target task data needed (sample complexity) • Analysis for discovering mappings is off-line • Cons: • Exponential in # of state variables and actions 33
Generalized Mountain Car • 2D Mountain Car • x, • Left, Neutral, Right • 3D Mountain Car (novel task) • x, y, , • Neutral, West, East, South, North 34
Generalized Mountain Car • Both tasks: • Episodic • Scaled State Variables • Sarsa • CMAC function approximation • 2D Mountain Car • x, • Left, Neutral, Right • 3D Mountain Car (novel task) • x, y, , • Neutral, West, East, South, North • χX • x, y → x • , → • χA • Neutral → Neutral • West, South → Left • East, North → Right 35
MASTER Algorithm Record observed (ssource, asource, s’source) tuples in source task Record small number of (starget, atarget, s’target) tuples in target task Learn one-step transition model, T(S,A), for the target task: M(starget, atarget) →s’target for every possible action mapping χA for every possible state variable mapping χX Transform recorded source task tuples Calculate the error of the transformed source task tuples on the target task model: ∑(M(stransformed, atransformed) – s’ transformed)2 returnχA,χX with lowest error 36
MASTER and Mountain Car Record observed (x, , a2D, x’, ’) tuples in 2D task Record small number of (x, y, , , a3D, x’, y’, ’, ’) tuples in 3D task Learn one-step transition model, T(S,A), for the 3D task: M(x, y, , , a3D) →x’, y’, ’, ’ for every possible action mapping χA for every possible state variable mapping χX Transform recorded source task tuples Calculate the error of the transformed source task tuples on the target task model: ∑(M(stransformed, atransformed) – s’ transformed)2 returnχA,χX with lowest error (of 240 possible mappings: 16 state variables × 15 actions) 39
Utilizing Mappings in 3D Mountain Car Hand coded mappings No Transfer 42
Experimental Setup • Learn in 2D Mountain Car for 100 episodes • Learn in 3D Mountain Car for 25 episodes • Apply MASTER • Train transition model off-line using backprop in Weka • Transfer from 2D to 3D: Q-Value Reuse • Learn the 3D Task 43
Action Mappings Evaluated (-0.50, 0.01, Right, -0.49, 0.02) (-0.50, -0.50, 0.01, 0.01, East, -0.49, -0.49, 0.02, 0.02) (-0.50, -0.50, 0.01, 0.01, North, -0.49, -0.49, 0.02, 0.02) 45
Transfer in 3D Mountain Car Hand-Coded 1/MSE Average Actions No Transfer Average Both 46
Transfer in 3D Mountain Car: Zoom Average Actions No Transfer 47
MASTER Wrap-up • First fully autonomous mapping-learning method • Learning done off-line • Use to select most relevant source task or transfer from multiple source tasks • Future work • Incorporate heuristic search • Use in more complex domains • Formulate as optimization problem? 48
Outline • Reinforcement Learning Background • Inter-Task Mappings • Value Function Transfer • MASTER: Learning Inter-Task Mappings • Related Work • Future Work and Conclusion 49
Related Work: Framework • Allowed task differences • Source task selection • Type of knowledge transferred • Allowed base learners • + 3 others 50
Selected Related Work: Transfer Methods • Same state variables and actions [Selfridge+, 1985] • Multi-task learning [Fernandez and Veloso, 2006] • Methods to avoid inter-task mappings [Konidaris and Barto, 2007] • Different state variables and actions [Torrey+, •] T(s, a)=s’ Action State Reward s = ⟨x1, … xn⟩ 51
Hold: 2 vs. 1 Keepaway Selected Related Work: Mapping Learning Methods On-line: • Test possible mappings on-line as new actions [Soni and Singh, 2006] • k-Armed bandit, each arm is a mapping [Talvite and Singh, 2007] Off-line • Full Qualitative Dynamic Bayes Networks (QDBNs) [Liu and Stone, 2006] • Assume T types of task-independent objects • Keepaway domain has 2 object types: Keepers and Takers 52
Outline • Reinforcement Learning Background • Inter-Task Mappings • Value Function Transfer • MASTER: Learning Inter-Task Mappings • Related Work • Future Work and Conclusion 53
Open Question 1:Optimize for Metrics • Minimize target time: more source task training? • Minimize total time: “moderate” amount of training? • Depends on task similarity 3 vs. 2 to 4 vs. 3 54