170 likes | 319 Views
Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning. PhD Thesis Defence July, 2007. Candidate Kathryn Merrick School of Information Technologies University of Sydney. Supervisor Prof. Mary Lou Maher Key Centre for Design Computing
E N D
Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning PhD Thesis Defence July, 2007 Candidate Kathryn Merrick School of Information Technologies University of Sydney Supervisor Prof. Mary Lou Maher Key Centre for Design Computing and Cognition, University of Sydney Objectives | Contributions | Results | Conclusions
Introduction • Learning environments may be complex, with many states and possible actions • The tasks to be learned may change over time • It may be difficult to predict tasks in advance • Doing ‘everything’ may be infeasible • How can artificial agents focus attention to develop behaviours in complex, dynamic environments? • This thesis considers this question in conjunction with reinforcement learning Objectives | Contributions | Results | Conclusions
A3 S4 S3 A4 A2 S1 S2 A1 1. Develop models of motivation that focus attention based on experiences • Model complex, dynamic environments using a representation that enables adaptive behaviour 3. Develop learning agents with three aspects of attention focus: • Behavioural cycles • Adaptive behaviour • Multi-task learning 4. Develop metrics for comparing adaptability and multi-task learning behaviour of MRL agents. 5. Evaluate performanceand scalability of MRL agents using different models of motivation and different RL approaches. Objectives | Contributions | Results | Conclusions
Modelling Motivation as Experience-Based Reward Rm(t) = I(t) • Compute observations and events OS(t), ES(t) • Task selection using a self-organising map • Compute experience-based reward using: • Stanley’s model of habituation • Wundt Curve • No arbitration required Rm(t) = max(I(t), C(t)) • Compute observations and events OS(t), ES(t) • Task selection using a self-organising map • Compute experience-based reward using: • Policy error • Deci and Ryan’s model of optimal challenges • Arbitrate by taking maximum of interest and competence motivation Objectives | Contributions | Results | Conclusions
Representing Complex, Dynamic Environments P = {P1, P2, P3, …, Pi , …} S <sensations> <sensations> <PiSensations><sensations> | ε <PiSensations> <sj><PiSensations> | ε <sj> <number> | <string> <number> 1 | 2 | 3 | ... <string> ... S(1) = (<visiblePick:1> <visibleForge:1> <visibleSmithy:1>) A(1) = {A(pick-up, pick), A(pick-up, forge), A(pick-up, smithy)} S(2) = (<visibleAxe:1> <visibleLathe:1>) A(2) = {A(pick-up, axe), A(pick-up, lathe)} A<actions> <actions> <PiActions><actions> | ε <PiActions> <Aj><PiActions> | ε <Aj> ... Objectives | Contributions | Results | Conclusions
Metrics and Evaluation • A classification of different types of MRL and the role played by motivation in these approaches. • Metrics for comparing learned behavioural cycles in terms of adaptability and multi-task learning. • Evaluation of the performance and scalability of MRL agents using different: • Models of motivation • RL approaches • Types of environment • New approaches to the design of non-player characters for games, which can adapt in open-ended virtual worlds. Objectives | Contributions | Results | Conclusions
Experiment 1 Behavioural Variety Behavioural Complexity • Task oriented learning emerges using a task-independent motivation signal to direct learning. • Greatest behavioural variety in simple environments is achieved by MFRL agents • Greatest behavioural complexity is achieved by MFRL and MHRL agents, which can interleave solutions to multiple tasks Objectives | Contributions | Results | Conclusions
Experiment 2 Experiment 3 Experiment 4 MFRL MMORL MHRL • MFRL agents are most adaptable and most scalable as the number of tasks in the environment increases • MMORL are most scalable as the complexity of tasks increases • Agents motivated by interest and competence achieve greater adaptability, and show increased behavioural variety and complexity Objectives | Contributions | Results | Conclusions
Conclusions • MRL agents can learn task-oriented behavioural cycles using a task-independent motivation signal • The greatest behavioural variety and complexity in simple environments is achieved by MFRL agents • The greatest adaptability is displayed by MRL agents motivated by interest and competence • The most scalable approach when recall is required uses MMORL Objectives | Contributions | Results | Conclusions
Limitations and Future Work • Scalability of MRL in other types of environments • Additional approaches to motivation: • Biological models • Cognitive models • Social models • Combined models • Motivation in other machine learning settings: • Motivated supervised learning • Motivated unsupervised learning • Additional metrics for MRL: • Usefulness • Intelligence • Rationality (Linden, 2007) Objectives | Contributions | Results | Conclusions
Sensed state sensors Observation Agent World state Tasks • Maintenance tasks: observations • Achievement tasks: Events
Behavioural Cycles An-1 … Sn S3 A3 A2 A2 A2 An S1 S1 S2 S1 S2 S2 A1 S1 A1 A1 A1 S3 = (<location:NO_OBJECT> <Food Machine:1> <Food:1>) A3 = use (Food) S3 = (<location:Food> <Food Machine:1> <Food:1>) A2 = move to(Food) A4 = move to(Food Machine) S1 = (<location:Food Machine><Food Machine:1>) S2 = (<location:Food Machine><Food Machine:1><Food:1>) A1 = use (Food Machine)
Agent Models W(t) W(t) sensors sensors S(t) O(t)-1),E(t-1) S(t) M EU OU π (t) U A(t) U A U B EU OU π (t) U A(t) U A O(t)-1,E(t-1) M O(t)),E(t) S(t), Rmt) O(t), E(t) B(t)-1 Reflex S(t), Rmt) π (t)-1), S(t)-1,A(t)-1) B(t) S(t), Rmt) RL π (t)-1), S(t)-1,B(t)-1) B(t-1).π,S(t-1), S(t), B(t-1).A B(t-1).Ω(S(t-1)) π (t), S(t), A(t) MORL π (t), S(t), B(t) A(t) A(t) effectors effectors T(t) T(t) MFRL MMORL
Sensitivity Change in interest with (a) ρ+ = ρ- = 5,F+min = 0.5 and F-min = 1.5 and (b) ρ+ = ρ- = 30,F+min= 0.5 and F-min = 1.5 Change in interest with (a) ρ+ = ρ- =10, F+min = 0.1 and F-min = 1.9 and (b) ρ+ = ρ- = 10, F+min = 0.9 and F-min = 1.1
Metrics • A task is complete when its defining observation or event is achieved • A task is learned when the standard deviation of the number of actions in h behavioural cycle completing the task is less than some error threshold • Behavioural variety measures the number of tasks learned • Behavioural complexity measures the number of actions in a behavioural cycle