350 likes | 502 Views
Motivated Reinforcement Learning for Non-Player Characters in Persistent Computer Game Worlds. Kathryn Merrick University of Sydney and National ITC Australia Supervisor: Mary Lou Maher. Introduction. Reinforcement learning uses a reward signal as a learning stimulus
E N D
Motivated Reinforcement Learning for Non-Player Characters in Persistent Computer Game Worlds Kathryn Merrick University of Sydney and National ITC Australia Supervisor: Mary Lou Maher
Introduction • Reinforcement learning uses a reward signal as a learning stimulus • Common assumptions about reward: • Tasks are known at design time and can be modelled as a task specific reward signal or • A teacher is present to provide reward when desirable (or undesirable) tasks are performed
Research Question • How can we extend reinforcement learning to environments where: • Tasks are not known at design time • A teacher is not present In order to achieve: • Efficient learning • Competence at multiple tasks of any complexity
Overview • Related work • Motivation using interest as an intrinsic reward signal • Metrics for evaluating motivated reinforcement learning • Results • Future work
Existing Technologies for NPCs • Reflexive Agents • Rule based • State machines • Learning Agents • Reinforcement learning IF !Range(NearestEnemyOf(Myself),3) Range(NearestEnemyOf(Myself),8) THEN RESPONSE #40 EquipMostDamagingMelee() AttackReevalutate(NearestEnemyOf(Myself),60) RESPONSE #80 EquipRanged() AttackReevalutate(NearestEnemyOf(Myself),30) END startup state Startup${ trigger OnGoHandleMessage$ (WE_ENTERED_WORLD){ SetState Spawn$; }}
Motivated Reinforcement Learning • Motivated reinforcement learning introduces an intrinsic rewardsignal in addition to or instead of extrinsic reward • Intrinsic reward has been used to: • Speed learning of extrinsically rewarded tasks • Solve maintenance problems • Intrinsic reward has been modelled as: • Curiosity and boredom • Changes in light and sound intensity • Predictability, familiarity, stability • Novelty
Representing the Environment • Existing techniques: • Attribute based representation • Fixed length vectors <wall1x, wall1y, wall2x, wall2y, wall3x, wall3y, wall4x, wall4y> • Problem: • How long should the vector be?
Context Free Grammars • Represent only what is present using a context free grammar: S <objects> <object> <object><objects> | ε <object> <objectID><objectx><objecty> <objectID> <integer> <objectx> <integer> <objecty> <integer> <integer> 1 | 2| 3 | …
Representing Tasks • Potential learning tasks are represented as events: changes in the environment E(t) = S(t) – S(t-1) • S(1) (<locationX:2>, <locationY:5>, <pick:1>, <forge:1>) • A(1) (move, north) • S(2) (<locationX:2>, <locationY:6><, lathe:1>) • E(2) (<locationY:1>, <forge:-1>, <lathe:1>)
Motivation as Interesting Events E(t) Events Memory HSOM Clustering layer (SOM) Novelty losing neurons σ(t) = 0 winning neighbourhood σ(t) = 1 Interest Habituating layer Reward N(t) = habituated value from winning clustering neuron
Novelty and Interest • Novelty and Habituation • Stanley’s model of habituation • Interest • The Wundt curve
Motivated Reinforcement Learning W(t) • Sensation • Computes events • Motivation • Computes an intrinsic reward signal • Learning: • Q-Learning update • Activation • ε-greedy action selection sensors S(t) S(t-1) S Memory S(t) S(t), E(t) E(t-1) M E(t) S(t), R(t) π(t-1),A(t-1) L π(t) π(t) A A(t) A(t) effectors F(t)
Motivated Hierarchical Reinforcement Learning W(t) • Sensation • Computes events • Motivation • Computes an intrinsic reward signal • Organisation • Manages policies • Learning: • Hierarchical Q-Learning update • Activation • Recall reflex • ε-greedy action selection sensors S(t) S(t-1) S Memory S(t) S(t), E(t) E(t-1) M E(t) S(t), R(t), E(t) B(t-1) O B(t) S(t), R(t) π(t-1),B(t-1) L π (t) S(t), π (t) A A(t) A(t) effectors F(t)
Performance Evaluation • Related work: • Characterise the output of the motivation function • Measure learning efficiency • Characterise the emergent behaviour • Our goals: • Efficient learning • Competence at multiple tasks • Tasks of any complexity
Metrics • Existing metrics for learning efficiency • Eg: chart number of actions against time • Behavioural variety: • Summarises learning efficiency for multiple tasks • Behavioural complexity σE = CE = average(ā E | σE < r )
The Agent… • Sensors: • Location sensor • Object sensor • Inventory Sensor • Effectors: • Move to object effector • Pick up object effector • Use object effector
MRL – Behavioural Variety E(<inventoryIron:1>) E(<inventoryTimber:-1>) E(<location:-2>)
MHRL – Behavioural Variety E(<inventoryIron:-1>) E(<inventoryTimber:-1>) E(<location:-2>) E(<location:-2>) E(<inventoryTimber:1>) E(<inventoryIron:-1>)
Emergent Behaviour – Travelling Vendor • Sensors: • Location sensor • Object sensor • Effectors: • Move to object effector
Conclusions • It is possible for efficient task oriented learning to emerge without explicitly representing tasks in the reward signal. • Agents motivated by interest learn behaviours of greater variety and complexity than agents motivated by a random reward signal • Motivated hierarchical reinforcement learning agents are able to recall learned behaviours however behaviours are learned more slowly.
Conclusions about MRL for NPCs • Motivated reinforcement learning offers a single agent model for many characters. • Motivated characters display progressively emerging behavioural patterns. • Motivated characters can adapt their behaviour to changes in their environment.
Ongoing and Future Work • Scalability testing • Alternative models of motivation • Competence based motivation • Motivation with other classes of machine learning algorithms • Applications to intelligent environments