320 likes | 332 Views
Discover how to automatically uncover complex behavioral patterns across multiple agents in diverse domains using modular neuroevolution. Explore the challenges and solutions in interleaved and blended task environments through simulations involving Ms. Pac-Man. Learn how to identify and optimize distinct task divisions for enhanced performance.
E N D
Solving Interleaved and Blended Sequential Decision-Making Problems through Modular Neuroevolution Jacob Schrum (schrum2@southwestern.edu) Risto Miikkulainen (risto@cs.utexas.edu)
Introduction • Challenge: Discover behavior automatically • Simulations • Robotics • Video games • Why challenging? • Complex domains • Multiple agents • Multiple objectives • Multimodal behavior required
When is Multimodal Behavior Needed? • Agent must exhibit multiple behaviors • Domain consists of multiple tasks • How to identify/distinguish tasks? • Interleaved • Each task is separate • Sharp borders between tasks • Blended • Tasks partially overlap • Blurred border between tasks
What types of policies work best? • Previous results show great success in Ms. Pac-Man (Schrum and Miikkulainen, GECCO 2014) • Ms. Pac-Man has blended tasks • Mixture of threat/edible ghosts • Neuroevolution discovers own task divisions • What about interleaved tasks? • Introduce a version of Ms. Pac-Man with interleaved tasks • Ghosts are all threats or all edible • What about human-specified task divisions? • Evolve multitask networks • One policy for each task • Understand problem better with formalism (in paper)
Ms. Pac-Man • Multimodal behavior needed • Predator/prey variant • Pac-Man takes on both roles • Goals: Maximize score by • Eating all pills in each level • Avoiding threatening ghosts • Eating ghosts (after power pill) • Non-deterministic • Very noisy evaluations • Four mazes • Behavior must generalize Human Play
Blended Original game Edible and threat ghosts present simultaneously Distinct threat/edible regions Four ghosts can be mix of threat/edible/imprisoned Interleaved Modified version Ghosts all edible or all threat Eaten ghosts stay in prison after being eaten Easier to eat ghosts and stay alive in this version Two Versions of Ms. Pac-Man
Modular Policy • One policy consisting of several policies/modules • Number preset, or learned • Means of arbitration also needed • Human-specified, or learned via preference neurons Outputs: Inputs: Multitask (Caruana 1997) Preference Neurons
Constructive Neuroevolution • Genetic Algorithms + Neural Networks • Three basic mutations + Crossover • Other structural mutations possible (NEAT by Stanley and Miikkulainen 2004) Perturb Weight Add Connection Add Node
Module Mutation • A mutation that adds a module • MM(Duplicate) copies previous module • New module can diverge as weights change • Can happen more than once MM(Duplicate) (cf Calabretta et al 2000)
Experiments • Evolved using Modular Multiobjective NEAT (http://nn.cs.utexas.edu/?mm-neat) • Each network type evaluated in 30 runs • Standard networks with one module (1M) • Nets with two (2M) or three (3M) modules + preference neurons • Nets starting with one module, subject to MM(D) • Multitask networks with human-specified task divisions • Interleaved: • One module for edible, one for threats (MT) • Blended: • One module if any ghost is edible, one otherwise (MT2) • One module for edible, one for threats, one for mixed (MT3)
← Interleaved Results • All modular approaches superior • Final differences significant (p < 0.05) • Higher scores overall (easier domain) • Different behaviors have similar scores • Blended Results → • Preference neurons superior • Multitask as bad as one module • Harder domain with lower scores • Certain behaviors distinctly better • Final differences significant (p < 0.05)
Interleaved Post-Evolution Evaluations: Primary Module Usage vs. Score
Interleaved Post-Evolution Evaluations: Primary Module Usage vs. Score Separate modules for escaping toward power pill and everything else Separate modules for threat and edible ghosts
Interleaved Post-Evolution Evaluations: Primary Module Usage vs. Score SMALL GAP LITTLE OVERLAP Separate modules for escaping toward power pill and everything else Separate modules for threat and edible ghosts
Blended Post-Evolution Evaluations: Primary Module Usage vs. Score
Blended Post-Evolution Evaluations: Primary Module Usage vs. Score LARGER GAP MORE OVERLAP Separate modules for escaping toward power pill and everything else Separate modules for threat and edible ghosts (Imperfect division)
Threat/Edible SplitWith Multitask Networks Interleaved Two Module Multitask Champion Blended Two Module Multitask Champion Blended Three Module Multitask Champion Similar behavior is exhibited in both domains. It is competent, but not outstanding
Threat/Edible Split With Preference Neurons Interleaved Two Module Champion Blended Three Module Champion (only uses two modules) MM(D) champions are similar. Significant usage of 3+ modules does not occur.
Escape/Luring Behavior(Only Possible With Preference Neurons) Interleaved Two Module Champion Blended Two Module Champion • A special “escape” module is rarely used, but its use has a big impact. • Escaping ghosts when nearly surrounded helps Ms. Pac-Man survive longer. • Escaping to a power pill after luring ghosts close helps eat ghosts to maximize points
Discussion • Obvious division is between edible and threat • Works fine in interleaved domain • Causes problems in blended domain • Strict Multitask divisions do not perform well • Better division: one module when surrounded • Very asymmetrical: surprising • Useful in interleaved domain, vital in blended domain • Module activates when Pac-Man almost surrounded • Often leads to eating power pill: luring • Helps Pac-Man escape in other risky situations
Conclusion • Interleaved tasks • Many approaches work well • Multitask network with one module per task sufficient • Preference neurons can still learn better task division • Blended tasks • Harder scenario dealt with in more real domains • Human-specified task divisions (Multitask) break down • Evolution needs freedom to discover own task division • Results in novel, successful task division (escaping/luring)
Questions? Contact me: schrum2@southwestern.edu Movies: http://nn.cs.utexas.edu/?interleaved-pm http://nn.cs.utexas.edu/?blended-pm Code: http://nn.cs.utexas.edu/?mm-neat
Multimodal Behavior • Animals can perform many different tasks • Imagine learning a monolithic policy as complex as a cardinal’s behavior: HOW? • Problem more tractable if broken into component behaviors Nesting Foraging Flying
Previous Work in Pac-Man • Custom Simulators • Genetic Programming: Koza 1992 • Neuroevolution: Gallagher & Ledwich 2007, Burrow & Lucas 2009, Tan et al. 2011 • Reinforcement Learning: Burrow & Lucas 2009, Subramanian et al. 2011, Bom 2013 • Alpha-Beta Tree Search: Robles & Lucas 2009 • Screen Capture Competition: Requires Image Processing • Evolution & Fuzzy Logic: Handa & Isozaki 2008 • Influence Map: Wirth & Gallagher 2008 • Ant Colony Optimization: Emilio et al. 2010 • Monte-Carlo Tree Search: Ikehata & Ito 2011 • Decision Trees: Foderaro et al. 2012 • Pac-Man vs. Ghosts Competition: Pac-Man • Genetic Programming: Alhejali & Lucas 2010, 2011, 2013, Brandstetter & Ahmadi 2012 • Monte-Carlo Tree Search: Samothrakis et al. 2010, Alhejali & Lucas 2013 • Influence Map: Svensson & Johansson 2012 • Ant Colony Optimization: Recio et al. 2012 • Pac-Man vs. Ghosts Competition: Ghosts • Neuroevolution: Wittkamp et al. 2008 • Evolved Rule Set: Gagne & Congdon 2012 • Monte-Carlo Tree Search: Nguyen & Thawonmos 2013
Evolved Direction Evaluator • Inspired by Brandstetter and Ahmadi (CIG 2012) • Net with single output and direction-relative sensors • Each time step, run net for each available direction • Pick direction with highest net output Right Preference Left Preference argmax Left
Preference Neuron Arbitration • How can network decide which module to use? • Find preference neuron (grey) with maximum output • Corresponding policy neurons (white) define behavior 0.7 > 0.1, So use Module 2 Policy neuron for Module 2 has output of 0.5 [0.6, 0.1, 0.5, 0.7] Output value 0.5 defines agent’s behavior Outputs: Inputs:
Pareto-based Multiobjective Optimization (Pareto 1890) High health but did not deal much damage Tradeoff between objectives Dealt lot of damage, but lost lots of health (Deb et al. 2000)
Non-dominated Sorting Genetic Algorithm II(Deb et al. 2000) • Population P with size N; Evaluate P • Use mutation (& crossover) to get P´ size N; Evaluate P´ • Calculate non-dominated fronts of P È P´ size 2N • New population size N from highest fronts of P È P´
Direction Evaluator + Modules • Network is evaluated in each direction • For each direction, a module is chosen • Human-specified (Multitask) or Preference Neurons • Chosen module policy neuron sets direction preference [0.5, 0.1, 0.7, 0.6] → 0.6 > 0.1: Left Preference is 0.7 [0.3, 0.8, 0.9, 0.1] → 0.8 > 0.1: Right Preference is 0.3 0.7 > 0.3 Ms. Pac-Man chooses to go left, based on Module 2 [Left Inputs] [Right Inputs]