Managing State x Action Growth

1. RETALIATE: Learning Winning Policies in First-Person Shooter Games Dept. of Computer Science & Engineering Lehigh University

2. Outline Introduction Adaptive Game AI Domination games in Unreal Tournament� Reinforcement Learning Adaptive Game AI with Reinforcement Learning RETALIATE � architecture and algorithm Empirical Evaluation Final Remarks � Main Lessons

3. Introduction Adaptive Game AI, Unreal Tournament, Reinforcement Learning

4. Adaptive AI in Games Symbolic � Compositional syntax where atoms have meaning in themselves. (fol, other interpreted logical theory) Sub-symbolic � lack atomic elements that are themsleves meaningful representations (pixels GA as search technique, not learning technique Symbolic � Compositional syntax where atoms have meaning in themselves. (fol, other interpreted logical theory) Sub-symbolic � lack atomic elements that are themsleves meaningful representations (pixels GA as search technique, not learning technique

5. Adaptive Game AI and Learning Learning � Motivation Combinatorial explosion of possible situations Tactics (e.g., competing team�s tactics) Game worlds (e.g., map where the game is played) Game modes (e.g., domination, capture the flag) Little time for development Learning � the �Cons� Difficult to control and predict Game AI Difficult to test

6. Reinforcement Learning Agents learn policies through rewards and punishments Policy - Determines what action to take from a given state (or situation) Agent�s goal is to maximize some reward Tabular vs. Generalization Techniques We maintain a �Q-Table�: Q-table: State � Action ? value Supervisor -> labeled training; access to correct output Curse of dimensionality Tabular limited to small numbers of states and actions; not just memory, but time and data needed to fill accurately Generalization uses limited subset of state space to produce good approximation over much larger subsetSupervisor -> labeled training; access to correct output Curse of dimensionality Tabular limited to small numbers of states and actions; not just memory, but time and data needed to fill accurately Generalization uses limited subset of state space to produce good approximation over much larger subset

7. Unreal Tournament� (UT) Online FPS developed by Epic Games Inc. 1999 Six gameplay modes including team deathmatch and domination games Gamebots: a client-server architecture for controlling bots started by U.S.C. Information Sciences Institute (ISI) Direct competetor to Quake III arena U so cal.Direct competetor to Quake III arena U so cal.

8. UT Domination Games A number of fixed domination locations. Ownership: the team of last player to step into location Scoring: a team point awarded for every five seconds location remains controlled Winning: first team to reach pre-determined score (50) Top down view of the mapTop down view of the map

9. Adaptive Game AI with RL RETALIATE (Reinforced Tactic Learning in Agent-Team Environments) Tactic versus strategy?Tactic versus strategy?

10. The RETALIATE Team Controls two or more UT bots Commands bots to execute actions through the GameBots API The UT server provides sensory (state and event) information about the UT world and controls all gameplay Gamebots acts as middleware between the UT server and the Game AI Emphasize that bots are plug-ins. We learn strategies, not individual bot tactics.Emphasize that bots are plug-ins. We learn strategies, not individual bot tactics.

11. The RETALIATE Algorithm

12. Initialization

13. Rewards and Utilities

14. Rewards and Utilities ???But will not necessarily converge to an optimal policy ???But will not necessarily converge to an optimal policy

15. State Information and Actions Curse of dimensionalityCurse of dimensionality

16. Managing (State x Action) Growth Our Table: States: ( {E,F,N}, {E,F,N}, {E,F,N} ) = 27 Actions: ( {L1, L2, L3}, �) = 27 27 x 27 = 729 Generally, 3#loc x #loc#bot Adding health, discretized (high, med, low) States: (�, {h,m,l}) = 27 x 3 = 81 Actions: ( {L1, L2, L3, Health}, � ) = 43 = 64 81 x 64 = 5184 Generally, 3(#loc+1) x (#loc+1)#bot Number of Locations, size of team frequently varies.

17. Empirical Evaluation Opponents, Performance Curves, Videos

18. The Competitors Htn is previous work that beat the other three. We didn�t modify the HTN bots .. And its KB was successful. Therefore HTN bots is NOT a �straw-man� for proving retaliate, just a good opponent that IS currently FREELY available (since 2005)Htn is previous work that beat the other three. We didn�t modify the HTN bots .. And its KB was successful. Therefore HTN bots is NOT a �straw-man� for proving retaliate, just a good opponent that IS currently FREELY available (since 2005)

19. Summary of Results Against the opportunistic, possessive, and greedy control strategies, RETALIATE won all 3 games in the tournament. within the first half of the first game, RETALIATE developed a competitive strategy. Possibly get rid of OLD, and replace with the graphs depicting performance on the static opponents Make a point to say HTNBots won all games against non RL opponents. Maybe include the raph that cycles through opponents w/ retained q-table because it motivates that the strategy needs to remain dynamic!Possibly get rid of OLD, and replace with the graphs depicting performance on the static opponents Make a point to say HTNBots won all games against non RL opponents. Maybe include the raph that cycles through opponents w/ retained q-table because it motivates that the strategy needs to remain dynamic!

20. Summary of Results: HTNBots vs RETALIATE (Round 1) Epsilon isn�t changing � so can�t talk about exploring vs. exploiting What I really mean is how well the exploitation is working.Epsilon isn�t changing � so can�t talk about exploring vs. exploiting What I really mean is how well the exploitation is working.

21. Summary of Results: HTNBots vs RETALIATE (Round 2) Same thing about exploit vs. explore� be carefulSame thing about exploit vs. explore� be careful

22. Video: Initial Policy Add caption on left. What is red, blue? Draw wedge of a circle, explain apex is center, and rest is angle of viewAdd caption on left. What is red, blue? Draw wedge of a circle, explain apex is center, and rest is angle of view

23. Video: Learned Policy Keep same key on this slide.Keep same key on this slide.

24. Final Remarks Lessons Learned, Future Work

25. Final Remarks (1) From our work with RETALIATE we learned the following lessons, beneficial to any real-world application of RL for these kinds of games: Separate individual bot behavior from team strategies. Model the problem of learning team tactics through a simple state formulation. The use of non-discounted rewards works well in this domain. - Need to say that we originally tried much more state information, but learned through experiments to keep the state information small. - Explain how we learned we should separate strat from plug-in bots (ie state info again)- Need to say that we originally tried much more state information, but learned through experiments to keep the state information small. - Explain how we learned we should separate strat from plug-in bots (ie state info again)

26. Final Remarks (2) It is very hard to predict all strategies beforehand As a result, RETALIATE was able to find a weakness and exploit it to produce a winning strategy that HTNBots could not counter On the other hand HTNBots produce winning strategies against the other opponents from the beginning while it took RETALIATE half a game in some situations Tactics emerging from RETALIATE might be difficult to predict, a game developer will have a hard time maintaining the Game AI Future Work: This suggest that a combination of HTN Planning to lay down initial strategies and using Reinforcement Learning to tune these strategies should address individual weaknesses from both approaches Future Work: This suggest that a combination of HTN Planning to lay down initial strategies and using Reinforcement Learning to tune these strategies should address individual weaknesses from both approaches

27. Thank you! Questions?

28. REMBER Emphasis should be given on main lessons: simple domain representation, gamma = 1, etc. I think we outline a response to John and we included some of this in the book chapter, so we should include this in the presentation. There might be questions about other RL approaches and whether we didn't try them. We should think of an answer to that. Also a question if we tried gamma less than one (my recollection is that we did and Megan reported that it was converging too slowly towards a competitive policy)

Managing State x Action Growth

Managing State x Action Growth

Presentation Transcript

Managing State Information

X-Ray Sterilization – Managing Risks

Managing Growth

Managing Rapid Growth

Managing for Growth

State action?

ACTION - RED BULLETIN X - FIGHTERS -

ACTION – RED BULLETIN X–FIGHTERS –

Action Potential: Resting State

State Forest Action Plans

Chapter 14 Managing growth

Action Growth Plan

Managing CDM Team Growth

Session 10 : Managing State

MANAGING NON-STATE FUNDS

Module 14: Managing State

Managing Growth

MANAGING THE GROWTH

1. X () X () X () Concept: Being There: Social Action:

Managing Growth

MANAGING NON-STATE FUNDS