230 likes | 240 Views
Explore the history and future possibilities of Deep Reinforcement Learning in games research at IBM, focusing on maze navigation and the applications of RL in virtual worlds and text-based games.
E N D
Deep RL in Games Research @IBM Gerry Tesauro Principal Research Staff Member IBM T.J.Watson Research Center <gtesauro AT us DOT ibm DOT com> http://researcher.watson.ibm.com/researcher/view.php?person=us-gtesauro Joint work with: Janusz Marecki (IBM, Google DeepMind) Joe Bigus (IBM) Ban Kawas (IBM) Kamil Rocki (IBM)
History of Games @ IBM backgammon chess checkers Go Jeopardy!
New TD-Gammon Results! (Tesauro, 1992)
Towards Vision-Based Maze Navigation • Use one of the earliest FPS (“First Person Shooter”) computer (DOS) games – Wolfenstein3D(1992) • Display shows live 3-dimensional visual depiction of the environment from the “first-person” perspective • By contrast, depiction of Atari game state is “flat” • Game consists of a series of mazes or “levels” – goal is to exit each level while defeating enemies, pick up useful supplies (ammunition, food, medical etc.)
Why Study 3D Maze Games ?? • Instance of challenging POMDPs • Each maze has unknown layout • Player cannot infer the full game state from current visual frame – need to maintain history of past observations • Clear Metrics to Measure Progress • Point Scores, time to clear each level • High quality simulation model • Training in simulation usually more effective than live training • Potential Competition with Expert Humans • “adds spice to the study” (Samuel) • “provides a convincing demonstration for those who do not believe that machines can learn” (Samuel)
Highly Simplified Initial Task • Eliminate objects, weapons, enemies • Only goal is to find the exit (“First Person Non-Shooter”) • Create simplified maze, colors, textures • (Ambiguity increasesthe challenge) • Simplify legal actions • Three discrete actions: (1) slight move forward; (2) slight turn left; (3) slight turn right
Interface Learner to Game Engine Wolf3D exe (run in dosbox) Send Keystrokes Screen capture; Write Frames to file Shell script Read action from file Write action to file Load image Python NN
Inputs: current and previous observations (frames) and actions: First Hidden layer is a (previously trained) Autoencoder layer (RBM) Recurrent LSTM variant just implemented QNN Learner Architecture ot at-1, ot-1 Outputs (Q-values) at-n, ot-n H2 input H1
No knowledge of 2-D or 3-D vision, no knowledge of 2-D topology of pixels, no knowledge of 2-D layout of maze Learner only gets two types of rewards: (1) Reward = +1 if goal is reached (more than 15% of pixels are red) (2) Reward = -0.002 per time step Results of Maximal “Purist” Approach
Try adding a penalty if the agent is detected to be in a “stuck” state Makes the learner avoid going forward: disaster Add “partial credit” reward ~0.1 if the goal (red pixels) is visible, and gets closer (increase in red pixels): helps finish the epoch Add a fourth “U-turn” action: randomized turn 180o +/- 70o Immediately cures the stuck state Highly randomizing if explored frequently Hard-wired constraint on use of U-turn: U-turn is disabled if the agent is not stuck U-turn is mandatory if the agent is stuck Hope that this will eventually be learnable Minimal Knowledge to Add ?
At beginning of learning, 100% random exploration; still takes a long time to stumble upon the goal state Initial Results with U-turn etc.
Basic wall-following behavior Unanticipated strategy to maximize cumulative reward Training Results with U-turn
RL for Non-Player Characters in Virtual Worlds • Massive Multi-player Online Games: • World of Warcraft (~10 million users) • Open-Ended Virtual Worlds: • users create/add their own environment (terrain, buildings, objects, even laws of physics!) • Second Life • Active Worlds
Games Could Drive RL toward Strong AI • Text-Based Adventure Games (e.g. Zork series) • puzzle-solving, qualitative physics, commonsense reasoning • room descriptions, actions etc. all communicated by natural language interface • need an implicit sense of making progress
Learning backgammon using TD() • Neural net observes a sequence of input patterns x1, x2, x3, …, xf : sequence of board positions occurring during a game • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding. (“hand-crafted features” added in later versions) • At final position xf, reward signal z given: • z = 1 if White wins; • z = 0 if Black wins • Train neural net using gradient version of TD() • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt )
Q: Who makes the moves?? • A: Let neural net make the moves itself, using its current evaluator: score all legal moves, and pick max Vt for White, or min Vt for Black. • Hopelessly non-theoretical and crazy: • Training V using non-stationary (no convergence proof) • Training V using nonlinear func. approx. (no cvg. proof) • Random initial weights Random initial play! Extremely long sequence of random moves and random outcome Learning seems hopeless to a human observer • But what the heck, let’s just try and see what happens...
TD-Gammon can teach itself by playing games against itself and learning from the outcome • Works even starting from random initial play and zero initial expert knowledge (surprising!) achieves strong intermediate play • add hand-crafted features: advanced level of play (1991) • 2-ply search: strong master play (1993) • 3-ply search: superhuman play (1998)