Temporal Difference Learning with Expectimax Search for the Threes-bot

Temporal Difference Learning with Expectimax Search for the Threes-bot National Chiao Tung University Department of Computer Science Computer Games and Intelligence (CGI) Lab Advisor: I-Chen Wu Author: Han Chiang

Reference “Threes!”, http://asherv.com/threes/ “Taiwan 2048 Bot “, http://2048-botcontest.twbbs.org/ CGI-2048 http://2048.aigames.nctu.edu.tw/replay.php “Threesus!”, http://blog.waltdestler.com/2014/04/threesus.html Albert L. Zobrist. A New Hashing Method With Application For Game Playing. Technical Report #88, April 1970/ Bruce W. Ballard, “The ,-Minimax Search Procedure for Trees Containing Chance Nodes “ MarcinSzubert, WojciehJaskowaski, Institute of Computing Science, Poznan University of Technology, Poznan, Poland, “Temporal Difference Learning of N-tuple Networks for the Game 2048”, CIG2014 J. Baxter, A. Tridgell, and L. Weaver, “Learning to Play Chess Using Temporal Differences,” Machine Learning, vol. 40, no. 3, pp. 243–263, 2000. Temporal-Difference Learning, Section II-6, “An Introduction to Reinforcement Learning”

Outline • Background knowledge • Expectimax • TD-Learning • Formula • Tuple network • Our algorithm • Features • Apply to expectimax • Result

Expectimax

TD-Learning in Game Threes • TD-learning can be successfully applied to game 2048.[Szubert & Jakowski 2014] • We designed our Threes program, • Different definitions to game board. (Threes! vs. 2048) • Use our own features. • Use expectimax search.

TD-Learning in Game Threes • Use TD(0) learning method: • : the expected cumulative reward for a board, implemented using N-tuple networks • : the learning rate • other variables are defined at the next page • Minimize the difference between the current prediction of cumulative future reward and one-step-ahead prediction.

TD-Learning Add a new random tile Move right s s' s'' Learning the expected cumulative result for the board

Tuple Networks Implement the function mentioned before. is the function shown below:

Feature • Feature: • Max tile value and position • Possible new tile • 3 different parts of board with rotate and symmetric

TD-Learning with Expectimax At the leaf nodes of the expectimax search tree, we return the heuristic of the board. We replace the heuristic with the value, V(s), we retrieve in TD learning

Result (in our environment) Highest Score: 255531 Average Score: 107833 Max Tile: 3072 192 Rate: 100% 384 Rate: 100% 768 Rate: 97% 1536 Rate: 86% 3072 Rate: 29% 6144 Rate: 0% Move Count: 81097 Time: 199.35

Result (in contest server) Max Score : 246297 Avg. Score : 110931 192 rate : 100% 384 rate : 100% 768 rate :99% 1536 rate : 86% 3072 rate : 31%

Thank you

Temporal Difference Learning with Expectimax Search for the Threes-bot

Temporal Difference Learning with Expectimax Search for the Threes-bot

Presentation Transcript

Temporal Search: Detecting Hidden Malware Timebombs with Virtual Machines

Enhancing Search for Satisficing Temporal Planning with Objective-driven Decisions

Distributed Spatio-Temporal Similarity Search

Temporal Difference Learning

Navigating the Arduino-BOT with whiskers

Under threes

CMSC 471 Fall 2009 Temporal Difference Learning

Distributed Spatio-Temporal Similarity Search

Chapter 6: Temporal Difference Learning

Lecture 18: Temporal-Difference Learning

Temporal Event Map Construction For Event Search

Learning Programming With Light- bot

Temporal-Difference Learning Week #6

Regularization and Feature Selection in Least-Squares Temporal Difference Learning

Threes

Games: Expectimax

Temporal Difference Learning with Expectimax Search for the CGI-Threes

Distributed Spatio-Temporal Similarity Search

CMSC 471 Fall 2009 Temporal Difference Learning

Randomized Strategies and Temporal Difference Learning in Poker

Chapter 6: Temporal Difference Learning