160 likes | 411 Views
Temporal Difference Learning with Expectimax Search for the CGI-Threes. Author : Chia- Chuan Chang, Chao-Chin Liang Advisor: I-Chen Wu Speaker: Chia- Chuan Chang. Reference. “ Threes” , http://asherv.com/threes/ “ Threesus !”, http:// blog.waltdestler.com/2014/04/threesus.html
E N D
Temporal Difference Learning with Expectimax Search for the CGI-Threes Author: Chia-Chuan Chang, Chao-Chin Liang Advisor:I-Chen Wu Speaker: Chia-Chuan Chang
Reference “Threes” , http://asherv.com/threes/ “Threesus!”, http://blog.waltdestler.com/2014/04/threesus.html “THREESJS”, http://threesjs.com/ “Taiwan 2048 Bot”, www.facebook.com/2048BotContest “2048 AI webpage (from our lab)”, http://2048.aigames.nctu.edu.tw/replay.php Albert L. Zobrist. A New Hashing Method With Application For Game Playing. Technical Report #88, April 1970/ Bruce W. Ballard, “The *-Minimax Search Procedure for Trees Containing Chance Nodes”. ArtifIntell 21:327-350, 1983 Marcin Szubert, WojciehJaskowaski, Institute of Computing Science, Poznan University of Technology, Poznan, Poland, “Temporal Difference Learning of N-tuple Networks for the Game 2048”, IEEE CIG 2014 conference, August 2014 J. Baxter, A. Tridgell, and L. Weaver, “Learning to Play Chess Using Temporal Differences,” Machine Learning, vol. 40, no. 3, pp. 243–263, 2000. Temporal-Difference Learning, Section II-6, “An Introduction to Reinforcement Learning” + time
Outline • Background knowledge • Expectimax • TD-Learning • Formula • Tuple network • Our algorithm • Board Design • New Features • Apply to expectimax
Outline • Background knowledge • Expectimax • TD-Learning • Formula • Tuple network • Our algorithm • Board Design • New Features • Apply to expectimax
Outline • Background knowledge • Expectimax • TD-Learning • Formula • Tuple network • Our algorithm • Board Design • New Features • Apply to expectimax
TD-Learning in Game Threes • TD-learning can be successfully applied to game 2048.[Szubert & Jakowski 2014] • We designed our Threes program, CGI-Threes: • Based on CGI-2048 • Use TD-learning as the above program • Different definitions to game board (Threes! vs. 2048) • Use our own features. • Use expectimax search.
TD-Learning in Game Threes • Use TD(0) learning method: • : the expected cumulative reward for a board, implemented using N-tuple networks • : the learning rate • other variables are defined at the next page • Minimize the difference between the current prediction of cumulative future reward and one-step-ahead prediction.
TD-Learning Add a new random tile Move right s s' s'' Learning the expected cumulative result for the board
Tuple Networks Implement the function mentioned before. is the function shown below:
Outline • Background knowledge • Expectimax • TD-Learning • Formula • Tuple network • Our algorithm • Board Design • New Features • Apply to expectimax
Our Algorithm –Board Design • Board Design • Bitboard: • 4-bit (0~f) for a tile • 0 for empty, • 1 for 1, • …., • efor 6144. • 16 tiles on board • 64-bit integer • Transposition table: • Z-hash
Our Algorithm –Our features • In principle, four features as pictures below were used, • Need to consider the symmetry • We also had some more tuning.
TD-Learning with Expectimax At the leaf nodes of the expectimax search tree, we return the heuristic of the board. We replace the heuristic with the value, V(s), we retrieve in TD learning
Contest Results • The results: • Max Score : 619347 (have reached 775095 in other 100 rounds) • Avg. Score: 223595 • Max tile : 6144 • 192 rate : 100% • 384rate : 100% • 768 rate : 100% • 1536 rate : 96 % • 3072 rate : 68 % • 6144 rate : 10 % • Speed: 300~500 moves/sec (test by ourselves, under one core cpu : Intel(R) Xeon(R) CPU E31225 @ 3.10GHz)