200 likes | 438 Views
Reinforcement Learning for the game of Tetris using Cross Entropy. Roee Zinaty and Sarai Duek Supervisor: Sofia Berkovich. Tetris Game. The Tetris game is composed of a 10x20 board and 7 types of blocks that can spawn: Each block can be rotated and translated to the desired placement.
E N D
Reinforcement Learning for the game of Tetris using Cross Entropy Roee Zinaty and Sarai Duek Supervisor: Sofia Berkovich
Tetris Game • The Tetris game is composed of a 10x20 board and 7 types of blocks that can spawn: • Each block can be rotated and translated to the desired placement. • Points are given upon completion of rows.
Our Tetris Implementation • We used a version of the Tetris game which is common in many computer applications (various machine learning competitions and the like). • We differ from the known game in several ways: • We rotate the pieces at the top, and then drop them straight down, simplifying and removing some possible moves. • We don’t award extra points for combos – we record simply how many rows were completed in each game.
World • Agent Reinforcement Learning • A form of machine learning, where each action is evaluated and then awarded a certain grade – good actions are awarded points, while bad actions are penalized. • Mathematically, it is defined as follows: Where is the value given to the state s, based on the reward function which is dependent on the state s and action a, and on the value of the next resulting state. Input Action
Cross Entropy A method for achieving a rare-occurrence result from a given distribution with minimal steps/iterations. We need it to find the optimal weights of given features in the Tetris game (our value function). That is because the chance of success in Tetris is much smaller than the chance of failure – a rare occurrence. • This is an iterative method, using the last iteration’s results and improving them. • We add noise to the CE result to prevent an early convergence to a wrong result.
CE Algorithm For iteration t, with distribution of • Draw sample vectors and evaluate their values • Select the best samples, and denote their indices by • Compute the parameters of the next iteration’s distribution by: • is a constant vector (dependent on iteration) of noise. • We’ve tried different kinds of noise, eventually using
RL and CE in the Tetris Case • We use a certain amount of in-game parameters, and generate using the CE method corresponding weights for each, using a base distribution of • Our reward function is derived from the weights and features: where is the weight of the matching feature • Afterwards, we run games using the above weights and sort those weights according to the number of rows completed in each game, computing the next iteration’s distributions according to the best results.
Our Parameters – First Try • We used initially a set of parameters detailing the following features: • Max pile height. • Number of holes. • Individual column heights. • Difference of heights between the columns. • Results from using these features were bad, and didn’t match the original paper they were taken from.
Our Parameters – Second Try • Afterwards, we tried the following features, which have their results displayed next:
2-Piece Strategy • However, we tried using a 2-piece strategy (look at the next piece and plan accordingly) with the first set of parameters. We thus achieved superb results – after ~20 iterations of the algorithm, we scored 4.8 million rows on average! • The downside was running time, approx. 1/10 of the speed of our normal algorithm. Coupled with the better results, long running times resulted. • Only two games were run using the 2-piece strategy, and they ran for about 3-4 weeks before ending abruptly (the computer restarted).
The Tetris Algorithm New Tetris block Compute best action using the current block and the feature weights Compute best action using both blocks and the feature weights Use two blocks for strategy? Yes No Move block according to the best action Update board if necessary (collapse full rows, lose) Upon loss, return number of completed rows
Results • Following are some results we have from running our algorithm with the afore-mentioned features (second try). • Each run takes approx. two days, with arbitrary 50 iterations of the CE algorithm. • Each iteration includes 100 randomly generated weights, and a game played for each to evaluate, and then 30 games of the “best” result (most rows completed) for statistics.
Results – Sample Output Below is a sample output as printed by our program. • Performance & Weight Overview (min, max, avg weight values): Iteration 46, average is 163667.57 rows, best is 499363 rows min: -41.56542, average: -13.61646, max: 5.54374 Iteration 47, average is 138849.43 rows, best is 387129 rows min: -38.91538, average: -12.93479, max: 4.42429 Iteration 48, average is 251081.03 rows, best is 806488 rows min: -38.60941, average: -11.88640, max: 11.97776 Iteration 49, average is 251740.57 rows, best is 648248 rows min: -38.41177, average: -11.81831, max: 7.05757 • Feature Weights & Matching STD (of the Normal distribution): -10.748 -20.345 -7.5491 -11.033 7.0576 -8.9337 -12.211 -0.063724 -11.804 -15.959 -38.412 2.9659 1.4162 1.3864 1.6074 1.4932 0.93831 1.0166 0.34907 1.1931 0.7918 2.536
ResultsEach graph is a feature’s weight, averaged over the different simulations, versus the iterations
ResultsEach graph is the STD of a feature’s weight (derived from the CE method), averaged over the different simulations, versus the iterations
ResultsFinal weight vectors per simulation, reduced to 2D-space, with the average rows performance matching the weights (averaged over 30 games each)
Conclusions • We currently see lack of progress in the games. Games get quickly to a good result, but swing back and forth in coming iterations (i.e. on one iteration, 100K rows, next 200K, next 50K). • We can see from the graphs that the STD of the weights didn’t go down to near-zero, meaning we might gain more from further iterations. • Another thing is that the weights vectors are all different, meaning there wasn’t some convergence to a similar weight vector – there is room for improvement.
Possible Directions • We might try different approaches of noise. We currently used a noise model of where t is the iteration. So we can either use a smaller or bigger noise to check for changes. • We can try updating the distributions only partly with the new parameters, and partly from the last iteration’s parameters [ ]. • We can use a certain threshold for the weights’ Variance. Once they pass it, we will “lock” that weight from changing, and thus randomize less weight vectors (less possible combinations).