230 likes | 324 Views
Progressive Strategies For Monte-Carlo Tree Search. Authors: G.M.J.B. Chaslot, M.H.M. Winands, J.W.H.M. Uiterwijk, H.J. van den Herik and B. Bouzy. Presenter: Ling Zhao University of Alberta November 5, 2007. Outlines.
E N D
Progressive StrategiesFor Monte-Carlo Tree Search Authors: G.M.J.B. Chaslot, M.H.M. Winands, J.W.H.M. Uiterwijk, H.J. van den Herik and B. Bouzy Presenter: Ling Zhao University of Alberta November 5, 2007
Outlines • Monte-Carlo Tree Search (MCTS) and the implementation in MANGO. • Progressive strategies: progressive bias and progressive unpruning. • Experiments. • Conclusions and future work.
Selection • Process: select moves in UCT tree for the best balance between exploitation and exploration. • A multi-armed bandit problems. • UCB formula: k: No. k child of node n, vi: value of node i ni: visit count of node i, np: visit count of node p C: const • Selection precondition: np >= T (= 30)
Expansion • Process: For a given leaf node, determine whether it will be expanded by storing one or more of its children in UCT tree. • Simple rule: expand one node per simulated game (the first node encountered not in UCT tree). • In MANGO, if np = T (= 30), all its children will be expanded.
Simulation • Process: self-play until the end of the game. • Rules: 1. Disallow play in its eyes 2. Stop the game after a certain number of moves. • In MANGO, the probability of a move being selected in simulation is proportional to its urgency, a sum of capture value, 3x3 pattern value and proximity modification.
Backpropagation • Process: using the result of a simulated game to update the nodes it traverses. • Result: +1 for win, -1 for loss, 0 for draw • vi of node i is computed by averaging the result of all simulated games made through it.
Progressive Strategies • Soft transition between selection strategy and simulation strategy. • Intuition: Selection strategy becomes more accurate than simulation one only when the number of games simulated is large. • Progress strategy uses the information available for the selection strategy, and some expensive domain knowledge. • Progress strategy is similar to the simulation strategy when a few games have been played, and converges to selection strategy when numerous games have been played.
Progressive Bias • Direct search using possibly expensive heuristic knowledge. • Modify the selection strategy, and make sure the influence decreases fast when many games have been played.
Progressive Bias Formula • Hiis a coefficient representing knowledge • For children with ni =0, is replaced by M with M>>any vi, thus the children with the highest f(ni) is selected. • If np [30, 100], f(ni) is dominant. • If np (100, 500], f(ni) has partial impact. • When np > 500, f(ni) is dominated, but can be used for tie breaker.
Alternative Approach • Using prior knowledge (Gelly and Silver): • “Scalability of this approach to larger board sizes is an open question”.
Progressive Unpruning • Reducing the branching factor artificially when the selection strategy is used. • Increase the branching factor progressively when more games are simulated. • Pruning or unpruning is done according to the heuristic value of the children.
Progressive Unpruning (Details) • If np = T, only k0 (=5) children with highest heuristic values are not pruned. • If np > T, k = lg(np /40) * 2.67 + k0, children will be left unpruned. • k = 5 (np = 40), 7 (np = 80), 10 (np = 120) • Similar idea used by Coulom (progressive widening).
Heuristic Values • Pattern value: learned offline using pattern matching (89,119 patterns from 2000 pro games). • Capture value: the number of stones to be captured or to escape a capture with the move. • Proximity value: Euclidean distance to the last move.
Heuristic Value Formula • Ci: Capture value • Pi: pattern value • Dk,i: distance to the kth last move • k = 1.25 + k/2 • Computing Pi the time consuming part
Time For Computing Heuristics • Computing H is around 1000 times slower than playing a move in simulated game. • So H is computed only once per node, when T (=30) games is played through it. • Speed reduction is only 4%, since the number of nodes with visit count >= 30 is low compared to the total number of moves in simulated games.
Experiments • Self played games on 13x13 board (10 sec per move): MANGO with progressive strategies won 91% of the 500 games against MANGO without progressive strategies. • MANGO : 20,000 simulated games, 1 sec on 9x9, 2 sec on 13x13, 5 sec on 19x19. • GNU Go: level 10 on 9x9 and 13x13, 0 on 19x19.
MANGO Vs. GNU Go • Plain MCTS does not scale well to 13x13 or 19x19 board. • Progressive strategies are useful on every board size. • The two progressive strategies combined are most powerful, esp. in 19x19.
Tournament Results • Always in the top half. • But were negative results removed?
Conclusions and Future Work • Two progressive strategies are useful by providing a soft transition between selection and simulation. • Overhead is negligible. • Combine with RAVE and UCT with prior knowledge. • Combine with the advanced knowledge developed by Coulom. • Using life and death information. • Better progressive bias. P-A. Coquelin and R. Munos. Bandit Algorithm for Tree Search. Technical Report 6141, INRIA, 2007.