10 likes | 174 Views
K th bandit. 2 nd bandit. 1 ST bandit. 00. 00. 00. 01. 01. 01. 10. 10. 10. 11. 11. 11. a 1. a K. a 2. Decision maker. s 0. …. a 0. a 1. a k. s 0. s 1. s n. s 0. s 1. s n. a 0. a 1. a k. …. An episode. s 0. s 1. s n. GPU GO: Accelerating Monte-Carlo UCT.
E N D
Kth bandit 2nd bandit 1ST bandit 00 00 00 01 01 01 10 10 10 11 11 11 a1 aK a2 Decision maker s0 … a0 a1 ak s0 s1 sn s0 s1 sn a0 a1 ak … An episode s0 s1 sn GPU GO: Accelerating Monte-Carlo UCT Tae-hyung Kim*, Ryan Meuth*, Paul Robinette*, Donald C. Wunsch II* Department of Electrical and Computer EngineeringUniversity of Missouri – Rolla, Rolla, MO 65409tk424@umr.edu, rmeuth@umr.edu, pmrmq3@umr.edu,dwunsch@umr.edu GPU Graphics Processing Units Modern graphics processing units (GPU) and game consoles are used for much more than 3D graphics applications and video games. From machine vision to finite element analysis, GPU’s are being used in diverse applications, collectively called General Purpose computation on Graphics Processor Units (GPGPU). Additionally, game consoles are entering the market of high performance computing as inexpensive nodes in computing clusters. Applications in Strategy and Economics Combining novel Computational intelligence methods and state of the art brute-force acceleration methods to solve a game with 10^60 states and wide ranging applications in science and industry. Go is considered to be the most important strategy game to be solved by computers NVIDIA Tesla GPU Image Source: http://www.nvidia.com/docs/IO/43399/tesla_main.jpg A Go Board Image source: http://upload.wikimedia.org/wikipedia/en/2/2e/Go_board.jpg • UCT (Upper Confidence based Tree Search) • An efficient game tree search algorithm for the game of Go by Levente Kocsis and Csaba Szepesvari [1]. • The UCB1 (Upper Confidence Bounds) Algorithm [2] is employed to search the game tree. • Features of UCT include: • - balancing the exploration-exploitation dilemma, • - small probability of selecting erroneous moves even if the algorithm is stopped prematurely, • - convergence to the best action if sufficient time is allowed, • - sampling actions selectively with a fixed look ahead depth d. • - tree built incrementally by selecting a branch or an episode selectively over time (Figure 2). • The Go problem seen from UCT is a multi-arm bandit problem [3], i.e. a K-armed bandit (Figure 1). • Figure 1. The multi-arm bandit problem. Figure 2. The game tree expressed in terms of the Markov Decision Process. • A bandit problem is similar to a slot machine with multiple levers • The goal is to maximize the expected payoff • The problem can be formulated by the Markov Decision Process • - State s is the current board status • - Action a is the choice of a bandit, action set A=(1, …, K) • - Reward ra,nis the payoff of action a and index n in the range of [0,1]. • In the problem of Go, the goal is to select the best move, action, or bandit • to maximize the expected total reward given a state (positions on the Go board). • UCT selects an action a to maximize the summation of the estimate action-value • function at depth d, Qt(s,a), and the bias term, b(s,a). • (1) Pseudocode for a generic • Monte-Carlo planning algorithm • The action-value function Qt(s,a) is estimated by • averaging the rewards r(a,Ns,a,t) where Ns,a,t is the • number of times action a has been selected by time t. • (2) • The bias term used by UCT is given below where Cd • is a non-zero constant to prevent drifting and Ns,t is • the number of times state s has been visited by time t. • (3) Significant performance gains can be elicited from implementing Neural Networks and Approximate Dynamic Programming algorithms on graphics processing units. However, there is an amount of art to these implementations. In some cases the performance gains can be as high as 200 times, but as low as 2 times or even less than CPU operation. Thus it is necessary to understand the limitations of the graphics processing hardware, and to take these limitations into account when developing algorithms targeted at the GPU. Any solution to Go, and this algorithm in particular, can be applied to decision processes such as how to price gas for maximum profit Image Source: http://www.gasbuddy.com/gb_gastemperaturemap.aspx GPU Capabilities are increasing exponentially through the utilization of parallel architectures. The exploration-exploitation components of this algorithm combined with the concept of territory control offered by Go can give insight in deciding where to place a new Starbucks in Manhattan. Image Source: http://gothamist.com/attachments/jake/2006_10_starbuckslocations.jpg In the context of Go, GPU’s can not only reduce the time to evaluate game states, but they can also serve as massively parallel processors, computing move trees in parallel. The most natural application of Go is in military strategy. The UCT algorithm can be applied to maximizing zones of control, planning efficient and safe routes, and allocating resources correctly. Image Source: http://www.globalsecurity.org/military/ops/images/oif_mil-ground-routes-map_may04.jpg GPU – HDP implementation yields 22x performance increase on Legacy hardware. [1] Levente Kocsis and Csaba Szepesvari, “Bandit based Monte-Carlo Planning”, Lecture Notes in Artificial Intelligence, 2006. [2] Peter Auer, Nicolò Cesa-Bianchi, Paul Fischer, “Finite-time Analysis of the Multi-armed Bandit Problem”, Machine Learning, 2002. [3] Richard S. Sutton and Andres G. Barto, "Reinforcement Learning: An Introduction", MIT Press, 1998.