530 likes | 742 Views
Monte-Carlo Tree Search. Alan Fern. Introduction. Rollout does not guarantee optimality or near optimality It only guarantees policy improvement (under certain conditions) Theoretical Question: Can we develop Monte-Carlo methods that give us near optimal policies?
E N D
Monte-Carlo Tree Search Alan Fern
Introduction • Rollout does not guarantee optimality or near optimality • It only guarantees policy improvement (under certain conditions) • Theoretical Question:Can we develop Monte-Carlo methods that give us near optimal policies? • With computation that does NOT depend on number of states! • This was an open theoretical question until late 90’s. • Practical Question:Can we develop Monte-Carlo methods that improve smoothly and quickly with more computation time?
… … … … … … … … Look-Ahead Trees • Rollout can be viewed as performing one level of search on top of the base policy • In deterministic games and search problems it is common to build a look-ahead tree at a state to select best action • Can we generalize this to general stochastic MDPs? • Sparse Sampling is one such algorithm • Strong theoretical guarantees of near optimality ak a1 a2 Maybe we should searchfor multiple levels. …
Online Planning with Look-Ahead Trees • At each state we encounter in the environment we build a look-ahead tree of depth hand use it to estimate optimal Q-values of each action • Select action with highest Q-value estimate • s = current state • Repeat • T = BuildLookAheadTree(s) ;; sparse sampling or UCT ;; tree provides Q-value estimates for root action • a = BestRootAction(T) ;; action with best Q-value • Execute action a in environment • s is the resulting state
Planning with Look-Ahead Trees s s a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a2 a2 a2 a2 a2 a2 a2 a2 a2 a2 a2 a1 Real worldstate/action sequence s … … … … … … … … Build look-ahead tree Build look-ahead tree s11 s11 … … s1w s1w s21 s21 … … s2w s2w R(s1w,a1) R(s1w,a1) R(s1w,a2) R(s1w,a2) R(s11,a2) R(s11,a2) R(s11,a1) R(s11,a1) R(s21,a1) R(s21,a1) R(s21,a2) R(s21,a2) R(s2w,a1) R(s2w,a1) R(s2w,a2) R(s2w,a2) ………… …………
ExpectimaxTree (depth 1) max node (max over actions) Alternate max &expectation s a b expectation node (weighted averageover states) … … s1 s2 sn-1 sn s1 s2 sn-1 sn States -- weighted by probability of occuring after taking ain s • After taking each action there is a distribution over next states (nature’s moves) • The value of an action depends on the immediate reward and weighted value of those states
Expectimax Tree (depth 2) max node (max over actions) Alternate max &expectation s a b expectation node (max over actions) … … a a b b …...... … … … … Max Nodes: value equal to max of values of child expectation nodes Expectation Nodes: value is weighted average of values of its children Compute values from bottom up (leaf values = 0). Select root action with best value.
Exact Expectimax Tree V*(s,H) Alternate max &expectation Q*(s,a,H) In general can grow tree to any horizon H Size depends on size of the state-space. Bad!
Sparse Sampling Tree V*(s,H) Q*(s,a,H) Sampling width w (kw)H leaves Replace expectation with average over w samples w will typically be much smaller than n.
Sparse Sampling [Kearns et. al. 2002] The Sparse Sampling algorithm computes root value via depth first expansion Return value estimate V*(s,h) of state s and estimated optimal action a* SparseSampleTree(s,h,w) If h == 0 Then Return [0, null] For each action a in s Q*(s,a,h) = 0 For i = 1 to w Simulate taking a in s resulting in si and reward ri [V*(si,h),a*] = SparseSample(si,h-1,w) Q*(s,a,h) = Q*(s,a,h) + ri+ β V*(si,h) Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h) V*(s,h) = maxa Q*(s,a,h) ;; estimate of V*(s,h) a* = argmaxa Q*(s,a,h) Return [V*(s,h), a*]
SparseSample(h=2,w=2) Alternate max &averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=2,w=2) Alternate max &averaging s a 10 SparseSample(h=1,w=2) Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=2,w=2) Alternate max &averaging s a 0 10 SparseSample(h=1,w=2) SparseSample(h=1,w=2) Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=2,w=2) Alternate max &averaging s a 5 0 10 SparseSample(h=1,w=2) SparseSample(h=1,w=2) Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=2,w=2) Alternate max &averaging s b a 3 5 2 4 SparseSample(h=1,w=2) SparseSample(h=1,w=2) Select action ‘a’ since its value is larger than b’s. Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.
Sparse Sampling (Cont’d) • For a given desired accuracy, how largeshould sampling width and depth be? • Answered: Kearns, Mansour, and Ng (1999) • Good news: gives values for w and H to achieve policy arbitrarily close to optimal • Values are independent of state-space size! • First near-optimal general MDP planning algorithm whose runtime didn’t depend on size of state-space • Bad news: the theoretical values are typically still intractably large---also exponential in H • Exponential in H is the best we can do in general • In practice: use small Hand use heuristic at leaves
Sparse Sampling (Cont’d) • In practice: use small H& evaluate leaves with heuristic • For example, if we have a policy, leaves could be evaluated by estimating the policy value (i.e. average reward across runs of policy) a1 a2 … … …...... … … … … policysims
Uniform vs. Adaptive Tree Search • Sparse sampling wastes time on bad parts of tree • Devotes equal resources to each state encountered in the tree • Would like to focus on most promising parts of tree • But how to control exploration of new parts of tree vs. exploiting promising parts? • Need adaptive bandit algorithmthat explores more effectively
What now? • Adaptive Monte-Carlo Tree Search • UCB Bandit Algorithm • UCT Monte-Carlo Tree Search
Bandits: Cumulative Regret Objective • Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step) • Optimal (in expectation) is to pull optimal arm n times • UniformBandit is poor choice --- waste time on bad arms • Must balance exploring machines to find good payoffs and exploiting current knowledge s ak a1 a2 …
Cumulative Regret Objective • Theoretical results are often about “expected cumulative regret” of an arm pulling strategy. • Protocol: At time step n the algorithm picks an arm based on what it has seen so far and receives reward ( are random variables). • Expected Cumulative Regret (: difference between optimal expected cumulative reward and expected cumulative reward of our strategy at time n
UCB Algorithm for Minimizing Cumulative Regret Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2), 235-256. • Q(a) : average reward for trying action a (in our single state s) so far • n(a) : number of pulls of arm a so far • Action choice by UCB after n pulls: • Assumes rewards in [0,1]. We can always normalize if we know max value.
UCB: Bounded Sub-Optimality Value Term:favors actions that looked good historically Exploration Term:actions get an exploration bonus that grows with ln(n) Expected number of pulls of sub-optimal arm a is bounded by: where is the sub-optimality of arm a Doesn’t waste much time on sub-optimal arms, unlike uniform!
UCB Performance Guarantee[Auer, Cesa-Bianchi, & Fischer, 2002] Theorem: The expected cumulative regret of UCB after n arm pulls is bounded by O(log n) • Is this good? Yes. The average per-step regret is O Theorem: No algorithm can achieve a better expected regret (up to constant factors)
What now? • Adaptive Monte-Carlo Tree Search • UCB Bandit Algorithm • UCT Monte-Carlo Tree Search
UCT Algorithm [Kocsis & Szepesvari, 2006] • UCT is an instance of Monte-Carlo Tree Search • Applies principle of UCB • Similar theoretical properties to sparse sampling • Much better anytime behavior than sparse sampling • Famous for yielding a major advance in computer Go • A growing number of success stories
Monte-Carlo Tree Search What is the rollout policy? • Builds a sparse look-ahead tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy” • During construction each tree node s stores: • state-visitation count n(s) • action counts n(s,a) • action values Q(s,a) • Repeat until time is up • Execute rollout policy starting from root until horizon(generates a state-action-reward trajectory) • Add first node not in current tree to the tree • Update statistics of each tree nodes on trajectory • Increment n(s) and n(s,a) for selected action a • Update Q(s,a) by total reward observed after the node
Rollout Policies • Monte-Carlo Tree Search algorithms mainly differ on their choice of rollout policy • Rollout policies have two distinct phases • Tree policy: selects actions at nodes already in tree(each action must be selected at least once) • Default policy: selects actions after leaving tree • Key Idea: the tree policy can use statistics collected from previous trajectories to intelligently expand tree in most promising direction • Rather than uniformly explore actions at each node
Example MCTS • For illustration purposes we will assume MDP is: • Deterministic • Only non-zero rewards are at terminal/leaf nodes • Algorithm is well-defined without these assumpitons
At a leaf node tree policy selects a random action then executes default Iteration 1 Current World State Initially tree is single leaf 1 1 new tree node DefaultPolicy 1 Terminal(reward = 1) Assume all non-zero reward occurs at terminal nodes.
Must select each action at a node at least once Iteration 2 Current World State 1 1 new tree node DefaultPolicy 0 Terminal(reward = 0)
Must select each action at a node at least once Iteration 3 Current World State 1/2 1 0
When all node actions tried once, select action according to tree policy Iteration 3 Current World State 1/2 Tree Policy 1 0
When all node actions tried once, select action according to tree policy Iteration 3 Current World State 1/2 Tree Policy 1 0 new tree node DefaultPolicy 0
When all node actions tried once, select action according to tree policy Iteration 4 Current World State 1/3 Tree Policy 0 1/2 0
When all node actions tried once, select action according to tree policy Iteration 4 Current World State 1/3 0 1/2 0 1
When all node actions tried once, select action according to tree policy Current World State 2/3 Tree Policy 0 2/3 0 1 What is an appropriate tree policy?Default policy?
UCT Algorithm [Kocsis & Szepesvari, 2006] • Basic UCT uses random default policy • In practice often use hand-coded or learned policy • Tree policy is based on UCB: • Q(s,a) : average reward received in trajectories so far after taking action a in state s • n(s,a) : number of times action a taken in s • n(s) : number of times state s encountered Theoretical constant that must be selected empirically in practice
When all node actions tried once, select action according to tree policy Current World State 1/3 a2 a1 Tree Policy 0 1/2 0 1
When all node actions tried once, select action according to tree policy Current World State 1/3 Tree Policy 0 1/2 0 1
When all node actions tried once, select action according to tree policy Current World State 1/3 Tree Policy 0 1/2 0 1 0
When all node actions tried once, select action according to tree policy Current World State 1/4 Tree Policy 0 1/3 1 0/1 0
When all node actions tried once, select action according to tree policy Current World State 1/4 Tree Policy 0 1/3 1 0/1 0 1
When all node actions tried once, select action according to tree policy Current World State 2/5 Tree Policy 1/2 1/3 1 1 0/1 0
UCT Recap • To select an action at a state s • Build a tree using N iterations of monte-carlo tree search • Default policy is uniform random • Tree policy is based on UCB rule • Select action that maximizes Q(s,a)(note that this final action selection does not take the exploration term into account, just the Q-value estimate) • The more simulations the more accurate
Some Successes • Computer Go • Klondike Solitaire (wins 40% of games) • General Game Playing Competition • Real-Time Strategy Games • Combinatorial Optimization • Crowd Sourcing • List is growing • Usually extend UCT is some ways
Some Improvements • Use domain knowledge to handcraft a more intelligent default policy than random • E.g. don’t choose obviously stupid actions • In Go a hand-coded default policy is used • Learn a heuristic function to evaluate positions • Use the heuristic function to initialize leaf nodes(otherwise initialized to zero)
Practical Issues • Selecting K • There is no fixed rule • Experiment with different values in your domain • Rule of thumb – try values of K that are of the same order of magnitude as the reward signal you expect • The best value of K may depend on the number of iterations N • Usually a single value of K will work well for values of N that are large enough
Practical Issues • UCT can have trouble building deep trees when actions can result in a large # of possible next states • Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf s a b
Practical Issues • UCT can have trouble building deep trees when actions can result in a large # of possible next states • Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf s a b