660 likes | 767 Views
Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern. Large Worlds. We have considered basic model-based planning algorithms Model-based planning : assumes MDP model is available Methods we learned so far are at least poly-time in the number of states and actions
E N D
Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern
Large Worlds • We have considered basic model-based planning algorithms • Model-based planning: assumes MDP model is available • Methods we learned so far are at least poly-time in the number of states and actions • Difficult to apply to large state and action spaces (though this is a rich research area) • We will consider various methods for overcoming this issue
Approaches for Large Worlds • Planning with compact MDP representations • Define a language for compactly describing an MDP • MDP is exponentially larger than description • E.g. via Dynamic Bayesian Networks • Design a planning algorithm that directly works with that language • Scalability is still an issue • Can be difficult to encode the problem you care about in a given language • Study in last part of course
Approaches for Large Worlds • Reinforcement learning w/ function approx. • Have a learning agent directly interact with environment • Learn a compact description of policy or value function • Often works quite well for large problems • Doesn’t fully exploit a simulator of the environment when available • We will study reinforcement learning later in the course
Approaches for Large Worlds: Monte-Carlo Planning • Often a simulator of a planning domain is availableor can be learned from data Fire & Emergency Response Klondike Solitaire 5
Large Worlds: Monte-Carlo Approach • Often a simulator of a planning domain is availableor can be learned from data • Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator action World Simulator RealWorld State + reward 6
Example Domains with Simulators • Traffic simulators • Robotics simulators • Military campaign simulators • Computer network simulators • Emergency planning simulators • large-scale disaster and municipal • Sports domains (Madden Football) • Board games / Video games • Go / RTS In many cases Monte-Carlo techniques yield state-of-the-artperformance. Even in domains where model-based plannersare applicable.
MDP: Simulation-Based Representation • A simulation-based representation gives: S, A, R, T, I: • finite state set S (|S|=n and is generally very large) • finite action set A (|A|=m and will assume is of reasonable size) • Stochastic, real-valued, bounded reward function R(s,a) = r • Stochastically returns a reward r given input s and a • Stochastic transition function T(s,a) = s’ (i.e. a simulator) • Stochastically returns a state s’ given input s and a • Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP • Stochastic initial state function I. • Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language!
Outline • Uniform Monte-Carlo • Single State Case (Uniform Bandit) • Policy rollout • Approximate Policy Iteration • Sparse Sampling • Adaptive Monte-Carlo • Single State Case (UCB Bandit) • UCT Monte-Carlo Tree Search
Single State Monte-Carlo Planning • Suppose MDP has a single state and k actions • Goal: figure out which action has best expected reward • Can sample rewards of actions using calls to simulator • Sampling action a is like pulling slot machine arm with random payoff function R(s,a) s ak a1 a2 … … R(s,ak) R(s,a2) R(s,a1) Multi-Armed Bandit Problem
PAC Bandit Objective: Informal • Probably Approximately Correct (PAC) • Select an arm that probably (w/ high probability) has approximately the best expected reward • Use as few simulator calls (or pulls) as possible to guarantee this s ak a1 a2 … … R(s,ak) R(s,a2) R(s,a1) Multi-Armed Bandit Problem
PAC Bandit Algorithms • Let k be the number of arms, Rmax be an upper bound on reward, and (i.e. R* is the best arm in expectation) • Such an algorithm is efficient in terms of # of arm pulls, and is probably (with probability 1-) approximately correct (picks an arm with expected reward within of optimal). Definition (Efficient PAC Bandit Algorithm): An algorithm ALG is an efficient PAC bandit algorithm iff for any multi-armed bandit problem, for any 0<<1 and any 0<<1, ALG pulls a number of arms that is polynomial in 1/, 1/ , k, and Rmaxand returns an arm index j such that with probability at least 1-
UniformBandit AlgorithmNaiveBandit from [Even-Dar et. al., 2002] • Pull each arm w times (uniform pulling). • Return arm with best average reward. Can we make this an efficient PAC bandit algorithm? s ak a1 a2 … r11 r12 … r1w r21 r22 … r2w rk1 rk2 … rkw
Chernoff Bound Aside: Additive Chernoff Bound • Let R be a random variable with maximum absolute value Z. An let rii=1,…,w be i.i.d. samples of R • The Chernoff bound gives a bound on the probability that the average of the ri are far from E[R] Equivalently: With probability at least we have that,
Aside: Coin Flip Example • Suppose we have a coin with probability of heads equal to p. • Let X be a random variable where X=1 if the coin flip gives heads and zero otherwise. (so Z from bound is 1) • E[X] = 1*p + 0*(1-p) = p • After flipping a coin w times we can estimate the heads prob. by average of xi. • The Chernoff bound tells us that this estimate converges exponentially fast to the true mean (coin bias) p.
UniformBandit AlgorithmNaiveBandit from [Even-Dar et. al., 2002] • Pull each arm w times (uniform pulling). • Return arm with best average reward. How large must w be to provide a PAC guarantee? s ak a1 a2 … r11 r12 … r1w r21 r22 … r2w rk1 rk2 … rkw
UniformBandit PAC Bound • For a single bandit arm the Chernoff bound says: • Bounding the error by ε gives: • Thus, using this many samples for a single arm will guarantee an ε-accurate estimate with probability at least With probability at least we have that, or equivalently
UniformBandit PAC Bound • So we see that with samples per arm, there is no more than a probability that an individual arm’s estimate will not be ε-accurate • But we want to bound the probability of any arm being inaccurate The union bound says that for k events, the probability that at least one event occurs is bounded by the sum of individual probabilities • Using the above # samples per arm and the union bound (with events being “arm i is not ε-accurate”) there is no more than probability of any arm not being ε-accurate • Setting all arms are ε-accurate with prob. at least
UniformBandit PAC Bound Putting everything together we get: If then for all arms simultaneously with probability at least • That is, estimates of all actions are ε – accurate with probability at least 1- • Thus selecting estimate with highest value is approximately optimal with high probability, or PAC
# Simulator Calls for UniformBandit s ak a1 a2 … … R(s,ak) R(s,a2) R(s,a1) • Total simulator calls for PAC: • So we have an efficient PAC algorithm • Can get rid of ln(k) term with more complex algorithm [Even-Dar et. al., 2002].
Outline • Uniform Monte-Carlo • Single State Case (Uniform Bandit) • Policy rollout • Approximate Policy Iteration • Sparse Sampling • Adaptive Monte-Carlo • Single State Case (UCB Bandit) • UCT Monte-Carlo Tree Search
Policy Improvement via Monte-Carlo • Now consider a very large multi-state MDP. • Suppose we have a simulator and a non-optimal policy • E.g. policy could be a standard heuristic or based on intuition • Can we somehow compute an improved policy? World Simulator + Base Policy action RealWorld State + reward 22
Recall: Policy Improvement Theorem • The Q-value function of a policy gives expected discounted future reward of starting in state s, taking action a, and then following policy π thereafter • Define: • Theorem [Howard, 1960]: For any non-optimal policy π the policy π’ a strict improvement over π. • Computing π’ amounts to finding the action that maximizes the Q-function of π • Can we use the bandit idea to solve this?
Policy Improvement via Bandits s ak a1 a2 … SimQ(s,a1,π) SimQ(s,a2,π) SimQ(s,ak,π) • Idea: define a stochastic function SimQ(s,a,π) that we can implement and whose expected value is Qπ(s,a) • Then use Bandit algorithm to PAC select an improved action How to implement SimQ?
Q-value Estimation • SimQ might be implemented by simulating the execution of action a in state s and then following π thereafter • But for infinite horizon problems this would never finish • So we will approximate via finite horizon • The h-horizon Q-function Qπ(s,a,h) is defined as: expected total discounted reward of starting in state s, taking action a, and then following policy π for h-1 steps • The approximation error decreases exponentially fast in h(you will show in homework)
Policy Improvement via Bandits s ak a1 a2 … SimQ(s,a1,π,h) SimQ(s,a2,π,h) SimQ(s,ak,π,h) • Refined Idea: define a stochastic function SimQ(s,a,π,h) that we can implement, whose expected value is Qπ(s,a,h) • Use Bandit algorithm to PAC select an improved action How to implement SimQ?
Policy Improvement via Bandits • SimQ(s,a,π,h) • r = R(s,a) simulate a in s • s = T(s,a) • for i = 1 to h-1 r = r + βi R(s, π(s)) simulate h-1 steps • s = T(s, π(s)) of policy • Return r • Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards • Expected value of SimQ(s,a,π,h) is Qπ(s,a,h) which can be made arbitrarily close to Qπ(s,a) by increasing h
… … … Policy Improvement via Bandits • SimQ(s,a,π,h) • r = R(s,a) simulate a in s • s = T(s,a) • for i = 1 to h-1 r = r + βi R(s, π(s)) simulate h-1 steps • s = T(s, π(s)) of policy • Return r Trajectory under p Sum of rewards = SimQ(s,a1,π,h) a1 Sum of rewards = SimQ(s,a2,π,h) s a2 … ak Sum of rewards = SimQ(s,ak,π,h)
… … … … … … … … … Policy Rollout Algorithm Rollout[π,h,w](s) • For each ai run SimQ(s,ai,π,h) w times • Return action with best average of SimQ results s ak a1 a2 … SimQ(s,ai,π,h) trajectories Each simulates taking action ai then following π for h-1 steps. q11 q12 … q1w q21 q22 … q2w qk1 qk2 … qkw Samples of SimQ(s,ai,π,h)
… … … … … … … … … Policy Rollout: # of Simulator Calls s ak a1 a2 … SimQ(s,ai,π,h) trajectories Each simulates taking action ai then following π for h-1 steps. • For each action w calls to SimQ, each using h sim calls • Total of khw calls to the simulator
Policy Rollout Quality • Let a* be the action that maximizes the true Q-funciton Qπ(s,a). • Let a’ be the action returned by Rollout[π,h,w](s). • Putting the PAC bandit result together with the finite horizon approximation we can derive the following: If then with probability at least But does this guarantee that the value of Rollout[π,h,w](s) will be close to the value of π’ ?
Policy Rollout: Quality • How good is Rollout[π,h,w] compared to π’? • Bad News. In general for a fixed h and w there is always an MDP such that the quality of the rollout policy is arbitrarily worse than π’. • The example MDP is somewhat involved, but shows that even small error in Q-value estimates can lead to large performance gaps compared to π’ • But this result is quite pathological
Policy Rollout: Quality • How good is Rollout[π,h,w] compared to π’? • Good News. If we make an assumption about the MDP, then it is possible to select h and w so that the rollout quality is close to π’. • This is a bit involved. • Assume a lower bound on the difference between the best Q-value and the second best Q-value • More Good News. It is possible to select h and w so that Rollout[π,h,w] is (approximately) no worse than π for any MDP • So at least rollout won’t hurt compared to the base policy • At the same time it has the potential to significantly help
Multi-Stage Rollout • A single call to Rollout[π,h,w](s) approximates one iteration of policy iteration starting at policy π • But only computes the action for state s rather than all states (as done by full policy iteration) • We can use more computation time to approximate multiple iterations of policy iteration via nesting calls to Rollout • Gives a way to use more time in order to improve performance
… … … … … … … … … Multi-Stage Rollout s Each step requires khw simulator callsfor Rollout policy ak a1 a2 … Trajectories of SimQ(s,ai,Rollout[π,h,w],h) • Two stage: compute rollout policy of “rollout policy of π” • Requires (khw)2 calls to the simulator for 2 stages • In general exponential in the number of stages
Rollout Summary • We often are able to write simple, mediocre policies • Network routing policy • Policy for card game of Hearts • Policy for game of Backgammon • Solitaire playing policy • Policy rollout is a general and easy way to improve upon such policies given a simulator • Often observe substantial improvement, e.g. • Compiler instruction scheduling • Backgammon • Network routing • Combinatorial optimization • Game of GO • Solitaire
Example: Rollout for Solitaire[Yan et al. NIPS’04] • Multiple levels of rollout can payoff but is expensive Can we somehow get the benefit of multiple levels without the complexity?
Outline • Uniform Monte-Carlo • Single State Case (Uniform Bandit) • Policy rollout • Approximate Policy Iteration • Sparse Sampling • Adaptive Monte-Carlo • Single State Case (UCB Bandit) • UCT Monte-Carlo Tree Search
Approximate Policy Iteration: Main Idea • Nested rollout is expensive because the “base policies” (i.e. nested rollouts themselves) are expensive • Suppose that we could approximate a level-one rollout policy with a very fast function (e.g. O(1) time) • Then we could approximate a level-two rollout policy while paying only the cost of level-one rollout • Repeatedly applying this idea leads to approximate policy iteration
Compute Vpat all states Return to Policy Iteration Vp Choose best actionat each state p Improved Policy p’ Current Policy • Approximate policy iteration: • Only computes values and improved action at some states. • Uses those to infer a fast, compact policy over all states.
Sample p’ trajectories using rollout Approximate Policy Iteration technically rollout only approximates π’. p’ trajectories Learn fast approximation of p’ p Current Policy p’ • Generate trajectories of rollout policy (starting state of each trajectory is drawn from initial state distribution I) • “Learn a fast approximation” of rollout policy • Loop to step 1 using the learned policy as the base policy What do we mean by generate trajectories?
… … … … … … … … … … … … … … … … Generating Rollout Trajectories Get trajectories of current rollout policy from an initial state a2 ak Random draw from i ak ak a1 a1 a2 a2 s … … run policy rollout run policy rollout … …
… … … … … … … … … … … … … … … … Generating Rollout Trajectories Get trajectories of current rollout policy from an initial state ak ak Multiple trajectoriesdiffer since initial stateand transitions are stochastic a1 a1 a2 a2 … … … … … … …
Generating Rollout Trajectories Get trajectories of current rollout policy from an initial state … … … … … Results in a set of state-action pairs giving the action selected by “improved policy” in states that it visits. {(s1,a1), (s2,a2),…,(sn,an)}
Sample p’ trajectories using rollout Approximate Policy Iteration technically rollout only approximates π’. p’ trajectories Learn fast approximation of p’ p Current Policy p’ • Generate trajectories of rollout policy (starting state of each trajectory is drawn from initial state distribution I) • “Learn a fast approximation” of rollout policy • Loop to step 1 using the learned policy as the base policy What do we mean by “learn an approximation”?
Aside: Classifier Learning • A classifier is a function that labels inputs with class labels. • “Learning” classifiers from training data is a well studied problem(decision trees, support vector machines, neural networks, etc). Training Data{(x1,c1),(x2,c2),…,(xn,cn)} Example problem: xi - image of a face ci {male, female} LearningAlgorithm ClassifierH : X C
Aside: Control Policies are Classifiers A control policy maps states and goals to actions. : states actions Training Data{(s1,a1), (s2,a2),…,(sn,an)} LearningAlgorithm Classifier/Policy : states actions
Sample p’ trajectories using rollout Approximate Policy Iteration p’ training data {(s1,a1), (s2,a2),…,(sn,an)} Learn classifier to approximate p’ p Current Policy p’ • Generate trajectories of rollout policy Results in training set of state-action pairs along trajectories T = {(s1,a1), (s2,a2),…,(sn,an)} • Learn a classifier based on T to approximate rollout policy • Loop to step 1 using the learned policy as the base policy
Sample p’ trajectories using rollout Approximate Policy Iteration p’ training data {(s1,a1), (s2,a2),…,(sn,an)} Learn classifier to approximate p’ p Current Policy p’ • The hope is that the learned classifier will capture the general structure of improved policy from examples • Want classifier to quickly select correct actions in states outside of training data (classifier should generalize) • Approach allows us to leverage large amounts of work in machine learning
API for Inverted Pendulum Consider the problem of balancing a pole by applying either a positive or negative force to the cart. The state space is described by the velocity of the cartand angle of the pendulum. There is noise in the force that is applied, so problem is stochastic.