Approximation Dynamic Programming

Approximation Dynamic Programming Presented by Yu-Shun, Wang

Agenda • Introduction • Background • Policy Evaluation Algorithms • General Issues of Cost Approximation • Approximate Policy Iteration • Direct and Indirect Approximation • The Role of Monte Carlo Simulation • Direct Policy Evaluation - Gradient Methods • Batch Gradient Methods for Policy Evaluation • Incremental Gradient Methods for Policy Evaluation • Comparison with our approach

Background • A principal aim of the methods is to address problems with very large number of states n. • Another aim of the methods of this chapter is to addressmodel-free situations. • i.e., problems where a mathematical model is unavailable or hard to construct. • The system and cost structure may be simulated. • for example, a queuing network with complicated but well-defined service disciplines.

Background • The assumption here is that: • There is a computer program that simulates, for a given control u, the probabilistic transitions from any given state i to a successor state j according to the transition probabilities pij(u). • It also generates a corresponding transition cost g(i, u, j). • It may be possible to use repeated simulation to calculate (at least approximately) the transition probabilities of the system and the expected stage costs by averaging.

Background • We will aim to approximate the cost function of a given policy or even the optimal cost-to-go function by generating one or more simulated system trajectories and associated costs. • In another type of method, which we will discuss only briefly, we use a gradient method and simulation data to approximate directly an optimal policy.

Policy Evaluation Algorithms • With this class of methods, we aim to approximate the cost function Jμ(i) of a policy μ with a parametric architecture of the form , where r is a parameter vector. • Alternatively, it may be used to construct an approximate cost-to-go function of a single suboptimal/heuristic policy with one-step or multistep look ahead.

Policy Evaluation Algorithms • We focus primarily on two types of methods, the first class called direct, we use simulation to collect samples of costs for various initial states, and fit the architecture to the samples. • The second and currently more popular class of methods is called indirect . Here, we obtain r by solving an approximate version of Bellman’s equation. We obtain the parameter vector r by solving the equation:

Approximate Policy Iteration • Suppose that the current policy is μ, and for a given r, is an approximation of Jμ(i). We generate an “improved” policy using the formula: • When the sequence of policies obtained actually converges to some , then it can be proved that μ is optimal to within:

Approximate Policy Iteration • Block diagram of approximate policy iteration

Approximate Policy Iteration • A simulation-based implementation of the algorithm is illustrated in the following figure. It consists of four modules: • The simulator, which given a state-control pair (i, u), generates the next state j according to the system’s transition probabilities. • The decision generator, which generates the control of the improved policy at the current state i for use in the simulator. • The cost-to-go approximator, which is the function that is used by the decision generator. • The cost approximation algorithm, which accepts as input the output of the simulator and obtains the approximation of the cost of .

Approximate Policy Iteration

Approximate Policy Iteration • There are two policies μ and , and parameter vectors r and , which are involved in this algorithm. • In particular, r corresponds to the current policy μ, and the approximation is used in the policy improvement to generate the new policy . • At the same time, drives the simulation that generates samples that determines the parameter ,which will be used in the next policy iteration.

Approximate Policy Iteration System simulator Cost-to-Go Approximator Decision Generator Cost Approximation Algorithm

Direct and Indirect Approximation • An important generic difficulty with simulation-based policy iteration is the cost samples may biases the simulation by underrepresented states. • As a result, the cost-to-go estimates of these states may be highly inaccurate, causing serious errors in the calculation of the improved control policy. • The difficulty is known as inadequate exploration of the system’s dynamics because of the use of a fixed policy.

Direct and Indirect Approximation • One possibility for adequate exploration is to frequently restart the simulation and to ensure that the initial states employed form a rich and representative subset. • A related approach, called iterative resampling, is to derive an initial cost evaluation of μ, simulate the next policy obtained on the basis of this initial evaluation to obtain a set of representative states S • And repeat the evaluation of μ using additional trajectories initiated from S.

Direct and Indirect Approximation • The moststraightforward algorithmic approaches for approximatingthe cost functionis direct. • It is used to find an approximation∈ S that matches best Jμ in some normed error sense, i.e., • Where • Here, ||．|| is usually some (weighted) Euclidean norm.

Direct and Indirect Approximation • If the matrix φ has linearly independent columns, the solution isunique and can also be represented as: • where Πdenotes projection with respect to ||．|| on the subspace S. • A majordifficulty is that specific cost function values Jμ(i) can only be estimatedthrough their simulation-generated cost samples.

Direct and Indirect Approximation

Direct and Indirect Approximation • An alternative approach, referred to as indirect , is to approximate the solution of Bellman’s equation J = TμJ on the subspace S. • We can view this equation as a projected form of Bellman’s equation.

Direct and Indirect Approximation • Solving projected equations as approximations to more complex/higher-dimensional equations has a long history in scientific computation in the context of Galerkin methods. • The use of the Monte Carlo simulation ideas that are central in approximate DP is an important characteristic that differentiates the methods from the Galerkin methods.

Direct and Indirect Approximation

The Role of Monte Carlo Simulation • The methods of this chapter rely to a large extent on simulation in conjunctionwith cost function approximation in order to deal with large statespaces. • The advantage that simulation holds in this regard can be tracedto its ability to compute (approximately) sums with a very large numberterms.

The Role of Monte Carlo Simulation • Example:Approximate Policy Evaluation • Consider the approximatesolution of the Bellman equation that correspondsto a given policy of an n-state discounted problem: • where P is the transition probability matrix and α is the discount factor. • Let us adopt a hard aggregation approach whereby we divide the n states in two disjoint subsets I1 and I2with I1∪I2= {1, . . . , n}, and we use the piecewise constant approximation:

The Role of Monte Carlo Simulation • Example:Approximate Policy Evaluation (cont.) • This corresponds to the linear feature-based architecture J ≈ φr, where φ is the n × 2 matrix with column components equal to 1 or 0, depending on whether the component corresponds to I1or I2. • We obtain the approximate equations:

The Role of Monte Carlo Simulation • Example:Approximate Policy Evaluation (cont.) • we can reduce to just two equations by forming two weighted sums (with equal weights) of the equations corresponding to the states in I1 and I2, respectively: • where n1 and n2 are numbers of states in I1 and I2. We thusobtain the aggregate system of the following two equations in r1 and r2:

The Role of Monte Carlo Simulation • Example:Approximate Policy Evaluation (cont.) • Here the challenge, when the number of states n is very large, is the calculationof the large sums in the right-hand side, which can be of order O(n2). • Simulation allows the approximate calculation of these sums with complexitythat is independent of n. • This is similar to the advantage that Monte-Carlointegration holds over numerical integration.

Batch Gradient Methods for Policy Evaluation • Suppose that the current policy is μ, and for a given r, isan approximation of Jμ(i). We generate an “improved” policyμ using the formula: • To evaluate approximately , we select a subset of “representative” states S (obtained by simulation), and for each i ∈ S, we obtain M(i) samples of the cost . • The mth such sample is denoted by c(i,m), and it can be viewed as plus some simulation error.

Batch Gradient Methods for Policy Evaluation • We obtain the corresponding parameter vector r by solving the following least squares problem: • The above problem can be solved if a linear approximation architecture is used. • However, when a nonlinear architecture is used, we may use gradient-like methods for solving the least squares problem.

Batch Gradient Methods for Policy Evaluation • Let us focus on an N-transition portion (i0, . . . , iN) of a simulated trajectory, also called a batch. We view the numbers • as cost samples, one per initial state i0, . . . , iN−1, which can be used for least squares approximation of the parametric architecture

Batch Gradient Methods for Policy Evaluation • One way to solve this least squares problem is to use a gradient method, whereby the parameter associated with is updated at time N by • Here, denotes gradient with respect to and γ is a positive stepsize, which is usually diminishing over time (we leave its precise choice open for the moment).

Batch Gradient Methods for Policy Evaluation • The update of is done after processing the entire batch, and that the gradients are evaluated at the preexisting value of , i.e., the one before the update. • In a traditional gradient method, the gradient iteration is repeated, until convergence to the solution.

Batch Gradient Methods for Policy Evaluation • However, there is an important tradeoff relating to the size N of the batch: • In order to reduce simulation error and generate multiple cost samples for a representatively large subset of states, it is necessary to use a large N, • Yet to keep the work per gradient iterationsmall, it is necessary to use a small N.

Batch Gradient Methods for Policy Evaluation • To address the issue of size of N, batches may be changed after one or more iterations. • Thus, the N-transition batch comes from a potentially longer simulated trajectory, or from one of many simulated trajectories. • We leave the method for generating simulated trajectories and forming batches open for the moment. • But we note that it influences strongly the result of the corresponding least squares optimization.

Incremental Gradient Methods for Policy Evaluation • We now consider a variant of the gradient method called incremental . This method can also be described through the use of N-transition batches. • But we will see that the method is suitable for use with a single very long simulated trajectory, viewed as a single batch.

Incremental Gradient Methods for Policy Evaluation • For a given N-transition batch (i0, . . . , iN), the batch gradient method processes the N transitions all at once, and updates . • The incremental method updates a total of N times, once after each transition. • Each time it adds to the corresponding portion of the gradient that can be calculated using the newly available simulation data.

Incremental Gradient Methods for Policy Evaluation • Thus, after each transition (ik, ik+1): • We evaluate the gradient at the current value of . • We sum all the terms that involve the transition (ik, ik+1), and we update by making a correction along their sum:

Incremental Gradient Methods for Policy Evaluation • By adding the “incremental” correction in the above iteration, we see that after N transitions, all the terms of the batch iteration will have been accumulated. • But there is a difference: • In the incremental version, is changed during the processing of the batch, and the gradient is evaluated at the most recent value of [after the transition (it, it+1)].

Incremental Gradient Methods for Policy Evaluation • By contrast, in the batch version these gradients are evaluated at the value of prevailing at the end of the batch. • Note that the gradient sum can be conveniently updated following each transition, thereby resulting in an efficient implementation.

Incremental Gradient Methods for Policy Evaluation • It can be seen that because is updated at intermediate transitions within a batch (rather than at the end of the batch), the location of the end of the batch becomes lessrelevant. • In this case, for each state i, we will have one cost sample for every time when state i is encountered in the simulation. • Accordingly state i will be weighted in the least squares optimization in proportion to the frequency of its occurrence within the simulated trajectory.

Incremental Gradient Methods for Policy Evaluation • Generally, the incremental versions of the gradient methods can be implemented more flexibly and tend to converge faster than their batch counterparts. • However, the rate of convergence can be very slow.

Comparison with our approach

The End Thanks for your listening

Approximation Dynamic Programming

Approximation Dynamic Programming

Presentation Transcript

Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming

Approximation Algorithms: Dynamic Programming

Dynamic Programming

Dynamic Programming

Dynamic Programming