1 / 24

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Reinforcement Learning in MDPs by Lease-Square Policy Iteration. Presented by Lihan He Machine Learning Reading Group Duke University 09/16/2005. Outline. MDP and Q-function Value function approximation LSPI: Least-Square Policy Iteration Proto-value Functions

winter
Download Presentation

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning in MDPs by Lease-Square Policy Iteration Presented by Lihan He Machine Learning Reading Group Duke University 09/16/2005

  2. Outline • MDP and Q-function • Value function approximation • LSPI: Least-Square Policy Iteration • Proto-value Functions • RPI: Representation Policy Iteration

  3. Markov Decision Process (MDP) An MDP is a model M = < S, A, T, R > a set of environment states S, a set of actions A, a transition function T: S  A  S  [0,1] ,T(s,a,s’) = P(s’| s,a), a reward function R: S A  R . A policy is a function : S  A. Value function (expected cumulative reward) V: S R . Satisfying Bellman Eq.: V(s) = R(s, (s)) +  s’P(s’| s, (s))V(s’)

  4. +1 0.7 0.8 0.9 +1 0.6 0.7 0.9 0.5 0.6 0.7 0.8 Optimal policy Value function Markov Decision Process (MDP) An example of grid world environment

  5. State-action value function Q The state-action value function Q(s,a) of any policy  is defined over all possible combinations of states and actions and indicates the expected, discounted, total reward when taking action a in state s and following policy  thereafter. Q(s,a) = R(s,a) +  s’P(s’| s,a) Q(s’, (s’)) Given policy , for each state-action pair, we have a Q(s,a)value. In matrix format, above Bellman equation becomes Q = R +  P Q Q, R: vectors of size |S||A| P: stochastic matrix of size (|S||A| |S|) P((s,a),s’) = P(s’|s,a)

  6. Value Function Q Value Evaluation Policy improvement Policy  How the policy iteration works? Q = R +  P Q Model For model-free reinforcement learning, we don’t have model P.

  7. with free parameters w: Let be an approximation to i.e., Qvalues are approximated by a linear parametric combination of k basis functions The basis functions are fixed, but arbitrarily selected (non-linear) functions of s and a. Note that Q is a vector of size |S||A|. If k=|S||A| and bases are independent, we can find w such that In general,k<<|S||A|, we use linear combination of only several bases to approximate value function Q. Solve w evaluate and get updated policy Value function approximation where

  8. Value function approximation Examples of basis functions: Polynomials: Use indicator function I(a=ai) to decouple actions so that each action gets its own parameters. Radial basis functions (RBF) Proto-value functions Other manually designed bases based on specific problems

  9. Value function approximation Least-Square Fixed-Point Approximation Let We have Use Q = R +  P Q, and remember is the projection of Q onto Φ space, by projection theory, finally we get

  10. A is the sum of many matrices b is the sum of many vectors And they are weighted by transition probability Least-Square Policy Iteration Solving the parameter w is equivalent to solving linear system where

  11. Least-Square Policy Iteration If we sampled data from underlying transition probability, samples: A and b can be learned (in block) as Or (real-time)

  12. Input: D, k, φ, γ, ε, π0(w0) π’ π0 repeat ππ’ % value function update % could use real-time update % policy update until Least-Square Policy Iteration

  13. in LSPI algorithm? How to choose basis functions Proto-Value Function Proto-value functions are good bases for value function approximation. • Do not need to design bases manually • Data tell us what are the corresponding proto-value functions • Generate from topology of the underlying state space • Do not estimate underlying state transition probability • Capture the intrinsic smoothness constraints that true value functions have.

  14. s1 s4 s7 s9 s1 s4 s7 s9 s2 s5 s10 s2 s5 s10 s3 s3 s6 s8 s11 s6 s8 s11 Proto-Value Function 1. Graph representation of the underlying state transition: 2. Adjacency matrix A:

  15. Proto-Value Function 3. Combinatorial Laplacian L: L=T - A where T is the diagonal matrix whose entries are row sums of the adjacency matrix A 4. Proto value functions: Eigenvectors of the combinatorial Laplacian L Each eigenvector provides one basis φj(s), combined with indicator function for action a, we get φj(s,a),

  16. 20 21 G Grid world: 1260 states Example of proto value functions: Adjacency matrix zoom in

  17. Proto-value functions: Low-order eigenvectors as basis functions

  18. Optimal value function Value function approximation using 10 proto-value functions as bases

  19. Representation Policy Iteration (offline) Input: D, k, γ, ε, π0(w0) 1. Construct basis functions: • Use sample D to learn a graph that encodes the underlying state space topology. • Compute the lowest-order k eigenvectors of the combinatorial Laplacian on the graph. The basis functions φ(s,a) are produced by combining the k proto-value functions with indicator function of action a. 2. π’ π0 3. repeat ππ’ % value function update % policy update until

  20. Using offline algorithm, based on D0, k, γ, ε, π0, learn policy π(0), and get Representation Policy Iteration (online) Input: D0, k, γ, ε, π0(w0) 1. Initialization: 2. π’ π(0) 3. repeat (a)π(t)π’ (b) execute π(t) to get new data D(t)={st, at, rt, s’t} . (c) If new data sample D(t) changes the topology of G, compute a new set of basis functions. (d) % value function update (e) % policy update until

  21. Example: Chain MDP, rewards +1 at state 10 & 41, otherwise 0. Optimal policy: 1-9, 26-41: Right; 11-25, 42-50 Left

  22. Example: Chain MDP 20 bases used Value function and approximation in each iteration

  23. Example: Chain MDP Policy L1 error with respect to optimal policy Steps to convergence Performance comparison

  24. References: M. Lagoudakis & R. Parr, Least-Square Policy Iteration. Journal of Machine Learning Research 4 (2003), 1107-1149. -- Give LSPI algorithm for reinforcement learning S. Mahadevan, Proto-Value Functions: Developmental Reinforcement Learning. Proceedings of ICML2005. -- How to build basis function for LSPI algorithm C.Kwok & D. Fox, Reinforcement Learning for Sensing Strategies. Proceedings of IROS2004. -- An application of LSPI

More Related