240 likes | 388 Views
Reinforcement Learning in MDPs by Lease-Square Policy Iteration. Presented by Lihan He Machine Learning Reading Group Duke University 09/16/2005. Outline. MDP and Q-function Value function approximation LSPI: Least-Square Policy Iteration Proto-value Functions
E N D
Reinforcement Learning in MDPs by Lease-Square Policy Iteration Presented by Lihan He Machine Learning Reading Group Duke University 09/16/2005
Outline • MDP and Q-function • Value function approximation • LSPI: Least-Square Policy Iteration • Proto-value Functions • RPI: Representation Policy Iteration
Markov Decision Process (MDP) An MDP is a model M = < S, A, T, R > a set of environment states S, a set of actions A, a transition function T: S A S [0,1] ,T(s,a,s’) = P(s’| s,a), a reward function R: S A R . A policy is a function : S A. Value function (expected cumulative reward) V: S R . Satisfying Bellman Eq.: V(s) = R(s, (s)) + s’P(s’| s, (s))V(s’)
+1 0.7 0.8 0.9 +1 0.6 0.7 0.9 0.5 0.6 0.7 0.8 Optimal policy Value function Markov Decision Process (MDP) An example of grid world environment
State-action value function Q The state-action value function Q(s,a) of any policy is defined over all possible combinations of states and actions and indicates the expected, discounted, total reward when taking action a in state s and following policy thereafter. Q(s,a) = R(s,a) + s’P(s’| s,a) Q(s’, (s’)) Given policy , for each state-action pair, we have a Q(s,a)value. In matrix format, above Bellman equation becomes Q = R + P Q Q, R: vectors of size |S||A| P: stochastic matrix of size (|S||A| |S|) P((s,a),s’) = P(s’|s,a)
Value Function Q Value Evaluation Policy improvement Policy How the policy iteration works? Q = R + P Q Model For model-free reinforcement learning, we don’t have model P.
with free parameters w: Let be an approximation to i.e., Qvalues are approximated by a linear parametric combination of k basis functions The basis functions are fixed, but arbitrarily selected (non-linear) functions of s and a. Note that Q is a vector of size |S||A|. If k=|S||A| and bases are independent, we can find w such that In general,k<<|S||A|, we use linear combination of only several bases to approximate value function Q. Solve w evaluate and get updated policy Value function approximation where
Value function approximation Examples of basis functions: Polynomials: Use indicator function I(a=ai) to decouple actions so that each action gets its own parameters. Radial basis functions (RBF) Proto-value functions Other manually designed bases based on specific problems
Value function approximation Least-Square Fixed-Point Approximation Let We have Use Q = R + P Q, and remember is the projection of Q onto Φ space, by projection theory, finally we get
A is the sum of many matrices b is the sum of many vectors And they are weighted by transition probability Least-Square Policy Iteration Solving the parameter w is equivalent to solving linear system where
Least-Square Policy Iteration If we sampled data from underlying transition probability, samples: A and b can be learned (in block) as Or (real-time)
Input: D, k, φ, γ, ε, π0(w0) π’ π0 repeat ππ’ % value function update % could use real-time update % policy update until Least-Square Policy Iteration
in LSPI algorithm? How to choose basis functions Proto-Value Function Proto-value functions are good bases for value function approximation. • Do not need to design bases manually • Data tell us what are the corresponding proto-value functions • Generate from topology of the underlying state space • Do not estimate underlying state transition probability • Capture the intrinsic smoothness constraints that true value functions have.
s1 s4 s7 s9 s1 s4 s7 s9 s2 s5 s10 s2 s5 s10 s3 s3 s6 s8 s11 s6 s8 s11 Proto-Value Function 1. Graph representation of the underlying state transition: 2. Adjacency matrix A:
Proto-Value Function 3. Combinatorial Laplacian L: L=T - A where T is the diagonal matrix whose entries are row sums of the adjacency matrix A 4. Proto value functions: Eigenvectors of the combinatorial Laplacian L Each eigenvector provides one basis φj(s), combined with indicator function for action a, we get φj(s,a),
20 21 G Grid world: 1260 states Example of proto value functions: Adjacency matrix zoom in
Proto-value functions: Low-order eigenvectors as basis functions
Optimal value function Value function approximation using 10 proto-value functions as bases
Representation Policy Iteration (offline) Input: D, k, γ, ε, π0(w0) 1. Construct basis functions: • Use sample D to learn a graph that encodes the underlying state space topology. • Compute the lowest-order k eigenvectors of the combinatorial Laplacian on the graph. The basis functions φ(s,a) are produced by combining the k proto-value functions with indicator function of action a. 2. π’ π0 3. repeat ππ’ % value function update % policy update until
Using offline algorithm, based on D0, k, γ, ε, π0, learn policy π(0), and get Representation Policy Iteration (online) Input: D0, k, γ, ε, π0(w0) 1. Initialization: 2. π’ π(0) 3. repeat (a)π(t)π’ (b) execute π(t) to get new data D(t)={st, at, rt, s’t} . (c) If new data sample D(t) changes the topology of G, compute a new set of basis functions. (d) % value function update (e) % policy update until
Example: Chain MDP, rewards +1 at state 10 & 41, otherwise 0. Optimal policy: 1-9, 26-41: Right; 11-25, 42-50 Left
Example: Chain MDP 20 bases used Value function and approximation in each iteration
Example: Chain MDP Policy L1 error with respect to optimal policy Steps to convergence Performance comparison
References: M. Lagoudakis & R. Parr, Least-Square Policy Iteration. Journal of Machine Learning Research 4 (2003), 1107-1149. -- Give LSPI algorithm for reinforcement learning S. Mahadevan, Proto-Value Functions: Developmental Reinforcement Learning. Proceedings of ICML2005. -- How to build basis function for LSPI algorithm C.Kwok & D. Fox, Reinforcement Learning for Sensing Strategies. Proceedings of IROS2004. -- An application of LSPI