1 / 40

Regularization and Feature Selection in Least-Squares Temporal Difference Learning

Regularization and Feature Selection in Least-Squares Temporal Difference Learning. J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford University June 16 th , ICML 2009. TexPoint fonts used in EMF.

Download Presentation

Regularization and Feature Selection in Least-Squares Temporal Difference Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. NgComputer Science DepartmentStanford University June 16th, ICML 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAA

  2. Outline • RL with (linear) function approximation • Least-squares temporal difference (LSTD) algorithms very effective in practice • But, when number of features is large, can be expensive and over-fit to training data

  3. Outline • RL with (linear) function approximation • Least-squares temporal difference (LSTD) algorithms very effective in practice • But, when number of features is large, can be expensive and over-fit to training data • This work: present method for feature selection in LSTD (via L1 regularization) • Introduce notion of L1-regularized TD fixed points, and develop an efficient algorithm

  4. Outline

  5. Outline

  6. Outline

  7. Outline

  8. RL with Least-Squares Temporal Difference

  9. Problem Setup • Markov chain M = (S, R, P, ) • Set of states S • Reward function R(s) • Transition Probabilities P(s’|s) • Discount factor  • Want to compute the value function for the Markov chain

  10. Problem Setup • Markov chain M = (S, R, P, ) • Set of states S • Reward function R(s) • Transition Probabilities P(s’|s) • Discount factor  • Want to compute the value function for the Markov chain • But, problem is hard because: • Don’t know the true state transitions / reward (only have access to samples) • State space is too large to represent the value function explicitly

  11. TD Algorithms • Temporal difference (TD) family of algorithms (Sutton, 1988) addresses this problem setting • In particular, focus on Least-Squares Temporal Difference (LSTD) algorithms (Bradtke and Barto, 1996; Boyan, 1999, Lagoudakis and Parr, 2003) • work well in practice, make efficient use of data

  12. Brief LSTD Overview • Represent value function using linear approximation

  13. Brief LSTD Overview • Represent value function using linear approximation parameter vector

  14. Brief LSTD Overview • Represent value function using linear approximation state features

  15. Brief LSTD Overview • TD methods seek parameters w that satisfy the following fixed-point equation

  16. Brief LSTD Overview • TD methods seek parameters w that satisfy the following fixed-point equation optimization variable

  17. Brief LSTD Overview • TD methods seek parameters w that satisfy the following fixed-point equation matrix of all state features

  18. Brief LSTD Overview • TD methods seek parameters w that satisfy the following fixed-point equation vector of all rewards

  19. Brief LSTD Overview • TD methods seek parameters w that satisfy the following fixed-point equation Matrix of transition probabilities

  20. Brief LSTD Overview • TD methods seek parameters w that satisfy the following fixed-point equation • Also sometimes written (equivalently) as

  21. Brief LSTD Overview • TD methods seek parameters w that satisfy the following fixed-point equation • Also sometimes written (equivalently) as LSTD finds a w that approximately satisfies this equation using only samples from the MDP (gives closed form expression for optimal w)

  22. Problems with LSTD • Requires storing/inverting k x k matrix • Can be extremely slow for large k • In practice, often means that practitioner puts great effort into picking a few “good” features • For many features / few samples, LSTD can over-fit to training data

  23. Regularization and Feature Selection for LSTD

  24. Regularized LSTD • Introduce regularization term into LSTD fixed point equation

  25. Regularized LSTD • Introduce regularization term into LSTD fixed point equation • In particular, focus on L1 regularization • Encourages sparsity in feature weights (i.e., feature selection) • Avoids over-fitting to training samples • Avoids storing/inverting full k x k matrix

  26. Regularized LSTD Solution • Unfortunately, for L1 regularized LSTD • No closed-form solution for optimal w • Optimal w cannot even be expressed as solution to convex optimization problem

  27. Regularized LSTD Solution • Unfortunately, for L1 regularized LSTD • No closed-form solution for optimal w • Optimal w cannot even be expressed as solution to convex optimization problem • Fortunately, can be solved efficiently using algorithm similar to Least Angle Regression (LARS) (Efron et al., 2004)

  28. LARS-TD Algorithm • Intuition of our algorithm (LARS-TD) • Express L1-regularized fixed point in terms of optimality conditions for convex problem • Then, beginning at fully regularized solution (w=0), proceed down regularization path (piecewise linear adjustments to w, which can be computed analytically) • Stop when we reach the desired amount of regularization

  29. Theoretical Guarantee Theorem: Under certain conditions (similar to those required to show convergence of ordinary TD) the L1-reguarlized fixed point exists and is unique, and the LARS-TD algorithm is guaranteed to find this fixed point.

  30. Computational Complexity • LARS-TD algorithm has computational complexity of approximately O(kp3) • k = number of total features • p = number of non-zero features (<< k) • Importantly, algorithm is linear in number of total features

  31. Experimental Results

  32. Chain Domain • 20 state chain domain (Lagoudakis and Parr, 2003) • Twenty states, two actions, use LARS-TD for LSPI-style policy iteration • Five “relevant” features: RBFs • Varying number of irrelevant Gaussian noise features

  33. Chain – 1000 Irrelevant Features

  34. Chain – 800 Samples

  35. Chain – 800 Samples

  36. Mountain Car Domain • Classic Mountain Car Domain • 500 training samples from 50 episodes • 1365 basis functions (automatically generated RBFs w/ many different bandwidth parameters)

  37. Mountain Car Domain • Classic Mountain Car Domain • 500 training samples from 50 episodes • 1365 basis functions (automatically generated RBFs w/ many different bandwidth parameters)

  38. Related Work • RL feature selection / generation: (Menache et al., 2005), (Keller et al., 2006), (Parr et al., 2007), (Loth et al., 2007), (Parr et al., 2008) • Regularization: (Farahmand et al., 2009) • Kernel selection: (Jung and Polani, 2006), (Xu et al., 2007)

  39. Summary • LSTD able to learn value function approximation using only samples from MDP, but can be computationally expensive and/or over-fit to data • Present feature selection framework for LSTD (using L1 regularization) • Encourages sparse solutions, prevents over-fitting, computationally efficient

  40. Thank you! Extended paper (with full proofs) available at: http://ai.stanford.edu/~kolter

More Related