1 / 37

A Semiparametric Statistics Approach to Model-Free Policy Evaluation

A Semiparametric Statistics Approach to Model-Free Policy Evaluation. Tsuyoshi UENO (1) , Motoaki KAWANABE (2) , Takeshi MORI (1) , Shin-ich MAEDA (1) , Shin ISHII (1),(3) (1) Kyoto University (2) Fraunhofer FIRST. Summary of This Talk.

gbarfield
Download Presentation

A Semiparametric Statistics Approach to Model-Free Policy Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO(1), Motoaki KAWANABE(2), Takeshi MORI(1), Shin-ich MAEDA(1) , Shin ISHII(1),(3) (1)Kyoto University (2)Fraunhofer FIRST

  2. Summary of This Talk • We discussed LSTD-based policy evaluation from the viewpoint of semiparametric statistics and estimating function. • How good is LSTD? • Can we improve LSTD ? LSTD is a type of estimating function method, and evaluate the asymptotic estimation variance of LSTD. We derive an optimal estimating function with the minimum asymptotic estimation variance. We propose a new policy evaluation algorithm (gLSTD)

  3. Model-Free Reinforcement Learning Action Policy Environment Reward State Goal:Obtain an optimal policy which maximizes the sum of future rewards

  4. Policy Iteration [Sutton & Barto, 1998] Policy Evaluation (Estimate the value function) Policy Improvement (Update the policy) If the value function can be correctly estimated, policy iteration convergesthe optimal policy Value function estimation is a key of policy iteration !!

  5. Policy Evaluation Method: LSTD[Bratke & Barto, 1996] • Least Squares Temporal Difference (LSTD) • LSTD-based policy iteration algorithms have shown good practical performance. • Least Squares Policy Iteration (LSPI) [Lagoudakis & Parr, 2003] • Natural Actor-Critic (NAC) [Peters et.al., 2003, 2005] • Representation Policy Iteration (RPI)[Mahadevan & Maggino, 2007] LSTD is one of the important algorithms in RL field

  6. Parameter Feature Least Square Temporal Difference (LSTD) • Assumption • Bellman equation [Bellman, 1966 ] We assume that the linear function ‘completely’ represents the value function. (There are no bias.)

  7. Least Square Temporal Difference (LSTD) • Linearly approximated bellman equation Output: Noise Parameter Noise Input: the input and observation noise are mutually dependent!! Just a linear regression problem (Error in (input) variable problem [Young,1984])

  8. y x OLS estimator is biased. Linear Regression with Error in Variables • Ordinary least squares method (OLS): OLS the observation noise depends on the input variable,

  9. Instrumental Variable Method[Soderstrom and Stoica, 2002] • Introduce the instrumental variable: The instrumental variable is correlated with the inputbut uncorrelated with the noise Output: y Input: x is an unbiased estimator

  10. Least Square Temporal Difference (LSTD) • LSTD = Instrumenatal variable method. • Instrumental variable : (for example) are also instrumental variables It is important to choose an appropriate instrumental variable.

  11. Our Approach • How good is LSTD ? • Can we improve LSTD? We analysis the asymptotic estimation variance of instrumental variable method. We optimize the instrumental variable so as to minimize the asymptotic estimation variance. We introduce a viewpoint of semiparametric statistical inference

  12. Semiparametric Statistics Approach • Semiparametric model: • is target parameter • are nuisance parameter (infinite degree of freedom ) • Linearly approximated Bellman equation We don’t know the noise distribution. We need to estimate only the target parameter regardless of the nuisance parameters

  13. Inference of Semiparametric Model • Estimating function [Godambe, 1985] • [Conditions] • Estimating equation For any nuisance parameter converges to the true parameterregardless of nuisance parameter.

  14. Estimating Functions • Estimating function = LSTD • Estimating function = Instrumental variable method Instrumental Variable Are there any other estimating functions ?

  15. Are There Any Other Estimating Functions ? No !! Proposition 1 Every admissible estimating functions must have the form of “Inadmissible” estimating function means there are superior estimating functions to it.

  16. Asymptotic Variance of LSTD-Based Estimators Lemma 2. The asymptotic estimation variance of estimating function for value functions is given by where and Which instrumental variable performs the minimum asymptotic variance ?

  17. The Optimal Estimating Function Theorem 1. The optimal instrumental variable with the minimum asymptotic variance is given by where True parameter (unknown) Unknown conditional expectations gLSTD Approximation is necessary

  18. gLSTD The optimal instrumental variable • The residual of true parameter • Unknown conditional expectations (Unknown) Replace the regression residual of true parameter with that of LSTD estimator. Approximate these conditional expectations by using a sample-based function approximation technique.

  19. Summary of gLSTD • Calculate the initial estimator and replace the true residual • Approximate the conditional expectations • Construct the instrumental variable • Calculate the gLSTD estimator

  20. 1 4 2 3 5 Simulation (Markov Random Walk) • Conditions of the simulation experiment • Policy: Random • The number of steps: 100 • The number of episodes: 100 • Discounted factor: 0.9 • Basis function : • We generated three basis functions by the diffusion model. [Mahadevan & Maggino, 2007] R=0 R=0 R=0 R=0.5 R=1.0

  21. Simulation Result. 20% Median The upper and lower quartiles The estimator of gLSTD achieved 20% smaller MSE than that of the LSTD

  22. Conclusion • We discussed LSTD-based policy evaluation in the framework of semiparametric statistics approach. • We evaluated the asymptotic variance of LSTD-based estimator. • We derived the optimal estimating function with the minimum asymptotic variance and proposed its practical implementation method: gLSTD. • Through an simple Markov chain problem, we demonstrated that gLSTD reduces the estimation variance of LSTD.

  23. Future Work • Application to the policy improvement • - Least Squares Policy Iteration (LSPI) • - Natural Actor Critic (NAC) etc. A Semiparametric Approach to Model-Free Policy Evaluation • A Semiparametric Approach to • Model-Free Reinforcement Learning

  24. EndThank you for your attention!!

  25. Cost Function

  26. Simulation Result 1 2 3 4 5

  27. Questions How good is the LSTD? Can we improve the LSTD ? LSTD is a type of estimating function method, and evaluate the asymptotic estimation variance of LSTD. We derive the optimal estimating function with the minimum asymptotic estimation variance.

  28. The Suboptimal Estimating Function (LSTDc) • GLSTD is required to estimate the functions depending on current state. • To avoid estimating these functions, we simple replace them by constant value. Optimize it to minimize the asymptotic variance

  29. The Suboptimal Estimating Function (LSTDc) Theorem 2. The optimal shift is given by where

  30. Summary of This Talk • We introduce a semiparametricstatistical viewpoint for estimation of value function with linear model. • Our aim • Evaluate the estimation variance of value functions • Develop more efficient estimation methods

  31. Summary of Our Main Results • Formulate the estimation problem of linearly-represented value functions as a semiparametric inference problem • Evaluate the asymptotic variance of estimations of value function • Derive the optimal estimation method with the minimum asymptotic variance

  32. Estimating Functions • Question Which function is appropriate when more than one estimating function exist ? • Answer Choose the estimating function with minimum asymptotic variance

  33. Instrumental Variable (IV) Method • Instrumental variable: • Correlated to the input variable, but uncorrected to the noise. • Instrumental variable method

  34. Statistics approach

  35. What is the Semiparametric Approach ? Semiparametric model: Parameter: Nuisance parameter: Estimating function[Godambe, 1985] [Conditions] We need to estimate the parameter regardless of the nuisance parameter . converges to the true parameter Show the detail in [Godambe, 1985]

More Related