Convergence Analysis of RO-TD

Regularized Off-Policy TD-Learning Bo Liu, Sridhar Mahadevan, University of Massachusetts Amherst, {boliu, mahadeva}@cs.umass.edu Ji Liu, ji-liu@cs.wisc.edu MSBE MSPBE Problem Setting: Off-Policy training is training on data from one policy in order to learn the value of another policy TD Learning algorithm diverges in off-policy training. TD with Gradient Correction (TDC) algorithm is an off-policy convergent RL algorithm [Sutton et. al, 2009] Regularization helps improve stability of TD methods RO-TD algorithm: First RegularizedOff-Policy convergentTD algorithm with Linear Computation Complexity ALGORITMS: EXPERIMENTAL RESULT: Objective Function:l1-regularized approximate solution of linear equation formulation of TDC Convex-concave Formulation: Saddle-point bilinear representation enables stochastic regularization Linear Computation: Linear complexity w.r.t sample and feature size O(Nd) Control Learning Extension: RO-GQ(λ) Off-Policy Convergence: TD with gradient correction (TDC) 1 TDC aims at minimizing the mean-squares projected Bellman error (MSPBE). 2 the gradient update adds a correction term on the standard TD algorithm. 3 TDC is in essence solving a linear equation Ax=b using stochastic gradient. Algorithm Details: Feature Selection Feature Selection Control Learning Convex-concave Formulation Regularized approximate solution of linear equation is reached via First-Order Method (FOM) • Method 1The FOM solver uses both proximal gradient method and l2-constrained projection. • Method 2 The FOM solver uses only linf-constrained projection. • Future Work: • Mirror Descent: Introducing mirror descent into off-policy TD learning and policy gradient algorithms. • Options: Scaling to large MDPs, including hierarchical mirror descent RL, in particular extending to Semi-MDP Q-learning. • Off-Policy Policy Gradient: To scale to larger MDPs, it is possible to design regularized off-policy policy gradient methods as well. Convergence Analysis of RO-TD The approximate saddle-point of RO-TD converges w.p.1 to the global minimizer

Convergence Analysis of RO-TD

Convergence Analysis of RO-TD

Presentation Transcript

Convergence of Series

Convergence of Series

Evolution of TD-SCDMA

Convergence of Series

Radius of convergence

Convergence Analysis of Canonical Genetic Algorithm

Relationships | convergence | transparency | analysis

Analysis of the speed of convergence

An Experimental Analysis of BGP Convergence Time

An Analysis of BGP Convergence Properties

Convergence Programme Convergence of Class Headings

Structural Analysis of NBIC Convergence

CONVERGENCE OF MGNREGA

Convergence Analysis of Reinforcement Learning Agents

TD Bank

TD Education 2006 TD REPORTS

An Experimental Analysis of BGP Convergence Time

Evolution of TD-SCDMA

Ro-Ro carriers