10 likes | 132 Views
Sparse Q-learning with Mirror Descent Sridhar Mahadevan and Bo Liu , University of Massachusetts Amherst Autonomous Learning Laboratory , {mahadeva, boliu}@cs.umass.edu. Convergence Comparison with LARS-TD:. ABSTRACT:
E N D
Sparse Q-learning with Mirror Descent Sridhar Mahadevan and Bo Liu, University of Massachusetts Amherst Autonomous Learning Laboratory, {mahadeva, boliu}@cs.umass.edu Convergence Comparison with LARS-TD: • ABSTRACT: • This paper explores a new framework for reinforcement learning (RL) based on online convex optimization, in particular mirror descent and related algorithms. • A new class of proximal-gradient based temporal difference (TD) methods are presented based on different Bregman divergences, which are more powerful than regular TD learning. • A new family of first-order sparse RL methods are proposed, which are able to find sparse fixed-point of an L1-regularized Bellman equation at significantly less computational cost than previous second-order methods. ALGORITMS: Less difference between successive weights Less running time at each iteration • BACKGROUND • Mirror Descentis an enhanced gradient method, which can be viewed as a proximal algorithm where the distance generating function used is a Bregman divergence. • ERROR BOUND ANALYSIS: Variance Comparison with Q-learning: Less variance compared with Q-learning The error bound is controlled by 1Expressiveness of -subspace 2 Sparsity parameter 3 Quality of empirical l_1 solver Control Learning • DISCUSSIONS AND FUTRE WORK: • Comparison of p-norm with Exponentiated Gradient (EG): EG is not able to generate sparse solutions; Besides, EG-based methods are prone to cause overflow of coefficients. • P-norm link function provides an interpolation between additive and multiplicative gradient update and is thus more flexible and robust to various basis functions. • The regret bound w.r.t different link function in RL setting is yet to be further discovered. • Introducing mirror descent into off-policy TD learning and policy gradient algorithms. • Scaling to large MDPs, including hierarchical mirror descent RL, in particular extending to Semi-MDP Q-learning. EXPERIMENTAL RESULT: Decaying p-norm: Iterative soft-thresholdingfor sparsity • MOTIVATION • This is a two-step Nested Optimization problem: • Projection Step: • Fixed-point Step: Proceedings of the Conference on Uncertainty in AI (UAI), August 15-17, 2012, Catalina Island, CA For more information, please contact: Prof. Sridhar Mahadevan, Dept. Computer Science, University of Massachusetts Amherst, Email: Mahadeva@cs.umass.edu