PEGASUS: A policy search method for large MDP’s and POMDP’s

PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine

Motivation • For large, complicated domains, estimation of value functions/Q-functions can take a long time. • However, there often exist far simpler policies than the optimal that perform nearly as well. • Can directly search through a policy space

Preliminaries • MDP – M = (S, D, A, {Psa(.)}, γ, R) • S – set of states • D – initial state distribution • A – set of action • Psa(.) : S -> [0,1] – transition probabilities • γ – discount factor • R – deterministic rewards (function of state)

Policies • Policy п : S -> A • Value Function Vп : S -> Reals Vп(s) = R(s) + γ Es’~P(s,п(s))[Vп(s’)] • For convenience, also define: V(п) = Es0~D[Vп(s0)]

Application Domain • Helicopter Flight (Hovering in Place) • 12-d continuous state space ([0,1]12) • (x,y,z,pitch,roll,yaw,x’,y’,z’,pitch’,roll’,yaw’) • 4-d continuous action space ([0,1]4) • (front/back cyclic pitch control,left/right cyclic pitch control main rotor pitch control,tail rotor pitch control) • Timesteps correspond to 1/50th of a second • γ = .9995 • R(s) = -(a(x-x*)2+b(y-y*)2+c(z-z*)2+(yaw-yaw*)2)

Helicopter

Transformation of MDP’s • Given M = (S, D, A, {Psa(.)}, γ, R) we construct M’ = (S’, D’, A, {P’sa(.)}, γ, R’), an MDP with deterministic state transitions • Intuition: Instead of rolling the dice when we move from state to state, we will roll all the dice we need ahead of time, and store their results as part of our state.

Parcheesi … …

Deterministic Simulative Model • Assume we have a deterministic functional representation of our MDP Transitions • g : S x A x [0,1]dp –> S such that if p is distributed uniformly in [0,1]dp then Prp[g(s, a, p) = s’] = Psa(s’). • More powerful than a generative model.

Transformations of MDP’s • S’ = S x [0,1]¥ • D’ – (s, p1, p2, p3, …) such that s ~ D, and the pi’s are drawn iid from Uniform[0,1] • P’ta(t’) ={1 if g(s, a, p1)=s’,0 otherwise}(dP = 1) • R’(t) = R(s) t = (s, p1, p2, p3, …) t’ =(s’, p2, p3, …)

Policies • Given a policy space П for S, consider a corresponding policy space П’ for S’, s.t. • " п in П, $ п’ in П’, " s in S, " p1, p2,… п’((s, p1, p2, p3, …)) = п(s) • As the transition probabilities and rewards are equivalent in the transformed MDP: VM п(s) = Ep~Unif[0,1]^¥[VM’ п’(s,p)] VM(п) = VM’(п’)

Policy Search • VMп(s0) = R(s0) + γ Es’~P(s0,п(s0))[Vп(s’)] • VM’п’((s0,p1,p2,…)) = R(s0)+γR(s1)+γ2R(s2)+… • s1 = g(s0, п’(s0), p1), s2 = g(s1, п’(s1), p2) • As VM(п) = VM’(п’), we can estimate VM(п) = Et0~D’[VM’п’(t0)]

PEGASUS Policy Evaluation-of-Goodness and Search Using Scenarios • Draw a sample of m initial states (scenarios) {s0(1), s0(2), s0(3), …, s0(m)} iid from D’ • Estimate

PEGASUS • Given {s0(1), s0(2), s0(3), …, s0(m)}, is a deterministic function • The sum is infinite, but can truncate the sum after Hε = logγ(ε(1-γ)/2Rmax), introducing at most ε/2 error. Also, this allows us to store our “dice rolls” in finite space.

PEGASUS • Given the deterministic function VM’(п), we can use an optimization technique to find argmaxп VM’(п). • If working in a continuous, smooth, differentiable domain, we can use gradient ascent • If R is discontinuous, may need to use “continuation” methods to smooth it out

Results • On 5x5 Gridworld POMDP, discovers near optimal policy in very few scenarios (~5) • On continuous space/action bicycle riding problem, results near optimal and far better than earlier reward shaping methods.

Helicopter Hovering • Policy represented by a hand-crafted neural network. • PEGASUS used to search through set of possible ANN weights. • Tried both gradient ascent and random walk searches

Neural Network Structure (x,y,z) = (forward, sideways, down) a1 = front/back cyclic pitch control, a2 = left/right cyclic pitch control a3 = main rotor pitch control a4 = tail rotor pitch control

Results • Able to keep helicopter stable on its maiden flight. Hovering • Neural network modified to fly competition class maneuvers Triangle • Finally, hovering upside down accomplished • http://ai.stanford.edu/~ang/rl-videos/helicopter/

Pseudo-Dimension • H set of functions X -> Reals • H shatters x1, x2, …, xd ε X if there exists a sequence of real numbers t1, t2, …, td s.t. {(h(x1) – t1, h(x2) – t2, …, h(xd) – td)| h ε H} intersects all 2d orthants of Rd • The pseudo-dimension of H (dimp(H)) is the size of the largest set shattered by H

Lipschitz Continuity • A function f is Lipschitz continuous with Lipschitz bound B if ||f(x) – f(y)|| <= B||x – y|| (with respect to Euclidean norm on range and domain)

Realizable Dynamics in an MDP • Let S = [0, 1]ds, g: S x A x [0, 1]dp -> S be given. • We can define Fi as a set of functions {Fia: S x [0, 1]dp -> [0, 1], Fia(s, p1,…,pdp) = Ii(g(s, a, p1,…,pdp))| "a in A} Ii(x) returns the ith coordinate of x

PEGASUS Theoretical Result • Let S = [0, 1]ds, policy class П, and model g: S x A x [0, 1]dp -> S be given. • F is the family of realizable dynamics in the MDP and Fi the resulting family of coordinate functions. For all i, let dimP(Fi) <= d, and let Fi be uniformly Lipschitz continuous with bound B • Reward Function R is Lipschitz continuous with bound BR. • Then if: with probability at least 1 – δ, the PEGASUS estimate V’(п) will be uniformly close to the the actual value |V’(п) – V (п)| <= ε

Proof (1) • Think of the reward at step i as a random variable Vп(s0(1)) = R(so(1)) + γ R(s1(1)) + γ2 R(s2(1)) +… Vп(s0(2)) = R(so(2)) + γ R(s1(2)) + γ2 R(s2(2)) +… Vп(s0(3)) = R(so(3)) + γ R(s1(3)) + γ2 R(s2(3)) +… • By bounding properties of each R(si(j)), we can prove uniform convergence for V(п)

Proof (2) • Calling on work by Haussler, we show that if the psuedo-dimension of each Fi, dimP(Fi) <= d, we can “nearly” represent our world dynamics functions Fia by a smaller set of functions of size

Proof (3) • Similarly if Fi uniformly has Lipschitz bound B, and the Reward function R has Lipschitz bound BR, we can “nearly” represent a function mapping from scenarios to ith step rewards by a set of size

Proof (4) • A result by Haussler then shows that with probability 1 – δ, our ith step reward will be ε-close to the mean if we select a number of scenarios bounded by

Proof (5) • Strengthening the bound to account for all Hε rewards and employing the Union bound, we find that a number of scenarios bounded by is sufficient.

Critique • Success limited to very small fairly linear control problem, with high frequency controller • Lots of human bias incorporated into system • Restrictions/Linear Regression for model identification • Structure of neural net for each of the tasks • PAC learning guarantees still out of reach • No theoretical bounds on final policy

Bibliography • Chapter on PAC learning model, and decision-theoretic generalizations, with applications to neural nets. From Mathematical Perspectives on Neural Networks, Lawrence Erlbaum Associates, 1995, Information and Computation, Vol. 100, September, 1992, pp. 78-150. • Ng, A. Y., Jordan, M. I. PEGASUS: A policy search method for large MDP’s and POMDP’s. In Uncertainty in Artificial Intelligence, Sixth Conference, 2000. • Ng, A. Y., Kim, H. J., Jordan, M. I., & Sastry, S. Autonomous helicopter flight via reinforcement learning. Advances in Neural Information Processing Systems 16. 2004. • Ng, A. Y.,Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., and Liang, E. Inverted autonomous helicopter flight via reinforcement learning, In International Symposium on Experimental Robotics, 2004.

Application – Helicoptor Flight • PEGASUS has been used to derive policies for hovering in place. • Later generalized to handle slow motion maneuvers and upside down hovering. • GPS system relays state information (position and velocity) to an off board computer which calculates a 4-dimensional action

Model Identification • Construction of an MDP representation of the world dynamics • Transition Dynamics learned from several minutes of data based on human flight • Fit using linear regression • Forced to respect innate properties of the domain (gravity, symmetry)

PEGASUS: A policy search method for large MDP’s and POMDP’s

PEGASUS: A policy search method for large MDP’s and POMDP’s

Presentation Transcript

Search Engine Optimization Basics

SharePoint Search

Locality Sensitive Hashing and Large Scale Image Search

Finding Eigenvalues and Eigenvectors

What is Robust Design or Taguchi’s method?

User Interfaces for Information Access

JET PROPULSION

CIMOM Implementation

Graph Mining - surprising patterns in real graphs

Unit 2: Trade Policy

METHOD OF UNDETERMINED COEFFICIENTS

Search Patterns

SEARCH FOR NEW PHYSICS AT LARGE HADRON COLLIDER (CMS) (Hope not last hadron collider)

8/27/14

Lecture 15: Binary Search Trees (BST)

Information Search 3: INTERNET

Search Trees: BSTs and B-Trees

ILS 501 Unit 3 Searching Issues

Lecture 6