320 likes | 487 Views
PEGASUS: A policy search method for large MDP’s and POMDP’s. Andrew Ng, Michael Jordan Presented by: Geoff Levine. Motivation. For large, complicated domains, estimation of value functions/Q-functions can take a long time.
E N D
PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine
Motivation • For large, complicated domains, estimation of value functions/Q-functions can take a long time. • However, there often exist far simpler policies than the optimal that perform nearly as well. • Can directly search through a policy space
Preliminaries • MDP – M = (S, D, A, {Psa(.)}, γ, R) • S – set of states • D – initial state distribution • A – set of action • Psa(.) : S -> [0,1] – transition probabilities • γ – discount factor • R – deterministic rewards (function of state)
Policies • Policy п : S -> A • Value Function Vп : S -> Reals Vп(s) = R(s) + γ Es’~P(s,п(s))[Vп(s’)] • For convenience, also define: V(п) = Es0~D[Vп(s0)]
Application Domain • Helicopter Flight (Hovering in Place) • 12-d continuous state space ([0,1]12) • (x,y,z,pitch,roll,yaw,x’,y’,z’,pitch’,roll’,yaw’) • 4-d continuous action space ([0,1]4) • (front/back cyclic pitch control,left/right cyclic pitch control main rotor pitch control,tail rotor pitch control) • Timesteps correspond to 1/50th of a second • γ = .9995 • R(s) = -(a(x-x*)2+b(y-y*)2+c(z-z*)2+(yaw-yaw*)2)
Transformation of MDP’s • Given M = (S, D, A, {Psa(.)}, γ, R) we construct M’ = (S’, D’, A, {P’sa(.)}, γ, R’), an MDP with deterministic state transitions • Intuition: Instead of rolling the dice when we move from state to state, we will roll all the dice we need ahead of time, and store their results as part of our state.
Parcheesi … …
Deterministic Simulative Model • Assume we have a deterministic functional representation of our MDP Transitions • g : S x A x [0,1]dp –> S such that if p is distributed uniformly in [0,1]dp then Prp[g(s, a, p) = s’] = Psa(s’). • More powerful than a generative model.
Transformations of MDP’s • S’ = S x [0,1]¥ • D’ – (s, p1, p2, p3, …) such that s ~ D, and the pi’s are drawn iid from Uniform[0,1] • P’ta(t’) ={1 if g(s, a, p1)=s’,0 otherwise}(dP = 1) • R’(t) = R(s) t = (s, p1, p2, p3, …) t’ =(s’, p2, p3, …)
Policies • Given a policy space П for S, consider a corresponding policy space П’ for S’, s.t. • " п in П, $ п’ in П’, " s in S, " p1, p2,… п’((s, p1, p2, p3, …)) = п(s) • As the transition probabilities and rewards are equivalent in the transformed MDP: VM п(s) = Ep~Unif[0,1]^¥[VM’ п’(s,p)] VM(п) = VM’(п’)
Policy Search • VMп(s0) = R(s0) + γ Es’~P(s0,п(s0))[Vп(s’)] • VM’п’((s0,p1,p2,…)) = R(s0)+γR(s1)+γ2R(s2)+… • s1 = g(s0, п’(s0), p1), s2 = g(s1, п’(s1), p2) • As VM(п) = VM’(п’), we can estimate VM(п) = Et0~D’[VM’п’(t0)]
PEGASUS Policy Evaluation-of-Goodness and Search Using Scenarios • Draw a sample of m initial states (scenarios) {s0(1), s0(2), s0(3), …, s0(m)} iid from D’ • Estimate
PEGASUS • Given {s0(1), s0(2), s0(3), …, s0(m)}, is a deterministic function • The sum is infinite, but can truncate the sum after Hε = logγ(ε(1-γ)/2Rmax), introducing at most ε/2 error. Also, this allows us to store our “dice rolls” in finite space.
PEGASUS • Given the deterministic function VM’(п), we can use an optimization technique to find argmaxп VM’(п). • If working in a continuous, smooth, differentiable domain, we can use gradient ascent • If R is discontinuous, may need to use “continuation” methods to smooth it out
Results • On 5x5 Gridworld POMDP, discovers near optimal policy in very few scenarios (~5) • On continuous space/action bicycle riding problem, results near optimal and far better than earlier reward shaping methods.
Helicopter Hovering • Policy represented by a hand-crafted neural network. • PEGASUS used to search through set of possible ANN weights. • Tried both gradient ascent and random walk searches
Neural Network Structure (x,y,z) = (forward, sideways, down) a1 = front/back cyclic pitch control, a2 = left/right cyclic pitch control a3 = main rotor pitch control a4 = tail rotor pitch control
Results • Able to keep helicopter stable on its maiden flight. Hovering • Neural network modified to fly competition class maneuvers Triangle • Finally, hovering upside down accomplished • http://ai.stanford.edu/~ang/rl-videos/helicopter/
Pseudo-Dimension • H set of functions X -> Reals • H shatters x1, x2, …, xd ε X if there exists a sequence of real numbers t1, t2, …, td s.t. {(h(x1) – t1, h(x2) – t2, …, h(xd) – td)| h ε H} intersects all 2d orthants of Rd • The pseudo-dimension of H (dimp(H)) is the size of the largest set shattered by H
Lipschitz Continuity • A function f is Lipschitz continuous with Lipschitz bound B if ||f(x) – f(y)|| <= B||x – y|| (with respect to Euclidean norm on range and domain)
Realizable Dynamics in an MDP • Let S = [0, 1]ds, g: S x A x [0, 1]dp -> S be given. • We can define Fi as a set of functions {Fia: S x [0, 1]dp -> [0, 1], Fia(s, p1,…,pdp) = Ii(g(s, a, p1,…,pdp))| "a in A} Ii(x) returns the ith coordinate of x
PEGASUS Theoretical Result • Let S = [0, 1]ds, policy class П, and model g: S x A x [0, 1]dp -> S be given. • F is the family of realizable dynamics in the MDP and Fi the resulting family of coordinate functions. For all i, let dimP(Fi) <= d, and let Fi be uniformly Lipschitz continuous with bound B • Reward Function R is Lipschitz continuous with bound BR. • Then if: with probability at least 1 – δ, the PEGASUS estimate V’(п) will be uniformly close to the the actual value |V’(п) – V (п)| <= ε
Proof (1) • Think of the reward at step i as a random variable Vп(s0(1)) = R(so(1)) + γ R(s1(1)) + γ2 R(s2(1)) +… Vп(s0(2)) = R(so(2)) + γ R(s1(2)) + γ2 R(s2(2)) +… Vп(s0(3)) = R(so(3)) + γ R(s1(3)) + γ2 R(s2(3)) +… • By bounding properties of each R(si(j)), we can prove uniform convergence for V(п)
Proof (2) • Calling on work by Haussler, we show that if the psuedo-dimension of each Fi, dimP(Fi) <= d, we can “nearly” represent our world dynamics functions Fia by a smaller set of functions of size
Proof (3) • Similarly if Fi uniformly has Lipschitz bound B, and the Reward function R has Lipschitz bound BR, we can “nearly” represent a function mapping from scenarios to ith step rewards by a set of size
Proof (4) • A result by Haussler then shows that with probability 1 – δ, our ith step reward will be ε-close to the mean if we select a number of scenarios bounded by
Proof (5) • Strengthening the bound to account for all Hε rewards and employing the Union bound, we find that a number of scenarios bounded by is sufficient.
Critique • Success limited to very small fairly linear control problem, with high frequency controller • Lots of human bias incorporated into system • Restrictions/Linear Regression for model identification • Structure of neural net for each of the tasks • PAC learning guarantees still out of reach • No theoretical bounds on final policy
Bibliography • Chapter on PAC learning model, and decision-theoretic generalizations, with applications to neural nets. From Mathematical Perspectives on Neural Networks, Lawrence Erlbaum Associates, 1995, Information and Computation, Vol. 100, September, 1992, pp. 78-150. • Ng, A. Y., Jordan, M. I. PEGASUS: A policy search method for large MDP’s and POMDP’s. In Uncertainty in Artificial Intelligence, Sixth Conference, 2000. • Ng, A. Y., Kim, H. J., Jordan, M. I., & Sastry, S. Autonomous helicopter flight via reinforcement learning. Advances in Neural Information Processing Systems 16. 2004. • Ng, A. Y.,Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., and Liang, E. Inverted autonomous helicopter flight via reinforcement learning, In International Symposium on Experimental Robotics, 2004.
Application – Helicoptor Flight • PEGASUS has been used to derive policies for hovering in place. • Later generalized to handle slow motion maneuvers and upside down hovering. • GPS system relays state information (position and velocity) to an off board computer which calculates a 4-dimensional action
Model Identification • Construction of an MDP representation of the world dynamics • Transition Dynamics learned from several minutes of data based on human flight • Fit using linear regression • Forced to respect innate properties of the domain (gravity, symmetry)