1 / 13

Apprenticeship Learning by Inverse Reinforcement Learning

Explore how Apprenticeship Learning improves performance by modeling expert behavior and reward functions, reducing control problems. Algorithm convergence and sampling results support practical implementation.

kbaltazar
Download Presentation

Apprenticeship Learning by Inverse Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University

  2. Motivation • Typical control setting • Given: system model, reward function • Return: controller optimal with respect to the given model and reward function • Reward function might be hard to exactly specify • E.g. driving well on a highway: need to trade-off • Distance, speed, lane preference, …

  3. Apprenticeship Learning • = task of learning from an expert/teacher • Previous work: • Mostly try to directly mimic teacher by learning the mapping from states to actions directly • Lack of strong performance guarantees • Our approach • Returns policy with performance as good as the expert on the expert’s unknown reward function • Reduces the problem to solving the control problem with given reward • Algorithm inspired by Inverse Reinforcement Learning (Ng and Russell, 2000)

  4. Preliminaries • Markov Decision Process (MDP) (S,A,T, ,D,R) • S : finite set of states • A : set of actions • T = {Psa} : state transition probabilities • є [0,1) : discount factor • D : initial state distribution • R(s)=wT(s) : reward function •  : S  [0,1]k : k-dimensional feature vector • Policy  : S  A

  5. Value of a Policy U() • Uw() = E[t t R(st)|] = E[t t wT(st)|] = wTE[t t (st)|] • Define feature distribution () • () = E[t t (st)|] є 1/(1- ) [0,1]k • So Uw() = wT() • Optimal policy  = arg maxUw()

  6. Feature Distribution Closeness and Performance • Assume the feature distribution of the expert/teacher E is given. • If we can find a policy  such that || () - E ||2  then we have for any underlying reward R*(s) =w*T(s) (||w||1  1) |Uw*() - Uw*(E)| = | w*T () - w*T E |  

  7. Algorithm • Input: MDP\R, E • 1: Randomly pick a policy 0, set i=1 • 2: Compute ti = maxt,w t such that: wT(E - (j))  t for j=0..i-1 • 3: If ti   terminate • 4: Compute i = arg max  Uw() • 5: Compute (i) • 6: Set i=i+1, go to step 2 • Return: set of policies {j}, and we have j such that: w*T(j)  w*TE - 

  8. Theoretical Results: Convergence • Let an MDP\R, k-dimensional feature vector  be given. Then the algorithm will terminate with ti   after at most O( k/[(1-)]2 log (k/[(1-))] ) iterations.

  9. Theoretical Results: Sampling • In practice, we have to use sampling estimates for the feature distribution of the expert. We still have -optimal performance w.p. (1-) for number of samples m  9k/(2[(1-)]2) log 2k/

  10. Experiments: Gridworld • 128x128 gridworld, 4 actions (4 compass directions), 70% success (otherwise random among other neighbouring squares) • Non-overlapping regions of 16x16 cells are the features. A small number have non-zero (positive) rewards. • Expert optimal w.r.t. some weights w*

  11. Experiments: Gridworld (ctd)

  12. Experiments: Car Driving • Illustrate how different driving styles can be learned (videos)

  13. Conclusion • Returns policy with performance as good as the expert on the expert’s unknown reward function • Reduces the problem to solving the control problem with given reward • Algorithm guaranteed to converge in polynomial number of iterations • Sample complexity poly(k,1/(1-),1/)

More Related