Apprenticeship Learning by Inverse Reinforcement Learning

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University

Motivation • Typical control setting • Given: system model, reward function • Return: controller optimal with respect to the given model and reward function • Reward function might be hard to exactly specify • E.g. driving well on a highway: need to trade-off • Distance, speed, lane preference, …

Apprenticeship Learning • = task of learning from an expert/teacher • Previous work: • Mostly try to directly mimic teacher by learning the mapping from states to actions directly • Lack of strong performance guarantees • Our approach • Returns policy with performance as good as the expert on the expert’s unknown reward function • Reduces the problem to solving the control problem with given reward • Algorithm inspired by Inverse Reinforcement Learning (Ng and Russell, 2000)

Preliminaries • Markov Decision Process (MDP) (S,A,T, ,D,R) • S : finite set of states • A : set of actions • T = {Psa} : state transition probabilities • є [0,1) : discount factor • D : initial state distribution • R(s)=wT(s) : reward function •  : S  [0,1]k : k-dimensional feature vector • Policy  : S  A

Value of a Policy U() • Uw() = E[t t R(st)|] = E[t t wT(st)|] = wTE[t t (st)|] • Define feature distribution () • () = E[t t (st)|] є 1/(1- ) [0,1]k • So Uw() = wT() • Optimal policy  = arg maxUw()

Feature Distribution Closeness and Performance • Assume the feature distribution of the expert/teacher E is given. • If we can find a policy  such that || () - E ||2  then we have for any underlying reward R*(s) =w*T(s) (||w||1  1) |Uw*() - Uw*(E)| = | w*T () - w*T E |  

Algorithm • Input: MDP\R, E • 1: Randomly pick a policy 0, set i=1 • 2: Compute ti = maxt,w t such that: wT(E - (j))  t for j=0..i-1 • 3: If ti   terminate • 4: Compute i = arg max  Uw() • 5: Compute (i) • 6: Set i=i+1, go to step 2 • Return: set of policies {j}, and we have j such that: w*T(j)  w*TE - 

Theoretical Results: Convergence • Let an MDP\R, k-dimensional feature vector  be given. Then the algorithm will terminate with ti   after at most O( k/[(1-)]2 log (k/[(1-))] ) iterations.

Theoretical Results: Sampling • In practice, we have to use sampling estimates for the feature distribution of the expert. We still have -optimal performance w.p. (1-) for number of samples m  9k/(2[(1-)]2) log 2k/

Experiments: Gridworld • 128x128 gridworld, 4 actions (4 compass directions), 70% success (otherwise random among other neighbouring squares) • Non-overlapping regions of 16x16 cells are the features. A small number have non-zero (positive) rewards. • Expert optimal w.r.t. some weights w*

Experiments: Gridworld (ctd)

Experiments: Car Driving • Illustrate how different driving styles can be learned (videos)

Conclusion • Returns policy with performance as good as the expert on the expert’s unknown reward function • Reduces the problem to solving the control problem with given reward • Algorithm guaranteed to converge in polynomial number of iterations • Sample complexity poly(k,1/(1-),1/)

Apprenticeship Learning by Inverse Reinforcement Learning