130 likes | 142 Views
Explore how Apprenticeship Learning improves performance by modeling expert behavior and reward functions, reducing control problems. Algorithm convergence and sampling results support practical implementation.
E N D
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University
Motivation • Typical control setting • Given: system model, reward function • Return: controller optimal with respect to the given model and reward function • Reward function might be hard to exactly specify • E.g. driving well on a highway: need to trade-off • Distance, speed, lane preference, …
Apprenticeship Learning • = task of learning from an expert/teacher • Previous work: • Mostly try to directly mimic teacher by learning the mapping from states to actions directly • Lack of strong performance guarantees • Our approach • Returns policy with performance as good as the expert on the expert’s unknown reward function • Reduces the problem to solving the control problem with given reward • Algorithm inspired by Inverse Reinforcement Learning (Ng and Russell, 2000)
Preliminaries • Markov Decision Process (MDP) (S,A,T, ,D,R) • S : finite set of states • A : set of actions • T = {Psa} : state transition probabilities • є [0,1) : discount factor • D : initial state distribution • R(s)=wT(s) : reward function • : S [0,1]k : k-dimensional feature vector • Policy : S A
Value of a Policy U() • Uw() = E[t t R(st)|] = E[t t wT(st)|] = wTE[t t (st)|] • Define feature distribution () • () = E[t t (st)|] є 1/(1- ) [0,1]k • So Uw() = wT() • Optimal policy = arg maxUw()
Feature Distribution Closeness and Performance • Assume the feature distribution of the expert/teacher E is given. • If we can find a policy such that || () - E ||2 then we have for any underlying reward R*(s) =w*T(s) (||w||1 1) |Uw*() - Uw*(E)| = | w*T () - w*T E |
Algorithm • Input: MDP\R, E • 1: Randomly pick a policy 0, set i=1 • 2: Compute ti = maxt,w t such that: wT(E - (j)) t for j=0..i-1 • 3: If ti terminate • 4: Compute i = arg max Uw() • 5: Compute (i) • 6: Set i=i+1, go to step 2 • Return: set of policies {j}, and we have j such that: w*T(j) w*TE -
Theoretical Results: Convergence • Let an MDP\R, k-dimensional feature vector be given. Then the algorithm will terminate with ti after at most O( k/[(1-)]2 log (k/[(1-))] ) iterations.
Theoretical Results: Sampling • In practice, we have to use sampling estimates for the feature distribution of the expert. We still have -optimal performance w.p. (1-) for number of samples m 9k/(2[(1-)]2) log 2k/
Experiments: Gridworld • 128x128 gridworld, 4 actions (4 compass directions), 70% success (otherwise random among other neighbouring squares) • Non-overlapping regions of 16x16 cells are the features. A small number have non-zero (positive) rewards. • Expert optimal w.r.t. some weights w*
Experiments: Car Driving • Illustrate how different driving styles can be learned (videos)
Conclusion • Returns policy with performance as good as the expert on the expert’s unknown reward function • Reduces the problem to solving the control problem with given reward • Algorithm guaranteed to converge in polynomial number of iterations • Sample complexity poly(k,1/(1-),1/)