Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517: Reinforcement Learning in Artificial IntelligenceLecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods November 15, 2010 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010

Introduction • We will discuss different methods that heuristically approximate the dynamic programming problem • Approximate Dynamic Programming • Direct Neuro-Dynamic Programming • Assumptions: • DP methods assume fully observed systems • ADP rely on no model and (usually) partial observability (POMDPs) • Relationship to classical control theories • Optimal control – in the linear case, is a solved problem. It estimates the state vector using Kalman filter methodologies • Adaptive control responds to the question: what can we do when the dynamics of the system are unknown • We have a model of the plant but lack its parameters (values) • Often focus on stability properties rather than performance

Reference to classic control theories (cont.) • Robust control • Attempt to find a controller design that guarantees stability, i.e. that the plant will not “blow up” • Regardless of what the unknown parameter values are • e.g. Lyapunov-based analysis (e.g. queueuing systems) • Adaptive control • Attempt to adapt the controller in real time, based on real-time observations of how the plant actually behaves • ADP may be viewed as an adaptive control framework • “Neural-observers” – used to predict the next set of observations, based on which the controller acts

Core principals of ADP • The following are three general principles that are at the core of ADP • Value approximation – instead of solving for V(s) exactly, we can use a universal approximation function V(s,W) ~ V(s) • Alternate starting points – instead of always starting from the Bellman equation directly, we can start from related recurrence equations • Hybrid design – combining multiple ADP systems in more complex hybrid designs • Usually in order to scale better • Mixture of continuous and discrete variables • Multiple spatio-temporal scales

Direct Neuro-Dynamic Programming (Direct NDP) • Motivation • The intuitive appeal of Reinforcement Learning, in particular the actor/critic design • The power of calculus of variation and this concept used in the form of backpropagation to solve optimal control problems • Can inherently deal with POMDPs (using RNNs, for example) • The method is considered “direct” in that • It does not have explicit state representation • Temporal progression – everything is a function of time and not state/action sets • It is also model-free as it does not assume a model or attempt to directly estimate model dynamics/structure

Direct NDP Architecture • Critic Network : estimate the future reward-to-go (i.e. value function) • Action Network : adjust action to minimize the difference between the estimated J and the ultimate objective Uc.

Direct NDP vs. Classic RL Environment J(t-1)-r(t) a U (t) J(t) c Action Reward Critic Network u(t) State Action Network Agent: direct NDP

Inverted Helicopter Flight (Ng. / Stanford 2004)

Solving POMDPs with RNNs • Case study: framework for obtaining optimal policy in model-free POMDPs using Recurrent Neural Networks • Uses NDP version of Q-Learning • TRTRL is employed(efficient version of RTRL) • Goal: investigate scenarioin which two states havethe same observation(yet different optimalactions) • Method: RNNs in aTD framework (morelater) • Model is unknown!

Direct NDP architecture using RNNs Ot Q(st ,at) ~ RNN Q-Learning approx. Softmax TD at rt Final action Environment Method is good for small action sets. Q: why ?

Simulation results – 10 neurons

Training Robots (1st-gen AIBOs) to walk (faster) • 1st generation AIBOs were used (internal CPU) • Fundamental motor capabilities were prescribed • e.g. apply torque to a given joint, turn in a given direction • In other words, finite action set • Observations were limited to distance (a.k.a. radar view) • The goal was to cross the field in short time • Reward was growing negative as time progressed • Large positive reward when goal was met • Multiple robots were trained to observe variability in the learning process

The general RL approach revisited • RL will solve all of your problems, but … • We need lots of experience to train from • Taking random actions can be dangerous • It can take a long time to learn • Not all problems fit into the NDP framework • An alternative approach to RL is to reward whole policies, rather than individual actions • Run whole policy, then receive a single reward • Reward measures success of the entire policy • If there are a small number of policies, we can exhaustively try them all • However, this is not possible in most interesting problems

This is another learning rate Policy Gradient Methods • Assume that our policy, p, has a set of n real-valued parameters, q = {q1, q2, q3, ... , qn} • Running the policy with a particular q results in a reward, rq • Estimate the reward gradient, , for each qi

Policy Gradient Methods (cont.) • This results in hill-climbing in policy space • So, it’s subject to all the problems of hill-climbing • But, we can also use tricks from search theory, like random restarts and momentum terms • This is a good approach if you have a parameterized policy • Let’s assume we have a “reasonable” starting policy • Typically faster than value-based methods • “Safe” exploration, if you have a good policy • Learns locally-best parameters for that policy

An Example: Learning to Walk • RoboCup 4-legged league • Walking quickly is a big advantage • Until recently, this was tuned manually • Robots have a parameterized gait controller • 12 parameters • Controls step length, height, etc. • Robot walk across soccer field and is timed • Reward is a function of the time taken • They know when to stop (distance measure)

An Example: Learning to Walk (cont.) • Basic idea • Pick an initial q = {q1, q2, ... , q12} • Generate N testing parameter settings by perturbing q qj = {q1 + d1, q2 + d2, ... , q12 + d12}, di {-e, 0, e} • Test each setting, and observe rewards qj→ rj • For each qi q Calculate qi+, qi0, qi- and set • Set q ← q’, and go to 2 Average reward when qni = qi - di

An Example: Learning to Walk (cont.) • Q: Can we translate Gradient Policy into a direct policy, actor/critic Neuro-Dynamic Programming system? Initial Final

Value Function or Policy Gradient? • When should I use policy gradient? • When there’s a parameterized policy • When there’s a high-dimensional state space • When we expect the gradient to be smooth • Typically one episodic tasks (e.g. AIBO walking) • When should I use a value-based method? • When there is no parameterized policy • When we have no idea how to solve the problem (i.e. no known structure)

Direct NDP with RNNs – Backpropagation through a model • RNNs have memory and can create temporal context • Applies to both actor and critic • Much harder to train (time and logic/memory resources) • e.g. RTRL issues

Consolidated Actor-Critic Model (Z. Liu, I. Arel, 2007) • Single network (FF or RNN) sufficient for both actor and critic functions • Two-pass (TD-style) for both action and value estimate corrections • Training via standard techniques, e.g. BP

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science