1 / 24

Reinforcement Learning: Learning algorithms

Reinforcement Learning: Learning algorithms. Yishay Mansour Tel-Aviv University. Outline . Last week Goal of Reinforcement Learning Mathematical Model (MDP) Planning Value iteration Policy iteration This week: Learning Algorithms Model based Model Free. Planning - Basic Problems.

frye
Download Presentation

Reinforcement Learning: Learning algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning:Learning algorithms Yishay Mansour Tel-Aviv University

  2. Outline • Last week • Goal of Reinforcement Learning • Mathematical Model (MDP) • Planning • Value iteration • Policy iteration • This week: Learning Algorithms • Model based • Model Free

  3. Planning - Basic Problems. Given a complete MDP model. Policy evaluation - Given a policy p, estimate its return. Optimal control - Find an optimal policy p*(maximizes the return from any start state).

  4. Planning - Value Functions Vp(s)The expected value starting at state s followingp. Qp(s,a)The expected value starting at state s with action a and then followingp. V*(s) and Q*(s,a) are define usingan optimal policy p*. V*(s) = maxpVp(s)

  5. Algorithms - optimal control CLAIM: A policy p is optimal if and only if at each state s: Vp(s) = MAXa{Qp(s,a)} (Bellman Eq.) The greedy policy with respect to Qp(s,a) is p(s) = argmaxa{Qp(s,a) }

  6. MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. 3. Policy Iteration method.

  7. Planning versus Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning.

  8. Example - Elevator Control Learning (alone): Model the arrival model well. Planning (alone) : Given arrival model build schedule Real objective: Construct a schedule while updating model

  9. Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning).

  10. Learning - Model Based Estimate the model from the observation. (Both transition probability and rewards.) Use the estimated model as the true model, and find optimal policy. If we have a “good” estimated model, we should have a “good” estimation.

  11. Learning - Model Based: off policy • Let the policy run for a “long” time. • what is “long” ?! • Build an “observed model”: • Transition probabilities • Rewards • Use the “observed model” to estimate value of the policy.

  12. Learning - Model Basedoff-policy algorithm • Observe a trajectory generated by a policy π • off-policy: not need to control • For every s,s’ ε S and a ε A: • obsδ(s,a,s’)= #(s,a,s’) / #(s,a,*) • obsR(s,a) = average(r | s,a) • Find an optimal policy for (S,a,obsδ,obsR)

  13. Learning - Model Based • Claim: if the model is “accurate” we will find a near-optimal policy. • Hidden assumption: • Each (s,a,*) is sampled many times. • This is the “responsibility” of the policy π • Simple question: How many samples we need for each (s,a,*)

  14. Learning - Model Basedsample size Sample size (optimal policy): Naive: O(|S|log (|S| |A|) ) samples per (s,a) (approximates each transition d(s,a,s’) well.) Better: O(log (|S| |A|) ) samples per (s,a) (Sufficient to approximate optimal policy.) [KS, NIPS’98]

  15. Learning - Model Based: on policy • The learner has control over the action. • The immediate goal is to lean a model • As before: • Build an “observed model”: • Transition probabilities and Rewards • Use the “observed model” to estimate value of the policy. • Accelerating the learning: • How to reach “new” places ?!

  16. Learning - Model Based: on policy Relatively unknown nodes Well sampled nodes

  17. Learning: Policy improvement • Assume that we can perform: • Given a policy p, • Compute V and Q functions of p • Can run policy improvement: • p = Greedy (Q) • Process converges if estimates are accurate.

  18. Learning: Monte Carlo Methods • Assume we can run in episodes • Terminating MDP • Discounted return • Simplest: sample the return of state s: • Wait to reach state s, • Compute the return from s, • Average all the returns.

  19. Learning: Monte Carlo Methods • First visit: • For each state in the episode, • Compute the return from first occurrence • Average the returns • Every visit: • Might be biased! • Computing optimal policy: • Run policy iteration.

  20. Learning - Model FreePolicy evaluation: TD(0) An online view: At state st we performed action at, received reward rtand moved to state st+1. Our “estimation error” isErrt =rt+gV(st+1)-V(st), The update: Vt +1(st) = Vt(st ) + a Errt Note that for the correct value function we have: E[r+gV(s’)-V(s)] =E[Errt]=0

  21. Learning - Model FreeOptimal Control: off-policy Learn online the Q function. Qt+1 (st ,at ) = Qt (st ,at )+ a [ rt+g Vt (st+1) - Qt (st ,at )] OFF POLICY: Q-Learning Any underlying policy selects actions. Assumes every state-action performed infinitely often Learning rate dependency. Convergence in the limit: GUARANTEED [DW,JJS,S,TS]

  22. Learning - Model FreeOptimal Control: on-policy Learn online the Q function. Qt+1 (st ,at ) = Qt (st ,at )+ a [ rt+g Qt (st+1,at+1) - Qt (st ,at )] ON-Policy:SARSA at+1 the e-greedy policy for Qt. The policy selects the action! Need to balance exploration and exploitation. Convergence in the limit: GUARANTEED [DW,JJS,S,TS]

  23. Learning - Model FreePolicy evaluation: TD() Again: At state st we performed action at, received reward rtand moved to state st+1. Our “estimation error” Errt =rt+gV(st+1)-V(st), Update every state s: Vt +1(s) = Vt(s) + a Errt e(s) Update of e(s) : When visiting s: incremented by 1: e(s) = e(s)+1 For all other s’: decremented by g  factor:e(s’) = g  e(s’)

  24. Summary Markov Decision Process: Mathematical Model. Planning Algorithms. Learning Algorithms: Model Based Monte Carlo TD(0) Q-Learning SARSA TD()

More Related