1 / 21

Reinforcement Learning: Learning algorithms Function Approximation

Reinforcement Learning: Learning algorithms Function Approximation. Yishay Mansour Tel-Aviv University. Outline . Week I: Basics Mathematical Model (MDP) Planning Value iteration Policy iteration Week II: Learning Algorithms Model based Model Free Week III: Large state space.

Download Presentation

Reinforcement Learning: Learning algorithms Function Approximation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning:Learning algorithmsFunction Approximation Yishay Mansour Tel-Aviv University

  2. Outline • Week I: Basics • Mathematical Model (MDP) • Planning • Value iteration • Policy iteration • Week II: Learning Algorithms • Model based • Model Free • Week III: Large state space

  3. Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning, SARSA).

  4. Learning: Policy improvement • Assume that we can compute: • Given a policy π, • The V and Q functions of π • Can perform policy improvement: • Π= Greedy (Q) • Process converges if estimations are accurate.

  5. Learning - Model FreeOptimal Control: off-policy Learn online the Q function. Qt+1 (st ,at ) = Qt (st ,at )+ a At At = rt+g MAXa {Qt (st+1,a)} - Qt (st ,at ) OFF POLICY: Q-Learning Maximization Operator!!!

  6. Learning - Model FreePolicy evaluation: TD(0) An online view: At state st we performed action at, received reward rtand moved to state st+1. Our “estimation error” isAt =rt+gVt(st+1)-Vt(st), The update: Vt +1(st) = Vt(st ) + a At No maximization over actions!

  7. Learning - Model FreeOptimal Control: on-policy Learn online the optimal Q* function. Qt+1 (st ,at ) = Qt (st ,at )+ a [ rt+g Qt (st+1,at+1) - Qt (st ,at )] ON-Policy:SARSA at+1 the e-greedy policy for Qt. The policy selects the action! Need to balance exploration and exploitation.

  8. Modified Notation • Rather than Q(s,a) have Qa(s) • Greedy(Q) = MAXa Qa(s) • Each action has a function Qa(s) • Learn eachQa(s)independently!

  9. Large state space • Reduce number of states • Symmetries (x-o) • Cluster states • Define attributes • Limited number of attributes • Some states will be identical

  10. Example X-O • For each action (square) • Consider row/diagonal/column through it • The state will encode the status of “rows”: • Two X’s • Two O’s • Mixed (both X and O) • One X • One O • empty • Only Three types of squares/actions

  11. Clustering states • Need to create attributes • Attributes should be “game dependent” • Different “real” states - same representation • How do we run? • We estimate action value. • Consider only legal actions. • Play “best” action.

  12. Function Approximation • Use a limited model for Qa(s) • Have an attribute vector: • Each state s has a vector vec(s)=x1 ... xk • Normally k << |S| • Examples: • Neural Network • Decision tree • Linear Function • Weights  = 1 ... k • Value  ixi

  13. Gradient Decent • Minimize Squared Error • Square Error = ½  P(s) [V(s) – V(s)]2 • P(s) is a weighting on the states • Algorithm: • (t+1) = (t) +  [V(st) – V(t)(st)] (t) V(t)(st) • (t) = partial derivatives • Replace V(st) by a sample • Monte Carlo: use Rt forV(st) • TD(0) use At for [V(st) – V(t)(st)]

  14. Linear Functions • Linear function:  ixi = < ,x > • Derivative (t) Vt(st) = vec(st) • Update Rule: • t+1 = t +  [V(st) – Vt(st)] vec(st) • MC: t+1 = t +  [ Rt – < t ,st>] vec(st) • TD: t+1 = t +  At vec(st)

  15. Example: 4 in a row • Select attributes for action (column): • 3 in a row (type X or type O) • 2 in a row (type X or O) and [blocked/ not] • Next location 3 in a row. • Next move might lose • Other “features” • RL will learn the weights. • Look ahead significantly helps • use max-min tree

  16. Bootstrapping • Playing against a “good” player • Using .... • Self play • Start with a random player • play against one self. • Choose a starting point. • Max-Min tree with simple scoring function. • Add some simple guidance • add “compulsory” moves.

  17. Scoring Function • Checkers: • Number of pieces • Number of Queens • Chess • Weighted sum of pieces • Othello/Reversi • Difference in number of pieces • Can be used with Max-Min Tree • (,) pruning

  18. Example: Revesrsi (Othello) • Use a simple score functions: • difference in pieces • edge pieces • corner pieces • Use Max-Min Tree • RL: optimize weights.

  19. Advanced issues • Time constraints • fast and slow modes • Opening • can help • End game • many cases: few pieces, • can be solved efficiently • Train on a specific state • might be helpful/ not sure that its worth the effort.

  20. What is Next? • Create teams: • Choose a game! • GUI for game • Deadline April 12, 2010 • System specification • Project outline • High level components planning • May 10, 2010

  21. Schedule (more) • Build system • Project completion • Aug. 30, 2010 • All supporting documents in html! • From next week: • Each groups works by itself. • Feel free to contact us.

More Related