230 likes | 241 Views
This book provides an overview of reinforcement learning algorithms for large state space problems. It covers topics such as basics, mathematical models (MDP), planning, value iteration, policy iteration, learning algorithms, and function approximation.
E N D
לביצוע מיידי! • להתחלק לקבוצות • 2 או 3 בקבוצה • להעביר את הקבוצות • היום בסוף השיעור! • ספר Reinforcement Learning • הספר קיים online (גישה מהאתר של הסדנה)
Reinforcement Learning:Learning algorithmsFunction Approximation Yishay Mansour Tel-Aviv University
Outline • Week I: Basics • Mathematical Model (MDP) • Planning • Value iteration • Policy iteration • Week II: Learning Algorithms • Model based • Model Free • Week III: Large state space
Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning, SARSA).
Learning: Policy improvement • Assume that we can compute: • Given a policy π, • The V and Q functions of π • Can perform policy improvement: • Π= Greedy (Q) • Process converges if estimations are accurate.
Learning - Model FreeOptimal Control: off-policy Learn online the Q function. Qt+1 (st ,at ) = Qt (st ,at )+ a At At = rt+g MAXa {Qt (st+1,a)} - Qt (st ,at ) OFF POLICY: Q-Learning Maximization Operator!!!
Learning - Model FreePolicy evaluation: TD(0) An online view: At state st we performed action at, received reward rtand moved to state st+1. Our “estimation error” isAt =rt+gVt(st+1)-Vt(st), The update: Vt +1(st+1) = Vt(st ) + a At No maximization over actions!
Learning - Model FreeOptimal Control: on-policy Learn online the optimal Q* function. Qt+1 (st ,at ) = Qt (st ,at )+ a [ rt+g Qt (st+1,at+1) - Qt (st ,at )] ON-Policy:SARSA at+1 the e-greedy policy for Qt. The policy selects the action! Need to balance exploration and exploitation.
Modified Notation • Rather than Q(s,a) have Qa(s) • Greedy(Q) = MAXa Qa(s) • Each action has a function Qa(s) • Learn eachQa(s)independently!
Large state space • Reduce number of states • Symmetries (x-o) • Cluster states • Define attributes • Limited number of attributes • Some state will be identical • Action view of a state
Example X-O • For each action (square) • Consider row/diagonal/column through it • The state will encode the status of “rows”: • Two X’s • Two O’s • Mixed (both X and O) • One X • One O • empty • Only Three types of squares/actions
Clustering states • Need to create attributes • Attributes should be “game dependent” • Different “real” states - same representation • How do we differentiate states? • We estimate action value. • Consider only legal actions. • Play “best” action.
Function Approximation • Use a limited model for Qa(s) • Have an attribute vector: • Each state s has a vector vec(s)=x1 ... xk • Normally k << |S| • Examples: • Neural Network • Decision tree • Linear Function • Weights = 1 ... k • Value ixi
Gradient Decent • Minimize Squared Error • Square Error = ½ P(s) [V(s) – V(s)]2 • P(s) is sum weighting on the states • Algorithm: • (t+1) = (t) + [V(st) – V(t)(st)] (t) V(t)(st) • (t) = partial derivatives • Replace V(st) by a sample • Monte Carlo: use Rt forV(st) • TD(0) use At for [V(st) – V(t)(st)]
Linear Functions • Linear function: ixi = < ,x > • Derivative (t) Vt(st) = vec(st) • Update Rule: • t+1 = t + [V(st) – Vt(st)] vec(st) • MC: t+1 = t + [ Rt – < t ,st>] vec(st) • TD: t+1 = t + At vec(st)
Example: 4 in a row • Select attributes for action (column): • 3 in a row (type X or type O) • 2 in a row (type X or O) and [blocked/ not] • Next location 3 in a row. • Next move might lose • Other “features” • RL will learn the weights. • Look ahead significantly helps • use max-min tree
Bootstraping • Playing against a “good” player • Using .... • Self play • Start with a random player • play against one self. • Choose a starting point. • Max-Min tree with simple scoring function. • Add some simple guidance • add “compulsory” moves.
Scoring Function • Checkers: • Number of pieces • Number of Queens • Chess • Weighted sum of pieces • Othello/Reversi • Difference in number of pieces • Can be used with Max-Min Tree • (,) pruning
Example: Revesrsi (Othello) • Use a simple score functions: • difference in pieces • edge pieces • corner pieces • Use Max-Min Tree • RL: optimize weights.
Advanced issues • Time constraints • fast and slow modes • Opening • can help • End game • many cases: few pieces, • can be solved efficiently • Train on a specific state • might be helpful/ not sure that its worth the effort.
What is Next? • Create teams: • at least 2 students at most 3 students • Group size will influence our expectations! • Choose a game! • Give the names and game • GUI for game • Deadline Dec. 25, 2005
Schedule (more) • System specification • Project outline • High level components planning • Jan. 29, 2006 • Build system • Project completion • May 1, 2006 • All supporting documents in html!
Next week • GUI interface (using C++) • Afterwards: • Each groups works by itself