1 / 40

Reinforcement Learning : Dynamic Programming

Reinforcement Learning : Dynamic Programming. Csaba Szepesv ári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A. Reinforcement Learning.

mmarcel
Download Presentation

Reinforcement Learning : Dynamic Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning:Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/ TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

  2. Reinforcement Learning RL = “Sampling based methods to solve optimal control problems” • Contents • Defining AI • Markovian Decision Problems • Dynamic Programming • Approximate Dynamic Programming • Generalizations (Rich Sutton)

  3. Literature • Books • Richard S. Sutton, Andrew G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 • Dimitri P. Bertsekas, John Tsitsiklis: Neuro-Dynamic Programming, Athena Scientific, 1996 • Journals • JMLR, MLJ, JAIR, AI • Conferences • NIPS, ICML, UAI, AAAI, COLT, ECML, IJCAI

  4. Some More Books • Martin L. Puterman. Markov Decision Processes. Wiley, 1994. • Dimitri P. Bertsekas: Dynamic Programming and Optimal Control. Athena Scientific. Vol. I (2005), Vol. II (2007). • James S. Spall: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, 2003.

  5. Resources • RL-Glue • http://rlai.cs.ualberta.ca/RLBB/top.html • RL-Library • http://rlai.cs.ualberta.ca/RLR/index.html • The RL Toolbox 2.0 • http://www.igi.tugraz.at/ril-toolbox/general/overview.html • OpenDP • http://opendp.sourceforge.net • RL-Competition (2008)! • http://rl-competition.org/ • June 1st, 2008: Test runs begin! • Related fields: • Operations research (MOR, OR) • Control theory (IEEE TAC, Automatica, IEEE CDC, ECC) • Simulation optimization (Winter Simulation Conference)

  6. Abstract Control Model Environment Sensations (and reward) actions Controller = agent “Perception-action loop”

  7. Zooming in.. memory reward external sensations agent state internal sensations actions

  8. A Mathematical Model • Plant (controlled object): • xt+1 = f(xt,at,vt)xt : state, vt : noise • zt = g(xt,wt)zt : sens/obs, wt : noise • State: Sufficient statistics for the future • Independently of what we measure ..or.. • Relative to measurements • Controller • at = F(z1,z2,…,zt)at: action/control => PERCEPTION-ACTION LOOP “CLOSED-LOOP CONTROL” • Design problem:F = ? • Goal: t=1T r(zt,at)! max “Subjective State” “Objective State”

  9. A Classification of Controllers • Feedforward: • a1,a2,… is designed ahead in time • ??? • Feedback: • Purely reactive systems: at = F(zt) • Why is this bad? • Feedback with memory: mt = M(mt-1,zt,at-1) ~interpreting sensations at = F(mt) decision making: deliberative vs. reactive

  10. Feedback controllers • Plant: • xt+1 = f(xt,at,vt) • zt+1 = g(xt,wt) • Controller: • mt = M(mt-1,zt,at-1) • at = F(mt) • mt¼ xt: state estimation, “filtering” difficulties: noise,unmodelled parts • How do we compute at? • With a model (f’): model-based control • ..assumes (some kind of) state estimation • Without a model: model-free control

  11. Markovian Decision Problems

  12. Markovian Decision Problems • (X,A,p,r) • X – set of states • A – set of actions (controls) • p – transition probabilitiesp(y|x,a) • r – rewardsr(x,a,y), or r(x,a), or r(x) • ° – discount factor0 ·° < 1

  13. The Process View • (Xt,At,Rt) • Xt – state at time t • At – action at time t • Rt – reward at time t • Laws: • Xt+1~p(.|Xt,At) • At ~ ¼(.|Ht) • ¼: policy • Ht = (Xt,At-1,Rt-1, .., A1,R1,X0) – history • Rt = r(Xt,At,Xt+1)

  14. The Control Problem • Value functions: • Optimal value function: • Optimal policy:

  15. Operations research Econometrics Optimal investments Replacement problems Option pricing Logistics, inventory management Active vision Production scheduling Dialogue control Control, statistics Games, AI Bioreactor control Robotics (Robocup Soccer) Driving Real-time load balancing Design of experiments (Medical tests) Applications of MDPs

  16. Variants of MDPs • Discounted • Undiscounted: Stochastic Shortest Path • Average reward • Multiple criteria • Minimax • Games

  17. MDP Problems • PlanningThe MDP (X,A,P,r,°) is known.Find an optimal policy ¼*! • LearningThe MDP is unknown. You are allowed to interact with it. Find an optimal policy ¼*! • Optimal learningWhile interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning

  18. Solving MDPs – Dimensions • Which problem? (Planning, learning, optimal learning) • Exact or approximate? • Uses samples? • Incremental? • Uses value functions? • Yes: Value-function based methods • Planning: DP, Random Discretization Method, FVI, … • Learning: Q-learning, Actor-critic, … • No: Policy search methods • Planning: Monte-Carlo tree search, Likelihood ratio methods (policy gradient), Sample-path optimization (Pegasus), • Representation • Structured state: • Factored states, logical representation, … • Structured policy space: • Hierarchical methods

  19. Dynamic Programming

  20. Richard Bellman (1920-1984) • Control theory • Systems Analysis • Dynamic Programming:RAND Corporation, 1949-1955 • Bellman equation • Bellman-Ford algorithm • Hamilton-Jacobi-Bellman equation • “Curse of dimensionality” • invariant imbeddings • Grönwall-Bellman inequality

  21. Bellman Operators • Let ¼:X ! A be a stationary policy • B(X) = { V | V:X! R, ||V||1<1 } • T¼:B(X)! B(X) • (T¼ V)(x) = yp(y|x,¼(x)) [r(x,¼(x),y)+° V(y)] • Theorem:T¼ V¼ = V¼ • Note: This is a linear system of equations: r¼ + ° P¼ V¼ = V¼ V¼ = (I-° P¼)-1 r¼

  22. Proof of T¼ V¼ = V¼ • What you need to know: • Linearity of expectation: E[A+B] = E[A]+E[B] • Law of total expectation:E[ Z ] = x P(X=x) E[ Z | X=x ], andE[ Z | U=u ] = x P(X=x|U=u) E[Z|U=u,X=x]. • Markov property: E[ f(X1,X2,..) | X1=y,X0=x] = E[ f(X1,X2,..) | X1=y] • V¼(x) = E¼ [t=01°t Rt|X0 = x] = y P(X1=y|X0=x) E¼[t=01°t Rt|X0 = x,X1=y] (by the law of total expectation)= y p(y|x,¼(x)) E¼[t=01°t Rt|X0 = x,X1=y] (since X1~p(.|X0,¼(X0)))= y p(y|x,¼(x)) {E¼[ R0|X0=x,X1=y]+° E¼ [ t=01°t Rt+1|X0=x,X1=y]}(by the linearity of expectation)= y p(y|x,¼(x)) {r(x,¼(x),y) + ° V¼(y)}(using the definition of r, V¼)= (T¼ V¼)(x). (using the definition of T¼)

  23. The Banach Fixed-Point Theorem • B = (B,||.||) Banach space • T: B1! B2is L-Lipschitz (L>0) if for any U,V,|| T U – T V || · L ||U-V||. • T is contraction if B1=B2, L<1; L is a contraction coefficient of T • Theorem [Banach]: Let T:B! B be a °-contraction. Then T has a unique fixed point V and 8 V02 B, Vk+1=T Vk, Vk! V and ||Vk-V||=O(°k)

  24. An Algebra for Contractions • Prop: If T1:B1! B2is L1-Lipschitz, T2: B2! B3is L2-Lipschitz then T2 T1is L1 L2Lipschitz. • Def: If T is 1-Lipschitz, T is called a non-expansion • Prop: M: B(X£ A) ! B(X), M(Q)(x) = maxa Q(x,a) is a non-expansion • Prop: Mulc: B! B, Mulc V = c V is |c|-Lipschitz • Prop: Addr: B ! B, Add V = r + V is a non-expansion. • Prop: K: B(X) ! B(X), (K V)(x)=y K(x,y) V(y) is a non-expansion if K(x,y)¸ 0, y K(x,y) =1.

  25. Policy Evaluations are Contractions • Def: ||V||1 = maxx |V(x)|, supremum norm; here ||.|| • Theorem: Let T¼the policy evaluation operator of some policy ¼. Then T¼is a °-contraction. • Corollary: V¼is the unique fixed point of T¼. Vk+1 = T¼ Vk! V¼, 8 V02 B(X) and ||Vk-V¼|| = O(°k).

  26. The Bellman Optimality Operator • Let T:B(X)! B(X) be defined by(TV)(x) = maxay p(y|x,a) { r(x,a,y) + ° V(y) } • Def: ¼ is greedy w.r.t. V if T¼V =T V. • Prop: T is a °-contraction. • Theorem (BOE): T V* = V*. • Proof: Let V be the fixed point of T. T¼· T  V*· V. Let ¼ be greedy w.r.t. V. Then T¼ V = T V. Hence V¼= V  V · V* V = V*.

  27. Value Iteration • Theorem: For any V02 B(X), Vk+1 = T Vk, Vk! V*and in particular ||Vk – V*||=O(°k). • What happens when we stop “early”? • Theorem: Let ¼ be greedy w.r.t. V. Then ||V¼ – V*|| · 2||TV-V||/(1-°). • Proof: ||V¼-V*||· ||V¼-V||+||V-V*|| … • Corollary: In a finite MDP, the number of policies is finite. We can stop when ||Vk-TVk|| ·¢(1-°)/2, where ¢ = min{ ||V*-V¼|| : V¼ V* } Pseudo-polynomial complexity

  28. Policy Improvement [Howard ’60] • Def: U,V2 B(X), V ¸ U if V(x) ¸ U(x) holds for all x2 X. • Def: U,V2 B(X), V > U if V ¸ U and 9 x2 X s.t. V(x)>U(x). • Theorem (Policy Improvement): Let ¼’ be greedy w.r.t. V¼. Then V¼’¸ V¼. If T V¼>V¼ then V¼’>V¼.

  29. Policy Iteration • Policy Iteration(¼) • V  V¼ • Do {improvement} • V’  V • Let ¼: T¼ V = T V • V  V¼ • While (V>V’) • Return ¼

  30. Policy Iteration Theorem • Theorem: In a finite, discounted MDP policy iteration stops after a finite number of steps and returns an optimal policy. • Proof: Follows from the Policy Improvement Theorem.

  31. Linear Programming • V ¸ T V V ¸ V* = T V*. • Hence, V* is the “largest” V that satisfies V ¸ T V. V ¸ T V , (*) V(x) ¸yp(y|x,a){r(x,a,y)+° V(y)}, 8x,a • LinProg(V): • x V(x) ! min s.t. V satisfies (*). • Theorem: LinProg(V) returns the optimal value function, V*. • Corollary: Pseudo-polynomial complexity

  32. Variations of a Theme

  33. Approximate Value Iteration • AVI: Vk+1 = T Vk + ²k • AVI Theorem:Let ² = maxk ||²k||. Thenlimsupk!1 ||Vk-V*|| · 2°² / (1-°). • Proof: Let ak = ||Vk –V*||.Then ak+1 = ||Vk+1 – V*|| = ||T Vk – T V* + ²k || ·° ||Vk-V*|| + ² = ° ak + ². Hence, ak is bounded. Take “limsup” of both sides: a·° a + ²; reorder.// (e.g., [BT96])

  34. Fitted Value Iteration – Non-expansion Operators • FVI: Let A be a non-expansion, Vk+1 = A T Vk. Where does this converge to? • Theorem: Let U,V be such that A T U = U and T V = V. Then||V-U|| · ||AV –V||/(1-°). • Proof: Let U’ be the fixed point of TA. Then ||U’-V|| ·° ||AV-V||/(1-°). Since A U’ = A T (AU’), U=AU’. Hence, ||U-V|| =||AU’-V|| · ||AU’-AV||+||AV-V|| … [Gordon ’95]

  35. Application to Aggregation • Let ¦ be a partition of X, S(x) be the unique cell that x belongs to. • Let A: B(X)! B(X) be(A V)(x) = z¹(z;S(x)) V(z), where ¹ is a distribution over S(x). • p’(C|B,a) = x2 B¹(x;B) y2 C p(y|x,a), r’(B,a,C) = x2 B¹(x;B) y2 C p(y|x,a) r(x,a,y). • Theorem: Take (¦,A,p’,r’), let V’ be its optimal value function, V’E(x) = V’(S(x)). Then ||V’E – V*|| · ||AV*-V*||/(1-°).

  36. Action-Value Functions • L: B(X)! B(X£ A), (L V)(x,a) = y p(y|x,a) {r(x,a,y) + ° V(y)}.“One-step lookahead”. • Note: ¼ is greedy w.r.t. V if (LV)(x,¼(x)) = max a (LV)(x,a). • Def: Q* = L V*. • Def: Let Max: B(X£ A)! B(X), (Max Q)(x) = max a Q(x,a). • Note: Max L = T. • Corollary: Q* = L Max Q*. • Proof: Q* = L V* = L T V* = L Max L V* = L Max Q*. • T = L Max is a °-contraction • Value iteration, policy iteration, …

  37. Changing Granularity • Asynchronous Value Iteration: • Every time-step update only a few states • AsyncVI Theorem: If all states are updated infinitely often, the algorithm converges to V*. • How to use? • Prioritized Sweeping • IPS [MacMahan & Gordon ’05]: • Instead of an update, put state on the priority queue • When picking a state from the queue, update it • Put predecessors on the queue • Theorem: Equivalent to Dijkstra on shortest path problems, provided that rewards are non-positive • LRTA* [Korf ’90] ~ RTDP [Barto, Bradtke, Singh ’95] • Focussing on parts of the state that matter • Constraints: • Same problem solved from several initial positions • Decisions have to be fast • Idea: Update values along the paths

  38. Changing Granularity • Generalized Policy Iteration: • Partial evaluation and partial improvement of policies • Multi-step lookahead improvement • AsyncPI Theorem: If both evaluation and improvement happens at every state infinitely often then the process converges to an optimal policy. [Williams & Baird ’93]

  39. Variations of a theme [SzeLi99] • Game against nature [Heger ’94]:infwt°t Rt(w) with X0 = x • Risk-sensitive criterion: log ( E[ exp(t°t Rt ) | X_0 = x ] ) • Stochastic Shortest Path • Average Reward • Markov games • Simultaneous action choices (Rock-paper-scissor) • Sequential action choices • Zero-sum (or not)

  40. References • [Howard ’60] R.A. Howard: Dynamic Programming and Markov Processes, The MIT Press, Cambridge, MA, 1960. • [Gordon ’95] G.J. Gordon: Stable function approximation in dynamic programming. ICML, pp. 261—268, 1995. • [Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990. • [McMahan, Gordon ’05] H. B. McMahan and Geoffrey J. Gordon: Fast Exact Planning in Markov Decision Processes.  ICAPS. • [Korf ’90] R. Korf: Real-Time Heuristic Search. Artificial Intelligence 42, 189–211, 1990. • [Barto, Bradtke & Singh, ’95] A.G. Barto, S.J. Bradtke & S. Singh: Learning to act using real-time dynamic programming, Artificial Intelligence 72, 81—138, 1995. • [Williams & Baird, ’93] R.J. Williams & L.C. Baird: Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions. Northeastern University Technical Report NU-CCS-93-14, November, 1993. • [SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, 1999. • [Heger ’94] M. Heger: Consideration of risk in reinforcement learning, ICML, 105—111, 1994.

More Related