1 / 36

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs. Istv án Szita & Andr ás Lőrincz. University of Alberta Canada. Eötvös Loránd University Hungary. Outline. Factored MDPs motivation definitions planning in FMDPs Optimism

paul2
Download Presentation

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs István Szita & András Lőrincz University of Alberta Canada Eötvös Loránd University Hungary

  2. Outline • Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  3. Reinforcement learning • the agent makes decisions • … in an unknown world • makes some observations (including rewards) • tries to maximize collected reward Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  4. What kind of observation? • structured observations • structure is unclear ??? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  5. How to “solve an RL task”? • a model is useful • can reuse experience from previous trials • can learn offline • observations are structured • structure is unknown • structured + model + RL = FMDP ! • (or linear dynamical systems, neural networks, etc…) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  6. Factored MDPs • ordinary MDPs • everything is factored • states • rewards • transition probabilities • (value functions) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  7. Factored state space • all functions depend on a few variables only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  8. Factored dynamics Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  9. Factored rewards Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  10. (Factored value functions) • V* is not factored in general • we will make an approximation error Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  11. Solving a known FMDP • NP-hard • either exponential-time or non-optimal… • exponential-time worst case • flattening the FMDP • approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000] • non-optimal solution (approximating value function in a factored form) • approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002] • ALP + policy iteration [Guestrin et al., 2002] • factored value iteration [Szita & Lőrincz, 2008] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  12. Factored value iteration H := matrix of basis functions N (HT) := row-normalization of HT, • the iterationconverges to fixed point w£ • can be computed quickly for FMDPs • Let V£ = Hw£. Then V£ has bounded error: Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  13. Learning in unknown FMDPs • unknown factor decompositions (structure) • unknown rewards • unknown transitions (dynamics) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  14. Learning in unknown FMDPs • unknown factor decompositions (structure) • unknown rewards • unknown transitions(dynamics) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  15. Outline • Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  16. Learning in an unknown FMDPa.k.a. “Explore or exploit?” • after trying a few action sequences… • … try to discover better ones? • … do the best thing according to current knowledge? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  17. Be Optimistic! (when facing uncertainty) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  18. either you get experience… Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  19. or you get reward! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  20. Outline • Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  21. Factored Initial Model component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  22. Factored Optimistic Initial Model “Garden of Eden” +$10000 reward (or something very high) component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  23. Later on… • according to initial model, all states have value • in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  24. Factored optimistic initial model • initialize model (optimistically) • for each time step t, • solve aproximate model using factored value iteration • take greedy action, observe next state • update model • number of non-near-optimal steps (w.r.t. V£ ) is polynomial with probability ¼1 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  25. elements of proof: some standard stuff • if , then • if for all i, then • let mi be the number of visits toif mi is large, thenfor all yi. • more precisely:with prob.(Hoeffding/Azuma inequality) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  26. elements of proof: main lemma • for any , approximate Bellman-updates will be more optimistic than the real ones: • if VE is large enough, the bonus term dominates for a long time • if all elements of H are nonnegative, projection preserves optimism lower bound by Azuma’s inequality bonus promised by Garden of Eden state Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  27. elements of proof: wrap up • for a long time, Vt is optimistic enough to boost exploration • at most polynomially many exploration steps can be made • except those, the agent must be near-V £-optimal Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  28. Previous approaches • extensions of E3, Rmax, MBIE to FMDPs • using current model, make smart plan (explore or exploit) • explore: make model more accurate • exploit: collect near-optimal reward • unspecified planners • requirement: output plan is close-to-optimal • …e.g., solve the flat MDP • polynomial sample complexity • exponential amounts of computation! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  29. Unknown rewards? • “To simplify the presentation, we assume the rewardfunction is known and does not need to be learned. All resultscan be extended to the case of an unknown reward function.”false. • problem: cannot observe reward components, only their sum • ! UAI poster [Walsh, Szita, Diuk, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  30. Unknown structure? • can be learnt in polynomial time • SLF-Rmax [Strehl, Diuk, Littman, 2007] • Met-Rmax [Diuk, Li, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  31. Take-home message if your model starts out optimistically enough, you get efficient exploration for free! (even if your planner is non-optimal (as long as it is monotonic)) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  32. Thank you for your attention!

  33. Optimistic initial model for FMDPs • add “garden of Eden” value to each state variable • add reward factors for each state variable • init transition model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  34. Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  35. Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  36. Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

More Related