Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs István Szita & András Lőrincz University of Alberta Canada Eötvös Loránd University Hungary

Outline • Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Reinforcement learning • the agent makes decisions • … in an unknown world • makes some observations (including rewards) • tries to maximize collected reward Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

What kind of observation? • structured observations • structure is unclear ??? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

How to “solve an RL task”? • a model is useful • can reuse experience from previous trials • can learn offline • observations are structured • structure is unknown • structured + model + RL = FMDP ! • (or linear dynamical systems, neural networks, etc…) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored MDPs • ordinary MDPs • everything is factored • states • rewards • transition probabilities • (value functions) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored state space • all functions depend on a few variables only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored dynamics Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored rewards Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

(Factored value functions) • V* is not factored in general • we will make an approximation error Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Solving a known FMDP • NP-hard • either exponential-time or non-optimal… • exponential-time worst case • flattening the FMDP • approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000] • non-optimal solution (approximating value function in a factored form) • approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002] • ALP + policy iteration [Guestrin et al., 2002] • factored value iteration [Szita & Lőrincz, 2008] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored value iteration H := matrix of basis functions N (HT) := row-normalization of HT, • the iterationconverges to fixed point w£ • can be computed quickly for FMDPs • Let V£ = Hw£. Then V£ has bounded error: Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Learning in unknown FMDPs • unknown factor decompositions (structure) • unknown rewards • unknown transitions (dynamics) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Learning in unknown FMDPs • unknown factor decompositions (structure) • unknown rewards • unknown transitions(dynamics) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Learning in an unknown FMDPa.k.a. “Explore or exploit?” • after trying a few action sequences… • … try to discover better ones? • … do the best thing according to current knowledge? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Be Optimistic! (when facing uncertainty) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

either you get experience… Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

or you get reward! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored Initial Model component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored Optimistic Initial Model “Garden of Eden” +$10000 reward (or something very high) component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Later on… • according to initial model, all states have value • in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored optimistic initial model • initialize model (optimistically) • for each time step t, • solve aproximate model using factored value iteration • take greedy action, observe next state • update model • number of non-near-optimal steps (w.r.t. V£ ) is polynomial with probability ¼1 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof: some standard stuff • if , then • if for all i, then • let mi be the number of visits toif mi is large, thenfor all yi. • more precisely:with prob.(Hoeffding/Azuma inequality) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof: main lemma • for any , approximate Bellman-updates will be more optimistic than the real ones: • if VE is large enough, the bonus term dominates for a long time • if all elements of H are nonnegative, projection preserves optimism lower bound by Azuma’s inequality bonus promised by Garden of Eden state Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof: wrap up • for a long time, Vt is optimistic enough to boost exploration • at most polynomially many exploration steps can be made • except those, the agent must be near-V £-optimal Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Previous approaches • extensions of E3, Rmax, MBIE to FMDPs • using current model, make smart plan (explore or exploit) • explore: make model more accurate • exploit: collect near-optimal reward • unspecified planners • requirement: output plan is close-to-optimal • …e.g., solve the flat MDP • polynomial sample complexity • exponential amounts of computation! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Unknown rewards? • “To simplify the presentation, we assume the rewardfunction is known and does not need to be learned. All resultscan be extended to the case of an unknown reward function.”false. • problem: cannot observe reward components, only their sum • ! UAI poster [Walsh, Szita, Diuk, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Unknown structure? • can be learnt in polynomial time • SLF-Rmax [Strehl, Diuk, Littman, 2007] • Met-Rmax [Diuk, Li, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Take-home message if your model starts out optimistically enough, you get efficient exploration for free! (even if your planner is non-optimal (as long as it is monotonic)) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Thank you for your attention!

Optimistic initial model for FMDPs • add “garden of Eden” value to each state variable • add reward factors for each state variable • init transition model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs