Optimism in the Face of Uncertainty: a Unifying approach

Optimism in the Face of Uncertainty:a Unifying approach István Szita & András Lőrincz Eötvös Loránd University Hungary

Outline • background • quick overview of exploration methods • construction of the new algorithm • analysis & experimental results • outlook

Background • Markov decision processes • finite, discounted • (…but wait until the end of the talk) • value function-based methods • Q(x,a) values • the efficient exploration problem

Basic exploration: -greedy • extremely simple • sufficient for convergence in the limit • for many classical methods like Q-learning, Dyna, Sarsa • …under suitable conditions • extremely inefficient

Advanced exploration • in case of uncertainty, be optimistic! • …details vary • we will use concepts from • R-max • optimistic initial values • exploration bonus methods • model-based interval estimation • there are many others, • Bayesian methods • UCT • delayed Q-learning • …

R-max (Brafman &Tennenholz, 2001) + • builds model from observations • uses an optimistic model • unknown transitions go to “garden of Eden” (hypothetical state with max. reward) • transitions declared known after O(nVisits3) steps poly-time convergence − slow in practice

Optimistic initial values + • set initial values high: • no extra work • usually combined with other techniques • with very high initial values, no need for additional exploration no extra work − wears out slowly only model-free

Exploration bonus methods (e.g. Mealeau & Bourgine, 1999; many others) + • bonus reward for “interesting” states • rarely visited, large TD-error, etc. • exact size/form varies • can oscillate fervently • regular/bonus rewards accumulated in separate value functions can be efficientin practice − ad-hoc method bonuses do not converge

Model-based interval estimation (Wiering, 1998; Strehl & Littman, 2006) + • builds model from observations • estimates confidence intervals of state values • exploration bonus: half-widths of intervals poly-time convergence − ???

Assembling the new algorithm • model estimation: sum of rewards for all (x,a,y) up to t number of visits to (x,a,y) up to t number of visits to (x,a) up to t

Assembling the new algorithm II • Optimistic initial model: • a single visit to xE from each (x,a) really optimistic!

Assembling the new algorithm II cf. optimistic initial values: no extra work after initialization • Optimistic initial model: • a single visit to xE from each (x,a) cf. R-max: hypothetical “Eden” statewith max. reward really optimistic!

Assembling the new algorithm III • in each step t, • at := greedy with respect to Qt(xt,¢) • perform at, observe next state and reward • update counters, model parameters • solve model MDP • ... can be done incrementally & fast, e.g.: • a few steps of value iteration • asynchronously, by prioritized sweeping • get new value function Qt+1

Assembling the new algorithm IV • Potential problem: Rmax is too large! • separate real/bonus rewards! initialize to 0 add “real” rewards initialize to 0 or Rmax add nothing we can use it at any time!

Assembling the new algorithm IV • Potential problem: Rmax is too large! • separate real/bonus rewards! cf. exploration bonus methods initialize to 0 add “real” rewards initialize to 0 or Rmax add nothing exploration bonus! we can use it at any time!

Convergence results • One parameter: Rmax • for large Rmax, converges to near-optimum (with high probability) • proof is based on MBIE’s proof (and R-max, E3) • by the time the bonus becomes small !numVisits is large !model estimate is accurate • bonus is (instead of MBIE’s ) • looser bound (but polynomial!)

Experimental results I (Strehl & Littman, 2006) • “RiverSwim” • “SixArms”

Experimental results II (Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000) • “Chain” • “Loop”

Experimental results III (Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000) • “FlagMaze”

+1000 +500 +500 Experimental results IV (Wiering & Schmidhuber, 1998) • “Maze with subgoals”

Outlook • extension to factored MDPs: almost ready • (we need benchmarks) • extension to general function approximation: in progress

Advantages of OIM • polynomial-time convergence (to near-optimum, with high probability) • convincing performance in practice • extremely simple to implement • all work done at initialization • decision making is always greedy • Matlab source code to be released soon

Thank you for your attention! check our web pages at http://szityu.web.eotvos.elte.hu http://inf.elte.hu/lorincz or my reinforcement learning blog “Gimme Reward” at http://gimmereward.wordpress.com

Full pseudocode of the OIM algorithm

Exact statement of the convergence theorem

Optimism in the Face of Uncertainty: a Unifying approach