Bayesian Sparse Sampling for On-line Reward Optimization

Bayesian Sparse Sampling for On-line Reward Optimization Presented in the Value of Information Seminar at NIPS 2005 Based on previous paper from ICML 2005, written by Dale Schuurmans et all

Background Perspective • Be Bayesian about reinforcement learning • Ideal representation of uncertainty for action selection • Computational barriers Why are Bayesian approaches not prevalent in RL?

Exploration vs. Exploitation • Bayes decision theory • Value of information measured by ultimate return in reward • Choose actions to max expected value • Exploration/exploitation tradeoff implicitly handled as side effect

Bayesian Approach conceptually clean but computationally disasterous versus conceptually disasterous but computationally clean

Overview • Efficient lookahead search for Bayesian RL • Sparser sparse sampling • Controllable computational cost • Higher quality action selection than current methods Greedy Epsilon - greedy Boltzmann Thompson Sampling Bayes optimal Interval estimation Myopic value of perfect info. Standard sparse sampling Péret & Garcia (Luce 1959) (Thompson 1933) (Hee 1978) (Lai 1987, Kaelbling 1994) (Dearden, Friedman, Andre 1999) (Kearns, Mansour, Ng 2001) (Péret & Garcia 2004)

Sequential Decision Making Requires model P(r,s’|s,a) How to make an optimal decision? V(s) s “Planning” a a MAX Q(s,a) s,a Q(s,a) s,a r r r r expectation expectation V(s’) s’ V(s’) s’ V(s’) s’ V(s’) s’ a’ a’ a’ a’ a’ a’ a’ a’ MAX MAX MAX MAX Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ E E E E E E E E s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ This is: finite horizon, finite action, finite reward case General case: Fixed point equations:

Reinforcement Learning Do not have model P(r,s’|s,a) V(s) s a a MAX s,a Q(s,a) s,a Q(s,a) r r r r expectation expectation V(s’) s’ V(s’) s’ V(s’) s’ V(s’) s’ a’ a’ a’ a’ a’ a’ a’ a’ MAX MAX MAX MAX s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ E E E E E E E E s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’

Cannot Compute Reinforcement Learning Do not have model P(r,s’|s,a) V(s) s a a MAX s,a Q(s,a) s,a Q(s,a) r r r r expectation expectation V(s’) s’ V(s’) s’ V(s’) s’ V(s’) s’ a’ a’ a’ a’ a’ a’ a’ a’ MAX MAX MAX MAX s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ E E E E E E E E s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’

Bayesian Reinforcement Learning Prior P(θ) on model P(rs’ |sa, θ) Belief state b=P(θ) Meta-level MDP meta-level state s b decision a Choose action to maximize long term reward a actions s b +a outcome r, s’, b’ Meta-level Model P(r,s’b’|s b,a) rewards r s’ b’ decision a’ a’ actions s’ b’ +a’ outcome r’, s’’, b’’ Meta-level Model P(r’,s’’b’’|s’ b’,a’) rewards r’ s’’ b’’ Have a model for meta-level transitions! - based on posterior update and expectations over base-level MDPs

Bayesian RL Decision Making How to make an optimal decision? V(s b) Bayes optimal action selection Solve planning problem in meta-level MDP: a a MAX Q(s b, a) Q(s b, a) - Optimal Q,V values r r r r E E V(s’ b’) V(s’ b’) V(s’ b’) V(s’ b’) Problem: meta-level MDP much larger than base-level MDP Impractical

Bayesian RL Decision Making Current approximation strategies: Consider current belief state b s b s a a a a MAX MAX Draw a base-level MDP s b, a s b, a s, a s, a E E r r r r r r r r E E s’ b’ s’ b’ s’ b’ s’ b’ s’ s’ s’ s’ ☺ Exploration is based on uncertainty Greedy approach: current b → mean base-level MDP model → point estimate for Q, V → choose greedy action Thompson approach: current b → sample a base-level MDP model → point estimate for Q, V (Choose action proportional to probability it is max Q) But doesn’t consider uncertainty But still myopic

Their Approach • Try to better approximate Bayes optimal action selection by performing lookahead • Adapt “sparse sampling” (Kearns, Mansour ,Ng) • Make some practical improvements

Sparse Sampling (Kearns, Mansour, Ng 2001)

Bayesian Sparse SamplingObservation 1 • Action value estimates are not equally important • Need better Q value estimates for some actions but not all • Preferentially expand tree under actions that might be optimal MAX Biased tree growth Use Thompson sampling to select actions to expand

Bayesian Sparse SamplingObservation 2 Correct leaf value estimates to same depth MAX t=1 Use mean MDP Q-value multiplied by remaining depth E t=2 E t=3 effective horizon N=3

Bayesian Sparse SamplingTree growing procedure 1. Sample prior for a model 2. Solve action values 3. Select the optimal action • Descend sparse tree from root • Thompson sample actions • Sample outcome • Until new node added • Repeat until tree size limit reached s,b s,b,a Execute action Observe reward s’b’ Control computation by controlling tree size

Simple experiments • 5 Bernoulli bandits • Beta priors • Sampled model from prior • Run action selection strategies • Repeat 3000 times • Average accumulated reward per step

That's it

Bayesian Sparse Sampling for On-line Reward Optimization