Reinforcement Learning Evaluative Feedback and Bandit Problems

Reinforcement LearningEvaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012

Recap: What is Reinforcement Learning? • An approach to Artificial Intelligence • Learning from interaction • Goal-oriented learning • Learning about, from, and while interacting with an external environment • Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal • Can be thought of as a stochastic optimization over time

Recap: The Setup for RL • Agent is: • Temporally situated • Continual learning and planning • Objective is to affect the environment – actions and states • Environment is uncertain, stochastic Environment action state reward Agent

Recap: Key Features of RL • Learner is not told which actions to take • Trial-and-Error search • Possibility of delayed reward • Sacrifice short-term gains for greater long-term gains • The need to explore and exploit • Consider the whole problem of a goal-directed agent interacting with an uncertain environment

Multi-arm Bandits (MAB) • N possible actions • You can play for some period of time and you want to maximize reward (expected utility) Which is the best arm/ machine? DEMO

Real-Life Version • Choose the best content to display to the next visitor of your commercial website • Content options = slot machines • Reward = user's response (e.g., click on an ad) • Also, clinical trials: arm = treatment, reward = patient cured • Simplifying assumption: no context (no visitor proles). In practice, we want to solve contextual bandit problems but that is for later discussion.

What is the Choice?

n-Armed Bandit Problem • Choose repeatedly from one of n actions; each choice is called a play • After each play at, you get a reward rt , where These are unknown action values Distribution of depends only on Objective is to maximize the reward in the long term, e.g., over 1000 plays To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them

Exploration/Exploitation Dilemma • Suppose you form estimates • The greedy action at time t is at* • You can’t exploit all the time; you can’t explore all the time • You can never stop exploring; but you could reduce exploring. action value estimates Why?

Action-Value Methods • Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action a had been chosen ka times, producing rewards r1 , r2 , …, rka , then “sample average”

Remark • The simple greedy action selection strategy: • Why might this above be insufficient? • You are estimating, online, from a few samples. How will this behave? DEMO

e-Greedy Action Selection • Greedy action selection: • e-Greedy: { . . . the simplest way to balance exploration and exploitation

Worked Example: 10-Armed Testbed • n = 10 possible actions • Each is chosen randomly from a normal distrib.: • Each is also normal: • 1000 plays, repeat the whole thing 2000 times and average the results

e-Greedy Methods on the 10-Armed Testbed

Softmax Action Selection • Softmax action selection methods grade action probabilities by estimated values. • The most common softmax uses a Gibbs, or Boltzmann, distribution:

Incremental Implementation Sample average estimation method: The average of the first k rewards is (dropping the dependence on a ): How to do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently: NewEstimate=OldEstimate+StepSize [Target–OldEstimate]

Tracking a Nonstationary Problem Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the change over time, But not in a nonstationary problem. Better in the nonstationary case is: exponential, recency-weighted average

Optimistic Initial Values • All methods so far depend on , i.e., they are biased • Encourage exploration: initialize the action values optimistically, i.e., on the 10-armed testbed, use

Beyond Counting…

An Interpretation of MAB Type Problems Related to ‘rewards’

MAB is a Special Case of Online Learning

How to Evaluate Online Alg.: Regret • After you have played for T rounds, you experience a regret: = [Reward sum of optimal strategy] – [Sum of actual collected rewards] • If the average regret per round goes to zero with probability 1, asymptotically, we say the strategy has no-regret property ~ guaranteed to converge to an optimal strategy • e-greedy is sub-optimal (so has some regret). Why? Randomness in draw of rewards & player’s strategy

Interval Estimation • Attribute to each arm an “optimistic initial estimate” within a certain confidence interval • Greedily choose arm with highest optimistic mean (upper bound on confidence interval) • Infrequently observed arm will have over-valued reward mean, leading to exploration • Frequent usage pushes optimistic estimate to true values

Interval Estimation Procedure • Associate to each arm 100(1-a)% reward mean upper band • Assume, e.g., rewards are normally distributed • Arm is observed n times to yield empirical mean & std dev • a-upper bound: • If a is carefully controlled, could be made zero-regret strategy • In general, we don’t know Cum. Distribution Function

Variant: UCB Strategy • Again, based on notion of an upper confidence bound but more generally applicable • Algorithm: • Play each arm once • At time t > K, play arm it maximizing

UCB Strategy

Reminder: Chernoff-Hoeffding Bound

UCB Strategy – Behaviour We will not try to prove the following result but I quote the final result to tell you why UCB may be a desirable strategy – regret is bounded. K = number of arms

Variation on SoftMax: • It is possible to drive regret down by annealing t • Exp3 : Exponential weight alg. for exploration and exploitation • Probability of choosing arm k at time t is g is a user defined open parameter

The Gittins Index • Each arm delivers reward with a probability • This probability may change through time but only when arm is pulled • Goal is to maximize discounted rewards – future is discounted by an exponential discount factor d < 1 • The structure of the problem is such that, all you need to do is compute an “index” for each arm and play the one with the highest index • Index is of the form:

Gittins Index – Intuition • Proving optimality isn’t within our scope; but based on, • Stopping time: the point where you should ‘terminate’ bandit • Nice Property: Gittins index for any given bandit is independent of expected outcome of all other bandits • Once you have a good arm, keep playing until there is a better one • If you add/remove machines, computation doesn’t really change • BUT: • hard to compute, even when you know distributions • Exploration issues; arms isn’t updated unless used (restless bandits?)

Numerous Applications!

Extending the MAB Model • In this lecture, we are in a single casino and the only decision is to pull from a set of n arms • except perhaps in the very last slides, exactly one state! Next, • What if there is more than one state? • So, in this state space, what is the effect of the distribution of payout changing based on how you pull arms? • What happens if you only obtain a net reward corresponding to a long sequence of arm pulls (at the end)?

Acknowledgements • Many slides are adapted from web resources associated with Sutton and Barto’s Reinforcement Learning book

Reinforcement Learning Evaluative Feedback and Bandit Problems

Reinforcement Learning Evaluative Feedback and Bandit Problems

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT AND FEEDBACK

Evaluative Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning

Multi-armed Bandit Problem and Bayesian Optimization in Reinforcement Learning

Reinforcement Learning

Evaluative Feedback Week #2

Reinforcement Learning

REINFORCEMENT AND FEEDBACK

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning