Multi-armed Bandit Problems with Dependent Arms

Multi-armed Bandit Problems with Dependent Arms Sandeep Pandey (spandey@cs.cmu.edu) Deepayan Chakrabarti (deepay@yahoo-inc.com) Deepak Agarwal (dagarwal@yahoo-inc.com)

(unknown reward probabilities) μ1 μ2 μ3 Background: Bandits Bandit “arms” • Pull arms sequentially so as to maximize the total expected reward • Show ads on a webpage to maximize clicks • Product recommendation to maximize sales

Dependent Arms • Reward probabilities μiare generally assumed to be independent of each other • What if they are dependent? • E.g., ads on similar topics, using similar text/phrases, should have similar rewards “Skiing, snowboarding” “Skiing, snowshoes” “Get Vonage!” “Snowshoe rental” μ1=0.3 μ2=0.28 μ3=10-6 μ2=0.31

Dependent Arms • Reward probabilities μiare generally assumed to be independent of each other • What if they are dependent? • E.g., ads on similar topics, using similar text/phrases, should have similar rewards • A click on one ad  other “similar” ads may generate clicks as well • Can we increase total reward using this dependency?

Arm 1 Arm 2 Arm 4 Arm 3 # pulls of arm i Some distribution (known) Cluster-specific parameter (unknown) Cluster Model of Dependence Cluster 1 Cluster 2 μi ~ f(π[i]) Successes si ~ Bin(ni, μi)

Arm 1 Arm 2 Arm 4 Arm 3 ∞ t=0 T t=0 Cluster Model of Dependence μi ~ f(π1) μi ~ f(π2) • Total reward: • Discounted:∑ αt.E[R(t)], α = discounting factor • Undiscounted:∑ E[R(t)]

Discounted Reward Arm 2 x’1 x’2 The optimal policy can be computed using per-cluster MDPs only. MDP for cluster 1 Pull Arm 1 x1 x2 x”1 x”2 • Optimal Policy: • Compute an (“index”, arm) pair for each cluster • Pick the cluster with the largest index, and pull the corresponding arm Arm 4 x’3 x’4 MDP for cluster 2 Pull Arm 3 x3 x4 x”3 x”4

Discounted Reward Arm 2 x’1 x’2 The optimal policy can be computed using per-cluster MDPs only. MDP for cluster 1 Pull Arm 1 • Reduces the problem to smaller state spaces • Reduces to Gittins’ Theorem [1979] for independent bandits • Approximation bounds on the index for k-step lookahead x1 x2 x”1 x”2 • Optimal Policy: • Compute an (“index”, arm) pair for each cluster • Pick the cluster with the largest index, and pull the corresponding arm Arm 4 x’3 x’4 MDP for cluster 2 Pull Arm 3 x3 x4 x”3 x”4

Arm 1 Arm 2 Arm 4 Arm 3 Cluster Model of Dependence μi ~ f(π1) μi ~ f(π2) • Total reward: • Discounted:∑ αt.E[R(t)], α = discounting factor • Undiscounted:∑ E[R(t)] ∞ t=0 T t=0

Arm 1 Arm 2 Arm 4 Arm 3 Undiscounted Reward “Cluster arm” 1 “Cluster arm” 2 All arms in a cluster are similar  They can be grouped into one hypothetical “cluster arm”

Arm 1 Arm 2 Arm 4 Arm 3 Undiscounted Reward • Two-Level Policy In each iteration: • Pick “cluster arm” using a traditional bandit policy • Pick an arm within that cluster using a traditional bandit policy “Cluster arm” 1 “Cluster arm” 2 Each “cluster arm” must have some estimated reward probability

Issues • What is the reward probability of a “cluster arm”? • How do cluster characteristics affect performance?

Reward probability of a “cluster arm” • What is the reward probability r of a “cluster arm”? • MEAN: r = ∑si / ∑ni, i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007] • Initially, r = μavg = average μ of arms in cluster • Finally, r = μmax = max μ among arms in cluster • “Drift” in the reward probability of the “cluster arm”

Arm 1 Arm 2 Arm 4 Arm 3 Reward probability drift causes problems • Drift  Non-optimal clusters might temporarily look better  optimal arm is explored only O(log T) times Best (optimal) arm, with reward probability μopt Cluster 1 Cluster 2 (opt cluster)

Reward probability of a “cluster arm” • What is the reward probability r of a “cluster arm”? • MEAN:r = ∑si / ∑ni • MAX:r =max( E[μi] ) • PMAX:r =E[max(μi) ] • Both MAX and PMAX aim to estimate μmax and thus reduce drift for all arms i in cluster

Reward probability of a “cluster arm” Bias in estimation of μmax • MEAN:r = ∑si / ∑ni • MAX:r =max( E[μi] ) • PMAX:r =E[max(μi) ] • Both MAX and PMAX aim to estimate μmax and thus reduce drift Variance of estimator High Unbiased Low High

Comparison of schemes 10 clusters, 11.3 arms/cluster MAX performs best

Issues • What is the reward probability of a “cluster arm”? • How do cluster characteristics affect performance?

Effects of cluster characteristics • We analytically study the effects of cluster characteristics on the “crossover-time” • Crossover-time: Time when the expected reward probability of the optimal cluster becomes highest among all “cluster arms”

Effects of cluster characteristics • Crossover-time Tc for MEAN depends on: • Cluster separation Δ = μopt – μmax outside opt clusterΔ increases  Tc decreases • Cluster size Aopt Aopt increases  Tc increases • Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases  Tc decreases

Experiments (effect of separation) Δ increases  Tc decreases  higher reward

Experiments (effect of size) Aopt increases  Tc increases  lower reward

Experiments (effect of cohesiveness) Cohesiveness increases  Tc decreases  higher reward

Related Work • Typical multi-armed bandit problems • Do not consider dependencies • Very few arms • Bandits with side information • Cannot handle dependencies among arms • Active learning • Emphasis on #examples required to achieve a given prediction accuracy

Conclusions • We analyze bandits where dependencies are encapsulated within clusters • Discounted Reward the optimal policy is an index scheme on the clusters • Undiscounted Reward • Two-level Policy with MEAN, MAX, and PMAX • Analysis of the effect of cluster characteristics on performance, for MEAN

x’1 x’2 Pull Arm 3 Pull Arm 2 Pull Arm 4 success x3 x4 Change of belief for both arms 1 and 2 Pull Arm 1 Estimated reward probabilities x”1 x”2 failure x3 x4 Discounted Reward 1 3 4 2 x1 x2 x3 x4 • Create a belief-state MDP • Each state contains the estimated reward probabilities for all arms • Solve for optimal

(unknown payoff probabilities) p1 p2 p3 Background: Bandits Bandit “arms” Regret = optimal payoff – actual payoff

Reward probability of a “cluster arm” • What is the reward probability of a “cluster arm”? • Eventually, every “cluster arm” must converge to the most rewarding arm μmaxwithin that cluster • since a bandit policy is used within each cluster • However, “drift” causes problems

Experiments • Simulation based on one week’s worth of data from a large-scale ad-matching application • 10 clusters, with 11.3 arms/cluster on average

Comparison of schemes 10 clusters, 11.3 arms/cluster • Cluster separation Δ = 0.08 • Cluster size Aopt = 31 • Cohesiveness = 0.75 MAX performs best

Arm 1 Arm 2 Arm 4 Arm 3 Reward probability drift causes problems Intuitively, to reduce regret, we must: • Quickly converge to the optimal “cluster arm” • and then to the best arm within that cluster Best (optimal) arm, with reward probability μopt Cluster 1 Cluster 2 (opt cluster)

Multi-armed Bandit Problems with Dependent Arms

Multi-armed Bandit Problems with Dependent Arms

Presentation Transcript

Problems with multi-resistant Acinetobacter spp .

With Open Arms

Multi-armed Bandit Problems with Dependent Arms

Multiplying with Multi-digit problems

Mean Field Equilibria of Multi-Armed Bandit Games

Multi Step Ratio Problems

“The Barefoot Bandit”

The Book Bandit

Mortal Multi-Armed Bandits

A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem

Multi-armed Bandit Problem and Bayesian Optimization in Reinforcement Learning

Multi Armed Bandits

An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem

Exploration and Exploitation Strategies for the K-armed Bandit Problem

Math Bandit

MULTI MODAL TRANSPORTATION PROBLEMS

Sequence-Dependent Setup Problems

Multiplying problems with multi - digits

Multi-Step Word Problems

Reinforcement Learning Evaluative Feedback and Bandit Problems

Akoya - Bandit OrbitVision

Multi-step Problems