310 likes | 493 Views
Multi-armed Bandit Problems with Dependent Arms. Sandeep Pandey (spandey@cs.cmu.edu) Deepayan Chakrabarti (deepay@yahoo-inc.com) Deepak Agarwal (dagarwal@yahoo-inc.com). (unknown reward probabilities). μ 1. μ 2. μ 3. Background: Bandits. Bandit “arms”.
E N D
Multi-armed Bandit Problems with Dependent Arms Sandeep Pandey (spandey@cs.cmu.edu) Deepayan Chakrabarti (deepay@yahoo-inc.com) Deepak Agarwal (dagarwal@yahoo-inc.com)
(unknown reward probabilities) μ1 μ2 μ3 Background: Bandits Bandit “arms” • Pull arms sequentially so as to maximize the total expected reward • Show ads on a webpage to maximize clicks • Product recommendation to maximize sales
Dependent Arms • Reward probabilities μiare generally assumed to be independent of each other • What if they are dependent? • E.g., ads on similar topics, using similar text/phrases, should have similar rewards “Skiing, snowboarding” “Skiing, snowshoes” “Get Vonage!” “Snowshoe rental” μ1=0.3 μ2=0.28 μ3=10-6 μ2=0.31
Dependent Arms • Reward probabilities μiare generally assumed to be independent of each other • What if they are dependent? • E.g., ads on similar topics, using similar text/phrases, should have similar rewards • A click on one ad other “similar” ads may generate clicks as well • Can we increase total reward using this dependency?
Arm 1 Arm 2 Arm 4 Arm 3 # pulls of arm i Some distribution (known) Cluster-specific parameter (unknown) Cluster Model of Dependence Cluster 1 Cluster 2 μi ~ f(π[i]) Successes si ~ Bin(ni, μi)
Arm 1 Arm 2 Arm 4 Arm 3 ∞ t=0 T t=0 Cluster Model of Dependence μi ~ f(π1) μi ~ f(π2) • Total reward: • Discounted:∑ αt.E[R(t)], α = discounting factor • Undiscounted:∑ E[R(t)]
Discounted Reward Arm 2 x’1 x’2 The optimal policy can be computed using per-cluster MDPs only. MDP for cluster 1 Pull Arm 1 x1 x2 x”1 x”2 • Optimal Policy: • Compute an (“index”, arm) pair for each cluster • Pick the cluster with the largest index, and pull the corresponding arm Arm 4 x’3 x’4 MDP for cluster 2 Pull Arm 3 x3 x4 x”3 x”4
Discounted Reward Arm 2 x’1 x’2 The optimal policy can be computed using per-cluster MDPs only. MDP for cluster 1 Pull Arm 1 • Reduces the problem to smaller state spaces • Reduces to Gittins’ Theorem [1979] for independent bandits • Approximation bounds on the index for k-step lookahead x1 x2 x”1 x”2 • Optimal Policy: • Compute an (“index”, arm) pair for each cluster • Pick the cluster with the largest index, and pull the corresponding arm Arm 4 x’3 x’4 MDP for cluster 2 Pull Arm 3 x3 x4 x”3 x”4
Arm 1 Arm 2 Arm 4 Arm 3 Cluster Model of Dependence μi ~ f(π1) μi ~ f(π2) • Total reward: • Discounted:∑ αt.E[R(t)], α = discounting factor • Undiscounted:∑ E[R(t)] ∞ t=0 T t=0
Arm 1 Arm 2 Arm 4 Arm 3 Undiscounted Reward “Cluster arm” 1 “Cluster arm” 2 All arms in a cluster are similar They can be grouped into one hypothetical “cluster arm”
Arm 1 Arm 2 Arm 4 Arm 3 Undiscounted Reward • Two-Level Policy In each iteration: • Pick “cluster arm” using a traditional bandit policy • Pick an arm within that cluster using a traditional bandit policy “Cluster arm” 1 “Cluster arm” 2 Each “cluster arm” must have some estimated reward probability
Issues • What is the reward probability of a “cluster arm”? • How do cluster characteristics affect performance?
Reward probability of a “cluster arm” • What is the reward probability r of a “cluster arm”? • MEAN: r = ∑si / ∑ni, i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007] • Initially, r = μavg = average μ of arms in cluster • Finally, r = μmax = max μ among arms in cluster • “Drift” in the reward probability of the “cluster arm”
Arm 1 Arm 2 Arm 4 Arm 3 Reward probability drift causes problems • Drift Non-optimal clusters might temporarily look better optimal arm is explored only O(log T) times Best (optimal) arm, with reward probability μopt Cluster 1 Cluster 2 (opt cluster)
Reward probability of a “cluster arm” • What is the reward probability r of a “cluster arm”? • MEAN:r = ∑si / ∑ni • MAX:r =max( E[μi] ) • PMAX:r =E[max(μi) ] • Both MAX and PMAX aim to estimate μmax and thus reduce drift for all arms i in cluster
Reward probability of a “cluster arm” Bias in estimation of μmax • MEAN:r = ∑si / ∑ni • MAX:r =max( E[μi] ) • PMAX:r =E[max(μi) ] • Both MAX and PMAX aim to estimate μmax and thus reduce drift Variance of estimator High Unbiased Low High
Comparison of schemes 10 clusters, 11.3 arms/cluster MAX performs best
Issues • What is the reward probability of a “cluster arm”? • How do cluster characteristics affect performance?
Effects of cluster characteristics • We analytically study the effects of cluster characteristics on the “crossover-time” • Crossover-time: Time when the expected reward probability of the optimal cluster becomes highest among all “cluster arms”
Effects of cluster characteristics • Crossover-time Tc for MEAN depends on: • Cluster separation Δ = μopt – μmax outside opt clusterΔ increases Tc decreases • Cluster size Aopt Aopt increases Tc increases • Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases Tc decreases
Experiments (effect of separation) Δ increases Tc decreases higher reward
Experiments (effect of size) Aopt increases Tc increases lower reward
Experiments (effect of cohesiveness) Cohesiveness increases Tc decreases higher reward
Related Work • Typical multi-armed bandit problems • Do not consider dependencies • Very few arms • Bandits with side information • Cannot handle dependencies among arms • Active learning • Emphasis on #examples required to achieve a given prediction accuracy
Conclusions • We analyze bandits where dependencies are encapsulated within clusters • Discounted Reward the optimal policy is an index scheme on the clusters • Undiscounted Reward • Two-level Policy with MEAN, MAX, and PMAX • Analysis of the effect of cluster characteristics on performance, for MEAN
x’1 x’2 Pull Arm 3 Pull Arm 2 Pull Arm 4 success x3 x4 Change of belief for both arms 1 and 2 Pull Arm 1 Estimated reward probabilities x”1 x”2 failure x3 x4 Discounted Reward 1 3 4 2 x1 x2 x3 x4 • Create a belief-state MDP • Each state contains the estimated reward probabilities for all arms • Solve for optimal
(unknown payoff probabilities) p1 p2 p3 Background: Bandits Bandit “arms” Regret = optimal payoff – actual payoff
Reward probability of a “cluster arm” • What is the reward probability of a “cluster arm”? • Eventually, every “cluster arm” must converge to the most rewarding arm μmaxwithin that cluster • since a bandit policy is used within each cluster • However, “drift” causes problems
Experiments • Simulation based on one week’s worth of data from a large-scale ad-matching application • 10 clusters, with 11.3 arms/cluster on average
Comparison of schemes 10 clusters, 11.3 arms/cluster • Cluster separation Δ = 0.08 • Cluster size Aopt = 31 • Cohesiveness = 0.75 MAX performs best
Arm 1 Arm 2 Arm 4 Arm 3 Reward probability drift causes problems Intuitively, to reduce regret, we must: • Quickly converge to the optimal “cluster arm” • and then to the best arm within that cluster Best (optimal) arm, with reward probability μopt Cluster 1 Cluster 2 (opt cluster)