1 / 31

Multi-armed Bandit Problems with Dependent Arms

Multi-armed Bandit Problems with Dependent Arms. Sandeep Pandey (spandey@cs.cmu.edu) Deepayan Chakrabarti (deepay@yahoo-inc.com) Deepak Agarwal (dagarwal@yahoo-inc.com). (unknown reward probabilities). μ 1. μ 2. μ 3. Background: Bandits. Bandit “arms”.

matt
Download Presentation

Multi-armed Bandit Problems with Dependent Arms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-armed Bandit Problems with Dependent Arms Sandeep Pandey (spandey@cs.cmu.edu) Deepayan Chakrabarti (deepay@yahoo-inc.com) Deepak Agarwal (dagarwal@yahoo-inc.com)

  2. (unknown reward probabilities) μ1 μ2 μ3 Background: Bandits Bandit “arms” • Pull arms sequentially so as to maximize the total expected reward • Show ads on a webpage to maximize clicks • Product recommendation to maximize sales

  3. Dependent Arms • Reward probabilities μiare generally assumed to be independent of each other • What if they are dependent? • E.g., ads on similar topics, using similar text/phrases, should have similar rewards “Skiing, snowboarding” “Skiing, snowshoes” “Get Vonage!” “Snowshoe rental” μ1=0.3 μ2=0.28 μ3=10-6 μ2=0.31

  4. Dependent Arms • Reward probabilities μiare generally assumed to be independent of each other • What if they are dependent? • E.g., ads on similar topics, using similar text/phrases, should have similar rewards • A click on one ad  other “similar” ads may generate clicks as well • Can we increase total reward using this dependency?

  5. Arm 1 Arm 2 Arm 4 Arm 3 # pulls of arm i Some distribution (known) Cluster-specific parameter (unknown) Cluster Model of Dependence Cluster 1 Cluster 2 μi ~ f(π[i]) Successes si ~ Bin(ni, μi)

  6. Arm 1 Arm 2 Arm 4 Arm 3 ∞ t=0 T t=0 Cluster Model of Dependence μi ~ f(π1) μi ~ f(π2) • Total reward: • Discounted:∑ αt.E[R(t)], α = discounting factor • Undiscounted:∑ E[R(t)]

  7. Discounted Reward Arm 2 x’1 x’2 The optimal policy can be computed using per-cluster MDPs only. MDP for cluster 1 Pull Arm 1 x1 x2 x”1 x”2 • Optimal Policy: • Compute an (“index”, arm) pair for each cluster • Pick the cluster with the largest index, and pull the corresponding arm Arm 4 x’3 x’4 MDP for cluster 2 Pull Arm 3 x3 x4 x”3 x”4

  8. Discounted Reward Arm 2 x’1 x’2 The optimal policy can be computed using per-cluster MDPs only. MDP for cluster 1 Pull Arm 1 • Reduces the problem to smaller state spaces • Reduces to Gittins’ Theorem [1979] for independent bandits • Approximation bounds on the index for k-step lookahead x1 x2 x”1 x”2 • Optimal Policy: • Compute an (“index”, arm) pair for each cluster • Pick the cluster with the largest index, and pull the corresponding arm Arm 4 x’3 x’4 MDP for cluster 2 Pull Arm 3 x3 x4 x”3 x”4

  9. Arm 1 Arm 2 Arm 4 Arm 3 Cluster Model of Dependence μi ~ f(π1) μi ~ f(π2) • Total reward: • Discounted:∑ αt.E[R(t)], α = discounting factor • Undiscounted:∑ E[R(t)] ∞ t=0 T t=0

  10. Arm 1 Arm 2 Arm 4 Arm 3 Undiscounted Reward “Cluster arm” 1 “Cluster arm” 2 All arms in a cluster are similar  They can be grouped into one hypothetical “cluster arm”

  11. Arm 1 Arm 2 Arm 4 Arm 3 Undiscounted Reward • Two-Level Policy In each iteration: • Pick “cluster arm” using a traditional bandit policy • Pick an arm within that cluster using a traditional bandit policy “Cluster arm” 1 “Cluster arm” 2 Each “cluster arm” must have some estimated reward probability

  12. Issues • What is the reward probability of a “cluster arm”? • How do cluster characteristics affect performance?

  13. Reward probability of a “cluster arm” • What is the reward probability r of a “cluster arm”? • MEAN: r = ∑si / ∑ni, i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007] • Initially, r = μavg = average μ of arms in cluster • Finally, r = μmax = max μ among arms in cluster • “Drift” in the reward probability of the “cluster arm”

  14. Arm 1 Arm 2 Arm 4 Arm 3 Reward probability drift causes problems • Drift  Non-optimal clusters might temporarily look better  optimal arm is explored only O(log T) times Best (optimal) arm, with reward probability μopt Cluster 1 Cluster 2 (opt cluster)

  15. Reward probability of a “cluster arm” • What is the reward probability r of a “cluster arm”? • MEAN:r = ∑si / ∑ni • MAX:r =max( E[μi] ) • PMAX:r =E[max(μi) ] • Both MAX and PMAX aim to estimate μmax and thus reduce drift for all arms i in cluster

  16. Reward probability of a “cluster arm” Bias in estimation of μmax • MEAN:r = ∑si / ∑ni • MAX:r =max( E[μi] ) • PMAX:r =E[max(μi) ] • Both MAX and PMAX aim to estimate μmax and thus reduce drift Variance of estimator High Unbiased Low High

  17. Comparison of schemes 10 clusters, 11.3 arms/cluster MAX performs best

  18. Issues • What is the reward probability of a “cluster arm”? • How do cluster characteristics affect performance?

  19. Effects of cluster characteristics • We analytically study the effects of cluster characteristics on the “crossover-time” • Crossover-time: Time when the expected reward probability of the optimal cluster becomes highest among all “cluster arms”

  20. Effects of cluster characteristics • Crossover-time Tc for MEAN depends on: • Cluster separation Δ = μopt – μmax outside opt clusterΔ increases  Tc decreases • Cluster size Aopt Aopt increases  Tc increases • Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases  Tc decreases

  21. Experiments (effect of separation) Δ increases  Tc decreases  higher reward

  22. Experiments (effect of size) Aopt increases  Tc increases  lower reward

  23. Experiments (effect of cohesiveness) Cohesiveness increases  Tc decreases  higher reward

  24. Related Work • Typical multi-armed bandit problems • Do not consider dependencies • Very few arms • Bandits with side information • Cannot handle dependencies among arms • Active learning • Emphasis on #examples required to achieve a given prediction accuracy

  25. Conclusions • We analyze bandits where dependencies are encapsulated within clusters • Discounted Reward the optimal policy is an index scheme on the clusters • Undiscounted Reward • Two-level Policy with MEAN, MAX, and PMAX • Analysis of the effect of cluster characteristics on performance, for MEAN

  26. x’1 x’2 Pull Arm 3 Pull Arm 2 Pull Arm 4 success x3 x4 Change of belief for both arms 1 and 2 Pull Arm 1 Estimated reward probabilities x”1 x”2 failure x3 x4 Discounted Reward 1 3 4 2 x1 x2 x3 x4 • Create a belief-state MDP • Each state contains the estimated reward probabilities for all arms • Solve for optimal

  27. (unknown payoff probabilities) p1 p2 p3 Background: Bandits Bandit “arms” Regret = optimal payoff – actual payoff

  28. Reward probability of a “cluster arm” • What is the reward probability of a “cluster arm”? • Eventually, every “cluster arm” must converge to the most rewarding arm μmaxwithin that cluster • since a bandit policy is used within each cluster • However, “drift” causes problems

  29. Experiments • Simulation based on one week’s worth of data from a large-scale ad-matching application • 10 clusters, with 11.3 arms/cluster on average

  30. Comparison of schemes 10 clusters, 11.3 arms/cluster • Cluster separation Δ = 0.08 • Cluster size Aopt = 31 • Cohesiveness = 0.75 MAX performs best

  31. Arm 1 Arm 2 Arm 4 Arm 3 Reward probability drift causes problems Intuitively, to reduce regret, we must: • Quickly converge to the optimal “cluster arm” • and then to the best arm within that cluster Best (optimal) arm, with reward probability μopt Cluster 1 Cluster 2 (opt cluster)

More Related