340 likes | 455 Views
Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study. Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014. Agenda. Restless multi-arm bandits problem Algorithms and policies Numerical experiments Simulated problem instances
E N D
Restless Multi-Arm Bandits Problem(RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014
Agenda • Restless multi-arm bandits problem • Algorithms and policies • Numerical experiments • Simulated problem instances • Real application: the capacity management problem
Restless Multi-arm Bandits Problem … … Active Passive
Objective • Discounted rewards (finite, infinite horizon) • Time average • A general modeling framework • N-choose-M problem • Limited capacity (production capacity, service capacity) • Connection with Multi-arm bandit problem Passive arm:
Exact Optimal Solution: Dynamic Programming • Markov decision process (MDP) • State: • Action: • Transition matrix: • Rewards: • Algorithm: • Finite horizon: backward induction • Infinite horizon (discounted): value iteration, policy iteration • Problem size: becomes a disaster quickly Active set : N-choose-M,
Lagrangian Relaxation: Upper Bound • = number of active arms at time t • Original requirement • Relaxed requirement: an “average’’ version • Solve the upper bound • Occupancy measures • Using Dual LP formulation of MDP max
Index Policies • Philosophy: Decomposition • 1 huge problem of states small problems of states • Index policy • Compute the index for each arm separately • Rank the indices • Choose the arms with M smallest/largest indices • Easy to compute/implement • Intuitive structure
The Whittle’s Index Policy (Discounted Rewards) • For a fixed arm, and a given state “Subsidy” W • The Whittle’s Index: W(s) • The subsidy that makes passive and active arms indifferent • Closed form solution depends on specific models Passive rewards W-subsidy problem active passive-W W too small /large Active /Passive arm is better
Numerical Algorithm for Solving Whittles’ Index STEP 1: Find the plausible range of W Initial , • Value iteration , Evaluate V(Passive)-V(Active) • Initial step size • Update W: • (reduce) • (increase) Yes No Reduce Increase ! STOP when reverses the sign for the first time Range of W identified: [L,U] STEP 2: Use binary search within the range [L,U]
The Primal-Dual Index Policy • Solve the Lagrangian relaxation formulation • Input: • Optimal primal solutions (occupancy measures) • Optimal reduced costs • Policy: • (1) p=M, choose them! • (2) p<M, add (M-p) more arms • Among the rest arms, choose (M-p) arms with smallest • (3) p>M, choose M out of p arms • Among the p arms, kick out (p-M) arms with smallest Being active if >0 total expected discounted time spent selecting arm n in state rate of decrease in the obj-value as increases by 1 unit How harmful for passive active How harmful for active passive rate of decrease in the obj-value as increases by 1 unit number of arms with
Heuristic Index Policies • Absolute-greedy policy • Choose M arms with largest active rewards • Relative-greedy policy • Choose M arms with largest marginal rewards • Rolling-horizon policy (H-period look-ahead) • Choose m arms with largest marginal value-to-go where is the optimal value function in the following H periods
Agenda • Restless multi-arm bandits problem • Algorithms and policies • Numerical experiments • Simulated problem instances • Real application: the capacity management problem
Experiment Settings • Assume active rewards are larger than passive rewards • Non-identical arms • Structures in transition dynamics • Uniformly sampled transition matrix • IFR matrix with non-increasing rewards • P1 is stochastically smaller than P2 • Less-connected chain • Evaluation • Small instances: exact optimal solution • Large instances: upper bound & Monte-Carlo simulation • Performance measure • Average gaps from Optimality or Upper Bound
5 Questions of Interest • How do different policies compare under different problem structures? • How do different policies compare under various problem sizes? • How do different policies compare under different discount factors? • How does a multi-period look ahead improve a myopic policy? 5. How do different policies compare under different time horizons?
Question 1: Does problem structure help? • Uniformly sampled transition matrix and rewards • Increasing failure rate matrix and non-increasing rewards • Less-connected Markov chain • P1 stochastically smaller than P2, non-increasing rewards
Question 2: Does problem size matter? • Optimality gap: Fixed N and M , increasing S
Question 2: Does problem size matter? • Optimality gap: Fixed M and S , increasing N Decreasing
Question 3: Does discount factor matter? • Infinite horizon: discount factors
Question 4: Does look ahead help a myopic policy? • Greedy policies vs Rolling-horizon policies different H • Problem size: S=8, N=6, M=2, • Problem structure: Uniform vs. less-connected =0.4
Question 4: Does look ahead help a myopic policy? • Greedy policies vs Rolling-horizon policies different H • Problem size: S=8, N=6, M=2, • Problem structure: Uniform vs. less-connected =0.4 =0.7
Question 5: Does look ahead help a myopic policy? • Greedy policies vs Rolling-horizon policies different H • Problem size: S=8, N=6, M=2, • Problem structure: Uniform vs. less-connected =0.4 =0.9
Question 4: Does look ahead help a myopic policy? • Greedy policies vs Rolling-horizon policies different H • Problem size: S=8, N=6, M=2, • Problem structure: Uniform vs. less-connected =0.4 =0.98
Agenda • Restless multi-arm bandits problem • Algorithms and policies • Numerical experiments • Simulated problem instances • Real application: the capacity management problem
Clinical Capacity Management Problem (Deo et al. 2013) • School-based asthma care for children Van capacity Scheduling Policy Medical records of patients Who to schedule (treat)? State (h,n), capacity M, population N Active set : choose M out of N h=health state at the last appointment n=the time since the last appointment OBJECTIVE: maximize total benefit of the community • Current guidelines (fixed duration policy) • Whittle’s index policy • Primal-dual index policy • Greedy (myopic) policy • Rolling-horizon policy • H-N priority policy, N-H priority policy • No-schedule [baseline] Improvement
How Large Is It? • Horizon: 24 periods (2 years) • Population size N ~ 50 patients • State space: • Each arm: • In total: 1.3 X 1099 • Decision space: • Choose 10 out of 50: 1.2 X 1010 • Choose 15 out of 50: 2.3 X 1012 • Actual computation time • Whittle’s Indices: 96 states/arm * 50 arms = 4800 indices • 1.5 - 3 hours • Presolve the LP relaxation for Primal-Dual Indices: • 4 - 60 seconds 96 states each arm
Performance of Policies Improvement
Performance of Policies Improvement
Performance of Policies Improvement
Whittle’s Index vs. Gitten’s Index • (S,N,M=1) vs. (S,N,M=2) • Sample 20 instances for each problem size • Whittle’s Index policy vs. DP exact solution • Optimality tolerance = 0.002 Percentage of time when Whittle’s Index policy is NOT optimal
Summary • Whittles’ Index and Primal-dual Index work well and efficiently • Relative greedy policy can work well depending on problem structure • Policies perform worse on the less-connected Markov chain • All policies tend to work better if capacity is tight • Look ahead policies have limited marginal benefit for small discount factor
Question 5: Does decision horizon matter? • Finite horizon: # of periods