510 likes | 745 Views
Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu. Online Sampling for Markov Decision Processes. Electrical and Computer Engineering Purdue University. Markov Decision Process (MDP). Ingredients: System state x in state space X Control action a in A ( x ) Reward R ( x,a )
E N D
Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Online SamplingforMarkov Decision Processes Electrical and Computer Engineering Purdue University
Markov Decision Process (MDP) • Ingredients: • System state x in state space X • Control action a in A(x) • Reward R(x,a) • State-transition probability P(x,y,a) • Find control policy to maximize objective fun Bob Givan Electrical and Computer Engineering Purdue University
Optimal Policies • Policy – mapping from state and time to actions • Stationary Policy – mapping from state to actions • Goal – a policy maximizing the objective function VH*(x0) = max Obj [R(x0,a0), …, R(xH-1,aH-1)] wherethe “max” is over all policies u = u0,…,uH-1 • For large H, a0 independent of H. (w/ergodicity assum.) • Stationary optimal action a0for H = via receding horizon control Bob Givan Electrical and Computer Engineering Purdue University
Q Values Fix a large H, focus on finite-horizon reward • Define Q(x,a) = R(x,a) + E[VH-1*(y)] • “Utility” of action a at state x. • Name: Q-value of action a at state x. • Key identities (Bellman’s equations): • VH*(x) = maxaQ(x,a) • 0*(x) = argmaxaQ(x,a) Bob Givan Electrical and Computer Engineering Purdue University
Solution Methods • Recall: • u0*(x) = argmaxaQ(x,a) • Q(x,a) =R(x,a) + E [VH-1*(y)] • Problem: Q-value depends on optimal policy. • State space is extremely large (often continuous) • Two-pronged solution approach: • Apply a receding-horizon method • Estimate Q-values via simulation/sampling Bob Givan Electrical and Computer Engineering Purdue University
Methods for Q-value Estimation Previous work by other authors: • Unbiased sampling (exact Q value)[Kearns et al., IJCAI-99] • Policy rollout (lower bound)[Bertsekas & Castanon, 1999] Our techniques: • Hindsight optimization (upper bound) • Parallel rollout (lower bound) Bob Givan Electrical and Computer Engineering Purdue University
Expectimax Tree for V* Bob Givan Electrical and Computer Engineering Purdue University
Unbiased Sampling Bob Givan Electrical and Computer Engineering Purdue University
Unbiased Sampling (Cont’d) • For a given desired accuracy, how largeshould sampling width and depth be? • Answered: Kearns, Mansour, and Ng (1999) • Requires prohibitive sampling width and depth • e.g. C 108, Hs > 60 to distinguish “best” and “worst” policies in our scheduling domain • We evaluate with smaller width and depth Bob Givan Electrical and Computer Engineering Purdue University
How to Look Deeper? Bob Givan Electrical and Computer Engineering Purdue University
Policy Roll-out Bob Givan Electrical and Computer Engineering Purdue University
Policy Rollout in Equations • Write VHu(y) for the value of following policy u • Recall: Q(x,a) = R(x,a) + E [VH-1*(y)] = R(x,a) + E [maxu VH-1u(y)] • Given a base policyu, use R(x,a) + E [VH-1u(y)] as an lower bound estimate of Q-value. • Resulting policy is PI(u), given infinite sampling Bob Givan Electrical and Computer Engineering Purdue University
Policy Roll-out (cont’d) Bob Givan Electrical and Computer Engineering Purdue University
Parallel Policy Rollout • Generalization of policy rollout, due to[Chang, Givan, and Chong, 2000] • Given a set U of base policies, use R(x,a) + E [maxu∊UVH-1u(y)] as an estimate of Q-value • More accurate estimate than policy rollout • Still gives a lower bound to true Q-value • Still gives a policy no worse than any in U Bob Givan Electrical and Computer Engineering Purdue University
Hindsight Optimization – Tree View Bob Givan Electrical and Computer Engineering Purdue University
Hindsight Optimization – Equations • Swap Max and Exp in expectimax tree. • Solve each off-line optimization problem • O (kC’ • f(H)) time • where f(H) is the offline problem complexity • Jensen’s inequality implies upper bounds Bob Givan Electrical and Computer Engineering Purdue University
Hindsight Optimization (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University
Application to Example Problems • Apply unbiased sampling, policy rollout, parallel rollout, and hindsight optimization to: • Multi-class deadline scheduling • Random early dropping • Congestion control Bob Givan Electrical and Computer Engineering Purdue University
Basic Approach • Traffic model provides a stochastic description of possible future outcomes • Method • Formulate network decision problems as POMDPs by incorporating traffic model • Solve belief-state MDP online using sampling(choose time-scale to allow for computation time) Bob Givan Electrical and Computer Engineering Purdue University
Domain 1: Deadline Scheduling Objective: Minimize weighted loss Bob Givan Electrical and Computer Engineering Purdue University
Domain 2: Random Early Dropping Objective: Minimize delaywithout sacrificing throughput Bob Givan Electrical and Computer Engineering Purdue University
Domain 3: Congestion Control Bob Givan Electrical and Computer Engineering Purdue University
Traffic Modeling • A Hidden Markov Model (HMM) for each source • Note: state is hidden, model is partially observed Bob Givan Electrical and Computer Engineering Purdue University
Deadline Scheduling Results Non-sampling Policies: • EDF: earliest deadline first. • Deadline sensitive, class insensitive. • SP: static priority. • Deadline insensitive, class sensitive. • CM: current minloss [Givan et al., 2000] • Deadline and class sensitive. • Minimizes weighted loss for the current packets. Bob Givan Electrical and Computer Engineering Purdue University
Deadline Scheduling Results • Objective: minimize weighted loss • Comparison: • Non-sampling policies • Unbiased sampling (Kearns et al.) • Hindsight optimization • Rollout with CM as base policy • Parallel rollout • Results due to H. S. Chang Bob Givan Electrical and Computer Engineering Purdue University
Deadline Scheduling Results Bob Givan Electrical and Computer Engineering Purdue University
Deadline Scheduling Results Bob Givan Electrical and Computer Engineering Purdue University
Deadline Scheduling Results Bob Givan Electrical and Computer Engineering Purdue University
Random Early Dropping Results • Objective: minimize delay subject to throughput loss-tolerance • Comparison: • Candidate policies: RED and “buffer-k” • KMN-sampling • Rollout of buffer-k • Parallel rollout • Hindsight optimization • Results due to H. S. Chang. Bob Givan Electrical and Computer Engineering Purdue University
Random Early Dropping Results Bob Givan Electrical and Computer Engineering Purdue University
Random Early Dropping Results Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control Results • MDP Objective: minimize weighted sum of throughput, delay, and loss-rate • Fairness is hard-wired • Comparisons: • PD-k (proportional-derivative with k target queue) • Hindsight optimization • Rollout of PD-k == parallel rollout • Results due to G. Wu, in progress Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University
Results Summary • Unbiased sampling cannot cope • Parallel rollout wins in 2 domains • Not always equal to simple rollout of one base policy • Hindsight optimization wins in 1 domain • Simple policy rollout – the cheapest method • Poor in domain 1 • Strong in domain 2 with best base policy – but how to find this policy? • So-so in domain 3 with any base policy Bob Givan Electrical and Computer Engineering Purdue University
Talk Summary • Case study of MDP sampling methods • New methods offering practical improvements • Parallel policy rollout • Hindsight optimization • Systematic methods for using traffic models to help make network control decisions • Feasibility of real-time implementation depends on problem timescale Bob Givan Electrical and Computer Engineering Purdue University
Ongoing Research • Apply to other control problems (different timescales): • Admission/access control • QoS routing • Link bandwidth allotment • Multiclass connection management • Problems arising in proxy-services • Diagnosis and recovery Bob Givan Electrical and Computer Engineering Purdue University
Ongoing Research (Cont’d) • Alternative traffic models • Multi-timescale models • Long-range dependent models • Closed-loop traffic • Fluid models • Learning traffic model online • Adaptation to changing traffic conditions Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University
Hindsight Optimization (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University
Policy Rollout (Cont’d) Policy-performance Base Policy Bob Givan Electrical and Computer Engineering Purdue University
Receding-horizon Control • For large horizon H, policy is ~ stationary. • At each time, if state is x, then apply action u*(x) = argmaxaQ(x,a) = argmaxaR(x,a) + E [VH-1*(y)] • Compute estimate of Q-value at each time. Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University
. . . . . . Domain 3: Congestion Control High-priority Traffic Bottleneck Node Best-effort Traffic • Resources: Bandwidth and buffer • Objective: optimize throughput, delay, loss, and fairness • High-priority traffic: • Open-loop controlled • Low-priority traffic: • Closed-loop controlled Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University
Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University