Network Utility Maximization over Partially Observable Markov Channels

Network Utility Maximization over Partially Observable Markov Channels Channel State 1 = ? 1 Channel State 2 = ? 2 Channel State 3 = ? 3 Restless Multi-Arm Bandit Chih-Ping Li , Michael J. Neely University of Southern California Information Theory and Applications Workshop, La Jolla, Feb. 2011

This work is from the following papers:* • Li, Neely WiOpt 2010 • Li, Neely ArXiv 2010, submitted for conference • Neely Asilomar 2010 • Chih-Ping Li is graduating and is currently looking for • post-doc positions! • *The above paper titles are given below, and are available at: • http://www-bcf.usc.edu/~mjneely/ • C. Li and M. J. Neely “Exploiting Channel Memory for Multi-User Wireless Scheduling without Channel Measurement: Capacity Regions and Algorithms,” Proc. WiOpt 2010. • C. Li and M. J. Neely, “Network Utility Maximization over Partially Observable Markovian Channels,” arXiv:1008.3421, Aug. 2010. • M. J. Neely, “Dynamic Optimization and Learning for Renewal Systems,” Proc. Asilomar Conf. on Signals, Systems, and Computers, Nov. 2010.

S1(t) = ? 1 S2(t) = ? 2 S3(t) = ? 3 Restless Multi-Arm Bandit with vector rewards εi Process Si(t) for Channel i: ON OFF δi • N-user wireless system. • Timeslots t in {0, 1, 2, …}. • Choose one channel for transmission every slot t. • Channels Si(t) ON/OFF Markov, current states Si(t) unknown.

S1(t) = ? 1 S2(t) = ? 2 S3(t) = ? 3 Restless Multi-Arm Bandit with vector rewards εi Process Si(t) for Channel i: ON OFF δi • Suppose we serve channel i on slot t:

0 S1(t) = ? 1 = r(t) 1 S2(t) = ? 2 S3(t) = ? 0 3 Restless Multi-Arm Bandit with vector rewards εi Process Si(t) for Channel i: ON OFF δi • Suppose we serve channel i on slot t: • If Si(t)=ON  ACK  Reward vector r(t) = (0, …, 0, 1, 0, …, 0).

0 S1(t) = ? 1 = r(t) 0 S2(t) = ? 2 S3(t) = ? 0 3 Restless Multi-Arm Bandit with vector rewards εi Process Si(t) for Channel i: ON OFF δi • Suppose we serve channel i on slot t: • If Si(t)=ON  ACK  Reward vector r(t) = (0, …, 0, 1, 0, …, 0). • If Si(t)=OFF  NACK  Reward vector r(t) = (0, …, 0, 0, 0, …, 0).

S1(t) = ? 1 S2(t) = ? 2 S3(t) = ? 3 Restless Multi-Arm Bandit with vector rewards εi Process Si(t) for Channel i: ON OFF δi Let ωi(t) = Pr[Si(t) = ON]. If we serve channel i, we update: ωi(t+1) = { (1-εi) if we get “ACK” { δi if we get “NACK”

S1(t) = ? 1 S2(t) = ? 2 S3(t) = ? 3 Restless Multi-Arm Bandit with vector rewards εi Process Si(t) for Channel i: ON OFF δi Let ωi(t) = Pr[Si(t) = ON]. If we do not serve channel i, we update: ωi(t+1) = ωi(t)(1-εi) + (1-ωi(t))δi

L We want to: Characterize the capacity region Λ of the system. Λ = { all stabilizable input ratevectors(λ1, ..., λΝ)} = { all possible time average reward vectors } 2) Perform concave utility maximization over Λ. Maximize: g(r1, ..., rΝ) Subject to: (r1, ..., rΝ) in Λ λ1 1 λ2 2 λ3 3

What is known about such systems? • If (S1(t), …, SN(t)) known every slot: • Capacity Region known [Tassiulas, Ephremides 1993]. • Greedy “Max-Weight” optimal [Tassiulas, Ephremides 1993]. • Capacity Region is same, and Max-Weight works, for both iid vectors and time-correlated Markov vectors. • 2) If (S1(t), …, SN(t)) unknown but iidover slots: • Capacity Region is known. • Greedy Max-Weight decisions are optimal. • [Gopalan, Caramanis, ShakkottaiAllerton 2007] • [Li, Neely CDC 2007, TMC 2010] • 3) If (S1(t), …, SN(t)) unknown and time-correlated: • Capacity Region is unknown. • Seems to be an intractable multi-dimensional Markov Decision Problem (MDP). Current decisions affect future (ω1(t), …, ωN(t)) probability vectors.

Our Contributions: 1) We construct an operational capacity region (inner bound). Our Contributions: 1) We construct an operational capacity region (inner bound). 2) We construct a novel frame based technique for utility maximization over this region.

Assume channels are positively correlated: εi + δi ≤ 1. εi 1-εi ON OFF δi δi ωi(t) • After “ACK”  ωi(t) > Steady state Pr[Si(t) = ON]= δi/(δi+εi) • After “NACK”  ωi(t) <Steady state Pr[Si(t) = ON]= δi/(δi+εi) • Gives good intuition for scheduling decisions. • For Special Case of channel symmetry (εi = ε, δi= δ for all i), • “round-robin” maximizes sum output rate. • [Ahmad, Liu, Javidi, Zhao, Krishnamachari, Trans IT 2009] • How to use intuition to construct a capacity region (for possibly asymmetric channels)? t

Inner Bound on Λint (“Operational Capacity Region”): λ1 1 λ2 2 Λint = Convex hull of allrandomized round-robin policies. λN N Every frame, randomly pick a subset and an orderingaccording to some probability distribution over the ≈N!2N choices. 3 1 7 4 Variable Length Frame

Inner Bound Properties: • Bound contains a huge number of policies. • Touches true capacity boundary as N ∞. • Even a good bound for N=2: • Can obtain efficient algorithms for optimizing over this region! • Let’s see how…

New Lyapunov Drift Analysis Technique: 3 1 7 4 Variable Length Frame t[k] t[k]+T[k] • Lyapunov Function: L(t) = ∑ Qi(t)2 • T-Slot Drift for frame k: Δ[k] = L(t[k] + T[k]) – L(t[k]) • New Drift-Plus-Penalty Ratio Method on each frame: E{ Δ[k] + V x Penalty[k] | Q(t[k]) } Minimize: E{ T[k] | Q(t[k]) }

New Lyapunov Drift Analysis Technique: 3 1 7 4 Variable Length Frame t[k] t[k]+T[k] • Lyapunov Function: L(t) = ∑ Qi(t)2 • T-Slot Drift for frame k: Δ[k] = L(t[k] + T[k]) – L(t[k]) • New Drift-Plus-Penalty Ratio Method on each frame: E{ Δ[k] + V x Penalty[k] | Q(t[k]) } Tassiulas, Ephremides 90, 92, 93 (queue stability) Minimize: E{ T[k] | Q(t[k]) }

New Lyapunov Drift Analysis Technique: 3 1 7 4 Variable Length Frame t[k] t[k]+T[k] • Lyapunov Function: L(t) = ∑ Qi(t)2 • T-Slot Drift for frame k: Δ[k] = L(t[k] + T[k]) – L(t[k]) • New Drift-Plus-Penalty Ratio Method on each frame: E{ Δ[k] + V x Penalty[k] | Q(t[k]) } Minimize: E{ T[k] | Q(t[k]) } Neely, Modiano 2003, 2005 (queue stability + utility optimization)

New Lyapunov Drift Analysis Technique: 3 1 7 4 Variable Length Frame t[k] t[k]+T[k] • Lyapunov Function: L(t) = ∑ Qi(t)2 • T-Slot Drift for frame k: Δ[k] = L(t[k] + T[k]) – L(t[k]) • New Drift-Plus-Penalty Ratio Method on each frame: E{ Δ[k] + V x Penalty[k] | Q(t[k]) } Minimize: E{ T[k] | Q(t[k]) } Li, Neely 2010 (queue stability + utility optimization for variable frames)

Conclusions: • Multi-Armed Bandit Problem with Reward Vectors (complex MDP). • Operational Capacity Region = Convex Hull over Frame-Based Randomized Round-Robin Policies. • Stochastic Network Optimization via the Drift-Plus-Penalty Ratio method. • Quick Advertisement: New Book: • M. J. Neely, Stochastic Network Optimization with Application to Communication and Queueing Systems. Morgan & Claypool, 2010. • PDF also available from “Synthesis Lecture Series” (on digital library) • Link available on Mike Neely homepage. • Lyapunov Optimization theory (including renewal system problems) • Detailed Examples and Problem Set Questions.

Network Utility Maximization over Partially Observable Markov Channels