250 likes | 400 Views
Online Fractional Programming for Markov Decision Systems. 1. s tate 1. s tate 4. s tate 2. 4. 2. 3. t. T[0]. T[1]. T[2]. Energy-Aware, 4-State Processor. Michael J. Neely, University of Southern California
E N D
Online Fractional Programming for Markov Decision Systems 1 state 1 state 4 state 2 4 2 3 t T[0] T[1] T[2] Energy-Aware, 4-State Processor Michael J. Neely, University of Southern California Proc. Allerton Conference on Communication, Control, and Computing, September 2011
General System Model state 1 state 4 state 2 t T[0] T[1] T[2] • Frames r in {0, 1, 2, …}. • k[r] = system state during frame r. k[r] in {1,…,K}. • ω[r] = random observation at start of frame r. ω[r]in Ω. • α[r] = control action on frame r. α[r] in A(k[r], ω[r]).
General System Model state 1 state 4 state 2 t T[0] T[1] T[2] • Frames r in {0, 1, 2, …}. • k[r] = system state during frame r. k[r] in {1,…,K}. • ω[r] = random observation at start of frame r. ω[r]in Ω. • α[r] = control action on frame r. α[r] in A(k[r], ω[r]). Control action affects Frame Size, Penalty Vector, Next State: • T[r] = T(k[r], ω[r], α[r]). • [y1[r],…,yL[r]] = y(k[r], ω[r], α[r]). • [Pij(ω[r], α[r])] = P(ω[r], α[r]).
Example 1: Discrete Time MDP Minimize: E{y0} Subject to: E{y1} ≤ 0 E{yL} ≤ 0 1 4 2 3 • All frames have unit size: T[r] = 1 for all r. • Control action α[r] affects Penalty Vector, and Transition Probs. • Additionally, we can treat problems with… • ω[r] = random observation at start of frame r: • ω[r] is i.i.d. over frames r. • ω[r] in Ω (Arbitrary cardinality set) • Pr[ω[r] =ω] (unknown probability distribution)
Example 2: Processor with Several Energy-Saving Modes 1 4 2 3 Energy-Aware, 4-State Processor • Random Job Arrivals, L different classes. • k[r] = processing mode (4 different modes). • Action α[r]: Choose which job to serve, and next mode. • k[r] and α[r] affect: • Processing Time • Switching Time • Energy Expenditure
Relation between Averages Define the frame-average for y0[r]: The time-average for y0[r] is then:
The General Problem state 1 state 4 state 2 t T[0] T[1] T[2]
Prior Methods for “typical” MDPs • Offline Linear Programming Methods (known probabilities). • Q-learning, Neurodynamic programming (unconstrained). • [Bertsekas, Tsitsiklis 1996] • 2-timescales/fluid models for constrained MDPs. • [Borkar 2005][Salodkar, Bhorkar, Karandikar, Borkar 2008] • [Djonin, Krishnamurthy 2007] • [Vazquez Abad, Krishnamurthy 2003] • [Fu, van der Schaar2010] • The above works typically require: • Finite action space • No ω[r] process • Fixed slot length problems (does not treat fractional problems).
The Linear Fractional Program For this slide, assume no ω[r] process, and set is A(k) are finite: Variables
The Linear Fractional Program For this slide, assume no ω[r] process, and set is A(k) are finite: Where f(k, α) is interpreted as the steady state probability of being In state k[r] = k and using action α[r] = a, and the policy is then: Note: See “Additional Slides 2” for the Linear Fractional Program with the ω[r] process.
Paper Outline • Linear Fractional Program involves many variables and would require full knowledge of probs p(ω). • We develop: • Algorithm 1: • Online Policy for Solving Linear Fractional Programs. • Allows infinite sets Ω and A(k,ω). • Does not require knowledge of p(ω). • Does not operate on the actual Markov dynamics. • Solves for the values (Pij*), {yl*(k)} associated with optimal • policy. • Algorithm 2: • Given target values (Pij*), {yl*(k)}, we develop on onlinesystem implementation (with the actual Markov state dynamics) that achieves them. We can also run these algorithms in parallel, continuously refining our target values for Alg 2 based on the running estimates from Alg 1.
ALG 1: Solving for Optimal Values • Define a new stochastic system with same ω[r] process. • Define a decision variable k[r]: • k[r] is chosen in {1, …, K} every frame. • It does not evolve according to the MDP. • Define a new penalty process qij[r]: • qij[r]= 1{k[r]=i}Pij(ω[r], α[r]) • qij = Fraction of time we transition • from “state” i to “state” j.
Treat as a Stoch Network Optimization: “Global Balance Equation”
Treat as a Stoch Network Optimization: Use a “Virtual Queue” for each constraint k: Hk[r] ∑jqkj[r] ∑iqik[r]
Lyapunov Optimization:Drift-Plus-Penalty Ratio • L[r] = sum of squares of all virtual queues for frame r. • Δ[r] = L[r+1] – L[r]. • Every slot, observe ω[r], queues[r]. • Then choose k[r] in {1,…, K}, α[r] in A(k[r],ω[r]) to greedily • minimize a bound on drift-plus-penalty ratio: E{Δ[r] + V y0[r] | queues[r]} E{T[r] | queues[r]} • Can be done “greedily” via a max-weight rule • generalization.
Alg 1: Special Case of no ω[r] process: • The drift-plus-penalty rule: • Every frame r, observe queues[r] = {Zl[r], Hk[r]}. • Then choose k[r] in {1,…,K}, and α[r] in A(k[r]) to • greedily minimize:
Alg 1 Performance Theorem Theorem: If the problem is feasible, then: (a) All virtual queues are rate stable. (b) Averages of our variables satisfy all desired constraints. (c) We find a value within O(1/V) of the optimal objective: ratio* = y0*/T* (d) Convergence time = O(V3). (e) We get an efficient set of parameters to plug into ALG 2: (Pij*), {yl*(k)} , {T*(k)} ALG 2 does not need the (huge number of) individual probabilities for each ω[r] and each control option α[r].
Alg 2: Targeting the MDP • Given targets: (Pij*), {yl*(k)} , {T*(k)}. • Let k[r] = Actual Markov State • Define qij[r] as before. • 1i = time average of indicator function 1{k[r] =i}. • MDP structure: This is not a standard stochastic net opt.
Lyapunov Optimization Solution: • While not a standard problem, we can still solve it greedily: 1{k[r]=i}Pij* Hij[r] qij[r] Uncontrollable variable with “memory”
Greedy Algorithm Idea and Theorem • L[r] = sum of squares of all virtual queues for frame r. • Δ[r] = L[r+1] – L[r]. • Every frame, observe ω[r], queues[r], and actual k[r]. • Then take action α[r] in A(k[r], ω[r]) to greedily minimizea bound on: Δ[r] + V y0[r] Theorem: If the targets are feasible, then this satisfies all constraints and gives performance objective within O(1/V) of optimality.
Conclusions 1 state 1 state 4 state 2 4 2 3 t T[0] T[1] T[2] Energy-Aware, 4-State Processor • General MDP with variable length frames. • ω[r] process can have infinite number of outcomes. • Control space can be infinite. • Combines Lyapunov optimization and MDP theory for “max-weight” rules. • Online algorithms for linear fractional programming.
Additional Slides -- 1 y0=0 y1=0 y0=0 y1=1 y0=1 y1=0 • Example “Degenerate” MDP: • Minimize: E{y0} • Subject to: E{y1} ≤ 1/2 p2 p1 1 1 • Can solve in an expected sense: Optimal E{y0} = ½. Optimal • fraction of time in lower left = fraction in lower right = ½ . • But it is impossible for pure time averages to achieve • the constraints. • Our ALG 1 would find the optimal soln in the expected sense.
Additional Slides -- 2 The linear fractional program with ω[r] process, but with finite sets Ωand A(k[r], ω[r]). Let y(k,w, a) represent the prob of being in state k[r]=k, seeing w[r] =w, and using a[r] =a. GBE Normalization Independence of k[r] and w[r].
Additional Slides -- 2 The linear fractional program with ω[r] process, but with finite sets Ωand A(k[r], ω[r]). Let y(k,w, a) represent the prob of being in state k[r]=k, seeing w[r] =w, and using a[r] =a. Note: An early draft (on my webpage for 1 week) omitted the normalization and independence constraint for this linear fractional program example. It also used more complex notation: y(k,w, a) = p(w)f(k,w, a). The online solution given in the paper (for the more general problem) enforces these constraints in a different (online) way.