Online Fractional Programming for Markov Decision Systems

Online Fractional Programming for Markov Decision Systems 1 state 1 state 4 state 2 4 2 3 t T[0] T[1] T[2] Energy-Aware, 4-State Processor Michael J. Neely, University of Southern California Proc. Allerton Conference on Communication, Control, and Computing, September 2011

General System Model state 1 state 4 state 2 t T[0] T[1] T[2] • Frames r in {0, 1, 2, …}. • k[r] = system state during frame r. k[r] in {1,…,K}. • ω[r] = random observation at start of frame r. ω[r]in Ω. • α[r] = control action on frame r. α[r] in A(k[r], ω[r]).

General System Model state 1 state 4 state 2 t T[0] T[1] T[2] • Frames r in {0, 1, 2, …}. • k[r] = system state during frame r. k[r] in {1,…,K}. • ω[r] = random observation at start of frame r. ω[r]in Ω. • α[r] = control action on frame r. α[r] in A(k[r], ω[r]). Control action affects Frame Size, Penalty Vector, Next State: • T[r] = T(k[r], ω[r], α[r]). • [y1[r],…,yL[r]] = y(k[r], ω[r], α[r]). • [Pij(ω[r], α[r])] = P(ω[r], α[r]).

Example 1: Discrete Time MDP Minimize: E{y0} Subject to: E{y1} ≤ 0 E{yL} ≤ 0 1 4 2 3 • All frames have unit size: T[r] = 1 for all r. • Control action α[r] affects Penalty Vector, and Transition Probs. • Additionally, we can treat problems with… • ω[r] = random observation at start of frame r: • ω[r] is i.i.d. over frames r. • ω[r] in Ω (Arbitrary cardinality set) • Pr[ω[r] =ω] (unknown probability distribution)

Example 2: Processor with Several Energy-Saving Modes 1 4 2 3 Energy-Aware, 4-State Processor • Random Job Arrivals, L different classes. • k[r] = processing mode (4 different modes). • Action α[r]: Choose which job to serve, and next mode. • k[r] and α[r] affect: • Processing Time • Switching Time • Energy Expenditure

Relation between Averages Define the frame-average for y0[r]: The time-average for y0[r] is then:

The General Problem state 1 state 4 state 2 t T[0] T[1] T[2]

Prior Methods for “typical” MDPs • Offline Linear Programming Methods (known probabilities). • Q-learning, Neurodynamic programming (unconstrained). • [Bertsekas, Tsitsiklis 1996] • 2-timescales/fluid models for constrained MDPs. • [Borkar 2005][Salodkar, Bhorkar, Karandikar, Borkar 2008] • [Djonin, Krishnamurthy 2007] • [Vazquez Abad, Krishnamurthy 2003] • [Fu, van der Schaar2010] • The above works typically require: • Finite action space • No ω[r] process • Fixed slot length problems (does not treat fractional problems).

The Linear Fractional Program For this slide, assume no ω[r] process, and set is A(k) are finite: Variables

The Linear Fractional Program For this slide, assume no ω[r] process, and set is A(k) are finite: Where f(k, α) is interpreted as the steady state probability of being In state k[r] = k and using action α[r] = a, and the policy is then: Note: See “Additional Slides 2” for the Linear Fractional Program with the ω[r] process.

Paper Outline • Linear Fractional Program involves many variables and would require full knowledge of probs p(ω). • We develop: • Algorithm 1: • Online Policy for Solving Linear Fractional Programs. • Allows infinite sets Ω and A(k,ω). • Does not require knowledge of p(ω). • Does not operate on the actual Markov dynamics. • Solves for the values (Pij*), {yl*(k)} associated with optimal • policy. • Algorithm 2: • Given target values (Pij*), {yl*(k)}, we develop on onlinesystem implementation (with the actual Markov state dynamics) that achieves them. We can also run these algorithms in parallel, continuously refining our target values for Alg 2 based on the running estimates from Alg 1.

ALG 1: Solving for Optimal Values • Define a new stochastic system with same ω[r] process. • Define a decision variable k[r]: • k[r] is chosen in {1, …, K} every frame. • It does not evolve according to the MDP. • Define a new penalty process qij[r]: • qij[r]= 1{k[r]=i}Pij(ω[r], α[r]) • qij = Fraction of time we transition • from “state” i to “state” j.

Treat as a Stoch Network Optimization:

Treat as a Stoch Network Optimization: “Global Balance Equation”

Treat as a Stoch Network Optimization: Use a “Virtual Queue” for each constraint k: Hk[r] ∑jqkj[r] ∑iqik[r]

Lyapunov Optimization:Drift-Plus-Penalty Ratio • L[r] = sum of squares of all virtual queues for frame r. • Δ[r] = L[r+1] – L[r]. • Every slot, observe ω[r], queues[r]. • Then choose k[r] in {1,…, K}, α[r] in A(k[r],ω[r]) to greedily • minimize a bound on drift-plus-penalty ratio: E{Δ[r] + V y0[r] | queues[r]} E{T[r] | queues[r]} • Can be done “greedily” via a max-weight rule • generalization.

Alg 1: Special Case of no ω[r] process: • The drift-plus-penalty rule: • Every frame r, observe queues[r] = {Zl[r], Hk[r]}. • Then choose k[r] in {1,…,K}, and α[r] in A(k[r]) to • greedily minimize:

Alg 1 Performance Theorem Theorem: If the problem is feasible, then: (a) All virtual queues are rate stable. (b) Averages of our variables satisfy all desired constraints. (c) We find a value within O(1/V) of the optimal objective: ratio* = y0*/T* (d) Convergence time = O(V3). (e) We get an efficient set of parameters to plug into ALG 2: (Pij*), {yl*(k)} , {T*(k)} ALG 2 does not need the (huge number of) individual probabilities for each ω[r] and each control option α[r].

Alg 2: Targeting the MDP • Given targets: (Pij*), {yl*(k)} , {T*(k)}. • Let k[r] = Actual Markov State • Define qij[r] as before. • 1i = time average of indicator function 1{k[r] =i}. • MDP structure: This is not a standard stochastic net opt.

Lyapunov Optimization Solution: • While not a standard problem, we can still solve it greedily: 1{k[r]=i}Pij* Hij[r] qij[r] Uncontrollable variable with “memory”

Greedy Algorithm Idea and Theorem • L[r] = sum of squares of all virtual queues for frame r. • Δ[r] = L[r+1] – L[r]. • Every frame, observe ω[r], queues[r], and actual k[r]. • Then take action α[r] in A(k[r], ω[r]) to greedily minimizea bound on: Δ[r] + V y0[r] Theorem: If the targets are feasible, then this satisfies all constraints and gives performance objective within O(1/V) of optimality.

Conclusions 1 state 1 state 4 state 2 4 2 3 t T[0] T[1] T[2] Energy-Aware, 4-State Processor • General MDP with variable length frames. • ω[r] process can have infinite number of outcomes. • Control space can be infinite. • Combines Lyapunov optimization and MDP theory for “max-weight” rules. • Online algorithms for linear fractional programming.

Additional Slides -- 1 y0=0 y1=0 y0=0 y1=1 y0=1 y1=0 • Example “Degenerate” MDP: • Minimize: E{y0} • Subject to: E{y1} ≤ 1/2 p2 p1 1 1 • Can solve in an expected sense: Optimal E{y0} = ½. Optimal • fraction of time in lower left = fraction in lower right = ½ . • But it is impossible for pure time averages to achieve • the constraints. • Our ALG 1 would find the optimal soln in the expected sense.

Additional Slides -- 2 The linear fractional program with ω[r] process, but with finite sets Ωand A(k[r], ω[r]). Let y(k,w, a) represent the prob of being in state k[r]=k, seeing w[r] =w, and using a[r] =a. GBE Normalization Independence of k[r] and w[r].

Additional Slides -- 2 The linear fractional program with ω[r] process, but with finite sets Ωand A(k[r], ω[r]). Let y(k,w, a) represent the prob of being in state k[r]=k, seeing w[r] =w, and using a[r] =a. Note: An early draft (on my webpage for 1 week) omitted the normalization and independence constraint for this linear fractional program example. It also used more complex notation: y(k,w, a) = p(w)f(k,w, a). The online solution given in the paper (for the more general problem) enforces these constraints in a different (online) way.

Online Fractional Programming for Markov Decision Systems

Online Fractional Programming for Markov Decision Systems

Presentation Transcript

Concurrent Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

CSE 473 Markov Decision Processes

CSE 473 Markov Decision Processes

Markov Decision Process

Partially Observable Markov Decision Processes

Markov Decision Processes

Markov Decision Processes Basics Concepts

Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

Markov Decision Process (MDP)

Markov Decision Process (MDP)

Markov Decision Processes: Approximate Equivalence

Markov Decision Processes

Linear Fractional Programming

Markov Decision Processes Chapter 17

Markov Decision Process (MDP)