170 likes | 317 Views
OR II GSLM 52800. Policy and Action. policy the rules to specify what to do for all states action what to do at a state as dictated by the policy examples policy: replacement only at state 3 do nothing at states 0, 1, and 2, replacing at state 3
E N D
Policy and Action • policy • the rules to specify what to do for all states • action • what to do at a state as dictated by the policy • examples • policy: replacement only at state 3 • do nothing at states 0, 1, and 2, replacing at state 3 • policy: overhaul at state 2 and replacement at state 3 • do nothing at state 0 and 1, overhaul at state 2, and replace at state 3 3
Expected Reward • pij(k) = the probability of changing from state i to state j when action k is taken • qij(k) = expected cost at state i when action k is taken and the state changes to j • Cik= the expected cost at state i with action k j i pij(k) 4
Definition of Variables • policy R • g(R) = the long-term average cost per unit time of policy R • objective: finding the policy that minimizes g • . • . • vi(R) = the effect on the total expected cost when adopting policy R and starting at state i 5
Relationship Between & Claim: The intuitive idea is exact 6
Key Result in Policy Improvement • M+1equations, M+2 unknowns • g(R) = the long-term average cost of policy R • vi(R) = the effect on the total expected cost when adopting policy R and starting at state i 7
Idea of Policy Improvement • the collection of vi(R) does not change by adding a constant • vi(R) = vi+c • the set of equations can be solved by arbitrarily setting vM(R) = 0 8
Idea of Policy Improvement • given policy R with action k, suppose that there exists policy Ro with action ko such that • then it can be shown that g(Ro) < g(R) 9
Policy Improvement • 1 Value Determination: Fix policy R. Set vM(R) to 0 and solve • 2 Policy Improvement: For each state i, find action k as argument minimum of • 3 Form a new policy from actions in 2. Stop if this policy is the same as R; else go to 1 10
Idea of Policy Improvement • it can be proven that • g is non-increasing • R is minimum if there is no change in policy • the algorithm stops after finite number of iterations 11
Example • Policy: Replacement only at state 3 • transition probability matrix • C11 = 0, C21 = 1000, C31 = 3000, C33 = 6000 12
Example • Iteration 1: • Value Determination 13
Example • Iteration 1: • Policy Improvement • nothing can be done at state 0 and machine must be replaced at state 3 • possible decisions at • state 1: decision 1 (do nothing, $1000) decision 3 (replace, $6000) • state 2: decision 1 (do nothing, $3000) decision 2 (overhaul, $4000) decision 3 (replace, $6000) 14
Example • Iteration 1: • Policy Improvement : the general expressions 15
Example new policy: do nothing at states 0 and 1, overhaul at state 2, and replace at state 3 • Iteration 1: • Policy Improvement 16
Example • Iteration 2: • Value Determination It can be shown that there is no improvement in policy so that doing nothing at states 0 and 1, overhauling at state 2, and replacing at state 3 is an optimum policy 17