Making complex decisions

Making complex decisions Chapter 17

Outline • Sequential decision problems (Markov Decision Process) • Value iteration • Policy iteration • Partially Observable MDP’s • Decision theoretic agent • Dynamic belief networks

Sequential Decisions • Agent’s utility depends on a sequence of decisions

Markov Decision Process (MDP) • Defined as a tuple: <S, A, M, R> • S: State • A: Action • M: Transition function • Table Mija = P(sj| si, a), prob of sj| given action “a” in state si • R: Reward • R(si, a) = cost or reward of taking action a in state si • In our case R = R(si) • Choose a sequence of actions (not just one decision or one action) • Utility based on a sequence of decisions

Generalization • Inputs: • Initial state s0 • Action model • Reward R(si) collected in each state si • A state is terminal if it has no successor • Starting at s0, the agent keeps executing actions until it reaches a terminal state • Its goal is to maximize the expected sum of rewards collected (additiverewards) • Additive rewards: U(s0, s1, s2, …) = R(s0) + R(s1) + R(s2)+ … • Discounted rewards (we will not use this reward form) U(s0, s1, s2, …) = R(s0) + R(s1) +  2R(s2) + … (0  1)

Dealing with infinite sequences of actions • Use discounted rewards • The environment contains terminal states that the agent is guaranteed to reach eventually • Infinite sequences are compared by their average rewards

Utility of a State Given a state s we can measure the expected utilities by applying any policy . We assume the agent is in state s and define St(a random variable) as the state reached at step t. Obviously S0 = s The expected utility of a state s given a policy  is U(s) = E[SttR(St)] *swill be a policy that maximizes the expected utility of state s.

Utility of a State The utility of a state s measures its desirability: • If i is terminal:U(Si) = R(Si) • If i is non-terminal, U(Si) = R(Si) + maxaSjP(Sj|Si,a) U(Sj) [Bellman equation] [the reward of s augmented by the expected sum of discounted rewards collected in future states]

Utility of a State The utility of a state s measures its desirability: • If i is terminal:U(Si) = R(Si) • If i is non-terminal, U(Si) = R(Si) +  maxaSjP(Sj|Si,a) U(Sj) [Bellman equation] [the reward of s augmented by the expected sum of discounted rewards collected in future states]  dynamic programming

Optimal Policy • A policy is a function that maps each state s into the action to execute if s is reached • The optimal policy * is the policy that always leads to maximizing the expected sum of rewards collected in future states (Maximum Expected Utility principle) • *(Si) = argmaxaSjP(Sj|Si,a) U(Sj)

Example • Fully observable environment • Non deterministic actions • intended effect: 0.8 • not the intended effect: 0.2 3 0.8 0.1 0.1 2 1 1 2 3 4

MDP of example • S: State of the agent on the grid (4,3) • Note that cell denoted by (x,y) • A: Actions of the agent, i.e., N, E, S, W • M: Transition function • E.g., M( (4,2) | (3,2), N) = 0.1 • E.g., M((3, 3) | (3,2), N) = 0.8 • (Robot movement, uncertainty of another agent’s actions,…) • R: Reward (more comments on the reward function later) • R(1, 1) = -1/25, R(1, 2) = -1/25 …. • R (4,3) = +1 R(4,2) = -1 •  = 1

Policy • Policy is a mapping from states to actions. • Given a policy, one may calculate the expected utility from series of actions produced by policy. • The goal: Find an optimal policy , one that would produce maximal expected utility. 3 Example of policy 2 1 1 2 3 4

Optimal policy and state utilities for MDP • What will happen if cost of step is very low? 3 3 2 2 1 1 1 2 3 4 1 2 3 4

Finding the optimal policy • Solution must satisfy both equations • *(Si) = argmaxaSjP(Sj|Si,a) U(Sj) (1) • U(Si) = R(Si) +  maxaSjP(Sj|Si,a) U(Sj) (2) • Value iteration: • start with a guess of a utility function and use (2) to get a better estimate • Policy iteration: • start with a fixed policy, and solve for the exact utilities of states; use eq 1 to find an updated policy.

Value iteration function Value-Iteration(MDP) returns a utility function inputs: P(Si |Sj,a), a transition model R, a reward function on states  discount parameter local variables: Utility function, initially identical to R U’, utility function, initially identical to R repeat U U’ for each stateido U’ [Si] R [Si] +  maxajP(Sj |Si,a) U(Sj) end until Close-Enough(U,U’) return U

Value iteration - convergence of utility values

policy iteration function Policy-Iteration (MDP) returns a policy inputs: P(Si |Sj,a), a transition model R, a reward function on states  discount parameter local variables: U, utility function, initially identical to R , a policy, initially optimal with respect to U repeat U  Value-Determination(, U, MDP) unchaged? true for each state Sido ifmaxaåjP(Sj |Si,a) U(Sj) > åjP(Sj |Si,(Si)) U(Sj) (Si)argmaxaåjP(Sj |Si,a) U(Sj) unchaged? false until unchaged? return 

Decision Theoretic Agent • Agent that must act in uncertain environments. Actions may be non-deterministic.

Stationarity and Markov assumption • A stationary process is one whose dynamics, i.e., P(Xt|Xt-1,…,X0) for t>0, are assumed not to change with time. • The Markov assumption is that the current state Xt is dependent only on a finite history of previous states. In this class, we will only consider first-order Markov processes for which P(Xt|Xt-1,…,X0) =P(Xt|Xt-1)

Transition model and sensor model • For first-order Markov processes, the laws describing how the process state evolves with time is contained entirely within the conditional distribution P(Xt|Xt-1), which is called the transition model for the process. • We will also assume that the observable state variables (evidence) Et are dependent only on the state variables Xt. P(Et|Xt) is called the sensor or observational model.

Decision Theoretic Agent - implemented

sensor model generic Example – part of steam train control

sensor model II

Model for lane position sensor for automated vehicle

dynamic belief network – generic structure • Describe the action model, namely the state evolution model.

State evolution model - example

Dynamic Decision Network • Can handle uncertainty • Deal with continuous streams of evidence • Can handle unexpected events (they have no fixed plan) • Can handle sensor noise and sensor failure • Can act to obtain information • Can handle large state spaces

Making complex decisions

Making complex decisions

Presentation Transcript

Making Decisions

Making Decisions

Chapter 17: Making Complex Decisions

Making Decisions

Making Decisions: Modeling perceptual decisions

MAKING DECISIONS

Making Decisions

Making Decisions

Making Decisions

Making Decisions

Making decisions

MAKING DECISIONS

Making Simple Decisions

Chapter 17: Making Complex Decisions

Making Effective Decisions

Making decisions

Business Decisions Making

Making Decisions

Chapter 17: Making Complex Decisions

Chapter 17 – Making Complex Decisions

Making Decisions