620 likes | 638 Views
Explore techniques for tracking and predicting changing environments, incorporating new observations in stochastic domains. Understand the concept of temporal probabilistic agents and inference tasks.
E N D
Decision Making Under Uncertainty CMSC 471 – Fall 2011 Class #23-24 – Thursday, November 17 / Tuesday, November 22 R&N, Chapters 15.1-15.2, 16.1-16.3, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and Daphne Koller
sensors environment ? agent actuators Temporal Probabilistic Agent t1, t2, t3, …
Time and Uncertainty • The world changes; we need to track and predict it • Examples: diabetes management, traffic monitoring • Basic idea: • Copy state and evidence variables for each time step • Model uncertainty in change over time, incorporating new observations as they arrive • Reminiscent of situation calculus, but for stochastic domains • Xt – set of unobservable state variables at time t • e.g., BloodSugart, StomachContentst • Et – set of evidence variables at time t • e.g., MeasuredBloodSugart, PulseRatet, FoodEatent • Assumes discrete time steps
States and Observations • Process of change is viewed as series of snapshots, each describing the state of the world at a particular time • Each time slice is represented by a set of random variables indexed by t: • the set of unobservable state variables Xt • the set of observable evidence variables Et • The observation at time t is Et = et for some set of values et • The notation Xa:b denotes the set of variables from Xa to Xb
Stationary Process/Markov Assumption • Markov Assumption: • Xt depends on some previous Xis • First-order Markov process: P(Xt|X0:t-1) = P(Xt|Xt-1) • kth order: depends on previous k time steps • Sensor Markov assumption:P(Et|X0:t, E0:t-1) = P(Et|Xt) • That is: The agent’s observations depend only on the actual current state of the world • Assume stationary process: • Transition model P(Xt|Xt-1) and sensor model P(Et|Xt) are time-invariant (i.e., they are the same for all t) • That is, the changes in the world state are governed by laws that do not themselves change over time
Complete Joint Distribution • Given: • Transition model: P(Xt|Xt-1) • Sensor model: P(Et|Xt) • Prior probability: P(X0) • Then we can specify the complete joint distribution of a sequence of states:
Example Raint-1 Raint Raint+1 Umbrellat-1 Umbrellat Umbrellat+1 This should look a lot like a finite state automaton (since it is one...)
Inference Tasks • Filtering or monitoring: P(Xt|e1,…,et)Compute the current belief state, given all evidence to date • Prediction: P(Xt+k|e1,…,et) Compute the probability of a future state • Smoothing: P(Xk|e1,…,et) Compute the probability of a past state (hindsight) • Most likely explanation: arg maxx1,..xtP(x1,…,xt|e1,…,et)Given a sequence of observations, find the sequence of states that is most likely to have generated those observations
Examples • Filtering: What is the probability that it is raining today, given all of the umbrella observations up through today? • Prediction: What is the probability that it will rain the day after tomorrow, given all of the umbrella observations up through today? • Smoothing: What is the probability that it rained yesterday, given all of the umbrella observations through today? • Most likely explanation: If the umbrella appeared the first three days but not on the fourth, what is the most likely weather sequence to produce these umbrella sightings?
Filtering • We use recursive estimation to compute P(Xt+1 | e1:t+1) as a function of et+1 and P(Xt | e1:t) • We can write this as follows: • This leads to a recursive definition: • f1:t+1 = FORWARD (f1:t, et+1) QUIZLET: What is α?
Filtering Example Raint-1 Raint Raint+1 Umbrellat-1 Umbrellat Umbrellat+1 What is the probability of rain on Day 2, given a uniform prior of rain on Day 0, U1 = true, and U2 = true?
Decision Making Under Uncertainty • Many environments have multiple possible outcomes • Some of these outcomes may be good; others may be bad • Some may be very likely; others unlikely • What’s a poor agent to do??
? ? a a b b c c • {a(pa), b(pb), c(pc)} • decision that maximizes expected utility value Non-Deterministic vs. Probabilistic Uncertainty • {a,b,c} • decision that is best for worst case Non-deterministic model Probabilistic model ~ Adversarial search
Expected Utility • Random variable X with n values x1,…,xn and distribution (p1,…,pn)E.g.: X is the state reached after doing an action A under uncertainty • Function U of XE.g., U is the utility of a state • The expected utility of A is EU[A] = Si=1,…,n p(xi|A)U(xi)
s0 A1 s1 s2 s3 0.2 0.7 0.1 100 50 70 One State/One Action Example U(A1, S0) = 100 x 0.2 + 50 x 0.7 + 70 x 0.1 = 20 + 35 + 7 = 62
s0 A1 A2 s1 s2 s3 s4 0.2 0.7 0.2 0.1 0.8 100 50 70 One State/Two Actions Example • U (A1, S0) = 62 • U (A2, S0) = 74 • U (S0) = maxa{U(a,S0)} • = 74 80
s0 A1 A2 s1 s2 s3 s4 0.2 0.7 0.2 0.1 0.8 100 50 70 Introducing Action Costs • U (A1, S0) = 62 – 5 = 57 • U (A2, S0) = 74 – 25 = 49 • U (S0) = maxa{U(a, S0)} • = 57 -5 -25 80
MEU Principle • A rational agent should choose the action that maximizes agent’s expected utility • This is the basis of the field of decision theory • The MEU principle provides a normative criterion for rational choice of action AI is Solved!!!
Not quite… • Must have a complete model of: • Actions • Utilities • States • Even if you have a complete model, decision making is computationally intractable • In fact, a truly rational agent takes into account the utility of reasoning as well (bounded rationality) • Nevertheless, great progress has been made in this area recently, and we are able to solve much more complex decision-theoretic problems than ever before
Axioms of Utility Theory • Orderability • (A>B) (A<B) (A~B) • Transitivity • (A>B) (B>C) (A>C) • Continuity • A>B>C p [p,A; 1-p,C] ~ B • Substitutability • A~B [p,A; 1-p,C]~[p,B; 1-p,C] • Monotonicity • A>B (p≥q [p,A; 1-p,B] >~ [q,A; 1-q,B]) • Decomposability • [p,A; 1-p, [q,B; 1-q, C]] ~ [p,A; (1-p)q, B; (1-p)(1-q), C]
Money Versus Utility • Money <> Utility • More money is better, but not always in a linear relationship to the amount of money • Expected Monetary Value • Risk-averse: U(L) < U(SEMV(L)) • Risk-seeking: U(L) > U(SEMV(L)) • Risk-neutral: U(L) = U(SEMV(L))
Value Function • Provides a ranking of alternatives, but not a meaningful metric scale • Also known as an “ordinal utility function” • Sometimes, only relative judgments (value functions) are necessary • At other times, absolute judgments (utility functions) are required
Multiattribute Utility Theory • A given state may have multiple utilities • ...because of multiple evaluation criteria • ...because of multiple agents (interested parties) with different utility functions • We will talk about this more later in the semester, when we discuss multi-agent systems and game theory
Sequential Decision Making • Finite Horizon • Infinite Horizon
Simple Robot Navigation Problem • In each state, the possible actions are U, D, R, and L
Probabilistic Transition Model • In each state, the possible actions are U, D, R, and L • The effect of U is as follows (transition model): • With probability 0.8, the robot moves up one square (if the robot is already in the top row, then it does not move)
Probabilistic Transition Model • In each state, the possible actions are U, D, R, and L • The effect of U is as follows (transition model): • With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move) • With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move)
Probabilistic Transition Model • In each state, the possible actions are U, D, R, and L • The effect of U is as follows (transition model): • With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move) • With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move) • With probability 0.1 the robot moves left one square (if the robot is already in the leftmost row, then it does not move)
Markov Property The transition properties depend only on the current state, not on the previous history (how that state was reached)
Sequence of Actions [3,2] 3 2 1 1 2 3 4 • Planned sequence of actions: (U, R)
[3,2] [3,2] [3,3] [4,2] Sequence of Actions 3 2 1 1 2 3 4 • Planned sequence of actions: (U, R) • U is executed
[3,2] [3,2] [3,3] [4,2] [3,1] [3,2] [3,3] [4,1] [4,2] [4,3] Histories 3 2 1 1 2 3 4 • Planned sequence of actions: (U, R) • U has been executed • R is executed • There are 9 possible sequences of states – called histories – and 6 possible final states for the robot!
Probability of Reaching the Goal 3 Note importance of Markov property in this derivation 2 1 1 2 3 4 • P([4,3] | (U,R).[3,2]) = • P([4,3] | R.[3,3]) x P([3,3] | U.[3,2]) + P([4,3] | R.[4,2]) x P([4,2] | U.[3,2]) • P([4,3] | R.[3,3]) = 0.8 • P([4,3] | R.[4,2]) = 0.1 • P([3,3] | U.[3,2]) = 0.8 • P([4,2] | U.[3,2]) = 0.1 • P([4,3] | (U,R).[3,2]) = 0.65
3 +1 2 -1 1 1 2 3 4 Utility Function • [4,3] provides power supply • [4,2] is a sand area from which the robot cannot escape
3 +1 2 -1 1 1 2 3 4 Utility Function • [4,3] provides power supply • [4,2] is a sand area from which the robot cannot escape • The robot needs to recharge its batteries
3 +1 2 -1 1 1 2 3 4 Utility Function • [4,3] provides power supply • [4,2] is a sand area from which the robot cannot escape • The robot needs to recharge its batteries • [4,3] or [4,2] are terminal states
3 +1 2 -1 1 1 2 3 4 Utility of a History • [4,3] provides power supply • [4,2] is a sand area from which the robot cannot escape • The robot needs to recharge its batteries • [4,3] or [4,2] are terminal states • The utility of a history is defined by the utility of the last state (+1 or –1) minus n/25, where n is the number of moves
Utility of an Action Sequence +1 3 -1 2 1 1 2 3 4 • Consider the action sequence (U,R) from [3,2]
[3,2] [3,2] [3,3] [4,2] [3,1] [3,2] [3,3] [4,1] [4,2] [4,3] Utility of an Action Sequence +1 3 -1 2 1 1 2 3 4 • Consider the action sequence (U,R) from [3,2] • A run produces one of 7 possible histories, each with some probability
[3,2] [3,2] [3,3] [4,2] [3,1] [3,2] [3,3] [4,1] [4,2] [4,3] Utility of an Action Sequence +1 3 -1 2 1 1 2 3 4 • Consider the action sequence (U,R) from [3,2] • A run produces one among 7 possible histories, each with some probability • The utility of the sequence is the expected utility of the histories:U = ShUhP(h)
[3,2] [3,2] [3,3] [4,2] [3,1] [3,2] [3,3] [4,1] [4,2] [4,3] Optimal Action Sequence +1 3 -1 2 1 1 2 3 4 • Consider the action sequence (U,R) from [3,2] • A run produces one among 7 possible histories, each with some probability • The utility of the sequence is the expected utility of the histories • The optimal sequence is the one with maximal utility
[3,2] [3,2] [3,3] [4,2] [3,1] [3,2] [3,3] [4,1] [4,2] [4,3] only if the sequence is executed blindly! Optimal Action Sequence +1 3 -1 2 1 1 2 3 4 • Consider the action sequence (U,R) from [3,2] • A run produces one among 7 possible histories, each with some probability • The utility of the sequence is the expected utility of the histories • The optimal sequence is the one with maximal utility • But is the optimal action sequence what we want to compute?
Accessible or observable state Reactive Agent Algorithm Repeat: • s sensed state • If s is terminal then exit • a choose action (given s) • Perform a
+1 3 -1 2 1 1 2 3 4 Policy (Reactive/Closed-Loop Strategy) • A policy P is a complete mapping from states to actions
Reactive Agent Algorithm Repeat: • s sensed state • If s is terminal then exit • aP(s) • Perform a
Note that [3,2] is a “dangerous” state that the optimal policy tries to avoid Makes sense because of Markov property Optimal Policy +1 3 -1 2 1 1 2 3 4 • A policy P is a complete mapping from states to actions • The optimal policyP* is the one that always yields a history (ending at a terminal state) with maximal • expected utility
This problem is called a Markov Decision Problem (MDP) How to compute P*? Optimal Policy +1 3 -1 2 1 1 2 3 4 • A policy P is a complete mapping from states to actions • The optimal policyP* is the one that always yields a history with maximal expected utility