260 likes | 373 Views
CS 416 Artificial Intelligence. Lecture 20 Making Complex Decisions Chapter 17. Midterm Results. AVG: 72 MED: 75 STD: 12 Rough dividing lines at: 58 (C), 72 (B), 85 (A). Assignment 1 Results. AVG: 87 MED: 94 STD: 19 How to interpret the grade sheet…. Interpreting the grade sheet….
E N D
CS 416Artificial Intelligence Lecture 20 Making Complex Decisions Chapter 17
Midterm Results • AVG: 72 • MED: 75 • STD: 12 • Rough dividing lines at: 58 (C), 72 (B), 85 (A)
Assignment 1 Results • AVG: 87 • MED: 94 • STD: 19 • How to interpret the grade sheet…
Interpreting the grade sheet… • You see the tests we ran listed in the first column • The metrics we accumulated are: • Solution depth, nodes created, nodes accessed, fringe size • All metrics are normalized by dividing by the value obtained using one of the good solutions from last year • The first four columns show these normalized metrics averaged across the entire class’s submissions • The next four columns show these normalized metrics for your submission… • Ex: A value of “1” for “Solution” means your code found a solution at the same depth as the solution from last year. The class average for “solution” might be 1.28 because some submissions searched longer and thus increased the average
Interpreting the grade sheet • SLOW = more than 30 seconds to complete • 66% credit given to reflect partial credit even though we never obtained firm results • N/A = the test would not even launch correctly… it might have crashed or ended without output • 33% credit given to reflect that frequently N/A occurs when no attempt was made to create an implementation • If you have an N/A but you think your code reflects partial credit, let us know.
Gambler’s Ruin • Consider working out examples of gambler’s ruin for $4 and $8 by hand • Ben created some graphs to show solution of gambler’s ruin for $8 • $0 bets are not permitted!
$8-ruin using batch update • Converges afterthree iterations. • Value vector isonly updated aftera completeiteration has completed
$8-ruin using in-place updating • Convergence occurs morequickly • Updates to valuefunction occurin-place startingfrom $1
$100-ruin • A more detailedgraph thanprovided in theassignment
Trying it by hand • Assume value update is working… • What’s the best action at $5? When tied… pick the smallest action
Office hours • Sunday: 4 – 5 in Thornton Stacks • Send email to Ben (hocking@virginia.edu) by Saturday at midnight to reserve a slot • Also make sure you have stepped through your code (say for the $8 example) to make sure that it is implementing your logic
Compilation • Just for grins • Take your Visual Studio code and compile using g++: • g++ foo.cpp –o foo -Wall
Partially observable Markov Decision Processes (POMDPs) • Relationship to MDPs • Value and Policy Iteration assume you know a lot about the world: • current state, action, next state, reward for state, … • In real world, you don’t exactly know what state you’re in • Is the car in front braking hard or braking lightly? • Can you successfully kick the ball to your teammate?
Partially observable • Consider not knowing what state you’re in… • Go left, left, left, left, left • Go up, up, up, up, up • You’re probably in upper-left corner • Go right, right, right, right, right
Extending the MDP model • MDPs have an explicit transition function T(s, a, s’) • We add O (s, o) • The probability of observing o when in state s • We add the belief state, b • The probability distribution over all possible states • b(s) = belief that you are in state s
Two parts to the problem • Figure out what state you’re in • Use Filtering from Chapter 15 • Figure out what to do in that state • Bellman’s equation is useful again • The optimal action depends only on the agent’s current belief state Update b(s) andp(s) / U(s) aftereach iteration
Selecting an action • a is normalizing constant that makes belief state sum to 1 • b’ = FORWARD (b, a, o) • Optimal policy maps belief states to actions • Note that the n-dimensional belief-state is continuous • Each belief value is a number between 0 and 1
A slight hitch • The previous slide required that you know the outcome o of action a in order to update the belief state • If the policy is supposed to navigate through belief space, we want to know what belief state we’re moving into before executing action a
Predicting future belief states • Suppose you know action a was performed when in belief state b. What is the probability of receiving observation o? • b provides a guess about initial state • a is known • Any observation could be realized… any subsequent state could be realized… any new belief state could be realized
Predicting future belief states • The probability of perceiving o, given action a and belief state b, is given by summing over all the actual states the agent might reach
Predicting future belief states • We just computed the odds of receiving o • We want new belief state • Let t (b, a, b’) be the belief transition function Equal to 1 if b′ = FORWARD(b, a, o)Equal to 0 otherwise
Predicted future belief states • Combining previous two slides • This is a transition model through belief states
Relating POMDPs to MDPs • We’ve found a model for transitions through belief states • Note MDPs had transitions through states (the real things) • We need a model for rewards based on beliefs • Note MDPs had a reward function based on state
Bringing it all together • We’ve constructed a representation of POMDPs that make them look like MDPs • Value and Policy Iteration can be used for POMDPs • The optimal policy, p*(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representation
Continuous vs. discrete • Our POMDP in MDP-form is continuous • Cluster continuous space into regions and try to solve for approximations within these regions
Final answer to POMDP problem • [l, u, u, r, u, u, r, u, u, r, …] • It’s deterministic (it already takes into account the absence of observations) • It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…) • It is successful 86.6% • In general, POMDPs with a few dozen states are nearly impossible to optimize