160 likes | 172 Views
CS 188: Artificial Intelligence Fall 2008. Lecture 27: Conclusion 12/9/2008. Dan Klein – UC Berkeley. Autonomous Robotics. Policy Search. [demo]. Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best
E N D
CS 188: Artificial IntelligenceFall 2008 Lecture 27: Conclusion 12/9/2008 Dan Klein – UC Berkeley
Policy Search [demo] • Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best • E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they still produced good decisions • Same distinction between modeling and prediction showed up in classification (where?) • Solution: learn the policy that maximizes rewards rather than the value that predicts rewards • This is the idea behind policy search, such as what controlled the upside-down helicopter
POMDPs • Up until now: • Search / MDPs: decision making when the world is fully observable (even if the actions are non-deterministic • Probabilistic reasoning: computing beliefs in a static world • Learning: discovering how the world works • What about acting under uncertainty? • In general, the problem formalization is the partially observable Markov decision process (POMDP) • A simple case: value of information
s b a a s, a b, a s,a,s’ o s’ b’ POMDPs • MDPs have: • States S • Actions A • Transition fn P(s’|s,a) (or T(s,a,s’)) • Rewards R(s,a,s’) • POMDPs add: • Observations O • Observation function P(o|s) (or O(s,o)) • POMDPs are MDPs over belief states b (distributions over S)
{e} a b e, a a e’ b, a {e, e’} o b’ Example: Ghostbusters • In (static) Ghostbusters: • Belief state determined by evidence to date {e} • Tree really over evidence sets • Probabilistic reasoning needed to predict new evidence given past evidence • Solving POMDPs • One way: use truncated expectimax to compute approximate value of actions • What if you only considered busting or one sense followed by a bust? • You get the VPI agent from project 4! {e} abust asense {e}, asense U(abust, {e}) e’ {e, e’} abust U(abust, {e, e’})
More Generally • General solutions map belief functions to actions • Can divide regions of belief space (set of belief functions) into policy regions (gets complex quickly) • Can build approximate policies using discretization methods • Can factor belief functions in various ways • Overall, POMDPs are very (actually PSACE-) hard • Most real problems are POMDPs, but we can rarely solve then in general!
Pacman Contest • 8 teams, 26 people qualified • 3rd Place: Niels Joubert, Michael Ngo, Rohit Nambiar, Tim Chen • What they did: split offense / defense • Strong offense: feature-based balance between eating dots against helping defend
Pacman Contest • Blue Team: Yiding Jia • Red Team: William Li, York Wu • What they did (Yiding): • Reflex plus tracking! • Probabilistic inference, particle filtering, consider direct ghost observations and dot vanishings • Defense: move toward distributions, hope to get better info and hunt, stay near remaining food • Offense: move toward guard-free dots, flee guard clouds • What they did (William, York): • … ?? [DEMO]
Example: Stratagus [DEMO]
Hierarchical RL • Stratagus: Example of a large RL task • Stratagus is hard for reinforcement learning algorithms • > 10100 states • > 1030 actions at each point • Time horizon ≈ 104 steps • Stratagus is hard for human programmers • Typically takes several person-months for game companies to write computer opponent • Still, no match for experienced human players • Programming involves much trial and error • Hierarchical RL • Humans supply high-level prior knowledge using partial program • Learning algorithm fills in the details [From Bhaskara Marthi’s thesis]
Partial “Alisp” Program (defun top () (loop (choose (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice (move ‘(N S E W NOOP)) (action move))))
Hierarchical RL • Define a hierarchical Q-function which learns a linear feature-based mini-Q-function at each choice point • Very good at balancing resources and directing rewards to the right region • Still not very good at the strategic elements of these kinds of games (i.e. the Markov game aspect) [DEMO]
Bugman • AI = Animal Intelligence? • Wim van Eck at Leiden University • Pacman controlled by a human • Ghosts controlled by crickets • Vibrations drive crickets toward or away from Pacman’s location [DEMO] http://pong.hku.nl/~wim/bugman.htm
Where to go next? • Congratulations, you’ve seen the basics of modern AI! • More directions: • Robotics / vision / IR / language: cs194 • There will be a web form to get more info, from the 188 page • Machine learning: cs281a • Cognitive modeling: cog sci 131 • NLP: 288 • Vision: 280 • … and more; ask if you’re interested
That’s It! • Help us out with some course evaluations • Have a good break, and always maximize your expected utilities!