1 / 16

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008. Lecture 27: Conclusion 12/9/2008. Dan Klein – UC Berkeley. Autonomous Robotics. Policy Search. [demo]. Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best

hedge
Download Presentation

CS 188: Artificial Intelligence Fall 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 188: Artificial IntelligenceFall 2008 Lecture 27: Conclusion 12/9/2008 Dan Klein – UC Berkeley

  2. Autonomous Robotics

  3. Policy Search [demo] • Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best • E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they still produced good decisions • Same distinction between modeling and prediction showed up in classification (where?) • Solution: learn the policy that maximizes rewards rather than the value that predicts rewards • This is the idea behind policy search, such as what controlled the upside-down helicopter

  4. POMDPs • Up until now: • Search / MDPs: decision making when the world is fully observable (even if the actions are non-deterministic • Probabilistic reasoning: computing beliefs in a static world • Learning: discovering how the world works • What about acting under uncertainty? • In general, the problem formalization is the partially observable Markov decision process (POMDP) • A simple case: value of information

  5. s b a a s, a b, a s,a,s’ o s’ b’ POMDPs • MDPs have: • States S • Actions A • Transition fn P(s’|s,a) (or T(s,a,s’)) • Rewards R(s,a,s’) • POMDPs add: • Observations O • Observation function P(o|s) (or O(s,o)) • POMDPs are MDPs over belief states b (distributions over S)

  6. {e} a b e, a a e’ b, a {e, e’} o b’ Example: Ghostbusters • In (static) Ghostbusters: • Belief state determined by evidence to date {e} • Tree really over evidence sets • Probabilistic reasoning needed to predict new evidence given past evidence • Solving POMDPs • One way: use truncated expectimax to compute approximate value of actions • What if you only considered busting or one sense followed by a bust? • You get the VPI agent from project 4! {e} abust asense {e}, asense U(abust, {e}) e’ {e, e’} abust U(abust, {e, e’})

  7. More Generally • General solutions map belief functions to actions • Can divide regions of belief space (set of belief functions) into policy regions (gets complex quickly) • Can build approximate policies using discretization methods • Can factor belief functions in various ways • Overall, POMDPs are very (actually PSACE-) hard • Most real problems are POMDPs, but we can rarely solve then in general!

  8. Pacman Contest • 8 teams, 26 people qualified • 3rd Place:  Niels Joubert, Michael Ngo, Rohit Nambiar, Tim Chen • What they did: split offense / defense • Strong offense: feature-based balance between eating dots against helping defend

  9. Pacman Contest • Blue Team: Yiding Jia • Red Team: William Li, York Wu • What they did (Yiding): • Reflex plus tracking! • Probabilistic inference, particle filtering, consider direct ghost observations and dot vanishings • Defense: move toward distributions, hope to get better info and hunt, stay near remaining food • Offense: move toward guard-free dots, flee guard clouds • What they did (William, York): • … ?? [DEMO]

  10. Example: Stratagus [DEMO]

  11. Hierarchical RL • Stratagus: Example of a large RL task • Stratagus is hard for reinforcement learning algorithms • > 10100 states • > 1030 actions at each point • Time horizon ≈ 104 steps • Stratagus is hard for human programmers • Typically takes several person-months for game companies to write computer opponent • Still, no match for experienced human players • Programming involves much trial and error • Hierarchical RL • Humans supply high-level prior knowledge using partial program • Learning algorithm fills in the details [From Bhaskara Marthi’s thesis]

  12. Partial “Alisp” Program (defun top () (loop (choose (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice (move ‘(N S E W NOOP)) (action move))))

  13. Hierarchical RL • Define a hierarchical Q-function which learns a linear feature-based mini-Q-function at each choice point • Very good at balancing resources and directing rewards to the right region • Still not very good at the strategic elements of these kinds of games (i.e. the Markov game aspect) [DEMO]

  14. Bugman • AI = Animal Intelligence? • Wim van Eck at Leiden University • Pacman controlled by a human • Ghosts controlled by crickets • Vibrations drive crickets toward or away from Pacman’s location [DEMO] http://pong.hku.nl/~wim/bugman.htm

  15. Where to go next? • Congratulations, you’ve seen the basics of modern AI! • More directions: • Robotics / vision / IR / language: cs194 • There will be a web form to get more info, from the 188 page • Machine learning: cs281a • Cognitive modeling: cog sci 131 • NLP: 288 • Vision: 280 • … and more; ask if you’re interested

  16. That’s It! • Help us out with some course evaluations • Have a good break, and always maximize your expected utilities!

More Related