1 / 26

RL Successes and Challenges in High-Dimensional Games

RL Successes and Challenges in High-Dimensional Games. Gerry Tesauro IBM T.J.Watson Research Center. Outline. Overview/Definition of “Games” Why Study Games? Commonalities of RL successes RL in Classic Board Games TD-Gammon, KnightCap, TD-Chinook, RLGO RL in Robotics Games

Sophia
Download Presentation

RL Successes and Challenges in High-Dimensional Games

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RL Successes and Challenges in High-Dimensional Games Gerry Tesauro IBM T.J.Watson Research Center

  2. Outline • Overview/Definition of “Games” • Why Study Games? • Commonalities of RL successes • RL in Classic Board Games • TD-Gammon, KnightCap, TD-Chinook, RLGO • RL in Robotics Games • Attacker/Defender Robots • Robocup Soccer • RL in Video/Online Games • AI Fighters • Open Discussion / Lessons Learned

  3. What Do We Mean by “Games” ?? • Some Definitions of “Game” • A structured activity, usually undertaken for enjoyment (Wikipedia) • Activity among decision-makers in seeking to achieve objectives in a limiting context (Clark Abt) • A form of play with goals and structure (Kevin Maroney) • Single-Player Game = “Puzzle” • “Competition” if players can’t interfere with other players’ performance • Olympic Hockey vs. Olympic Figure Skating • Common Ingredients: Players, Rules, Objective • But: Games with modifiable rules, no clear object (MOOs)

  4. Why Use Games for RL/AI ?? • Clean, Idealized Models of Reality • Rules are clear and known (Samuel: not true in economically important problems) • Can build very good simulators • Clear Metric to Measure Progress • Tournament results, Elo ratings, etc. • Danger: Metric takes on a life of its own • Competition spurs progress • DARPA Grand Challenge, Netflix competition • Man vs. Machine Competition • “adds spice to the study” (Samuel) • “provides a convincing demonstration for those who do not believe that machines can learn” (Samuel)

  5. How Games Extend “Classic RL” Complex motivation • Fourth dimension: non-stationarity “Motivated” RL Multi-agent game strategy Poker Robocup Soccer Chicken AI Fighters Lifelike environment backgammon, chess, etc.

  6. Ingredients for RL success • Several commonalities: • Problems are more-or-less MDPs (near full observability, little history dependence) • |S| is enormous  can’t do DP • State-space representation critical: use of “features” based on domain knowledge • Train in a simulator! Need lots of experience, but still << |S| • Smooth function approximation (linear or NN) → very aggressive generalization/extrapolation • Only visit plausible states; only generalize to plausible states

  7. RL + Gradient Parameter Training • Recall incremental Bellman updates (TD(0)) • If instead V(s) = V (s), adjust  to reduce MSE (R-V(s))2 by gradient descent:

  8. TD() training of neural networks (episodic; =1 and intermediate r = 0):

  9. RL in Classic Board Games

  10. Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Wbar

  11. Learning backgammon using TD() • Neural net observes a sequence of input patterns x1, x2, x3, …, xf : sequence of board positions occurring during a game • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding (“hand-crafted features” added in later versions) • 1-D geometry → 28 board locations → 200 “raw” input units → 300 input units incl. features • Train neural net using gradient version of TD() • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt )

  12. TD-Gammon can teach itself by playing games against itself and learning from the outcome • Works even starting from random initial play and zero initial expert knowledge (surprising!)  achieves strong intermediate play • add hand-crafted features: advanced level of play (1991) • 2-ply search: strong master play (1993) • 3-ply search: superhuman play (1998)

  13. New TD-Gammon Results! (Tesauro, 1992)

  14. Extending TD(λ) to TDLeaf • Checkers and Chess: 2-D geometry, 64 board locations, dozens to thousands (Deep Blue) of features, linear function approximation • Samuel had the basic idea: train value of current state to match minimax backed-up value • Proper mathematical formulation proposed by Beal & Smith; Baxter et al. • Baxter’s Chess program KnightCap showed rapid learning in play vs. humans: 1650→2150 Elo in only 300 games! • Schaeffer et al. retrained weights of Checkers program Chinook using TDLeaf + self-play; as strong as manually tuned weights (5 year effort)

  15. RL in Computer Go • Go: 2-D geometry, 361 board locations, hundreds to millions (RLGO) of features, linear or NN function approximation • NeuroGo (M. Enzenberger, 1996; 2003) • Multiple reward signals: single-point eyes, connections and live points • Rating ~1880 in 9x9 Go using 3-ply α-β search • RLGO (D. Silver, 2008) uses only primitive local features and a linear value function. Can do live on-the-fly training for each new position encountered in a Go game! • Rating ~2130 in 9x9 Go using α-β search (avg. depth ~6): strongest program not based on Monte-Carlo Tree Search

  16. RL in Robotics Games

  17. Robot Air Hockey • video at: http://www.cns.atr.jp/~dbent/mpeg/hockeyfullsmall.avi • D. Bentivegna & C. Atkeson, ICRA 2001 • 2-D spatial problem • 30 degree-of-freedom arm, 420 decisions/sec • hand-built primitives, supervised learning + RL

  18. WoLF in Adversarial Robot Learning • Gra-WoLF (Bowling & Veloso): Combines WoLF (“Win or Learn Fast”) principle with policy gradient RL (Sutton et al., 2000) • again 2-D spatial geometry, 7 input features, 16 CMAC tiles • video at: http://webdocs.cs.ualberta.ca/~bowling/videos/AdversarialRobotLearning.mp4

  19. RL in Robocup Soccer • Once again, 2-D spatial geometry • Much good work by Peter Stone et al. • TPOT-RL: Learned advanced team strategies given limited observability – key to CMUnited victories in late 90s • Fast Gait for Sony Aibo dogs • Ball Acquisition for Sony Aibo dogs • Keepaway in Robocup simulation league

  20. Robocup “Keepaway” Game (Stone et al.) • Uses Robocup simulator, not real robots • Task: one team (“keepers”) tries to maintain possession of the ball as long as possible, other team (“takers”) try to take away • Keepers are trained using continuous-time, semi-Markov version of Sarsa algorithm • Represent Q(s,a) using CMAC (coarse tile coding) function approximation • State representation: small # of distances and angles between teammates, opponents, and ball • Reward = time of possession • Results: learned policies do much better than either random or hand-coded policies, e.g. on 25x25 field: • learned TOP 15.0 sec, hand-coded 8.0 sec, random 6.4 sec

  21. RL in Video Games

  22. AI Fighters • Graepel, Herbrich & Gold, 2004 – used commercial game platform Tao Feng (runs on Xbox): real time simulator (3D!) • basic feature set + SARSA + linear value function • multiple challenges of environment (real time, concurrency,…): • opponent state not known exactly • agent state and reward not known exactly • due to game animation, legal moves are not known

  23. Links to AI Fighters videos: before training: http://research.microsoft.com/en-us/projects/mlgames2008/taofengearlyaggressive.wmv after training: http://research.microsoft.com/en-us/projects/mlgames2008/taofenglateaggressive.wmv

  24. Discussion / Lessons Learned ?? • Winning formula: hand-designed features (fairly small number) + smooth function approx. • hand-designed features (fairly small number) • aggressive smooth function approx. • Researchers should try raw-input comparisons and try nonlinear function approx. • Many/most state variables in real problems seem pretty irrelevant • Opportunity to try recent linear and/or nonlinear Dimensionality Reduction algorithms • Sparsity constraints (L1 regularization etc.) also promising • Brain/retina architecture impressively suited for 2-D spatial problems • More studies using Convolutional Neural Nets etc.

More Related