260 likes | 768 Views
RL Successes and Challenges in High-Dimensional Games. Gerry Tesauro IBM T.J.Watson Research Center. Outline. Overview/Definition of “Games” Why Study Games? Commonalities of RL successes RL in Classic Board Games TD-Gammon, KnightCap, TD-Chinook, RLGO RL in Robotics Games
E N D
RL Successes and Challenges in High-Dimensional Games Gerry Tesauro IBM T.J.Watson Research Center
Outline • Overview/Definition of “Games” • Why Study Games? • Commonalities of RL successes • RL in Classic Board Games • TD-Gammon, KnightCap, TD-Chinook, RLGO • RL in Robotics Games • Attacker/Defender Robots • Robocup Soccer • RL in Video/Online Games • AI Fighters • Open Discussion / Lessons Learned
What Do We Mean by “Games” ?? • Some Definitions of “Game” • A structured activity, usually undertaken for enjoyment (Wikipedia) • Activity among decision-makers in seeking to achieve objectives in a limiting context (Clark Abt) • A form of play with goals and structure (Kevin Maroney) • Single-Player Game = “Puzzle” • “Competition” if players can’t interfere with other players’ performance • Olympic Hockey vs. Olympic Figure Skating • Common Ingredients: Players, Rules, Objective • But: Games with modifiable rules, no clear object (MOOs)
Why Use Games for RL/AI ?? • Clean, Idealized Models of Reality • Rules are clear and known (Samuel: not true in economically important problems) • Can build very good simulators • Clear Metric to Measure Progress • Tournament results, Elo ratings, etc. • Danger: Metric takes on a life of its own • Competition spurs progress • DARPA Grand Challenge, Netflix competition • Man vs. Machine Competition • “adds spice to the study” (Samuel) • “provides a convincing demonstration for those who do not believe that machines can learn” (Samuel)
How Games Extend “Classic RL” Complex motivation • Fourth dimension: non-stationarity “Motivated” RL Multi-agent game strategy Poker Robocup Soccer Chicken AI Fighters Lifelike environment backgammon, chess, etc.
Ingredients for RL success • Several commonalities: • Problems are more-or-less MDPs (near full observability, little history dependence) • |S| is enormous can’t do DP • State-space representation critical: use of “features” based on domain knowledge • Train in a simulator! Need lots of experience, but still << |S| • Smooth function approximation (linear or NN) → very aggressive generalization/extrapolation • Only visit plausible states; only generalize to plausible states
RL + Gradient Parameter Training • Recall incremental Bellman updates (TD(0)) • If instead V(s) = V (s), adjust to reduce MSE (R-V(s))2 by gradient descent:
TD() training of neural networks (episodic; =1 and intermediate r = 0):
Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Wbar
Learning backgammon using TD() • Neural net observes a sequence of input patterns x1, x2, x3, …, xf : sequence of board positions occurring during a game • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding (“hand-crafted features” added in later versions) • 1-D geometry → 28 board locations → 200 “raw” input units → 300 input units incl. features • Train neural net using gradient version of TD() • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt )
TD-Gammon can teach itself by playing games against itself and learning from the outcome • Works even starting from random initial play and zero initial expert knowledge (surprising!) achieves strong intermediate play • add hand-crafted features: advanced level of play (1991) • 2-ply search: strong master play (1993) • 3-ply search: superhuman play (1998)
New TD-Gammon Results! (Tesauro, 1992)
Extending TD(λ) to TDLeaf • Checkers and Chess: 2-D geometry, 64 board locations, dozens to thousands (Deep Blue) of features, linear function approximation • Samuel had the basic idea: train value of current state to match minimax backed-up value • Proper mathematical formulation proposed by Beal & Smith; Baxter et al. • Baxter’s Chess program KnightCap showed rapid learning in play vs. humans: 1650→2150 Elo in only 300 games! • Schaeffer et al. retrained weights of Checkers program Chinook using TDLeaf + self-play; as strong as manually tuned weights (5 year effort)
RL in Computer Go • Go: 2-D geometry, 361 board locations, hundreds to millions (RLGO) of features, linear or NN function approximation • NeuroGo (M. Enzenberger, 1996; 2003) • Multiple reward signals: single-point eyes, connections and live points • Rating ~1880 in 9x9 Go using 3-ply α-β search • RLGO (D. Silver, 2008) uses only primitive local features and a linear value function. Can do live on-the-fly training for each new position encountered in a Go game! • Rating ~2130 in 9x9 Go using α-β search (avg. depth ~6): strongest program not based on Monte-Carlo Tree Search
Robot Air Hockey • video at: http://www.cns.atr.jp/~dbent/mpeg/hockeyfullsmall.avi • D. Bentivegna & C. Atkeson, ICRA 2001 • 2-D spatial problem • 30 degree-of-freedom arm, 420 decisions/sec • hand-built primitives, supervised learning + RL
WoLF in Adversarial Robot Learning • Gra-WoLF (Bowling & Veloso): Combines WoLF (“Win or Learn Fast”) principle with policy gradient RL (Sutton et al., 2000) • again 2-D spatial geometry, 7 input features, 16 CMAC tiles • video at: http://webdocs.cs.ualberta.ca/~bowling/videos/AdversarialRobotLearning.mp4
RL in Robocup Soccer • Once again, 2-D spatial geometry • Much good work by Peter Stone et al. • TPOT-RL: Learned advanced team strategies given limited observability – key to CMUnited victories in late 90s • Fast Gait for Sony Aibo dogs • Ball Acquisition for Sony Aibo dogs • Keepaway in Robocup simulation league
Robocup “Keepaway” Game (Stone et al.) • Uses Robocup simulator, not real robots • Task: one team (“keepers”) tries to maintain possession of the ball as long as possible, other team (“takers”) try to take away • Keepers are trained using continuous-time, semi-Markov version of Sarsa algorithm • Represent Q(s,a) using CMAC (coarse tile coding) function approximation • State representation: small # of distances and angles between teammates, opponents, and ball • Reward = time of possession • Results: learned policies do much better than either random or hand-coded policies, e.g. on 25x25 field: • learned TOP 15.0 sec, hand-coded 8.0 sec, random 6.4 sec
AI Fighters • Graepel, Herbrich & Gold, 2004 – used commercial game platform Tao Feng (runs on Xbox): real time simulator (3D!) • basic feature set + SARSA + linear value function • multiple challenges of environment (real time, concurrency,…): • opponent state not known exactly • agent state and reward not known exactly • due to game animation, legal moves are not known
Links to AI Fighters videos: before training: http://research.microsoft.com/en-us/projects/mlgames2008/taofengearlyaggressive.wmv after training: http://research.microsoft.com/en-us/projects/mlgames2008/taofenglateaggressive.wmv
Discussion / Lessons Learned ?? • Winning formula: hand-designed features (fairly small number) + smooth function approx. • hand-designed features (fairly small number) • aggressive smooth function approx. • Researchers should try raw-input comparisons and try nonlinear function approx. • Many/most state variables in real problems seem pretty irrelevant • Opportunity to try recent linear and/or nonlinear Dimensionality Reduction algorithms • Sparsity constraints (L1 regularization etc.) also promising • Brain/retina architecture impressively suited for 2-D spatial problems • More studies using Convolutional Neural Nets etc.