270 likes | 474 Views
ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 19: Case Studies. November 10, 2010. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010. Final Project Recap. Requirements:
E N D
ECE 517: Reinforcement Learning in Artificial IntelligenceLecture 19: Case Studies November 10, 2010 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010
Final Project Recap • Requirements: • Presentation • In-class 15 minute presentation + 5 minutes for questions • Presentation assignment slots have been posted on website • Project report – Due Friday, Dec 3th • Comprehensive documentation of your work • Recall that the Final Project is 30% of course grade!
Introduction • We’ll discuss several case studies of reinforcement learning • The intention is to illustrate some of the trade-offs and issues that arise in real applications • For example, we emphasize how domain knowledge is incorporated into the formulation and solution of the problem • We also highlight the representation issues that are so often critical to successful applications • Applications of reinforcement learning are still far from routine and typically require as much art as science • Making applications easier and more straightforward is one of the goals of current research in reinforcement learning
TD-Gammon (Tesauro’s 1992, 1994, 1995, …) • One of the most impressive applications of RL to date is Gerry Tesauro’s (IBM) game of backgammon • TD-Gammon, required little backgammon knowledge, yet learned to play extremely well, near the level of the world's strongest grandmasters • The learning algorithm was a straightforward combination of the TD(l) algorithm and nonlinear function approximation • FA using a FFNN trained by backpropagating TD errors • There are probably more professional backgammon players than there are professional chess players • BG is in part a game of chance, which can be viewed as a large MDP
TD-Gammon (cont.) • The game is played with 15 white and 15 black pieces on a board of 24 locations, called points • Here’s a typical position early in the game, seen from the perspective of the white player
White has just rolled a 5 and a 2, so it can move one of his pieces 5 and one (possibly the same) 2 steps The objective is to advance all pieces to points 19-24, and then off the board Hitting – removal of single piece TD-Gammon (cont.) 30 pieces, 24 locations implies enormous number of configurations (state set is ~1020) Effective branching factor of 400, considering that each dice role has ~20 possibilities
TD-Gammon - details • Although the game is highly stochastic, a complete description of the game's state is available at all times • The estimated value of any state was meant to predict the probability of winning starting from that state • Reward: 0 at all times except those in which the game is won, when it is 1 • Episodic (game = episode), undiscounted • Non-linear form of TD(l) using a FF neural network • Weights initialized to small random numbers • Backpropagation of TD error • Four input units for each point; unary encoding of number of white pieces, plus other features • Use of Afterstate • Learning during self-play – fully incrementally
Summary of TD-Gammon Results • Two players played against each other • Each had no prior knowledge of the game • Only the rules of the game were prescribed • Human’s learn from machines: TD-Gammon learned to play certain opening positions differently than was the convention among the best human players
Rebuttal on TD-Gammon • For an alternative view, see “Why did TD-Gammon Work?”, Jordan Pollack and Alan Blair, NIPS 9 (1997) • Claim: it was the “co-evolutionary training strategy, playing games against itself, which led to the success” • Any such approach would work with backgammon • Success does not extend to other problems • e.g. Tetris, maze-type problems – exploration issue comes up
The Acrobot • Robotic application of RL • Roughly analogous to a gymnast swinging on a high bar • The first joint (corresponding tothe hands on the bar) cannotexert torque • The second joint (correspondingto the gymnast bending at thewaist) can • This system has been widelystudied by control engineersand machine learning researchers
The Acrobot (cont.) • One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount equal to one of the links in minimum time • In this task, the torque applied at the second joint is limited to three choices: positive torque of a fixed magnitude, negative torque of the same magnitude, or no torque • A reward of –1 is given on all time steps until the goal is reached, which ends the episode. No discounting is used • Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps) • Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free context
RL in Robotics • Robot motor capabilities were investigated using RL • Walking, grabbing and delivering • MIT Media Lab • Robocup competitions – soccer games • Sony AIBOs are commonemployed • Maze-type problems • Balancing themselveson unstable platform • Multi-dimensional inputstreams • Hopefully some new applications soon
Introduction to Wireless Sensor Networks (WSN) • A sensor network is composed of a large number of sensor nodes, which are densely deployed either inside the phenomenon or very close to it • Random deployment • Cooperative capabilities • May be wireless or wired, however most modern applications require wireless communications • May be mobile or static • Main challenge: maximize the life of the networkunder battery constraints!
Nodes we have here at the lab Intel Mote UCB TelosB
Energy Consumption in WSN • Sources of Energy Consumption • Sensing • Computation • Communication (dominant) • Energy Wastes on Communications • Collisions. (Packet retransmission increases energy consumption) • Idle Listening. (listen to the channel when the node are not intending to transmit) • Communication Overhead. (the communications cost of the MAC protocol) • Overhearing (receive packets which are destined to other nodes)
MAC-related problems in WSN • Goal: to schedule or coordinate the communications among multiple nodes sharing the same wireless radio frequency. • Hidden Terminal Problem. Node 5 and node 3 want to transmit data to node 1. Since node 3 is out of the communication range of node 5, if communication occurs simultaneously, node 1 will experience collision. • Exposed Terminal Problem. node 1 sends data to node 3, since node 5 also overhears it, the transmission from node 6 to node 5 is constrained.
Latency Fairness Energy S-MAC – Example of WSN MAC Protocol • S-MAC — by Ye, Heidemann and Estrin (2003) • Tradeoffs • Major components in S-MAC • Periodic listen and sleep • Collision avoidance • Overhearing avoidance • Massage passing
RL-MAC (Z. Liu, I. Arel, 2005) • Formulate the MAC problem as a RL problem • Similar frame-based structure as in SMAC/TMAC • Each node infers the state of other nodes as part of its decision making process • Active time and duty cycle both a function of the traffic load and • Q-Learning was used • The main effort involved crafting the reward signal • nb- # of packetsqueued • tr– action (activetime) • Ratio of successfulrx vs. tx • # Failed attempts • Reflect on delay
Summary • RL is a powerful tool which can support a wide range of applications • There is an art to defining the observations, states, rewards and actions • Main goal: formulate “as simple as possible” representation • Depends on the application • Can impact results significantly • Fits in high-resource and low-resource systems • Next class, we’ll talk about a particular class of RL techniques called Neuro-Dynamic Programming