ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in Artificial IntelligenceLecture 19: Case Studies November 10, 2010 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010

Final Project Recap • Requirements: • Presentation • In-class 15 minute presentation + 5 minutes for questions • Presentation assignment slots have been posted on website • Project report – Due Friday, Dec 3th • Comprehensive documentation of your work • Recall that the Final Project is 30% of course grade!

Introduction • We’ll discuss several case studies of reinforcement learning • The intention is to illustrate some of the trade-offs and issues that arise in real applications • For example, we emphasize how domain knowledge is incorporated into the formulation and solution of the problem • We also highlight the representation issues that are so often critical to successful applications • Applications of reinforcement learning are still far from routine and typically require as much art as science • Making applications easier and more straightforward is one of the goals of current research in reinforcement learning

TD-Gammon (Tesauro’s 1992, 1994, 1995, …) • One of the most impressive applications of RL to date is Gerry Tesauro’s (IBM) game of backgammon • TD-Gammon, required little backgammon knowledge, yet learned to play extremely well, near the level of the world's strongest grandmasters • The learning algorithm was a straightforward combination of the TD(l) algorithm and nonlinear function approximation • FA using a FFNN trained by backpropagating TD errors • There are probably more professional backgammon players than there are professional chess players • BG is in part a game of chance, which can be viewed as a large MDP

TD-Gammon (cont.) • The game is played with 15 white and 15 black pieces on a board of 24 locations, called points • Here’s a typical position early in the game, seen from the perspective of the white player

White has just rolled a 5 and a 2, so it can move one of his pieces 5 and one (possibly the same) 2 steps The objective is to advance all pieces to points 19-24, and then off the board Hitting – removal of single piece TD-Gammon (cont.) 30 pieces, 24 locations implies enormous number of configurations (state set is ~1020) Effective branching factor of 400, considering that each dice role has ~20 possibilities

TD-Gammon - details • Although the game is highly stochastic, a complete description of the game's state is available at all times • The estimated value of any state was meant to predict the probability of winning starting from that state • Reward: 0 at all times except those in which the game is won, when it is 1 • Episodic (game = episode), undiscounted • Non-linear form of TD(l) using a FF neural network • Weights initialized to small random numbers • Backpropagation of TD error • Four input units for each point; unary encoding of number of white pieces, plus other features • Use of Afterstate • Learning during self-play – fully incrementally

TD-Gammon – Neural Network Employed

Summary of TD-Gammon Results • Two players played against each other • Each had no prior knowledge of the game • Only the rules of the game were prescribed • Human’s learn from machines: TD-Gammon learned to play certain opening positions differently than was the convention among the best human players

Rebuttal on TD-Gammon • For an alternative view, see “Why did TD-Gammon Work?”, Jordan Pollack and Alan Blair, NIPS 9 (1997) • Claim: it was the “co-evolutionary training strategy, playing games against itself, which led to the success” • Any such approach would work with backgammon • Success does not extend to other problems • e.g. Tetris, maze-type problems – exploration issue comes up

The Acrobot • Robotic application of RL • Roughly analogous to a gymnast swinging on a high bar • The first joint (corresponding tothe hands on the bar) cannotexert torque • The second joint (correspondingto the gymnast bending at thewaist) can • This system has been widelystudied by control engineersand machine learning researchers

The Acrobot (cont.) • One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount equal to one of the links in minimum time • In this task, the torque applied at the second joint is limited to three choices: positive torque of a fixed magnitude, negative torque of the same magnitude, or no torque • A reward of –1 is given on all time steps until the goal is reached, which ends the episode. No discounting is used • Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps) • Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free context

Acrobot Learning Curves for Sarsa(l)

Typical Acrobot Learned Behavior

RL in Robotics • Robot motor capabilities were investigated using RL • Walking, grabbing and delivering • MIT Media Lab • Robocup competitions – soccer games • Sony AIBOs are commonemployed • Maze-type problems • Balancing themselveson unstable platform • Multi-dimensional inputstreams • Hopefully some new applications soon 

Introduction to Wireless Sensor Networks (WSN) • A sensor network is composed of a large number of sensor nodes, which are densely deployed either inside the phenomenon or very close to it • Random deployment • Cooperative capabilities • May be wireless or wired, however most modern applications require wireless communications • May be mobile or static • Main challenge: maximize the life of the networkunder battery constraints!

Communication Topology of Sensor Networks

Fire detection and monitoring

Nodes we have here at the lab Intel Mote UCB TelosB

Energy Consumption in WSN • Sources of Energy Consumption • Sensing • Computation • Communication (dominant) • Energy Wastes on Communications • Collisions. (Packet retransmission increases energy consumption) • Idle Listening. (listen to the channel when the node are not intending to transmit) • Communication Overhead. (the communications cost of the MAC protocol) • Overhearing (receive packets which are destined to other nodes)

MAC-related problems in WSN • Goal: to schedule or coordinate the communications among multiple nodes sharing the same wireless radio frequency. • Hidden Terminal Problem. Node 5 and node 3 want to transmit data to node 1. Since node 3 is out of the communication range of node 5, if communication occurs simultaneously, node 1 will experience collision. • Exposed Terminal Problem. node 1 sends data to node 3, since node 5 also overhears it, the transmission from node 6 to node 5 is constrained.

Latency Fairness Energy S-MAC – Example of WSN MAC Protocol • S-MAC — by Ye, Heidemann and Estrin (2003) • Tradeoffs • Major components in S-MAC • Periodic listen and sleep • Collision avoidance • Overhearing avoidance • Massage passing

RL-MAC (Z. Liu, I. Arel, 2005) • Formulate the MAC problem as a RL problem • Similar frame-based structure as in SMAC/TMAC • Each node infers the state of other nodes as part of its decision making process • Active time and duty cycle both a function of the traffic load and • Q-Learning was used • The main effort involved crafting the reward signal • nb- # of packetsqueued • tr– action (activetime) • Ratio of successfulrx vs. tx • # Failed attempts • Reflect on delay

RL-MAC Results

RL-MAC Results (cont.)

Summary • RL is a powerful tool which can support a wide range of applications • There is an art to defining the observations, states, rewards and actions • Main goal: formulate “as simple as possible” representation • Depends on the application • Can impact results significantly • Fits in high-resource and low-resource systems • Next class, we’ll talk about a particular class of RL techniques called Neuro-Dynamic Programming

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 19: Case Studies