240 likes | 250 Views
This research paper explores the design principles for creating agents that can be shaped by human trainers through evaluative reinforcement. It discusses the benefits of human-shapable agents and presents the TAMER approach for teaching agents manually through human reinforcement. The paper also provides results from experiments conducted on Tetris and Mountain Car tasks.
E N D
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer Sciences
Transferring human knowledge through natural forms of communication Potential benefits over purely autonomous learners: • Decrease sample complexity • Learn in the absence of a reward function • Allow lay users to teach agents the policies that they prefer (no programming!) • Learn in more complex domains
Shaping LOOK magazine, 1952 Def. - creating a desired behavior by reinforcing successive approximations of the behavior
The Shaping Scenario(in this context) A human trainer observes an agent and manually delivers reinforcement (a scalar value), signaling approval or disapproval. E.g., training a dog with treats as in the previous picture
The Shaping Problem (for computational agents) Within a sequential decision making task, how can an agent harness state descriptions and occasional scalar human reinforcement signals to learn a good task policy?
Previous work on human-shapable agents • Clicker training for entertainment agents (Blumberg et al., 2002; Kaplan et al., 2002) • Sophie’s World (Thomaz & Breazeal, 2006) • RL with reward = environmental (MDP) reward + human reinforcement • Social software agent Cobot in LambdaMoo (Isbell et al., 2006) • RL with reward = human reinforcement
MDP reward vs. Human reinforcement • MDP reward (within reinforcement learning): • Key problem: credit assignment from sparse rewards • Reinforcement from a human trainer: • Trainer has long-term impact in mind • Reinforcement is within a small temporal window of the targeted behavior • Credit assignment problem is largely removed
Teaching an Agent Manually via Evaluative Reinforcement (TAMER) • TAMER approach: • Learn a model of human reinforcement • Directly exploit the model to determine policy • If greedy:
Teaching an Agent Manually via Evaluative Reinforcement (TAMER) Learning from targeted human reinforcement is a supervised learning problem, not a reinforcement learning problem.
Teaching an Agent Manually via Evaluative Reinforcement (TAMER)
The Shaped Agent’s Perspective • Each time step, agent: • receives state description • might receive a scalar human reinforcement signal • chooses an action • does not receive an environmental reward signal (if learning purely from shaping)
Tetris • Drop blocks to make solid horizontal lines, which then disappear • |state space| > 2250 • Challenging but slow • 21 features extracted from (s, a) • TAMER model: • Linear model over features • Gradient descent updates • Greedy action selection
TAMER in action: Tetris Training: After training: Before training:
Conjectures on how to create an agent that can be interactively shaped by a human trainer • For many tasks, greedily exploiting the human trainer’s reinforcement function yields a good policy. • Modeling a human trainer’s reinforcement is a supervised learning problem (not RL). • Exploration can be driven by negative reinforcement alone. • Credit assignment to a dense state-action history should … • A human trainer’s reinforcement function is not static. • Human reinforcement is a function of states and actions. • In an MDP, human reinforcement should be treated differently from environmental reward. • Human trainers reinforce predicted action as well as recent action.
Mountain Car • Drive back and forth, gaining enough momentum to get to the goal on top of the hill • Continuous state space • Velocity and position • Simple but rapid actions • Feature extraction: • 2D Gaussian RBFs over velocity and position of car • One “grid” of RBFs per action • TAMER model: • Linear model over RBF features • Gradient descent updates
TAMER in action: Mountain Car After training: Before training: Training:
HOW TO: Convert a basic TD-Learning agent into a TAMER agent (w/o temporal credit assignment) • the underlying fcn approximator must be a Q-function (for state-action values) • set discount factor (gamma) to 0 • make action selection fully greedy • human reinf. replaces environmental reward • if no human input is received, no update • remove any eligibility traces (can just change parameter lambda to 0) • maybe lower alpha to .01 or less
HOW TO: Convert a TD-Learning agent into a TAMER agent (cont.) With credit assignment (more frequent time steps) • Save (features, human reinf.) for each time step in a window from 0.2 seconds before to about 0.8 seconds • define a probability distribution fcn over the window (a uniform distribution is probably fine) 3. credit for each state-action pair is the integral of the pdf from the time of the next most recent timestep to the timestep for that pair • - for the update, both reward prediction (in place of state-action-value prediction) used to calculate the error and the calculation of the gradient for any one weight use the the weighted sum, for each action, of the features in the window (the weights are the "credit" calculated in the last step) • - time measurements used for credit assignment should be in real time, not simulation time