480 likes | 632 Views
Reinforcement Learning: A survey. Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006. I. Introduction. Introduction Exploitation versus Exploration Delayed Reward Learning an Optimal Policy: Model-free Methods
E N D
Reinforcement Learning: A survey Xin Chen Information and Computer Science Department University of Hawaii Fall, 2006
I. Introduction • Introduction • Exploitation versus Exploration • Delayed Reward • Learning an Optimal Policy: Model-free Methods • Learning an Optimal Policy: Model-based Methods • Generalization • Partially Observable Environments • Reinforcement Learning Applications • Conclusions
What is Reinforcement Learning • Is a subcategory of machine learning • Is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment.
Research History • Samuel (1959). Checkers learning program • Bellman (1958), Ford and Fulkerson (1962). Bellman-Ford, single-destination, shortest-path algorithms. • Bellman (1961), Blackwell (1965). Research on optimal control let to the solution of Markov decision processes. • Holland’s (1986). Bucket brigade method for learning classifier systems. • Barto et al. (1983). Approach to temporal credit assignment • Sutton (1988). TD(λ) method and proof of its convergence for λ = 0. • Dayan (1992). extended the above result to arbitrary values of λ. • Watkins (1989). Q-learning to acquire optimal policies when the reward and action transition functions are unknown. • McCallum (1995) and Littman (1996). Extension of reinforcement learning to settings with hidden state variables that violate the Markov assumption. • Scale up of these methods: Maclin and Shavlik (1996). Lin (1992). Singh (1993). Lin (1993). Dietterich and Flann (1995). Mitchell and Thrun (1993). Ring (1994). • Recent reviews: Kaelbling et al. (1996); Barto et al. (1995); Sutton and Barto (1998)
Two main strategies for solving Reinforcement Learning problems • Search in the space of behaviors to find one that performs well. E.g. genetic algorithms and genetic programming. (search and optimization) • Use statistical techniques and Dynamic Programming methods to estimate the utility of taking actions in states of the world.
Standard Reinforcement Learning Model • A discrete set of environment states, S; • A discrete set of agent actions, A; and • A set of scalar reinforcement signals (reward); typically {0,1}, or real numbers.
Example of Reinforcement Learning • Agent’s job: Find a policy π, mapping states to actions, that maximizes some long-run measure of reinforcement. • We expect that in general the environment is non-deterministic in the long run.
Models of optimal behavior • Finite-horizon model • Infinite-horizon discounted model • Average-reward model
Measuring learning performance • Eventual convergence to optimal. Many algorithms have a provable guarantee of asymptotic convergence to optimal behavior. This is reassuring but useless in practice. An agent that quickly reaches a plateau at 99% of optimality may often be preferred to one that eventually converges but takes a long time in low optimality. • Speed of convergence to optimality. More practical is the speed of convergence to near-optimality. This measure needs the definition of how near to optimality is sufficient. A related measure is level of performance after a given time, which similarly requires that the given time be defined. Measures related to the speed of learning have an additional weakness: an algorithm that merely tries to achieve optimality as fast as possible may incur unnecessarily large penalties during the learning period. • Regret: a more appropriate measure, is the expected decrease in reward gained due to executing the learning algorithm instead of behaving optimally from the very beginning. It penalizes mistakes wherever they occur during the run. Unfortunately, results concerning the regret of algorithms are quite hard to obtain.
II. Exploration versus Exploitation • Introduction • Exploitation versus Exploration • Delayed Reward • Learning an Optimal Policy: Model-free Methods • Learning an Optimal Policy: Model-based Methods • Generalization • Partially Observable Environments • Reinforcement Learning Applications • Conclusions
The k-armed Bandit problem • The simplest Reinforcement Learning problem: k-armed bandit problem. • Is a R.L. environment with a single state with only self-transitions. • Study extensively in the statistics and applied math literature. • It illustrates the fundamental tradeoff b/w exploitation and exploration.
Formally justified solutions • Dynamic programming approach • The expense of filling in the table of V* values in this way for all attainable belief states is linear in the number of belief states times actions, and thus exponential in the horizon. • Gettins allocation indices • Guarantees optimal exploration and simple (given the table of index values), promising in more complex applications. Proved useful in an application to robotic manipulation with immediate reward. Unfortunately, no one has yet been able to find an analog of index values for delayed reinforcement problems. • Learning automata • Always converges to a vector containing a single 1 and the rest 0’s, but not always converge to the correct action; but the probability that it converges to the wrong one can be made arbitrarily small.
Ad-hoc solutions • Simple, popular, rarely the best choice, but viewed as reasonable, computationally tractable heuristics. • Greedy strategies • Take the action with best estimated payoff. No exploration. • Randomized strategies • Epsilon-greedy: Take the best estimated expected reward by default, but with probability p, choose an action at random. • Boltzmann exploration • Interval-based techniques • Kaelbling’s interval estimation algorithm. Action is chosen by computing the upper bound of a 100*(1-a)% confidence-interval on the success probability of each action and choose the action with the highest upper bound.
More general problems • For multiple states, but reinforcement is still immediate, then all methods above can be replicated once for each state. • When generalization is required, these solutions must be integrated with generalization methods. This is easy for ad-hoc methods, but not for theoretical methods. • Many of these methods converge to such a regime that exploration are rarely or never taken any more. This is ok for stationary environment, but not for non-stationary world. Again, this is easy for the ad-hoc solutions, but not for the theoretic solutions.
III. Delayed Reward • Introduction • Exploitation versus Exploration • Delayed Reward • Learning an Optimal Policy: Model-free Methods • Learning an Optimal Policy: Model-based Methods • Generalization • Partially Observable Environments • Reinforcement Learning Applications • Conclusions
Model delayed reward Reinforcement Learning as MDP • Problems with delayed reinforcement are well modeled as Markov decision processes (MDPs). An MDP consists of: • a set of states S, • a set of actions A, • a reward function: R: S X A -> R, and • a state transition function T: S X A -> π(S), where a member of π(s) is a probability distribution over the set S. We write T(s, a, s’) for the probability of making a transition from state s to state s’ using action a. • This model is Markov if the state transitions are independent of any previous environment states or agent actions.
Finding a policy given a model • Value Iteration • Policy Iteration • In practice, value iteration is much faster per iteration, but policy iteration takes fewer iterations. • Puterman’s modified policy iteration algorithm provides a method for trading iteration time for iteration improvement in a smoother way. • Several standard numerical-analysis techniques that speed the convergence of dynamic programming can be used to accelerate value and policy iteration, including multigrid methods, and state aggregation.
IV. Model-free methods • Introduction • Exploitation versus Exploration • Delayed Reward • Learning an Optimal Policy: Model-free Methods • Learning an Optimal Policy: Model-based Methods • Generalization • Partially Observable Environments • Reinforcement Learning Applications • Conclusions
Temporal Difference TD(λ) Learning • Sutton, 1988. • Learn by reducing discrepancies between estimates made by the agent at different times. • Use a constant λ to combine the estimates obtained from n-step lookahead: • An equivalent recursive definiton for Qλ is:
Q-learning • Watkins, 1989. • Is a special case of Temporal different algorithms (TD(0)). • Q-learning handles discounted infinite-horizon MDPs. • Q-learning is easy to work with in practice. • Q-learning is exploration insensitive. • Q-learning is the most popular and seems to be the most effective model-free algorithm for learning from delayed reinforcement. • However, it does not address scaling problem, and may converge quite slowly.
Q-learning rule • s – current state • s’ – next state • a – action • a’ – action of the next state • r – immediate reward • γ – discount factor • Q(s,a) – expected discounted reinforcement of taking action a in state s. • <s, a, r, s’> is an experience tuple.
Q-learning algorithm • For each s, a, initialize table entry Q(s,a) 0 • Observe current state s • Do forever: • Select an action a and execute it • Receive immediate reward r • Observe the new state s’ • Update the table entry for Q(s, a) as follows: • Q (s, a) = Q(s, a) + α [ r + max Q (s’, a’) – Q (s, a)] • s s’
Convergence criteria • The system is a deterministic MDP. • The immediate reward values are bounded, i.e., there exists some positive constant c such that for all states s and actions a, |r(s,a)| < c. • The agent selects actions such that it visits every possible state-action pair infinitely often.
V. Model-based methods • Introduction • Exploitation versus Exploration • Delayed Reward • Learning an Optimal Policy: Model-free Methods • Learning an Optimal Policy: Model-based Methods • Generalization • Partially Observable Environments • Reinforcement Learning Applications • Conclusions
Why model-based methods instead of model-free methods? • Model-free methods need very little computation time per experience, but make extremely inefficient use of the gathered data and often need a large amount of experience to achieve good performance. • Model-based methods are important in situations where computation is considered cheap and real-world experience costly.
Model-Based methods • Certainty equivalent methods • Dyna • Prioritized Sweeping / Queue-Dyna • RTDP (real-time dynamic programming) • The Plexus planning system
VI. Generalization • Introduction • Exploitation versus Exploration • Delayed Reward • Learning an Optimal Policy: Model-free Methods • Learning an Optimal Policy: Model-based Methods • Generalization • Partially Observable Environments • Reinforcement Learning Applications • Conclusions
Generalization over input • Generalization over input • Immediate Reward • CRBP • ARC • REINFORCE Algorithms • Logic-Based Methods • Delayed Reward • Adaptive Resolution Models • Generalization over action
Hierarchical methods • Feudal Q-learning • In the simplest case, there is a high-level master and low-level master. • Compositional Q-learning • Consists of a hierarchy based on the temporal sequencing of sub-goals. • Hierarchical Distance to Goal
VII. Partially observable environments • Introduction • Exploitation versus Exploration • Delayed Reward • Learning an Optimal Policy: Model-free Methods • Learning an Optimal Policy: Model-based Methods • Generalization • Partially Observable Environments • Reinforcement Learning Applications • Conclusions
Why Partially Observable Environments? • Complete observability is necessary for learning methods based on MDPs. But in many real-world environments, it’s not possible for the agent to have perfect and complete perception of the state of the environment.
Partially Observable Environments • State-Free Deterministic Policies • State-Free Stochastic Policies • Policies with Internal State • Recurrent Q-learning • Classifier Systems • Finite-history-window Approach • POMDP Approach
VIII. Reinforcement learning applications • Introduction • Exploitation versus Exploration • Delayed Reward • Learning an Optimal Policy: Model-free Methods • Learning an Optimal Policy: Model-based Methods • Generalization • Partially Observable Environments • Reinforcement Learning Applications • Conclusions
Reinforcement learning applications • Game Playing • TD-Gammon: TD in Backgammon. • Special features of Backgammon. • Although experiments with other games (Go, Chess) have in some cases produced interesting learning behavior, no success close to that of TD-Gammon has been repeated. • It is still an open question as to if and how the success of TD-Gammon can be repeated in other domains.
Reinforcement learning applications • Robotics and Control • Schaal and Atkeson. Constructed a two-armed robot that learns to juggle a device known as a devil-stick • Mahadevan and Connell. Discussed a task in which a mobile robot pushes large boxes for extended periods of time. • Mataric. Describes a robotics experiment with very high dimensional state space, containing many dozens of degrees of freedom. • Crites and Barto. Q-learning has been used in an elevator dispatching task. • Kaelbling. Packaging task from a food processing industry. • Interesting results • To make a real system work it proved necessary to supplement the fundamental algorithm with extra pre-programmed knowledge. • None of the exploration strategies mirrors theoretically optimal (but computationally intractable) exploration, and yet all proved adequate. • The computational requirements of these experiments were all very different, indicating that the differing computational demands of various reinforcement learning algorithms do indeed have an array of differing applications.
IX. Conclusions • Introduction • Exploitation versus Exploration • Delayed Reward • Learning an Optimal Policy: Model-free Methods • Learning an Optimal Policy: Model-based Methods • Generalization • Partially Observable Environments • Reinforcement Learning Applications • Conclusions
Conclusions • Many techniques are good for small problems, but don’t scale. • To scale, we must make use of some bias. E.g., shaping, local reinforcement signals, imitation, problem decomposition and reflexes.
References • [1] Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore. Reinforcement Learning: A Survey. (1996) http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a-html/rl-survey.html • [2] Richard S. Sutton, Andrew G. Barto. Reinforcement Learning: An Introduction. (1998) http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html • [3] Machine Learning. Tom M. Mitchell. (1997) • [4] Barto, A., Bradtke, S., Singh, S. Learning to act using real-time dynamic programming. Artificial Intelligence, Special volume: Computational research on interaction and agency. 72(1), 81-138. (1995) • [5] Richard S. Sutton. 499/699 courses on Reinforcement Learning. University of Alberta, Spring 2006. http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse2006.html