200 likes | 313 Views
Tópicos Especiais em Aprendizagem. Prof. Reinaldo Bianchi Centro Universitário da FEI 2012. Objetivo desta Aula. Esta é uma aula mais informativa. Aprendizado por Reforço: Planning and Learning . Relational Reinforcement Learning Uso de Heurísticas para acelerar o AR .
E N D
Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012
Objetivo desta Aula Esta é uma aula mais informativa • Aprendizado por Reforço: • Planning and Learning. • RelationalReinforcement Learning • Uso de Heurísticas para acelerar o AR. • DimensionsofRL - Conclusões. • Aula de hoje: • Capítulos 9 e 10 do Sutton & Barto. • Artigo Relational RL, MLJ, de 2001 • Tese de Doutorado Bianchi.
Dimensions of Reinforcement Learning Capítulo 10 do Sutton e Barto
Objetivos • Review the treatment of RL taken in this course • What have left out? • What are the hot research areas?
Three Common Ideas • Estimation of value functions • Backing up values along real or simulated trajectories • Generalized Policy Iteration: maintain an approximate optimal value function and approximate optimal policy, use each to improve the other
Other Dimensions • Function approximation • tables • aggregation • other linear methods • many nonlinear methods • On-policy/Off-policy • On-policy: learn the value function of the policy being followed • Off-policy: try learn the value function for the best policy, irrespective of what policy is being followed
Still More Dimensions • Definition of return: episodic, continuing, discounted, etc. • Action values vs. state values vs. afterstate values • Action selection/exploration: e-greed, softmax, more sophisticated methods • Synchronous vs. asynchronous • Replacing vs. accumulating traces
Still More Dimensions • Real vs. simulated experience • Location of backups (search control) • Timing of backups: part of selecting actions or only afterward? • Memory for backups: how long should backed up values be retained?
Frontier Dimensions • Prove convergence for bootstrapping control methods. • Trajectory sampling • Non-Markov case: • Partially Observable MDPs (POMDPs) • Bayesian approach: belief states • construct state from sequence of observations • Try to do the best you can with non-Markov states • Modularity and hierarchies • Learning and planning at several different levels • Theory of options
More Frontier Dimensions • Using more structure • factored state spaces: dynamic Bayes nets • factored action spaces
Still More Frontier Dimensions • Incorporating prior knowledge • advice and hints • trainers and teachers • shaping • Lyapunov functions • etc.
Getting the degree of abstraction right • Actions can be low level (e.g. voltages to motors), high level (e.g., accept a job offer), "mental" (e.g., shift in focus of attention) • Situations: direct sensor readings, or symbolic descriptions, or "mental" situations (e.g., the situation of being lost) • An RL agent is not like a whole animal or robot, which consist of many RL agents as well as other components
Getting the degree of abstraction right • The environment is not necessarily unknown to the RL agent, only incompletely controllable • Reward computation is in the RL agent's environment because the RL agent cannot change it arbitrarily
A Unified View • There is a space of methods, with tradeoffs and mixtures • rather than discrete choices • Value function are key • all the methods are about learning, computing, or estimating values • all the methods store a value function • All methods are based on backups • different methods just have backups of different • shapes and sizes, • sources and mixtures
Some of the Open Problems • Incomplete state information • Exploration • Structured states and actions • Incorporating prior knowledge • Using teachers • Theory of RL with function approximators • Modular and hierarchical architectures • Integration with other problem–solving and planning methods
Lessons Learned • 1. Approximate the Solution, Not the Problem • 2. The Power of Learning from Experience • 3. The Central Role of Value Functionsin finding optimal sequential behavior • 4. Learning and Planning can be Radically Similar • 5. Accretive Computation "Solves Dilemmas" • 6. A General Approach need not be Flat, Low-level
Final Summary • RL is about a problem: • goal-directed, interactive, real-time decision making • all hail the problem! • Value functions seem to be the key to solving it,and TD methods the key new algorithmic idea. • There are a host of immediate applications • large-scale stochastic optimal control and scheduling • There are also a host of scientific problems