260 likes | 476 Views
Reinforcement Learning [Outro]. Marco Loog. Rationale. How can an agent learn if there is no teacher around who tells it with every action what’s right and what’s wrong?
E N D
Reinforcement Learning[Outro] Marco Loog
Rationale • How can an agent learn if there is no teacher around who tells it with every action what’s right and what’s wrong? • E.g., an agent can learn how to play chess by supervised learning, given that examples of states and their correct actions are provided • But what if these examples are not available?
Rationale • But what if these examples are not available? • Through random moves, i.e., exploratory behavior, agent may be able to infer knowledge about the environment it is in • But what is good and what is bad? = necessary knowledge to decide what to do in order to reach its goal
Rationale • But what is good and what is bad? = necessary knowledge to decide what to do in order to reach its goal • ‘Rewarding’ the agent when it did something good and ‘punishing’ it when it did something bad is called reinforcement • Task of reinforcement learning is to use observed rewards to learn a [best] policy for the environment
Reinforcement Learning • Use observed rewards to learn an [almost?] optimal policy for an environment • Reward R(s) assigns to every state s a number • Utility of an environment history is [as an example] the sum of the rewards received • Policy describes agent’s action from any state s in order to reach the goal • Optimal policy is policy with highest expected utility
Reinforcement Learning • Might be considered to encompass all of AI : an agent is dropped off somewhere and it should itself figure everything out • We will concentrate on simple settings and agent designs to keep things manageable • E.g. fully observable environment
Typically in Games • Offline / during development : episodic reinforcement learning • Multiple training instances / several runs from start to end • Online / during actual game playing : incremental reinforcement learning • One continuous sequence of states / possibly without clear ‘end’
3 Agent Designs • Utility-based agents : learns a utility function based on which it chooses actions • Q-learning agent : learns an action value function given the expected utility of taking a given action in a given state • Reflex agent : learns a policy that maps directly from states to actions
Passive Reinforcement • Policy is fixed : state s always leads to the same action • Goal is simply to learn how good this policy is • [Of course this can be extended ‘easily’ to policy learning...]
Direct Utility Estimation • Idea : utility of a state is expected total reward from that state onward • Each trial provides a sample of this value for each state visited • After trial, utility for each observed state is simply updated using running average • In the limit, sample average converges to true expectation • Direct utility estimation = standard supervised inductive learning
More Direct Utility Estimation • ‘Reduction’ of the problem to ‘standard learning’ is nice [of course] • However, important source of information is not used : utilities of states are not independent • Utility of each state is own reward + expected utility of its successor states • Bellman equations • Using this prior knowledge can improve [e.g. speed up] learning considerably • As is generally the case
Adaptive Dynamic Programming • Take into account constraints between states • Passive learning agent learns based on observed rewards and transition model • Latter models the probability of reaching state s’ from state s when performing action a(s) • Two possibilities • Solve system of linear equations [for small systems] • Update iteratively
Temporal Difference • Take into account constraints between states • Idea : use observed transitions to adjust utility value of observed states so that they agree [better] with the constraints
Active Reinforcement • Passive learning agent has fixed policy... • Active agent must decide [learn] what action to take, i.e., it should find the optimal policy • Agent should make a trade-off between exploitation and exploration
Exploitation & Exploration • Exploitation : use best action [at that time] in order to come to highest reward • Exploration : attempt to get to all states possible by trying all actions possible [resulting in experience from which can be learned]
Exploitation & Exploration • Agent relying completely on exploitation is called greedy and often very suboptimal • Trade-off between greed and curiosity of the agent is controlled by an exploration function
Learning Action-Value • Temporal difference learning can also be used for active reinforcement learning • Action-value function gives expected utility of taking given action in given state • Q-learning is an alternative to temporal difference learning that learns an action-value function Q(a,s) instead of utilities • The important difference is that former is ‘model-free’, no transition model has to be learned, nor the actual utilities
Of Course : Generalization • For large state spaces exact inference of utility and/or Q-function as a table becomes unrealistic • Function approximation is needed, i.e., not in a tabular form • Makes it possible to represent utility functions for very large state spaces • More importantly, it allows for generalization • All this relates, of course, to decision trees, MAP, regression, density estimation, ML, hypotheses spaces, etc.
E.g. Inverted Pendulum MPI Magdeburg, Germany
...and Triple Inverted MPI Magdeburg, Germany
Finally... a Summary • Reinforcement learning enables agents to become skilled in an unknown environment based only on percepts and occasional rewards • 3 approaches • Direct utility estimation : observations independent • Adaptive dynamic programming : learns model + reward function and uses this to determine utilities or optimal policy • Temporal difference : adjust utility value so they agree with the constraints
More Summary... • Trade-off between exploitation and exploration is important • Large state spaces call for approximate methods, giving rise to function learning, regression, etc. • Reinforcement learning : one of most active areas of machine learning research, because of its potential for eliminating hand coding of control strategies...
Next Week • Guest lecturer Peter Andreasen on ... I don’t know yet • Place : Auditorium 1 • Start : ±0900 • [Next next week : final lecture, including probably an hour’s lecture on NERO, some words on the course evaluation and there will be the possibility for asking questions...]