Reinforcement Learning [Outro]

Reinforcement Learning[Outro] Marco Loog

Rationale • How can an agent learn if there is no teacher around who tells it with every action what’s right and what’s wrong? • E.g., an agent can learn how to play chess by supervised learning, given that examples of states and their correct actions are provided • But what if these examples are not available?

Rationale • But what if these examples are not available? • Through random moves, i.e., exploratory behavior, agent may be able to infer knowledge about the environment it is in • But what is good and what is bad? = necessary knowledge to decide what to do in order to reach its goal

Rationale • But what is good and what is bad? = necessary knowledge to decide what to do in order to reach its goal • ‘Rewarding’ the agent when it did something good and ‘punishing’ it when it did something bad is called reinforcement • Task of reinforcement learning is to use observed rewards to learn a [best] policy for the environment

Reinforcement Learning • Use observed rewards to learn an [almost?] optimal policy for an environment • Reward R(s) assigns to every state s a number • Utility of an environment history is [as an example] the sum of the rewards received • Policy describes agent’s action from any state s in order to reach the goal • Optimal policy is policy with highest expected utility

Reinforcement Learning • Might be considered to encompass all of AI : an agent is dropped off somewhere and it should itself figure everything out • We will concentrate on simple settings and agent designs to keep things manageable • E.g. fully observable environment

Typically in Games • Offline / during development : episodic reinforcement learning • Multiple training instances / several runs from start to end • Online / during actual game playing : incremental reinforcement learning • One continuous sequence of states / possibly without clear ‘end’

3 Agent Designs • Utility-based agents : learns a utility function based on which it chooses actions • Q-learning agent : learns an action value function given the expected utility of taking a given action in a given state • Reflex agent : learns a policy that maps directly from states to actions

Passive Reinforcement • Policy is fixed : state s always leads to the same action • Goal is simply to learn how good this policy is • [Of course this can be extended ‘easily’ to policy learning...]

Direct Utility Estimation • Idea : utility of a state is expected total reward from that state onward • Each trial provides a sample of this value for each state visited • After trial, utility for each observed state is simply updated using running average • In the limit, sample average converges to true expectation • Direct utility estimation = standard supervised inductive learning

More Direct Utility Estimation • ‘Reduction’ of the problem to ‘standard learning’ is nice [of course] • However, important source of information is not used : utilities of states are not independent • Utility of each state is own reward + expected utility of its successor states • Bellman equations • Using this prior knowledge can improve [e.g. speed up] learning considerably • As is generally the case

Adaptive Dynamic Programming • Take into account constraints between states • Passive learning agent learns based on observed rewards and transition model • Latter models the probability of reaching state s’ from state s when performing action a(s) • Two possibilities • Solve system of linear equations [for small systems] • Update iteratively

Temporal Difference • Take into account constraints between states • Idea : use observed transitions to adjust utility value of observed states so that they agree [better] with the constraints

Active Reinforcement • Passive learning agent has fixed policy... • Active agent must decide [learn] what action to take, i.e., it should find the optimal policy • Agent should make a trade-off between exploitation and exploration

Exploitation & Exploration • Exploitation : use best action [at that time] in order to come to highest reward • Exploration : attempt to get to all states possible by trying all actions possible [resulting in experience from which can be learned]

Exploitation & Exploration • Agent relying completely on exploitation is called greedy and often very suboptimal • Trade-off between greed and curiosity of the agent is controlled by an exploration function

Learning Action-Value • Temporal difference learning can also be used for active reinforcement learning • Action-value function gives expected utility of taking given action in given state • Q-learning is an alternative to temporal difference learning that learns an action-value function Q(a,s) instead of utilities • The important difference is that former is ‘model-free’, no transition model has to be learned, nor the actual utilities

Of Course : Generalization • For large state spaces exact inference of utility and/or Q-function as a table becomes unrealistic • Function approximation is needed, i.e., not in a tabular form • Makes it possible to represent utility functions for very large state spaces • More importantly, it allows for generalization • All this relates, of course, to decision trees, MAP, regression, density estimation, ML, hypotheses spaces, etc.

E.g. Inverted Pendulum MPI Magdeburg, Germany

...and Triple Inverted MPI Magdeburg, Germany

E.g. Lee05a.pdf

Finally... a Summary • Reinforcement learning enables agents to become skilled in an unknown environment based only on percepts and occasional rewards • 3 approaches • Direct utility estimation : observations independent • Adaptive dynamic programming : learns model + reward function and uses this to determine utilities or optimal policy • Temporal difference : adjust utility value so they agree with the constraints

More Summary... • Trade-off between exploitation and exploration is important • Large state spaces call for approximate methods, giving rise to function learning, regression, etc. • Reinforcement learning : one of most active areas of machine learning research, because of its potential for eliminating hand coding of control strategies...

Next Week • Guest lecturer Peter Andreasen on ... I don’t know yet • Place : Auditorium 1 • Start : ±0900 • [Next next week : final lecture, including probably an hour’s lecture on NERO, some words on the course evaluation and there will be the possibility for asking questions...]

Reinforcement Learning [Outro]