Reinforcement Learning Part 1

Reinforcement Learning Part 1 Markov Decision Processes (MDPs), Value, and Search

Outline of Lecture • Markov Decision Processes • What are they conceptually? • What are they formally? • Value Functions • Using standard search methods

Computer Science Databases Networks Artificial Intelligence Theory … NLP Planning Machine Learning Vision … Unsupervised Learning ReinforcementLearning Supervised Learning Classification Regression … Clustering Representation discovery …

Reinforcement Learning • Learning to make optimal decisions without a teacher that says what the correct response is. • Related to: • Learning by trial and error • Operant conditioning

Reinforcement Learning • Sutton and Barto (1998) say: • Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal. • The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. • In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics--trial-and-error search and delayed reward--are the two most important distinguishing features of reinforcement learning.

Example: GridWorld Current State Actions States Terminal State (Optional) Initial or Start State

Reward Function

Transition Function

Policy • Distribution over admissible actions for each state. • π(s,a)= Probability of taking action a in state s.

Bad Policy *Takes blue-arrow action with high probability.

Good Policy

Another Good Policy We need to formalize the problem in order to say which is better.

Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • S is the set of possible states • “State set” or “state space” • Can be finite, or infinite (countable or uncountable). • We will mainly discuss the finite setting. • {1, 2, …, 22} in our GridWorld example.

Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • A is the set of possible actions • “Action set” or “action space” • Can be finite, or infinite (countable or uncountable). • We will mainly discuss the finite setting. • {up, down, left, right} in our GridWorld example.

Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • P describes the transition dynamics • “Transition function”, “transition matrix” • P(s,a,s’) Gives the probability of entering state s’ when action a is taken in state s. • P(s,a,·) is a probability mass function or probability density function over the state set. P is a conditional distribution over next-states given the current state and action.

Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • P describes the transition dynamics • “Transition function”, “transition matrix” • P(s,a,s’) Gives the probability of entering state s’ when action a is taken in state s. P(s, up, s’) = 0.5 P(s, up, s’’) = 0.2 …

Time • We use t to denote the (integer) time step. • st is the state at time t. • at is the action chosen at time t. • rt is the reward obtained at time t. • Think of this as the reward for taking action at in state st, and ending up in state st+1.

Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • R describes the rewards • “Reward function”, “reward vector” • R(s,a) is the expected reward, r, when action a is taken in state s.

Reward Function Actual transition: left right up down

Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • d0 is the distribution over the states at time t = 0. • “Initial state distribution” • In our example, it places a probability of one on the state marked S, and a probability of zero on the others.

Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • γ is a real-valued discount parameter in the interval [0,1]. • We will discuss it soon.

Episodes • An episode is one run of an MDP, starting at t=0, and running until a terminal state is reached. • Some MDPs always have finite-length episodes, some do not.

Trajectory • If you use a policy, π, on an MDP, M, for one episode, you get a trajectory. Here the gold arrow denotes where the agent’s action differs from the direction that it moved.

Return • The return is the sum of rewards from a trajectory. • This could be infinite!

Discounted Return • Idea: make rewards that are obtained later worth less. • Would you rather have 1 cookie right now or 3 cookies in a week? • We use the parameter γ to discount rewards. If it is less than one and the rewards are bounded, then the discounted return is always finite.

Value Function • The value is defined in terms of a state and a policy: • It is the expected discounted return when following the specified policy starting in the specified state.

Objective • Find a policy that maximizes the expected discounted return There could be more than one!

Problem Settings • The transition dynamics, P, and initial state distribution may or may not be known. • We will discuss the case where they are not. • The reward function, R, may or may not be known. • We will discuss the case where it is not. • Typically the state set, action set, and reward discount parameter are known ahead of time.

Partial Observability • Sometimes the agent doesn’t know the actual state of the world • Sensors make observations • Observations may be noisy • Some properties of the world might not be measured • Partially observable Markov decision processes (POMDPs) model this setting

Example MDPs: Pole Balancing

Example MDPs: Mountain Car

Example MDPs: Punchulum

Example MDPs: Function Electrical Stimulation

Example Removed

Example MDPs • Think of your own! • What are the states and actions? • What is the reward function? • What are the transition dynamics? • How much would you discount rewards over time?

One Last Example MDP • You! • States / observations = input to brain • Actions = output of brain • Rewards = stimuli that we have evolved to find nice (e.g., a sweet taste) • Transition dynamics = physics

Parameterized Policies • In order to search the set of policies for an optimal one, we need a way of describing policies. • Let π(s,a,θ) be a parameterized policy, where θ is a vector that we call the policy parameters. • Each fixed θ results in a policy. • E.g., the policy parameters could be a vector that contains the probability of each action in each state.

Softmax Policy • One policy parameter per state-action pair.

Parameterized Gaussian Policy • Let φ(s) be a vector of features associated with the state s. • For example, the actions could be the pressure on the break pedal. • Let the features be [distance from stop sign, speed]T. • Let the policy parameters be [-1.2, 0.7]T

“Solving an MDP” • Finding a globally optimal policy • Finding a locally optimal policy • Finding a policy that is good enough for the application at hand

Solving an MDP using Local Search • Given policy parameters, we can estimate how good they are by generating many trajectories using them and then averaging the returns. • Use hill-climbing, simulated annealing, a genetic algorithm, or any other local search method to find θ that maximize J.

Next Time • Can we take advantage of structure in the problem (that we know the underlying problem is an MDP) to do better than a standard local search method? • Is there evidence that suggests that our brains do this?

Reinforcement Learning Part 1

Reinforcement Learning Part 1

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning Part 2

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning an introduction part 4

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning an introduction part 3

An Introduction to Reinforcement Learning (Part 1)

Reinforcement Learning

Reinforcement: Part 1

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning