460 likes | 606 Views
Reinforcement Learning Part 1. Markov Decision Processes (MDPs), Value, and Search. Outline of Lecture. Markov Decision Processes What are they conceptually? What are they formally? Value Functions Using standard search methods. Computer Science. Databases. Networks.
E N D
Reinforcement Learning Part 1 Markov Decision Processes (MDPs), Value, and Search
Outline of Lecture • Markov Decision Processes • What are they conceptually? • What are they formally? • Value Functions • Using standard search methods
Computer Science Databases Networks Artificial Intelligence Theory … NLP Planning Machine Learning Vision … Unsupervised Learning ReinforcementLearning Supervised Learning Classification Regression … Clustering Representation discovery …
Reinforcement Learning • Learning to make optimal decisions without a teacher that says what the correct response is. • Related to: • Learning by trial and error • Operant conditioning
Reinforcement Learning • Sutton and Barto (1998) say: • Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal. • The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. • In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics--trial-and-error search and delayed reward--are the two most important distinguishing features of reinforcement learning.
Example: GridWorld Current State Actions States Terminal State (Optional) Initial or Start State
Policy • Distribution over admissible actions for each state. • π(s,a)= Probability of taking action a in state s.
Bad Policy *Takes blue-arrow action with high probability.
Another Good Policy We need to formalize the problem in order to say which is better.
Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • S is the set of possible states • “State set” or “state space” • Can be finite, or infinite (countable or uncountable). • We will mainly discuss the finite setting. • {1, 2, …, 22} in our GridWorld example.
Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • A is the set of possible actions • “Action set” or “action space” • Can be finite, or infinite (countable or uncountable). • We will mainly discuss the finite setting. • {up, down, left, right} in our GridWorld example.
Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • P describes the transition dynamics • “Transition function”, “transition matrix” • P(s,a,s’) Gives the probability of entering state s’ when action a is taken in state s. • P(s,a,·) is a probability mass function or probability density function over the state set. P is a conditional distribution over next-states given the current state and action.
Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • P describes the transition dynamics • “Transition function”, “transition matrix” • P(s,a,s’) Gives the probability of entering state s’ when action a is taken in state s. P(s, up, s’) = 0.5 P(s, up, s’’) = 0.2 …
Time • We use t to denote the (integer) time step. • st is the state at time t. • at is the action chosen at time t. • rt is the reward obtained at time t. • Think of this as the reward for taking action at in state st, and ending up in state st+1.
Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • R describes the rewards • “Reward function”, “reward vector” • R(s,a) is the expected reward, r, when action a is taken in state s.
Reward Function Actual transition: left right up down
Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • d0 is the distribution over the states at time t = 0. • “Initial state distribution” • In our example, it places a probability of one on the state marked S, and a probability of zero on the others.
Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • γ is a real-valued discount parameter in the interval [0,1]. • We will discuss it soon.
Episodes • An episode is one run of an MDP, starting at t=0, and running until a terminal state is reached. • Some MDPs always have finite-length episodes, some do not.
Trajectory • If you use a policy, π, on an MDP, M, for one episode, you get a trajectory. Here the gold arrow denotes where the agent’s action differs from the direction that it moved.
Return • The return is the sum of rewards from a trajectory. • This could be infinite!
Discounted Return • Idea: make rewards that are obtained later worth less. • Would you rather have 1 cookie right now or 3 cookies in a week? • We use the parameter γ to discount rewards. If it is less than one and the rewards are bounded, then the discounted return is always finite.
Value Function • The value is defined in terms of a state and a policy: • It is the expected discounted return when following the specified policy starting in the specified state.
Objective • Find a policy that maximizes the expected discounted return There could be more than one!
Problem Settings • The transition dynamics, P, and initial state distribution may or may not be known. • We will discuss the case where they are not. • The reward function, R, may or may not be known. • We will discuss the case where it is not. • Typically the state set, action set, and reward discount parameter are known ahead of time.
Partial Observability • Sometimes the agent doesn’t know the actual state of the world • Sensors make observations • Observations may be noisy • Some properties of the world might not be measured • Partially observable Markov decision processes (POMDPs) model this setting
Example MDPs • Think of your own! • What are the states and actions? • What is the reward function? • What are the transition dynamics? • How much would you discount rewards over time?
One Last Example MDP • You! • States / observations = input to brain • Actions = output of brain • Rewards = stimuli that we have evolved to find nice (e.g., a sweet taste) • Transition dynamics = physics
Parameterized Policies • In order to search the set of policies for an optimal one, we need a way of describing policies. • Let π(s,a,θ) be a parameterized policy, where θ is a vector that we call the policy parameters. • Each fixed θ results in a policy. • E.g., the policy parameters could be a vector that contains the probability of each action in each state.
Softmax Policy • One policy parameter per state-action pair.
Parameterized Gaussian Policy • Let φ(s) be a vector of features associated with the state s. • For example, the actions could be the pressure on the break pedal. • Let the features be [distance from stop sign, speed]T. • Let the policy parameters be [-1.2, 0.7]T
“Solving an MDP” • Finding a globally optimal policy • Finding a locally optimal policy • Finding a policy that is good enough for the application at hand
Solving an MDP using Local Search • Given policy parameters, we can estimate how good they are by generating many trajectories using them and then averaging the returns. • Use hill-climbing, simulated annealing, a genetic algorithm, or any other local search method to find θ that maximize J.
Next Time • Can we take advantage of structure in the problem (that we know the underlying problem is an MDP) to do better than a standard local search method? • Is there evidence that suggests that our brains do this?