1 / 45

Reinforcement Learning Part 1

Reinforcement Learning Part 1. Markov Decision Processes (MDPs), Value, and Search. Outline of Lecture. Markov Decision Processes What are they conceptually? What are they formally? Value Functions Using standard search methods. Computer Science. Databases. Networks.

shaina
Download Presentation

Reinforcement Learning Part 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning Part 1 Markov Decision Processes (MDPs), Value, and Search

  2. Outline of Lecture • Markov Decision Processes • What are they conceptually? • What are they formally? • Value Functions • Using standard search methods

  3. Computer Science Databases Networks Artificial Intelligence Theory … NLP Planning Machine Learning Vision … Unsupervised Learning ReinforcementLearning Supervised Learning Classification Regression … Clustering Representation discovery …

  4. Reinforcement Learning • Learning to make optimal decisions without a teacher that says what the correct response is. • Related to: • Learning by trial and error • Operant conditioning

  5. Reinforcement Learning • Sutton and Barto (1998) say: • Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal. • The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. • In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics--trial-and-error search and delayed reward--are the two most important distinguishing features of reinforcement learning.

  6. Example: GridWorld Current State Actions States Terminal State (Optional) Initial or Start State

  7. Reward Function

  8. Transition Function

  9. Policy • Distribution over admissible actions for each state. • π(s,a)= Probability of taking action a in state s.

  10. Bad Policy *Takes blue-arrow action with high probability.

  11. Good Policy

  12. Another Good Policy We need to formalize the problem in order to say which is better.

  13. Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • S is the set of possible states • “State set” or “state space” • Can be finite, or infinite (countable or uncountable). • We will mainly discuss the finite setting. • {1, 2, …, 22} in our GridWorld example.

  14. Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • A is the set of possible actions • “Action set” or “action space” • Can be finite, or infinite (countable or uncountable). • We will mainly discuss the finite setting. • {up, down, left, right} in our GridWorld example.

  15. Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • P describes the transition dynamics • “Transition function”, “transition matrix” • P(s,a,s’) Gives the probability of entering state s’ when action a is taken in state s. • P(s,a,·) is a probability mass function or probability density function over the state set. P is a conditional distribution over next-states given the current state and action.

  16. Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • P describes the transition dynamics • “Transition function”, “transition matrix” • P(s,a,s’) Gives the probability of entering state s’ when action a is taken in state s. P(s, up, s’) = 0.5 P(s, up, s’’) = 0.2 …

  17. Time • We use t to denote the (integer) time step. • st is the state at time t. • at is the action chosen at time t. • rt is the reward obtained at time t. • Think of this as the reward for taking action at in state st, and ending up in state st+1.

  18. Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • R describes the rewards • “Reward function”, “reward vector” • R(s,a) is the expected reward, r, when action a is taken in state s.

  19. Reward Function Actual transition: left right up down

  20. Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • d0 is the distribution over the states at time t = 0. • “Initial state distribution” • In our example, it places a probability of one on the state marked S, and a probability of zero on the others.

  21. Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • γ is a real-valued discount parameter in the interval [0,1]. • We will discuss it soon.

  22. Episodes • An episode is one run of an MDP, starting at t=0, and running until a terminal state is reached. • Some MDPs always have finite-length episodes, some do not.

  23. Trajectory • If you use a policy, π, on an MDP, M, for one episode, you get a trajectory. Here the gold arrow denotes where the agent’s action differs from the direction that it moved.

  24. Return • The return is the sum of rewards from a trajectory. • This could be infinite!

  25. Discounted Return • Idea: make rewards that are obtained later worth less. • Would you rather have 1 cookie right now or 3 cookies in a week? • We use the parameter γ to discount rewards. If it is less than one and the rewards are bounded, then the discounted return is always finite.

  26. Value Function • The value is defined in terms of a state and a policy: • It is the expected discounted return when following the specified policy starting in the specified state.

  27. Objective • Find a policy that maximizes the expected discounted return There could be more than one!

  28. Problem Settings • The transition dynamics, P, and initial state distribution may or may not be known. • We will discuss the case where they are not. • The reward function, R, may or may not be known. • We will discuss the case where it is not. • Typically the state set, action set, and reward discount parameter are known ahead of time.

  29. Partial Observability • Sometimes the agent doesn’t know the actual state of the world • Sensors make observations • Observations may be noisy • Some properties of the world might not be measured • Partially observable Markov decision processes (POMDPs) model this setting

  30. Example MDPs: Pole Balancing

  31. Example MDPs: Mountain Car

  32. Example MDPs: Punchulum

  33. Example MDPs: Function Electrical Stimulation

  34. Example Removed

  35. Example MDPs • Think of your own! • What are the states and actions? • What is the reward function? • What are the transition dynamics? • How much would you discount rewards over time?

  36. One Last Example MDP • You! • States / observations = input to brain • Actions = output of brain • Rewards = stimuli that we have evolved to find nice (e.g., a sweet taste) • Transition dynamics = physics

  37. Parameterized Policies • In order to search the set of policies for an optimal one, we need a way of describing policies. • Let π(s,a,θ) be a parameterized policy, where θ is a vector that we call the policy parameters. • Each fixed θ results in a policy. • E.g., the policy parameters could be a vector that contains the probability of each action in each state.

  38. Softmax Policy • One policy parameter per state-action pair.

  39. Parameterized Gaussian Policy • Let φ(s) be a vector of features associated with the state s. • For example, the actions could be the pressure on the break pedal. • Let the features be [distance from stop sign, speed]T. • Let the policy parameters be [-1.2, 0.7]T

  40. “Solving an MDP” • Finding a globally optimal policy • Finding a locally optimal policy • Finding a policy that is good enough for the application at hand

  41. Solving an MDP using Local Search • Given policy parameters, we can estimate how good they are by generating many trajectories using them and then averaging the returns. • Use hill-climbing, simulated annealing, a genetic algorithm, or any other local search method to find θ that maximize J.

  42. Next Time • Can we take advantage of structure in the problem (that we know the underlying problem is an MDP) to do better than a standard local search method? • Is there evidence that suggests that our brains do this?

More Related