1 / 45

Reinforcement Learning

Reinforcement Learning. 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Content. Introduction Main Elements Markov Decision Process (MDP) Value Functions. Reinforcement Learning. Introduction 大同大學資工所 智慧型多媒體研究室. Reinforcement Learning. Learning from interaction (with environment)

farhani
Download Presentation

Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室

  2. Content • Introduction • Main Elements • Markov Decision Process (MDP) • Value Functions

  3. Reinforcement Learning Introduction 大同大學資工所 智慧型多媒體研究室

  4. Reinforcement Learning • Learning from interaction (with environment) • Goal-directed learning • Learning what to do and its effect • Trial-and-error search and delayed reward • The two most important distinguishing features of reinforcement learning

  5. Exploration and Exploitation • The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. • Dilemma  neither exploitation nor exploration can be pursued exclusively without failing at the task.

  6. Supervised Learning System Inputs Outputs Supervised Learning Training Info = desired (target) outputs Error = (target output – actual output)

  7. RL System Outputs (“actions”) Inputs Reinforcement Learning Training Info = evaluations (“rewards” / “penalties”) Objective: get as much reward as possible

  8. Reinforcement Learning Main Elements 大同大學資工所 智慧型多媒體研究室

  9. Environment Environment action state reward Agent agent Main Elements To maximize value

  10. Example (Bioreactor) • state • current temperature and other sensory readings, composition, target chemical • actions • how much heating, stirring, what ingredients to add • reward • moment-by-moment production of desired chemical

  11. Example (Pick-and-Place Robot) • state • current positions and velocities of joints • actions • voltages to apply to motors • reward: • reach end-position successfully, speed, smoothnessof trajectory

  12. Example (Recycling Robot) • State • charge level of battery • Actions • look for cans, wait for can, go recharge • Reward • positive for finding cans, negative for running out of battery

  13. Main Elements • Environment • Its stateis perceivable • Reinforcement Function • To generate reward • A function of states (or state/action pairs) • Value Function • The potential to reach the goal (with maximum total reward) • To determine the policy • A function of state

  14. st+1 rt+1 action reward state at rt st … … rt rt+1 rt+2 rt+3 at at+1 at+2 at+3 st st+1 st+2 st+3 Frequently, we model the environment as a Markov Decision Process (MDP). The Agent-Environment Interface Environment Agent

  15. S: a set of states A: a set of actions Reward Function • A reward function defines the goal in a reinforcement learning problem. • Roughly speaking, it maps perceived states (or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state. or

  16. Goals and Rewards • The agent's goal is to maximize the total amount of reward it receives. • This means maximizing not just immediate reward, but cumulative reward in the long run.

  17. Can you design another reward function? Goals and Rewards Reward = 1 Reward = 0

  18. Goals and Rewards state reward +1 Win 1 Loss Draw or Non-terminal 0

  19. The reward signal is the way of communicating to the agent what we want it to achieve, nothow we want it achieved. Goals and Rewards 0 1 1 1 1

  20. Reinforcement Learning Markov Decision Processes 大同大學資工所 智慧型多媒體研究室

  21. Definition • An MDP consists of: • A set of states S, and actions A, • A transition distribution • Expected next rewards

  22. Decision Making • Many stochastic processes can be modeled within the MDP framework. • The process is controlled by choosing actions in each state trying to attain the maximum long-term reward. How to find the optimal policy?

  23. wait search recharge search wait Example (Recycling Robot) High Low

  24. recharge High Low wait search search wait Example (Recycling Robot)

  25. Reinforcement Learning Value Functions 大同大學資工所 智慧型多媒體研究室

  26. Value Functions • To estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). • The notion of ``how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. • Value functions are defined with respect to particular policies. or

  27. Returns • Episodic Tasks • finite-horizon tasks • indefinite-horizon tasks • Continual Tasks • infinite-horizon tasks

  28. Finite Horizontal Tasks k-armed bandit problem Return at time t Expected return at time t

  29. Indefinite Horizontal Tasks Play chess Return at time t Expected return at time t

  30. Infinite Horizontal Tasks Control Return at time t Expected return at time t

  31. r4=0 r1 r2 r3 s0 s1 s2 r5=0 . . . Unified Notation Reformulation of episodic tasks = 0  Discounted return at time t = 1 < 1  : discounting factor

  32. Policies • A policy, , is a mapping from states, sS, and actions, aA(s), to the probability (s, a) of taking action a when in state s.

  33. Value Functions under a Policy State-Value Function Action-Value Function

  34. Bellman Equation for a Policy p State-Value Function

  35. s a r Backup Diagram State-Value Function

  36. Bellman Equation for a Policy p  Action-Value Function

  37. s, a s’ a’ Backup Diagram Action-Value Function

  38. Bellman Equation for a Policy p • This is a set of equations (in fact, linear), one for each state. • The value function for  is its unique solution. • It can be regarded as consistency condition between values ofstates and successor states, and rewards.

  39. Example (Grid World) • State: position • Actions: north, south, east, west; deterministic. • Reward: If would take agent off the grid: no move but reward = –1 • Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy; g= 0.9

  40. Optimal Policy (*) Optimal State-Value Function What is the relation btw. them. Optimal Action-Value Function

  41. Optimal Value Functions Bellman Optimality Equations:

  42. Optimal Value Functions Bellman Optimality Equations: How to apply the value function to determine the action to be taken on each state? How to compute? How to store?

  43. Example (Grid World) Random Policy Optimal Policy * V*

  44. Finding Optimal Solution via Bellman • Finding an optimal policy by solving the Bellman Optimality Equation requires the following: • accurate knowledge of environment dynamics; • we have enough space and time to do the computation; • the Markov Property.

  45. Optimality and Approximation • How much space and time do we need? • polynomial in number of states (via dynamic programming methods) • BUT, number of states is often huge (e.g., backgammon has about 1020states). • We usually have to settle for approximations. • Many RL methods can be understood as approximately solving the Bellman Optimality Equation.

More Related