1 / 24

Reinforcement Learning

Reinforcement Learning. Reinforcement Learning. Reinforcement learning is a general-purpose framework of artificial intelligence Mathematically, a Markov decision process (MDP) is applied in the learning environment

todb
Download Presentation

Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning

  2. Reinforcement Learning • Reinforcement learning is a general-purpose framework of artificial intelligence • Mathematically, a Markov decision process (MDP) is applied in the learning environment • MDPs provide a framework for modeling decision making in situations where the outcomes are partly random and under the control of a decision maker • The RL emphasizes online performance • By finding a balance between exploration of unknown territory and exploitation of current knowledge in the decision process • The correct input/output pairs are never presented nor are suboptimal actions explicitly corrected

  3. Reinforcement Learning (cont.) • Expects users to take proactive actions to reinforce the quality of the input data to help the prediction accuracy • Considered a long-term performance reward • The idea is inspired by behavioral psychology • The actions can be related to game theory, control theory, operations research, information theory, crowd intelligence, statistics, and genetic algorithms • An RL model is built with five constituting parts • A learning environment characterized by a set of states • A set of actions that can be taken by RL agents

  4. Reinforcement Learning (cont.) • Each action influences the agent’s future states • The agent has the capacity to assess the long-term consequences of the actions • Rules of transitions between the RL states • Rules that determine the immediate reward of a state transition • Rules that specify what the agent can observe • The above rules are often stochastic or probabilistic • The observation typically involves the scalar immediate reward associated with the last transition • The agent is assumed to observe the current state fully • In the opposite case, the agent may have only partial observability

  5. Reinforcement Learning (cont.) • The full observation is modeled by the MDP • The conditional probability P(s’|s, a) are known • In the partial observation case • Some conditional probabilities are true • Others are unknown due to partial observability • An RL agent interacts with its environment in discrete time steps • At each time the agent receives an observation • Then chooses an action from the set of actions available, subsequently sent to the environment • The environment moves to a new state • One reward is associated with the transition • The goal of reinforcement learning is to accumulate the rewards as many as possible at successive steps

  6. Reinforcement Learning (cont.)

  7. Reinforcement Learning (cont.) • The agent can choose any action as a function of the history • To act near optimally, the agent must reason about the long-term consequences of the actions • The idea is to select actions to maximize the future rewards • e.g., Students try various study methods to earn the best score so as to achieve a satisfying career as the future reward • The RL process is mainly displayed by interactions between the learning agents and the working environment • Offers an algorithm to solve sequential decision-making problems

  8. Reinforcement Learning (cont.) • Cumulative reward is maximized by agents • After taking a series of actions in a working environment • Without knowing any rules in advance • An agent observes the current environmental state • Tries some actions to improve the learning process • A reward is the feedback to the agent by adjusting its action strategy • After numerous adjustments, the algorithm obtains the knowledge of optimal actions to achieve the best results for a specific situation in the environment • Interaction of an agent and environment • At each time t, the agent receives a state st and executes an action at

  9. Reinforcement Learning (cont.) • Then receives an observation ot and a reward rt associated with the action • The environment is typically formulated as a Markov decision process to allow the agent to interact with it

  10. Reinforcement Learning (cont.) • After receiving an action, the environment emits a state and a scalar reward • A sequence of observation, action and reward {o1, r1, a1,⋯, at-1, ot, rt} forms an experience • The state is a function of the experience • After the action is selected by the agent • Policy and value function play important roles in its performance • A RL algorithm demands a policy • A behavior function selecting actions given states • Links the states of the prediction model to the levels of reinforcement actions to take

  11. Reinforcement Learning (cont.) • One typical policy is the deterministic policy • Definitely executes some action a under a specific state s, i.e. a = 𝜋(s) • The other is stochastic policy • A probability to perform some action a under state s, i.e. 𝜋(a|s) = P[a|s] • Value function predicts the future reward • Evaluates the efficacy of an action or state • Q𝜋(s, a) is the expected total reward from state s and action a under policy 𝜋 • Calculates the expected value of the accumulated reward obtained in future states • i.e. t+1, t+2, t+3, …, etc

  12. Reinforcement Learning (cont.) • The future reward is discounted as time passes • The discount-factor 𝛾 ∈ [0, 1] is applied to decrease the award in a future state Q𝜋(s, a) = • No perfect model to predict what will exactly happen in the future • The goal is to obtain the maximum value of Q𝜋(s, a) • The optimal policy is obtained by maximizing the value function • Use dynamic programming to achieve the optimal value through multiple iterations • Give an action a under state s, a reward rt+1 is obtained at state st+1

  13. Reinforcement Learning (cont.) • To achieve the maximum Q value, the state st+1 needs to be optimal • Similarly, the Q-value of state st+2 should be optimized to guarantee the optimal Q-value of state st+1, etc • The iterative process goes on until the final state • When the number of states and actions is small • A state action table can be built to record the optimal Q-value • For infinite states, approximation function is needed to represent the relationship among state, action and value • The deep neural network is the best choice for this purpose

  14. Reinforcement Learning (cont.) • Hovering of an actual UAV using RL • Use the AR.Drone 2.0 quadrotor helicopter • The drone learns to stably hover using its controller • Uses images from the on-board bottom camera to control hovering • Recognizes a marker placed on the floor and learns to hover over it

  15. Reinforcement Learning (cont.) • The images taken by the drone’s camera are sent to the PC • Then the velocity commands for the drone are calculated and sent to the drone • The learning program runs on the PC • The drone is expected to achieve stable hovering gradually through trial and error • The image taken by the drone’s camera

  16. Reinforcement Learning (cont.) • The red circle is the marker placed on the floor • Each blue rectangle indicates the state • Divided the image into 10 states (s0 ∼ s9) • The state at a particular time is decided by the area where the marker exists at that time • This image shows the s0 state • Q-learning is a typical RL algorithm with discrete states and actions • Updates the Q-value Q(s, a) assigned to each state-action pair (s, a) • Such a value is an evaluation of an action a ∈ A in a state s ∈ S

  17. Reinforcement Learning (cont.) • st and at are the state and action respectively at time t • rt+1 is the received reward for executing at in st • st+1 is the new state •  ∈ [0, 1] and  ∈ (0, 1] are the discount rate and learning rate, respectively • The discount rate adjusts the influence of the rewards to be gained in the future • The updated Q-value is calculated the on the basis of the current reward and expectation of future reward • The rewards in each state • The maximum reward +10 is set at s4 • The minimum reward -60 is set at outermost area s9 • The reward for all the other states is set to -0.5 • This layout of rewards is decided according to a preliminary experiment

  18. Reinforcement Learning (cont.) • The actions are moving forward, back, left, right, and staying • The drone moves using one of the above five actions at each time • When the drone performs the same action a in the same state s

  19. Reinforcement Learning (cont.) • The state can either remain the same or change to another state • One transition does not correspond to a single action • To address this problem of state-action deviation • The Q-value is updated only when the state changes to a different state • The experiment is executed indoor • The discount rate  was 0.8 • The learning rate  was 0.9 • During learning, the acquired rewards increase • The drone can learn suitable behavior to acquire more rewards • The drone chooses its actions using a greedy way • To choose the action with the highest Q-value

  20. Reinforcement Learning (cont.) • RL is well suited to problems with a long-term vs. short-term reward trade-off • Have been applied successfully in robot control, elevator scheduling and game playing like chess, checkers, Go and Atari games, etc • RL algorithms encourage the use of samples to optimize performance • And the use of function approximation to deal with large environments • Especially effective in handling three machine learning environments • A known model environment short of an analytic solution • A simulation-based optimization environment

  21. Reinforcement Learning (cont.) • Information is collected by interacting with the environment with tradeoffs • The ultimate goal is to reach some form of equilibrium under bounded rationality • Similar to that practiced in the MDP with full observation by dynamic programming • Fundamental assumptions on the reinforcement learning environments • All events are episodic as a sequence of episodes • An episode ends when some terminal state is reached • No matter what course of actions the agent may take, termination is inevitable • The expectation of total reward is well-defined

  22. Reinforcement Learning (cont.) • For any policy and any initial distribution over states • One must be able to work out an RL algorithm to find a policy with maximum expected gain • The algorithm needs to search for optimal policy to achieve maximal rewards • Often apply deterministic stationary policies • Select the actions deterministically based only on the current or last state visited • A number of approaches in designing reinforcement learning algorithms • A brute-force method is to choose the policy with the largest expected return

  23. Reinforcement Learning (cont.) • The major difficulty with this approach is that the policy choices can be very large or even infinite • Value function approaches try to find a policy • Maximizes the return by maintaining a set of estimates of expected returns for some policies • These methods rely on the theory of MDPs • Optimality is defined as stronger than the one above • A policy is called optimal if it achieves the best expected return from any initial state • An optimal policy always results in choosing optimal actions with the highest value of each state • The time-difference method • Allows the policy to change before the award values are settled

  24. Reinforcement Learning (cont.) • A direct policy search method finds a good policy • By searching directly from a policy space • Policy search methods are often too slow to converge to the optimal choice • Reinforcement learning is often tied to models for human skill learning or acquisition • The emphasis is to achieve online performance • To reinforce the learning process, the user must first define what the optimality is • Instead of using brute force, users can use a value-function approach based on Monte Carlo or temporal difference methods like Q-learning • Also can consider the direct policy search approach

More Related