Introduction to Reinforcement Learning: A Framework for Artificial Intelligence

Reinforcement Learning

Reinforcement Learning • Reinforcement learning is a general-purpose framework of artificial intelligence • Mathematically, a Markov decision process (MDP) is applied in the learning environment • MDPs provide a framework for modeling decision making in situations where the outcomes are partly random and under the control of a decision maker • The RL emphasizes online performance • By finding a balance between exploration of unknown territory and exploitation of current knowledge in the decision process • The correct input/output pairs are never presented nor are suboptimal actions explicitly corrected

Reinforcement Learning (cont.) • Expects users to take proactive actions to reinforce the quality of the input data to help the prediction accuracy • Considered a long-term performance reward • The idea is inspired by behavioral psychology • The actions can be related to game theory, control theory, operations research, information theory, crowd intelligence, statistics, and genetic algorithms • An RL model is built with five constituting parts • A learning environment characterized by a set of states • A set of actions that can be taken by RL agents

Reinforcement Learning (cont.) • Each action influences the agent’s future states • The agent has the capacity to assess the long-term consequences of the actions • Rules of transitions between the RL states • Rules that determine the immediate reward of a state transition • Rules that specify what the agent can observe • The above rules are often stochastic or probabilistic • The observation typically involves the scalar immediate reward associated with the last transition • The agent is assumed to observe the current state fully • In the opposite case, the agent may have only partial observability

Reinforcement Learning (cont.) • The full observation is modeled by the MDP • The conditional probability P(s’|s, a) are known • In the partial observation case • Some conditional probabilities are true • Others are unknown due to partial observability • An RL agent interacts with its environment in discrete time steps • At each time the agent receives an observation • Then chooses an action from the set of actions available, subsequently sent to the environment • The environment moves to a new state • One reward is associated with the transition • The goal of reinforcement learning is to accumulate the rewards as many as possible at successive steps

Reinforcement Learning (cont.)

Reinforcement Learning (cont.) • The agent can choose any action as a function of the history • To act near optimally, the agent must reason about the long-term consequences of the actions • The idea is to select actions to maximize the future rewards • e.g., Students try various study methods to earn the best score so as to achieve a satisfying career as the future reward • The RL process is mainly displayed by interactions between the learning agents and the working environment • Offers an algorithm to solve sequential decision-making problems

Reinforcement Learning (cont.) • Cumulative reward is maximized by agents • After taking a series of actions in a working environment • Without knowing any rules in advance • An agent observes the current environmental state • Tries some actions to improve the learning process • A reward is the feedback to the agent by adjusting its action strategy • After numerous adjustments, the algorithm obtains the knowledge of optimal actions to achieve the best results for a specific situation in the environment • Interaction of an agent and environment • At each time t, the agent receives a state st and executes an action at

Reinforcement Learning (cont.) • Then receives an observation ot and a reward rt associated with the action • The environment is typically formulated as a Markov decision process to allow the agent to interact with it

Reinforcement Learning (cont.) • After receiving an action, the environment emits a state and a scalar reward • A sequence of observation, action and reward {o1, r1, a1,⋯, at-1, ot, rt} forms an experience • The state is a function of the experience • After the action is selected by the agent • Policy and value function play important roles in its performance • A RL algorithm demands a policy • A behavior function selecting actions given states • Links the states of the prediction model to the levels of reinforcement actions to take

Reinforcement Learning (cont.) • One typical policy is the deterministic policy • Definitely executes some action a under a specific state s, i.e. a = 𝜋(s) • The other is stochastic policy • A probability to perform some action a under state s, i.e. 𝜋(a|s) = P[a|s] • Value function predicts the future reward • Evaluates the efficacy of an action or state • Q𝜋(s, a) is the expected total reward from state s and action a under policy 𝜋 • Calculates the expected value of the accumulated reward obtained in future states • i.e. t+1, t+2, t+3, …, etc

Reinforcement Learning (cont.) • The future reward is discounted as time passes • The discount-factor 𝛾 ∈ [0, 1] is applied to decrease the award in a future state Q𝜋(s, a) = • No perfect model to predict what will exactly happen in the future • The goal is to obtain the maximum value of Q𝜋(s, a) • The optimal policy is obtained by maximizing the value function • Use dynamic programming to achieve the optimal value through multiple iterations • Give an action a under state s, a reward rt+1 is obtained at state st+1

Reinforcement Learning (cont.) • To achieve the maximum Q value, the state st+1 needs to be optimal • Similarly, the Q-value of state st+2 should be optimized to guarantee the optimal Q-value of state st+1, etc • The iterative process goes on until the final state • When the number of states and actions is small • A state action table can be built to record the optimal Q-value • For infinite states, approximation function is needed to represent the relationship among state, action and value • The deep neural network is the best choice for this purpose

Reinforcement Learning (cont.) • Hovering of an actual UAV using RL • Use the AR.Drone 2.0 quadrotor helicopter • The drone learns to stably hover using its controller • Uses images from the on-board bottom camera to control hovering • Recognizes a marker placed on the floor and learns to hover over it

Reinforcement Learning (cont.) • The images taken by the drone’s camera are sent to the PC • Then the velocity commands for the drone are calculated and sent to the drone • The learning program runs on the PC • The drone is expected to achieve stable hovering gradually through trial and error • The image taken by the drone’s camera

Reinforcement Learning (cont.) • The red circle is the marker placed on the floor • Each blue rectangle indicates the state • Divided the image into 10 states (s0 ∼ s9) • The state at a particular time is decided by the area where the marker exists at that time • This image shows the s0 state • Q-learning is a typical RL algorithm with discrete states and actions • Updates the Q-value Q(s, a) assigned to each state-action pair (s, a) • Such a value is an evaluation of an action a ∈ A in a state s ∈ S

Reinforcement Learning (cont.) • st and at are the state and action respectively at time t • rt+1 is the received reward for executing at in st • st+1 is the new state •  ∈ [0, 1] and  ∈ (0, 1] are the discount rate and learning rate, respectively • The discount rate adjusts the influence of the rewards to be gained in the future • The updated Q-value is calculated the on the basis of the current reward and expectation of future reward • The rewards in each state • The maximum reward +10 is set at s4 • The minimum reward -60 is set at outermost area s9 • The reward for all the other states is set to -0.5 • This layout of rewards is decided according to a preliminary experiment

Reinforcement Learning (cont.) • The actions are moving forward, back, left, right, and staying • The drone moves using one of the above five actions at each time • When the drone performs the same action a in the same state s

Reinforcement Learning (cont.) • The state can either remain the same or change to another state • One transition does not correspond to a single action • To address this problem of state-action deviation • The Q-value is updated only when the state changes to a different state • The experiment is executed indoor • The discount rate  was 0.8 • The learning rate  was 0.9 • During learning, the acquired rewards increase • The drone can learn suitable behavior to acquire more rewards • The drone chooses its actions using a greedy way • To choose the action with the highest Q-value

Reinforcement Learning (cont.) • RL is well suited to problems with a long-term vs. short-term reward trade-off • Have been applied successfully in robot control, elevator scheduling and game playing like chess, checkers, Go and Atari games, etc • RL algorithms encourage the use of samples to optimize performance • And the use of function approximation to deal with large environments • Especially effective in handling three machine learning environments • A known model environment short of an analytic solution • A simulation-based optimization environment

Reinforcement Learning (cont.) • Information is collected by interacting with the environment with tradeoffs • The ultimate goal is to reach some form of equilibrium under bounded rationality • Similar to that practiced in the MDP with full observation by dynamic programming • Fundamental assumptions on the reinforcement learning environments • All events are episodic as a sequence of episodes • An episode ends when some terminal state is reached • No matter what course of actions the agent may take, termination is inevitable • The expectation of total reward is well-defined

Reinforcement Learning (cont.) • For any policy and any initial distribution over states • One must be able to work out an RL algorithm to find a policy with maximum expected gain • The algorithm needs to search for optimal policy to achieve maximal rewards • Often apply deterministic stationary policies • Select the actions deterministically based only on the current or last state visited • A number of approaches in designing reinforcement learning algorithms • A brute-force method is to choose the policy with the largest expected return

Reinforcement Learning (cont.) • The major difficulty with this approach is that the policy choices can be very large or even infinite • Value function approaches try to find a policy • Maximizes the return by maintaining a set of estimates of expected returns for some policies • These methods rely on the theory of MDPs • Optimality is defined as stronger than the one above • A policy is called optimal if it achieves the best expected return from any initial state • An optimal policy always results in choosing optimal actions with the highest value of each state • The time-difference method • Allows the policy to change before the award values are settled

Reinforcement Learning (cont.) • A direct policy search method finds a good policy • By searching directly from a policy space • Policy search methods are often too slow to converge to the optimal choice • Reinforcement learning is often tied to models for human skill learning or acquisition • The emphasis is to achieve online performance • To reinforce the learning process, the user must first define what the optimality is • Instead of using brute force, users can use a value-function approach based on Monte Carlo or temporal difference methods like Q-learning • Also can consider the direct policy search approach

Introduction to Reinforcement Learning: A Framework for Artificial Intelligence

Introduction to Reinforcement Learning: A Framework for Artificial Intelligence

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning