Reinforcement Learning

Reinforcement Learning

Reinforcement Learning • Reinforcement learning is a general-purpose framework of artificial intelligence • Mathematically, a Markov decision process (MDP) is applied in the learning environment • MDPs provide a framework for modeling decision making in situations where the outcomes are partly random and under the control of a decision maker • The RL emphasizes online performance • By finding a balance between exploration of unknown territory and exploitation of current knowledge in the decision process • The correct input/output pairs are never presented nor are suboptimal actions explicitly corrected

Reinforcement Learning (cont.) • Expects users to take proactive actions to reinforce the quality of the input data to help the prediction accuracy • Considered a long-term performance reward • The idea is inspired by behavioral psychology • The actions can be related to game theory, control theory, operations research, information theory, crowd intelligence, statistics, and genetic algorithms • An RL model is built with five constituting parts • A learning environment characterized by a set of states • A set of actions that can be taken by RL agents

Reinforcement Learning (cont.) • Each action influences the agent’s future states • The agent has the capacity to assess the long-term consequences of the actions • Rules of transitions between the RL states • Rules that determine the immediate reward of a state transition • Rules that specify what the agent can observe • The above rules are often stochastic or probabilistic • The observation typically involves the scalar immediate reward associated with the last transition • The agent is assumed to observe the current state fully • In the opposite case, the agent may have only partial observability

Reinforcement Learning (cont.) • The full observation is modeled by the MDP • The conditional probability P(s’|s, a) are known • In the partial observation case • Some conditional probabilities are true • Others are unknown due to partial observability • An RL agent interacts with its environment in discrete time steps • At each time the agent receives an observation • Then chooses an action from the set of actions available, subsequently sent to the environment • The environment moves to a new state • One reward is associated with the transition • The goal of reinforcement learning is to accumulate the rewards as many as possible at successive steps

Reinforcement Learning (cont.)

Reinforcement Learning (cont.) • The agent can choose any action as a function of the history • To act near optimally, the agent must reason about the long-term consequences of the actions • The idea is to select actions to maximize the future rewards • e.g., Students try various study methods to earn the best score so as to achieve a satisfying career as the future reward • The RL process is mainly displayed by interactions between the learning agents and the working environment • Offers an algorithm to solve sequential decision-making problems

Reinforcement Learning (cont.) • Cumulative reward is maximized by agents • After taking a series of actions in a working environment • Without knowing any rules in advance • An agent observes the current environmental state • Tries some actions to improve the learning process • A reward is the feedback to the agent by adjusting its action strategy • After numerous adjustments, the algorithm obtains the knowledge of optimal actions to achieve the best results for a specific situation in the environment • Interaction of an agent and environment • At each time t, the agent receives a state st and executes an action at

Reinforcement Learning (cont.) • Then receives an observation ot and a reward rt associated with the action • The environment is typically formulated as a Markov decision process to allow the agent to interact with it

Reinforcement Learning (cont.) • After receiving an action, the environment emits a state and a scalar reward • A sequence of observation, action and reward {o1, r1, a1,⋯, at-1, ot, rt} forms an experience • The state is a function of the experience • After the action is selected by the agent • Policy and value function play important roles in its performance • A RL algorithm demands a policy • A behavior function selecting actions given states • Links the states of the prediction model to the levels of reinforcement actions to take

Reinforcement Learning (cont.) • One typical policy is the deterministic policy • Definitely executes some action a under a specific state s, i.e. a = 𝜋(s) • The other is stochastic policy • A probability to perform some action a under state s, i.e. 𝜋(a|s) = P[a|s] • Value function predicts the future reward • Evaluates the efficacy of an action or state • Q𝜋(s, a) is the expected total reward from state s and action a under policy 𝜋 • Calculates the expected value of the accumulated reward obtained in future states • i.e. t+1, t+2, t+3, …, etc

Reinforcement Learning (cont.) • The future reward is discounted as time passes • The discount-factor 𝛾 ∈ [0, 1] is applied to decrease the award in a future state Q𝜋(s, a) = • No perfect model to predict what will exactly happen in the future • The goal is to obtain the maximum value of Q𝜋(s, a) • The optimal policy is obtained by maximizing the value function • Use dynamic programming to achieve the optimal value through multiple iterations • Give an action a under state s, a reward rt+1 is obtained at state st+1

Reinforcement Learning (cont.) • To achieve the maximum Q value, the state st+1 needs to be optimal • Similarly, the Q-value of state st+2 should be optimized to guarantee the optimal Q-value of state st+1, etc • The iterative process goes on until the final state • When the number of states and actions is small • A state action table can be built to record the optimal Q-value • For infinite states, approximation function is needed to represent the relationship among state, action and value • The deep neural network is the best choice for this purpose

Reinforcement Learning (cont.) • Hovering of an actual UAV using RL • Use the AR.Drone 2.0 quadrotor helicopter • The drone learns to stably hover using its controller • Uses images from the on-board bottom camera to control hovering • Recognizes a marker placed on the floor and learns to hover over it

Reinforcement Learning (cont.) • The images taken by the drone’s camera are sent to the PC • Then the velocity commands for the drone are calculated and sent to the drone • The learning program runs on the PC • The drone is expected to achieve stable hovering gradually through trial and error • The image taken by the drone’s camera

Reinforcement Learning (cont.) • The red circle is the marker placed on the floor • Each blue rectangle indicates the state • Divided the image into 10 states (s0 ∼ s9) • The state at a particular time is decided by the area where the marker exists at that time • This image shows the s0 state • Q-learning is a typical RL algorithm with discrete states and actions • Updates the Q-value Q(s, a) assigned to each state-action pair (s, a) • Such a value is an evaluation of an action a ∈ A in a state s ∈ S

Reinforcement Learning (cont.) • st and at are the state and action respectively at time t • rt+1 is the received reward for executing at in st • st+1 is the new state •  ∈ [0, 1] and  ∈ (0, 1] are the discount rate and learning rate, respectively • The discount rate adjusts the influence of the rewards to be gained in the future • The updated Q-value is calculated the on the basis of the current reward and expectation of future reward • The rewards in each state • The maximum reward +10 is set at s4 • The minimum reward -60 is set at outermost area s9 • The reward for all the other states is set to -0.5 • This layout of rewards is decided according to a preliminary experiment

Reinforcement Learning (cont.) • The actions are moving forward, back, left, right, and staying • The drone moves using one of the above five actions at each time • When the drone performs the same action a in the same state s

Reinforcement Learning (cont.) • The state can either remain the same or change to another state • One transition does not correspond to a single action • To address this problem of state-action deviation • The Q-value is updated only when the state changes to a different state • The experiment is executed indoor • The discount rate  was 0.8 • The learning rate  was 0.9 • During learning, the acquired rewards increase • The drone can learn suitable behavior to acquire more rewards • The drone chooses its actions using a greedy way • To choose the action with the highest Q-value

Reinforcement Learning (cont.) • RL is well suited to problems with a long-term vs. short-term reward trade-off • Have been applied successfully in robot control, elevator scheduling and game playing like chess, checkers, Go and Atari games, etc • RL algorithms encourage the use of samples to optimize performance • And the use of function approximation to deal with large environments • Especially effective in handling three machine learning environments • A known model environment short of an analytic solution • A simulation-based optimization environment

Reinforcement Learning (cont.) • Information is collected by interacting with the environment with tradeoffs • The ultimate goal is to reach some form of equilibrium under bounded rationality • Similar to that practiced in the MDP with full observation by dynamic programming • Fundamental assumptions on the reinforcement learning environments • All events are episodic as a sequence of episodes • An episode ends when some terminal state is reached • No matter what course of actions the agent may take, termination is inevitable • The expectation of total reward is well-defined

Reinforcement Learning (cont.) • For any policy and any initial distribution over states • One must be able to work out an RL algorithm to find a policy with maximum expected gain • The algorithm needs to search for optimal policy to achieve maximal rewards • Often apply deterministic stationary policies • Select the actions deterministically based only on the current or last state visited • A number of approaches in designing reinforcement learning algorithms • A brute-force method is to choose the policy with the largest expected return

Reinforcement Learning (cont.) • The major difficulty with this approach is that the policy choices can be very large or even infinite • Value function approaches try to find a policy • Maximizes the return by maintaining a set of estimates of expected returns for some policies • These methods rely on the theory of MDPs • Optimality is defined as stronger than the one above • A policy is called optimal if it achieves the best expected return from any initial state • An optimal policy always results in choosing optimal actions with the highest value of each state • The time-difference method • Allows the policy to change before the award values are settled

Reinforcement Learning (cont.) • A direct policy search method finds a good policy • By searching directly from a policy space • Policy search methods are often too slow to converge to the optimal choice • Reinforcement learning is often tied to models for human skill learning or acquisition • The emphasis is to achieve online performance • To reinforce the learning process, the user must first define what the optimality is • Instead of using brute force, users can use a value-function approach based on Monte Carlo or temporal difference methods like Q-learning • Also can consider the direct policy search approach

Reinforcement Learning