160 likes | 251 Views
Reinforcement Learning : Overview. Cheng-Zhong Xu Wayne State University. Introduction.
E N D
Reinforcement Learning: Overview Cheng-Zhong Xu Wayne State University
Introduction • In RL, the learner is a decision-making agent that takes actions in an environment state and receives reward (or penalty) for its actions. The action may cause the change of environment state. After a set of trial-and-error runs, it should learn the best policy: the sequence of actions that maximize the total reward • Supervised learning: learning from examples provided by a teacher • RL: learning with a critic (reward or penalty); goal-directed learning from interaction • Examples: • Game-playing: Sequence of moves to win a game • Robot in a maze: Sequence of actions to find a goal
Example: K-armed Bandit • Given $10 to play on a slot machine with 5 levers: • Each play costs $1; each pull of a lever may produce payoff of 0, 1$, 5$, 10$ • Find the optimal policy that pay off the most. • Tradeoff between exploitation and exploration • Exploitation: continue to pull the lever that returns positive • Exploration: try to pull a new one • Deterministic model • The payoff of each lever is fixed, but unknown in advance • Stochastic model • The pay of each lever is uncertainty, with known or unknown probability
K-armed Bandit in General • In deterministic case: Q(a): value of action a Reward of act a is ra Q(a)= ra • Choose a* if Q(a*)=maxaQ(a) • In stochastic model: • Reward is non-deterministic: p(r|a) • Qt(a): estimate of the value of act a at time t • Delta Rule • is learning factor • Qt+1(a) is expected value and should converge to the mean of p(r|a) as t increases
K-Armed Bandit as Simplified RL Start S2 S4 S3 S8 S7 S5 Goal • Single state (single slot machine) vs Multiple States • p(r|si, aj) : different reward probabilities • Q(Si aj ): value of action aj in state si to be learnt • Action causes state change, in addition to reward • Rewards are not necessarily immediate value • Delayed rewards
Elements of RL • st : State of agent at time t • at: Action taken at time t • In st, action at is taken, clock ticks and reward rt+1 is received and state changes to st+1 • Next state prob: P (st+1 | st, at )Markov system • Reward prob: p (rt+1 | st, at ) • Initial state(s), goal state(s) • Episode (trial) of actions from initial state to goal
Policy and Cumulative Reward • Policy, • State value of a policy, • Finite-horizon: • Infinite horizon:
State Value Function Example • GridWorld: a simple MDP • Grid cell ~ environment states • Four possible actions at each cell: n/s/e/w, one cell in respective dir; • Agent would remain in location, if its move would take it off the grid, but with reward of -1; Other move receives reward of 0, except • Those moves out of states A and B; • rewarding 10 for each move out of A (to A’) and 5 for move out of B (to B’) • Policy: the agent selects four actions with equal prob and assume =0.9
Model-Based Learning • Environment, P (st+1 | st, at ), p (rt+1 | st, at ), is known • There is no need for exploration • Can be solved using dynamic programming • Solve for • Optimal policy
Value Iteration vs Policy Iteration Policy iteration needs fewer iterations than value iteration
Model-Free Learning • Environment, P (st+1 | st, at ), p (rt+1 | st, at ), is not known model-free learning, based on both exploitation and exploration • Temporal difference learning: use the (discounted) reward received in the next time step to update the value of current state (action): 1-step TD • Temporal difference: between the value of the current action and the value discounted from the next state
Deterministic Rewards and Actions Start S2 is reduced to Therefore, we have a backup update rule as Initially, and its value increases as learning proceeds episode by episode. S4 S3 S8 S7 In maze, all rewards of intermediate states are zero in the first episode. We a goal is reached, we get reward r and the Q value of last state, say S5, is Updated as r. In the next episode, when S5 is reached, the Q value of its preceding state S4 is updated as 2r. S5 Goal
Nondeterministic Rewards and Actions • Uncertainty in reward and state change is due To presence of opponents or randomness in the environment. • Q-learning (Watkins & Dayan’92): we keep a running average for each pair of state-action value of a sample of instances for each (st,at)
Exploration Strategies • Greedy: choose action that maximizes the immediate reward • ε-greedy: with prob ε,choose one action at random uniformly, and choose the best action with prob 1-ε • Softmax selection: • To m gradually move from exploration to exploitation, temperature variable T could help the annealing process
Summary • RL is a process of learning by interaction, in contrast to supervised learning from examples. • Elements of RL for an agent and its environment • state value function, state-action function (Q-value), reward, state change probability, policy • Tradeoff between exploitation and exploration • Markov Decision Process • Model-based learning • Value function in Bellman equation • Dynamic programming • Model-free learning • Temporal difference (TD) and Q-learning (timing average) to update Q value • Action selection for exploration • -greedy, softmax-based selection