TD-Learning

TD-Learning Shiqi Zhang AI class in Texas Tech, 2010 fall Adapted from Kainan University, Taiwan

Model is given? We say model is given when the conditional probabilities are known in all states • Model-based: trying to find R and T, and use them to get the value function • Model-free: directly find out the value function

Policy is fixed? • Passive learning: a policy is fixed, and we try to evaluate this policy • Active learning: the policy is not fixed, we can change it a little every time Policy evaluation and Policy improvement

On-policy or Off-policy • Basic TD-Learning • SARSA (On-policy) • Q-Learning (Off-policy)

Advantages • TD-learning vs Monte Carlo • TD-learning vs Dynamic Programming TD is an Online algorithm TD is Model-free

Q-Learning Task Agent state action reward Environment a0 a1 … S0 S1 S2 r0 r1 Goal: Learn to choose actions that maximize where

Problem Description • Environment A to E: rooms, F: outside building (target). The aim is that an agent to learn to get out of building from any of rooms in an optimal way.

Modeling of the environment

State, Action, Reward and Q-value • Reward matrix

Q-table and the update rule Q table update rule: ：learning parameter

Q-Learning • Given : State diagram with a goal state (represented by matrix R) • Find : Minimum path from any initial state to the goal state (represented by matrix Q) • Set parameter γ , and environment reward matrix R • Initialize matrix Q as zero matrix • For each episode: • ・　Select random initial state • ・　Do while not reach goal state • ・　Select one among all possible actions for the current state • ・　Using this possible action, consider to go to the next state • ・　Get maximum Q value of this next state based on all possible actions • ・　compute • ・　Set the next state as the current state • End Do • End For

Algorithm to utilize the Q table Input: Q matrix, initial state • Set current state = initial state • From current state, find action that produce maximum Q value • Set current state = next state • Go to 2 until current state = goal state The algorithm above will return sequence of current state from initial state until goal state. Comments: Parameter γ has range value of 0 to 1( ). If γ is closer to zero, the agent will tend to consider only immediate reward. If γ is closer to one, the agent will consider future reward with greater weight, willing to delay the reward.

Numerical Example Let us set the value of learning parameter and initial state as room B.

Episode 1 Look at the second row (state B) of matrix R. There are two possible actions for the current state B, that is to go to state D, or go to state F. By random selection, we select to go to F as our action.

Episode 2 This time for instance we randomly have state D as our initial state. From R; it has 3 possible actions, B, C and E. We randomly select to go to state B as our action.

Episode 2 (cont’d) The next state is B, now become the current state. We repeat the inner loop in Q learning algorithm because state B is not the goal state. There are two possible actions from the current state B, that is to go to state D, or go to state F. By lucky draw, our action selected is state F. No change

After Many Episodes If our agent learns more and more experience through many episodes, it will finally reach convergence values of Q matrix as Normalized to percentage

Once the Q matrix reaches almost the convergence value, our agent can reach the goal in an optimum way. To trace the sequence of states, it can easily compute by finding action that makes maximum Q for this state. For example from initial State C, using the Q matrix, we can have the sequences C – D – B – F or C-D-E-F

The End…

TD-Learning

TD-Learning

Presentation Transcript

TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning

TD Bank

TD Account

TD-SCDMA and TD-SCDMA Forum

TD-SCDMA and TD-SCDMA Forum

Dopamine, Uncertainty and TD Learning

Dopamine, Uncertainty and TD Learning

TD Bank

TD Education 2006 TD REPORTS

TD – Cryptography