1 / 19

TD-Learning

TD-Learning. Shiqi Zhang AI class in Texas Tech, 2010 fall Adapted from Kainan University, Taiwan. Model is given?. We say model is given when the conditional probabilities are known in all states Model-based: trying to find R and T, and use them to get the value function

dani
Download Presentation

TD-Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TD-Learning Shiqi Zhang AI class in Texas Tech, 2010 fall Adapted from Kainan University, Taiwan

  2. Model is given? We say model is given when the conditional probabilities are known in all states • Model-based: trying to find R and T, and use them to get the value function • Model-free: directly find out the value function

  3. Policy is fixed? • Passive learning: a policy is fixed, and we try to evaluate this policy • Active learning: the policy is not fixed, we can change it a little every time Policy evaluation and Policy improvement

  4. On-policy or Off-policy • Basic TD-Learning • SARSA (On-policy) • Q-Learning (Off-policy)

  5. Advantages • TD-learning vs Monte Carlo • TD-learning vs Dynamic Programming TD is an Online algorithm TD is Model-free

  6. Q-Learning Task Agent state action reward Environment a0 a1 … S0 S1 S2 r0 r1 Goal: Learn to choose actions that maximize where

  7. Problem Description • Environment A to E: rooms, F: outside building (target). The aim is that an agent to learn to get out of building from any of rooms in an optimal way.

  8. Modeling of the environment

  9. State, Action, Reward and Q-value • Reward matrix

  10. Q-table and the update rule Q table update rule: :learning parameter

  11. Q-Learning • Given : State diagram with a goal state (represented by matrix R) • Find : Minimum path from any initial state to the goal state (represented by matrix Q) • Set parameter γ , and environment reward matrix R • Initialize matrix Q as zero matrix • For each episode: • ・ Select random initial state • ・ Do while not reach goal state • ・ Select one among all possible actions for the current state • ・ Using this possible action, consider to go to the next state • ・ Get maximum Q value of this next state based on all possible actions • ・ compute • ・ Set the next state as the current state • End Do • End For

  12. Algorithm to utilize the Q table Input: Q matrix, initial state • Set current state = initial state • From current state, find action that produce maximum Q value • Set current state = next state • Go to 2 until current state = goal state The algorithm above will return sequence of current state from initial state until goal state. Comments: Parameter γ has range value of 0 to 1( ). If γ is closer to zero, the agent will tend to consider only immediate reward. If γ is closer to one, the agent will consider future reward with greater weight, willing to delay the reward.

  13. Numerical Example Let us set the value of learning parameter and initial state as room B.

  14. Episode 1 Look at the second row (state B) of matrix R. There are two possible actions for the current state B, that is to go to state D, or go to state F. By random selection, we select to go to F as our action.

  15. Episode 2 This time for instance we randomly have state D as our initial state. From R; it has 3 possible actions, B, C and E. We randomly select to go to state B as our action.

  16. Episode 2 (cont’d) The next state is B, now become the current state. We repeat the inner loop in Q learning algorithm because state B is not the goal state. There are two possible actions from the current state B, that is to go to state D, or go to state F. By lucky draw, our action selected is state F. No change

  17. After Many Episodes If our agent learns more and more experience through many episodes, it will finally reach convergence values of Q matrix as Normalized to percentage

  18. Once the Q matrix reaches almost the convergence value, our agent can reach the goal in an optimum way. To trace the sequence of states, it can easily compute by finding action that makes maximum Q for this state. For example from initial State C, using the Q matrix, we can have the sequences C – D – B – F or C-D-E-F

  19. The End…

More Related