1 / 36

Outline

Outline. MDP (brief) Background Learning MDP Q learning Game theory (brief) Background Markov games (2-player) Background Learning Markov games Littman’s Minimax Q learning (zero-sum) Hu & Wellman’s Nash Q learning (general-sum). Stochastic games (SG). Partially observable SG (POSG).

twyla
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline • MDP (brief) • Background • Learning MDP • Q learning • Game theory (brief) • Background • Markov games (2-player) • Background • Learning Markov games • Littman’s Minimax Q learning (zero-sum) • Hu & Wellman’s Nash Q learning (general-sum)

  2. Stochastic games (SG) Partially observable SG (POSG) / SG / POSG

  3. Expectation over next states Immediate reward Value of next state

  4. Model-based reinforcement learning: • Learn the reward function and the state transition function • Solve for the optimal policy • Model-free reinforcement learning: • Directly learn the optimal policy without knowing the reward function or the state transition function

  5. #times action a causes state transition s  s’ #times action a has been executed in state s Total reward accrued when applying a in s

  6. v(s’)

  7. Start with arbitrary initial values of Q(s,a), for all sS, aA • At each time t the agent chooses an action and observes its reward rt • The agent then updates its Q-values based on the Q-learning rule • The learning rate t needs to decay over time in order for the learning algorithm to converge

  8. Famous game theory example

  9. A co-operative game

  10. Generalization of MDP Mixed strategy

  11. Stationary: the agent’s policy does not change over time Deterministic: the same action is always chosen whenever the agent is in state s

  12. Example State 2 State 1

  13. v(s,*)  v(s,) for all s  S,  

  14. Max V Such that: rock + paper + scissors = 1

  15. Worst case Expectation over all actions Best response

  16. Quality of a state-action pair Discounted value of all succeeding states weighted by their likelihood This learning rule converges to the correct values of Q and v Discounted value of all succeeding states

  17. Expected reward for taking action a when opponent chooses o from state s eplor controls how often the agent will deviate from its current policy

  18. Hu and Wellman general-sum Markov games as a framework for RL Theorem (Nash, 1951) There exists a mixed strategy Nash equilibrium for any finite bimatrix game

More Related