360 likes | 538 Views
Outline. MDP (brief) Background Learning MDP Q learning Game theory (brief) Background Markov games (2-player) Background Learning Markov games Littman’s Minimax Q learning (zero-sum) Hu & Wellman’s Nash Q learning (general-sum). Stochastic games (SG). Partially observable SG (POSG).
E N D
Outline • MDP (brief) • Background • Learning MDP • Q learning • Game theory (brief) • Background • Markov games (2-player) • Background • Learning Markov games • Littman’s Minimax Q learning (zero-sum) • Hu & Wellman’s Nash Q learning (general-sum)
Stochastic games (SG) Partially observable SG (POSG) / SG / POSG
Expectation over next states Immediate reward Value of next state
Model-based reinforcement learning: • Learn the reward function and the state transition function • Solve for the optimal policy • Model-free reinforcement learning: • Directly learn the optimal policy without knowing the reward function or the state transition function
#times action a causes state transition s s’ #times action a has been executed in state s Total reward accrued when applying a in s
Start with arbitrary initial values of Q(s,a), for all sS, aA • At each time t the agent chooses an action and observes its reward rt • The agent then updates its Q-values based on the Q-learning rule • The learning rate t needs to decay over time in order for the learning algorithm to converge
Generalization of MDP Mixed strategy
Stationary: the agent’s policy does not change over time Deterministic: the same action is always chosen whenever the agent is in state s
Example State 2 State 1
Max V Such that: rock + paper + scissors = 1
Worst case Expectation over all actions Best response
Quality of a state-action pair Discounted value of all succeeding states weighted by their likelihood This learning rule converges to the correct values of Q and v Discounted value of all succeeding states
Expected reward for taking action a when opponent chooses o from state s eplor controls how often the agent will deviate from its current policy
Hu and Wellman general-sum Markov games as a framework for RL Theorem (Nash, 1951) There exists a mixed strategy Nash equilibrium for any finite bimatrix game