670 likes | 873 Views
Learning in Games. Georgios Piliouras. Games (i.e. Multi-Body Interactions). Interacting entities Pursuing their own goals Lack of centralized control. Prediction?. Games. (review). n players Set of strategies S i for each player i Possible states (strategy profiles) S=×S i
E N D
Learning in Games GeorgiosPiliouras
Games (i.e. Multi-Body Interactions) • Interacting entities • Pursuing their own goals • Lack of centralized control Prediction?
Games (review) • n players • Set of strategies Si for each player i • Possible states (strategy profiles) S=×Si • Utility ui:S→R • Social Welfare Q:S→R • Extend to allow probabilitiesΔ(Si), Δ(S) ui(Δ(S))=E(ui(S)) Q(Δ(S))=E(Q(S))
Zero-Sum Games & Equilibria (review) 1/3 1/3 1/3 Paper Scissors Rock 1/3 Rock Paper 1/3 Scissors 1/3 Nash: A product of mixed strategies s.t. no player has a profitable deviating strategy.
Why do we study Nash eq? • Nash eq. have a simple intuitive definition. • Nash eq. are applicable to all games. • In some classes of games, Nash eq. is reasonably good predictor of rational self-interested behavior (e.g. zero-sum games). • Even in general games, Nash eq. analysis seems like a natural, albeit optimistic, first step in understanding rational behavior.
Why is it optimistic? • Nash eq. analysis presumes that agents can resolve issues regarding: • Convergence: Agent behavior will converge to a Nash. • Coordination: If there are many Nash eq, agents can coordinate on one of them. • Communication: Agents are fully aware of each other utilities/rationality. • Complexity: Computing a Nash can be hard even from a centralized perspective.
Today: Learning in Games • Agentbehavioris online learning algorithm/dynamic • Input: Current state of environment/other agents • (+ history) • Output: Chosen (randomized) action • Analyze the evolution of systems of coupled dynamics, as a way to predict interacting agent behavior. • Advantages: Weaker assumptions. • If dynamic converges Nash equilibrium (may not converge) • Disadvantages: Harder to analyze
Today: Learning in Games • Agentbehavioris online learning algorithm/dynamic • Input: Current state of environment/other agents • (+ history) • Output: Chosen (randomized) action • Class 1: Best (Better) Response Dynamics • Class 2: No-regret dynamics • (e.g. Weighted Majority/Hedge dynamic)
Best Response Dynamics (BR) • Start from arbitrary state (Si) • Choose arbitrary agent i • Agent i deviates to a best (better) response given the strategies of other. • Advantages: Simple, widely applicable • Disadvantages: No intelligence/learning Does this work?
Congestion Games • n players and m resources (“edges”) • Each strategy corresponds to a set of resources (“paths”) • Each edge has a cost function ce(x) that determines the cost as a function on the # of players using it. • Cost experienced by a player = sum of edge costs x x x x Cost(red)=6 Cost(green)=8 2x x x 2x
Potential Games • A potential game is a game that exhibits a function Φ: S→R s.t. for every s ∈ S and every agent i, ui(si,s-i) - ui(s) = Φ(si,s-i) - Φ(s) • Every congestion game is a potential game: • This implies that any such game has pure NE and that best response converges. Speed?
No Regret Learning Regret(T) in a history of T periods: No single action significantly outperforms the dynamic. total profit of best fixed action in hindsight - total profit of algorithm An algorithm is characterized as “no regret” if for every input sequence the regret grows sublinearly in T. [Blackwell 56], [Hannan 57], [Fundberg, Levine 94],…
No Regret Learning No single action significantly outperforms the dynamic.
The Multiplicative Weights Algorithm a.k.a. Hedge a.k.a. Weighted Majority[LittlestoneWarmuth ’94, Freund Schapire ‘99] • Pick s with probability proportional to (1-ε)total(s), where total(s)denotes cumulative cost in all past periods. • Why is it regret minimizing? • Proof on the board.
No Regret and Equilibria Do no-regret algorithmsconverge to Nash equilibriain general games? Do no-regret algorithmsconverge to other equilibriain general games?
Other Equilibrium Notions (review) 1/3 1/3 1/3 Rock Paper Scissors 1/3 Rock Paper 1/3 Scissors 1/3 Choose any of the green outcomes uniformly (prob. 1/9) Nash: Aprobability distribution over outcomes, that is a product of mixed strategies s.t. no player has a profitable deviating strategy.
Other Equilibrium Notions (review) 1/3 1/3 1/3 Rock Paper Scissors 1/3 Rock Paper 1/3 Scissors 1/3 Nash: Aprobability distribution over outcomes, s.t. no player has a profitable deviating strategy. Coarse Correlated Equilibria (CCE):
Other Equilibrium Notions (review) Rock Paper Scissors Rock Paper Scissors Aprobability distribution over outcomes, s.t. no player has a profitable deviating strategy. Coarse Correlated Equilibria (CCE):
Other Equilibrium Notions (review) Choose any of the green outcomes uniformly (prob. 1/6) Rock Paper Scissors Rock Paper Scissors Aprobability distribution over outcomes, s.t. no player has a profitable deviating strategy. Coarse Correlated Equilibria (CCE):
Other Equilibrium Notions (review) Is this a CE? NO Choose any of the green outcomes uniformly (prob. 1/6) Rock Paper Scissors Rock Paper Scissors Aprobability distribution over outcomes, s.t. no player has a profitable deviating strategy even if he can condition the advice from the dist. . Correlated Equilibria (CE):
Other Equilibrium Notions (review) NE CE CCE Pure NE
No-regret & CCE A history of no-regret algorithmsis a sequence of outcomes s.t. no agent has a single deviating action that can increase her average payoff. A Coarse Correlated Equilibrium is a probability distribution over outcomes s.t.no agent has a single deviating action that can increase her expected payoff.
No Regret and Equilibria Do no-regret algorithmsconverge to Nash equilibriain general games? Do no-regret algorithmsconverge to other equilibriain general games? Do no-regret algorithmsconverge to Nash equilibriain interestinggames?
CCE in Zero-Sum Games In general games, CCE ⊇ conv(NE) Why? In zero-sum games, the marginals and utilities of CCE and NE agree Why? What does it imply for no-regret algs?
BREAK 2 Can learning beat NASH equilibria by an arbitrary factor?
CCE in Congestion Games Load balancing: n balls, n bins Makespan: Expected maximum latency over all links … … c(x)=x c(x)=x c(x)=x
CCE in Congestion Games Pure Nash Makespan: 1 … 1 1 1 … c(x)=x c(x)=x c(x)=x
CCE in Congestion Games [Koutsoupias, Mavronicolas, Spirakis ’02], [Czumaj, Vöcking ’02] Mixed Nash Makespan: Θ(logn/loglogn) … 1/n 1/n 1/n … c(x)=x c(x)=x c(x)=x
CCE in Congestion Games [Blum, Hajiaghayi, Ligett, Roth ’08] Coarse Correlated Equilibria Makespan: Exponentially worse Ω(√n) … … c(x)=x c(x)=x c(x)=x
No-Regret Algs in Congestion Games Since worst case CCE can be reproduced by worst case no-regret algs, worst case no-regret algorithms do not converge to Nash equilibria in general.
(Multiplicative Weights) Algorithm in (Potential) Games • (t) is the current state of the system (this is a tuple of randomized strategies, one for each player). • Each player tosses their coins and a specific outcome is realized. • Depending on the outcome of these random events, we transition to the next state. Infinite Markov Chains with Infinite States (t+1) (t+1) O(ε) (t) O(ε) (t+1) O(ε) Δ(S)
(Multiplicative Weights) Algorithm in (Potential) Games • Problem 1: Hard to get intuition about the problem, let alone analyze. • Let’s try to come up with a “discounted” version of the problem. • Ideas?? Infinite Markov Chains with Infinite States (t+1) (t+1) O(ε) (t) O(ε) (t+1) O(ε) Δ(S)
(Multiplicative Weights) Algorithm in (Potential) Games • Idea 1: Analyze expected motion. Infinite Markov Chains with Infinite States (t+1) (t+1) O(ε) (t) O(ε) (t+1) O(ε) Δ(S)
(Multiplicative Weights) Algorithm in (Potential) Games • The system evolution is now deterministic. (i.e. there exists a function f, s.t. • I wish to analyze this function (e.g. find fixed points). • Idea 1: Analyze expected motion. E[ (t+1)] (t) E[ (t+1)]= f ( (t),ε ) O(ε) Δ(S)
(Multiplicative Weights) Algorithm in (Potential) Games • Problem 2: The function f is still rather complicated. • Idea 2: I wish to analyze the MWA dynamics for small ε. • Use Taylor expansion to find a first order approximation to f. E[ (t+1)] (t) O(ε) Δ(S) f ( (t),ε) = f ( (t),0) + ε ×f ´( (t),0) + O(ε2)
(Multiplicative Weights) Algorithm in (Potential) Games • As ε→0, the equation specifies a vector on each point of our state space (i.e. a vector field). This vector field defines a system of ODEs which we are going to analyze. f ( (t),ε)-f ( (t),0) = f´( (t),0) ε (t) f´( (t),0) Δ(S)
Deriving the ODE • Taking expectations: • Differentiate w.r.t. ε, take expected value: • This is the replicator dynamic studied in evolutionary game theory.
Motivating Example c(x)=x c(x)=x
Motivating Example • Each player’s mixed strategy is summarized by a single number. (Probability of picking machine 1.) Plot mixed strategy profile in R2. Mixed Nash Pure Nash
Motivating Example • Each player’s mixed strategy is summarized by a single number. (Probability of picking machine 1.) Plot mixed strategy profile in R2.
Motivating Example • Even in the simplest case of two balls, two bins with linear utility the replicator equation has a nonlinear form.
The potential function • The congestion game has a potential function • Let Ψ=E[Φ]. A calculation yields • Hence Ψ decreases except when every player randomizes over paths of equal expected cost (i.e. is a Lyapunov function of the dynamics). [Monderer-Shapley ’96]. Analyzing the spectrum of the Jacobian shows that in “generic” congestion games only pure Nash are stable. [Kleinberg-Piliouras-Tardos ‘09]
Cyclic Matching Pennies (Jordan’s game) Profit of 1 if you mismatch opponent; 0 otherwise [Jordan ’93] H, T Nash Equilibrium ½, ½ H, T H, T ½, ½ ½, ½ • Social Welfare of NE: 3/2
Cyclic Matching Pennies (Jordan’s game) Profit of 1 if you mismatch opponent; 0 otherwise [Jordan ’93] H, T Best Response Cycle H, T H, T • Social Welfareof NE: 3/2 • (H,H,T)
Cyclic Matching Pennies (Jordan’s game) Profit of 1 if you mismatch opponent; 0 otherwise [Jordan ’93] H, T Best Response Cycle H, T H, T • Social Welfareof NE: 3/2 • (H,H,T),(H,T,T)
Cyclic Matching Pennies (Jordan’s game) Profit of 1 if you mismatch opponent; 0 otherwise [Jordan ’93] H, T Best Response Cycle H, T H, T • Social Welfareof NE: 3/2 • (H,H,T),(H,T,T),(H,T,H),(T,T,H),(T,H,H),(T,H,T),(H,H,T) Social Welfare: 2
Cyclic Matching Pennies (Jordan’s game) [Jordan ’93] H, T Best Response Cycle H, T H, T • Social Welfareof NE: 3/2 • (H,H,T),(H,T,T),(H,T,H),(T,T,H),(T,H,H),(T,H,T),(H,H,T) Social Welfare: 2
Asymmetric Cyclic Matching Pennies [Jordan ’93] H, T Best Response Cycle 1/(M+1), M/(M+1) H, T H, T 1/(M+1), M/(M+1) 1/(M+1), M/(M+1) • Social Welfareof NE: 3M/(M+1) < 3 • (H,H,T),(H,T,T),(H,T,H),(T,T,H),(T,H,H),(T,H,T),(H,H,T) Social Welfare: M+1