Multi agent social learning in large repeated games

Multiagent social learning in large repeated games Jean Oh

Motivation Approach Theoretical Empirical Conclusion far “Discovery of strategies that support mutually desirable outcomes” Selfish solutions can be suboptimal. If short-sighted,

Individual objective: to find a path that minimizes cost e1 e2 e3 e4 agent2 agent1 agentn Multiagent resource selection problem Strategy of agent i Cost ci(si, s-i) ? Strategy profile A={ resource1, resource2… resourcem } statet strategy … N={ }

Congestion cost depends on: the number of agents that have chosen the same resource. Congestion game! “Selfish solution”: the cost of every path becomes more or less indifferent; thus no one wants to deviate from current path. (a.k.a. Nash equilibrium, Wardrop’s first principle) social welfare: average cost of all agents • Individual objective: to minimize congestion cost • “Selfish solutions” can be arbitrarily suboptimal[Roughgarden 2007]. • Important subject in transportation science, computer networks, and algorithmic game theory.

Example: Inefficiency of selfish solution Selfish solution Average cost = 1 1 1 Central administrator n Optimal average cost [n/2  1 + n/2½]/n = ¾ Metro vs. Driving [Pigou 1920, Roughgarden 2007] Depends on # of drivers: 1 metro Stationary algorithms (e.g. no regret, fictious play) driving cost Nonlinear cost function? # of agents nagents Constant cost: 1 Objective: minimize average cost

If a few agents take alternative route, everyone else is better off. Just a few altruistic agents to sacrifice, any volunteers? Excellent! as long as it’s not me.

Related work Braess’ paradox Coping with the inefficiency of selfish solution • Increase resource capacity [Korilis 1999] • Redesign network structure [Roughgarden 2001a] • Algorithmic mechanism design[Ronen 2000,Calliese&Gordon 2008] • Centralization [Shenker 1995, Chakrabarty 2005, Blumrosen 2006] • Periodic policy under “homo-egualis” principle [Nowé et al. 2003] • Taking the worst-performing agent into consideration (to avoid inequality) • Collective Intelligence (COIN) [Wolpert & Tumer 1999] • WLU: Wonderful Life Utility! • Altruistic Stackelberg strategy [Roughgarden 2001b] • (Market) leaders make first moves, hoping to induce desired actions from the followers • LLF (centralized + selfish) agents • “Explicit coordination is necessary to achieve system optimal solution in congestion games” [Milchtaich 2004] Can self-interested agents support mutually beneficial solution without external intervention?

Related work: strategies that support mutually beneficial solutions Complete monitoring NP-hard (Meyers 2006) NP-complete (Borgs et al. 2008) Congestion cost Coordination overhead As long as you stay If you deviate Explicit threat: grim-trigger [Nash equilibrium of a repeated game] when everyone adopts grim-trigger Whatever you do from then on Agenda: to find more efficient strategies that can support mutually beneficial solutions. We’ll be mutually beneficial I’ll punish you with minimax value forever Minimax value: as good as [i] can get when the rest of the world turns against [i]. • Computational intractability • May require centralization: “significant coordination overhead” • Existing algorithms limited to 2-player games (Stimpson 2001, • Littman & Stone 2003, Sen et al. 2003, Crandall 2005)

Motivation Approach Theoretical Empirical Conclusion IMPRESImplicit Reciprocal Strategy Learning

IMPRES may be Assumptions The other agents are _______________. • opponents • sources of uncertainty • sources of knowledge The agents are _________ in their ability. • symmetric • asymmetric “sources of knowledge” “asymmetric”

IMPRES stop Go Intuition: social learning “Learn to act more rationally by giving strategy to others” “Learn to act more rationally by using strategy given by others” Independent: non-zero probability of collision

IMPRES - Meta-layer - Inner-layer Inner- layer Meta- layer -strategist -subscriber -solitary Agent k Overview: 2-layered decision making congestion cost Environment path Agent i 1. whose strategy? Agent j 2. which path? Agent k 3. Learn strategies using cost “Take route 2” Agent j Agent i

IMPRES how to select path from P = {p1,…} path Environment Agent i Current meta-action a cost strategy  Strategist lookup table L A = {-strategist, -solitary } Q 0 0 s 0.5 0.5 strategy … how to select action from A More probability mass to low cost actions Meta-learning: which strategy? -subscriber 0 • LOOP: • p  selectPath(a); take path p; find out congestion cost c • Update Q value of action a using cost c: Q(a)  (1-)Q(a) +  (MaxCost - c) • new action  randomPick(strategist lookup table L); A  A  {} • Update meta-strategy s • a  select action according to meta-strategy s; if a = -strategist, L  L  {i}

IMPRES f = 0 f = 2 Inner-learning: which path? e1 e2 e3 e4 • f: number of subscribers (to this strategy) when f = 0, no inner-learning • : joint strategy for f agents •   path p; take path p; observe # of agents on edges of p • Predict traffic on each edge generated by others • Select best joint strategy  for f agents (exploration with small probability)  symmetric network congestion games • Shuffle joint strategy  correlated strategy: probability distribution over all possible joint actions (drive, metro)

Motivation Approach Theoretical Empirical Conclusion IMPRES • Mechanics of the algorithm • Meta-layer: which strategy? • Inner-layer: which path? • Structure of learned strategy • IMPRES vs. Grim-trigger • Main theoretical results • Empirical results

Motivation Approach Theoretical Empirical Conclusion exploit explore Non-stationary strategy:strategy that depends on past plays Grim-trigger: any correlated strategy that is better than minimax can be supported. IMPRES: any correlated strategy that is better than independent strategy can be supported. Cost(C)  Cost(I) Cost(C) ≥ Cost(I) Correlated Strategy (C) Independent Strategy (I) -subscriber strategy -solitary strategy Cost(C)  Cost(I) Cost(I)  Cost(C) An IMPRES strategy

Strategies that can support mutually beneficial outcome exploit explore Grim-trigger vs. IMPRES Perfect monitoring Imperfect monitoring Intractable Tractable Coordination overhead (centralization) Efficient coordination Deterministic Stochastic Other players obey Observe a deviator Whatever Mutually beneficial strategy Minimax strategy Minimax strategy A grim-trigger strategy Cost(C)  Cost(I) Mutually beneficial strategy Cost(C) ≥ Cost(I) Independent strategy Independent strategy Cost(C)  Cost(I) Cost(I)  Cost(C) An IMPRES strategy

Motivation Approach Theoretical Empirical Conclusion Minimax strategy independent strategy Explicit threat Implicit threat “Rationally bounded IMPRES” “without” Main result General belief Mutually beneficial strategy Rational agents can support mutually beneficial outcome with explicit threat.

Motivation Approach Theoretical Empirical Conclusion Selfish solutions Congestion cost: arbitrarily suboptimal Coordination overhead: none Congestion cost Mutually beneficial solutions Congestion cost: optimal Coordination overhead: significant Coordination overhead Empirical evaluation Quantifying “mutually beneficial” and “efficient” IMPRES (1-to-n centralization)

Cost (solutionp) Cost (optimump) Evaluation criteria • Individual rationality: minimax-safety • Average congestion cost of all agents (social welfare); for problem p • Coordination overhead (size of subgroups) relative to a 1-to-n centrally administrated system. • Agent demographic (based on meta-strategy), e.g. percentage of solitaries, strategists, and subscribers. overhead (solutionp) overhead (maxp)

Experimental setup • Number of agents n = 100; (n = 2 ~ 1000) • All agents use IMPRES (self-play) • Number of iterations = 20,000 ~ 50,000 • Averaged over 10-30 trials • Learning parameters:

The lower, the better Free riders: always driving Metro vs. Driving (n=100) Agent demographic metro cost driving # of agents metro driving

Metro vs. Driving (n=100) metro IMPRES cost driving # of agents metro driving

Polynomial cost functions, average number of paths=5 C(s) C(optimum) C(selfish solution) C(optimum) C(s): congestion cost of solution s Selfish solution: the cost of every path becomes more or less indifferent; thus no one wants to deviate from current path. (data is based on average cost after 20,000 iterations) y=x For this problem: (3,3) (3,1) (3,1.2) Selfish solution Optimum IMPRES Selfish baseline [Fabrikant 2004] Optimal baseline [Meyers 2006]

Polynomial cost functions, average number of paths=5 C(s) C(optimum) O(solution) O(1-to-n solution) o(s): coordination overhead of solution s Average communication bandwidth Congestion cost worse better 1-to-n solution Optimum Coordination overhead

40 problems with mixed convex cost functions, average number of paths=5 C(s) C(optimum) C(selfish solution) C(optimum) On dynamic population (data is based on average cost after 50,000 iterations) 1 agent in every ith round, randomly selected, replaced with new one Selfish baseline Optimal baseline

Motivation Approach Theoretical Empirical Conclusion Summary of experiments • Symmetric network congestion games • Well-known examples • Linear, polynomial, exponential, & discrete cost functions • Scalability • number of alternative paths (|S| = 2 ~ 15) • Population size (n = 2 ~ 1000) • Robustness under dynamic population assumption • 2-player matrix games • Inefficiency of solution based on 121 problems: • Selfish solutions: 120% higher than optimum • IMPRES solutions: 30% higher than optimum 25% coord. overhead of 1-to-n model limitation

Motivation Approach Theoretical Empirical Conclusion Contributions • Discovery of social norm (strategies) that can support mutually beneficial solutions • Investigated “social learning” in multiagent context • Proposed IMPRES: 2-layered learning algorithm • significant extension to classical reinforcement learning models • the first algorithm that learns non-stationary strategies for more than 2 players under imperfect monitoring • Demonstrated IMPRES agents self-organize: • Every agent is individually rational (minimax-safety) • Social welfare is improved by approx. 4 times from selfish solutions • Efficient coordination (overhead within 25% of 1-to-n model)

Motivation Approach Theoretical Empirical Conclusion Future work • Short-term goals: more asymmetry • Strategists – give more incentive • Individual threshold (sightseers vs. commuters) • Tradeoffs between multiple criteria (weight) • Free rider problem • Long-term goals: • Establish the notion of social learning in artificial agent learning context • Learning by copying actions of others • Learning by observing consequences of other agents

Conclusion Rationally bounded agents adopting social learning can support mutually beneficial outcomes without the explicit notion of threat.

Thank you.

Multi agent social learning in large repeated games