Multi-Agent Learning Mini-Tutorial

Multi-Agent Learning Mini-Tutorial Gerry Tesauro IBM T.J.Watson Research Center http://www.research.ibm.com/infoecon http://www.research.ibm.com/massdist

Outline • Statement of the problem • Tools and concepts from RL & game theory • “Naïve” approaches to multi-agent learning • ordinary single-agent RL; no-regret learning • fictitious play • evolutionary game theory • “Sophisticated” approaches • minimax-Q (Littman), Nash-Q (Hu & Wellman) • tinkering with learning rates: WoLF (Bowling), Multiple-timescale Q-learning (Leslie & Collins) • “strategic teaching” (Camerer talk) • Challenges and Opportunities

Normal single-agent learning • Assume that environment has observable states, characterizable expected rewards and state transitions, and all of the above is stationary (MDP-ish) • Non-learning, theoretical solution to fully specified problem: DP formalism • Learning: solve by trial and error without a full specification: RL + exploration, Monte Carlo, ...

Multi-Agent Learning Problem: • Agent tries to solve its learning problem, while other agents in the environment also are trying to solve their own learning problems. • Non-learning, theoretical solution to fully specified problem: game theory

Basics of game theory • A game is specified by: players (1…N), actions, and payoff matrices (functions of joint actions) B’s action A’s action A’s payoff B’s payoff • If payoff matrices are identical, game is cooperative, else non-cooperative (zero-sum = purely competitive)

Basic lingo…(2) • Games with no states: (bi)-matrix games • Games with states: stochastic games, Markov games; (state transitions are functions of joint actions) • Games with simultaneous moves: normal form • Games with alternating turns: extensive form • No. of rounds = 1: one-shot game • No. of rounds > 1: repeated game • deterministic action choice: pure strategy • non-deterministic action choice: mixed strategy

Basic Analysis • A joint strategy x is Pareto-optimal if no x’ that improves everybody’s payoffs • An agent’s xi is a dominant strategy if it’s always best regardless of others’ actions • xi is a best-reponse to others’ x-i if it maximizes payoff given x-i • A joint strategy x is an equilibrium if each agent’s strategy is simultaneously a best-response to everyone else’s strategy, i.e. no incentive to deviate (Nash, correlated) • A Nash equilibrium always exists, but may be exponentially many of them, and not easy to compute

What about imperfect information games? • Nash eqm. requires knowledge of all payoffs. For imperfect info. games, corresponding concept is Bayes-Nash equilibrium (Nash plus Bayesian inference over hidden information). Even more intractable than regular Nash.

Can we make game theory more tractable? • Active area of research • Symmetric games: payoffs are invariant under swapping of player labels.  Can look for symmetric equilibria, where all agents play same mixed strategy. • Network games: agent payoffs only depend on interactions with a small # of neighbors • Summarization games: payoffs are simple summarization functions of population joint actions (e.g. voting)

Summary: pros and cons of game theory • Game theory provides a nice conceptual/theoretical framework for thinking about multi-agent learning. • Game theory is appropriate provided that: • Game is stationary and fully specified; • Enough computer power to compute equilibrium; • Can assume other agents are also game theorists; • Can solve equilibrium coordination problem. • Above conditions rarely hold in real applications • Multi-agent learning is not only a fascinating problem, it may be the only viable option.

Naïve Approaches to Multi-Agent Learning • Basic idea: agent adapts, ignoring non-stationarity of other agents’ strategies • 1. Fictitious play: Agent observes time-average frequency of other players’ action choices, and models: agent then plays best-response to this model • Variants of fictitious play: exponential recency weighting, “smoothed” best response (~softmax), small adjustment toward best response, ...

What if all agents use fictitious play? • Strict Nash equilibria are absorbing points for fictitious play • Typical result is limit-cycle behavior of strategies, with increasing period as N   • In certain cases, product of empirical distributions converges to Nash even though actual play cycles (penny matching example)

More Naïve Approaches… • 2. Evolutionary game theory:“Replicator Dynamics” models: large population of agents using different strategies, fittest agents breed more copies. • Let x= population strategy vector, and xk = fraction of population playing strategy k. Growth rate then: • Above equation also derived from an “imitation” model • NE are fixed points of above equation, but not necessarily attractors (unstable or neutral stable)

Many possible dynamic behaviors... • limit cycles attractors unstable f.p. • Also saddle points, chaotic orbits, ...

Replicator dynamics: auction bidding strategies

More Naïve Approaches… • 3. Iterated Gradient Ascent: (Singh, Kearns and Mansour): Again does a myopic adaptation to other players’ current strategy. • Coupled system of linear equations: u is linear in xi and x-i • Analysis for two-player, two-action games: either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles

Further Naïve Approaches… • 4. Dumb Single-Agent Learning: Use a single-agent algorithm in a multi-agent problem & hope that it works • No-regret learning by pricebots (Greenwald & Kephart) • Simultaneous Q-learning by pricebots (Tesauro & Kephart) • In many cases, this actually works: learners converge either exactly or approximately to self-consistent optimal strategies

“Sophisticated” approaches • Takes into account the possibility that other agents’ strategies might change. • 5. Multi-Agent Q-learning: • Minimax-Q (Littman): convergent algorithm for two-player zero-sum stochastic games • Nash-Q (Hu & Wellman): convergent algorithm for two-player general-sum stochastic games; requires use of Nash equilibrium solver

More sophisticated approaches... • 6. Varying learning rates • WoLF: “Win or Learn Fast” (Bowling): agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbing • Multi-timescale Q-Learning (Leslie): different agents use different power laws t-n for learning rate decay: achieves simultaneous convergence where ordinary Q-learning doesn’t

More sophisticated approaches... • 7. “Strategic Teaching:” recognizes that other players’ strategy are adaptive • “A strategic teacher may play a strategy which is not myopically optimal (such as cooperating in Prisoner’s Dilemma) in the hope that it induces adaptive players to expect that strategy in the future, which triggers a best-response that benefits the teacher.” (Camerer, Ho and Chong)

Theoretical Research Challenges • Proper theoretical formulation? • “No short-cut” hypothesis: Massive on-line search a la Deep Blue to maximize expected long-term reward • (Bayesian) Model and predict behavior of other players, including how they learn based on my actions (beware of infinite model recursion) • trial-and-error exploration • continual Bayesian inference using all evidence over all uncertainties (Boutilier: Bayesian exploration) • When can you get away with simpler methods?

Real-World Opportunities • Multi-agent systems where you can’t do game theory (covers everything :-)) • Electronic marketplaces (Kephart) • Mobile networks (Chang) • Self-managing computer systems (Kephart) • Teams of robots (Bowling, Stone) • Video games • Military/counter-terrorism applications

Multi-Agent Learning Mini-Tutorial