380 likes | 570 Views
Software Multiagent Systems: CS543. Milind Tambe University of Southern California tambe@usc.edu. Dimensions of Multiagent Learning. Ignore others’ learning vs Model others’ learning Cooperative vs Competitive Cooperative Learn to coordinate with others Learning organizational roles
E N D
Software Multiagent Systems:CS543 Milind Tambe University of Southern California tambe@usc.edu
Dimensions of Multiagent Learning • Ignore others’ learning vs Model others’ learning • Cooperative vs Competitive • Cooperative • Learn to coordinate with others • Learning organizational roles • Competitive (conflicting learning goals) • Learning to play better against adversary • Opponent modeling • We will focus on reinforcement learning: Q-learning methods
Some Terminology • Q-learning • Model-free vs Model-based
Q-learning • Q-values: Q(s,a) • Related to utility values: • U(s) = max Q(s,a) • Following equation must hold at equilibrium: Q(i,a) = R(i) + SP(j|i,a)* max Q(j,a’) • Requires learning a model! a J a’
TD Q-learning • Update equation for TD Q-learning is: Q(i,a) Q(i,a) + (R(i) + max Q( j, a’) – Q(i,a)) • What if = 0? • What if = 1? a’
Q Learning Agent • Q-learning-agent (e) returns an action • e is the percept • Q: table of action values • N: table of state-action frequencies • a, the last action • I, previous state • J state[e] • N[I,a] N[I,a] +1 • Q[I,a] Q[I,a] + (R(i) + max Q(j,a’) – Q( I, a)) • I J • Return (action a’ that maximizes f(Q( j, a’), N[j,a’])) a’
Choosing an Action…. • Step 5: choosing the best action to take in state J (a’ is the action chosen using f(Q(a’, j), N[a’, j])) • Suppose all Q values initially zero, and f(Q(a’, j), N[a’, j]) chooses max Q(a’, j) • Suppose after first exploration: • Q[J,A1] = 10, Q[J,A2] = 0, Q[J,A3] = 0, Q[J,A4] = 0 What will happen? Is this a problem?
Exploration vs Exploitation • Tradeoff: immediate good (exploit) vs long-term (explore) • Continuous exploration vs stuck to well-known path • Key question: How to balance the two? • One approach: • Give some weight to actions not tried often • Avoid actions that are of low utility
Exploration • Giving “weight” action not tried very often f(Q(a’, j), N[a’, j])) = argmax G(Q(a’, j), N[a’, j])) • G returns: • very high “R” if N(a’, j) < N-VISITS • otherwise Q (a’, j) • What will be the result of such a function G? a’
Two Frameworks for Multiagent “Learning” DCOP: Exploration + Exploitation (paper to be posted on the web site) [Jain et al IJCAI’09] Stochastic games: Multiagent learning to reach N.E. (in our readings)
DCOP Framework a1 a2 a3 • Assign values to distributed variables • Optimize total reward • No central control
DCOPS for Mobile Sensor Setworks (with Lockheed ATL)
New Challenges • Reward matrices unknown • Algorithms explore environment • Maximize total cumulative signal strength • Changes measuring of DCOP algorithms • Limited time horizon • Not explore everything • Horizon-aware DCOPs
DCOP Framework: Reward Matrix Unknown a1 a2 a3 Assigning values to variables = Exploration Exploration takes time (physical movement) Limited time; full exploration impossible
Three New Algorithms • Based on MGM (maximum gain message) • Hill climbing • Communicate possible gain to neighbors • Agent with max gain “moves” • Proposed new algorithms: • SE-optimistic: Unexplored domain values yield ‘maximum’ • Optimistic: Maximal Potential Gain Messaging • Exploration maximized: look for max value • SE-mean: Unexplored domain values yield ‘mean’ reward • “Realistic”: Limit exploration, satisfied by mean • BE-backtrack : Lookahead given reward function distribution • Intelligent: Decision theoretic limit on exploration Gain =15 Gain =20 a1 a2 a3
DCOP Framework: Reward Matrix Unknown a1 a2 a3 What if 20 is max reward? SE-optimistic: how will it work?
Lookahead • Agent decides: ‘explore’ or ‘backtrack’ to explored state • Let Rb be the best reward among explored states • The agent will explore for T units only if • EU(Explore) > EU(backtrack) • Expected Utility of Backtrack: • EU(backtrack) = Rb*T
Lookahead • Expected utility of explore is calculated as: • P(x,n,te) is the first order statistic of choosing the maximum as ‘x’ in te trials • E.U.(explore) is sum of three terms: • utility of exploring • utility of finding a better reward than current Rb • utility of failing to find a better reward than current Rb
Sample Results(Jain et al, IJCAI’09) • Decision theoretic approach to exploration • Interleave with DCOPs
Towards Multiagent Learning:Stochastic Game Generalize distributed POMDPs Different payoffs for each player, not a common payoff Focus on two person stochastic game Learning algorithms for stochastic games
Stochastic 2-player Game • States: S • Action sets for each player: A1, A2 • P transition probabilities: P(s’| s, a1, a2) • R or Reward: two separate rewards: • R1(s, a1,a2), depends on actions of all agents • R2(s, a1, a2), depends on actions of all agents • If R1(s,a1,a2) + R2(s,a1,a2) = 0, then zero sum game • State observable (MDP like) • Each player: maximize its own (discounted) sum of rewards
Stochastic Game P(s1|s0,a1,a2) R1(s0) R2(s0) R1(s1) R2(s1) R1(s2) R2(s2) P(s2|s0,a1,a2) Reward function depends on the state!
Stochastic Game • How are repeated games related to stochastic games?
Stochastic Game • Strategies = Policies • Since rewards differ for each agent • hence expected values differ as well • v1(s, π1, π2) gives us the expected value for agent1 in state s, given that agents pursued policies π1, π2 • Nash equilibrium in stochastic game: • pair of strategies (π1*, π2*) such that for all states s v1(s, π1*, π2*) >= v1(s, π1, π2*) And v2(s, π1*, π2*) >= v2(s, π1*, π2)
Nash Equlibrium Policies • In Stochastic games, we focus on policies that attain Nash equilibrium • If we don’t find Nash equilibrium, then players may have an incentive to deviate • Search for stability is critical • Policies may be randomized; may not be deterministic
Example Stochastic game • Goalee can move or stay • Shooter can move or shoot • Zero sum game, goal worth 10 points to shooter • Blocking worth 5 points to goalee
Q-learning in Stochastic Games Nash-Q algorithm: • Q1(s, a1,a2) - Q value of agent1 for state S • Q2(s, a1, a2) Q value of agent2 for state S • Optimal Q values: • Q1*(s, a1, a2) = R1(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V1(s’, π1*, π2*) • Q2*(s, a1, a2) = R2(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V2(s’, π1*, π2*)
Algorithm Consider two agents: • Each agent maintains m Q-tables, m = number of states • For each state, maintain |A1|*|A2| number of entries in the Q-table • |A1| for my actions • |A2| for other agents’ actions • Q-tables for me and for the other agent
Key Observation • State s’ • Bimatrix representation: Q1[s’], Q2[s’] • Defines a game • Can find mixed strategy nash equilibrium for this game • Mixed strategy Nash equilibrium: • Provides probability distribution for what action to execute
Multiagent Q-Learning • Initialize Q tables • Loop: • Choose action a1 based on π1(s), which is a mixed strategy Nash equilibrium of the game defined by (Q1(s), Q2(s)) • Observe r1, r2, a2, s’ • Update Q1(s) and Q2(s) using the equations defined below • Q1(s,a1,a2) Q1(s, a1,a2)+ (R(s)+ λ[ Z1 ]– Q1(s,a1,a2)) • Z1 = expected reward given N.E. in state s’ • due to game Q1(s’),Q2(s’)
What do we end up with: Agents converging into the Nash equilibrium
Towards Multiagent Learning • Learning “single agent” in a multiagent setting • Ignore other agents except for some property like location • Ignore that other agents act intentionally, adapt • Advantages: • Simpler • Easily converges
Single Agent in Multiagent Setting • RoboCup Soccer Simulation League • Players use model-free reinforcement learning to intercept the ball • Learn “on line” during the game
Finding #1: Online Learning Specialized by Opponent • Same player position against two different RoboCup teams: • Player 1 (forward) against CMUnited and Andhill • Against CMUnited, player turns more aggressively
Finding #2: Online Learning Specialized by Role Same team against different players Player 1 (forward) and Player 10 (fullback) against CMUnited
Lessons Learned • Surprise in tests against opponent teams: • Significant specialization of intercept with both role & opponent • Lesson: Transfer of experience or cross-training may be detrimental