Reinforcement Learning Dynamics in Social Dilemmas

Reinforcement Learning Dynamics in Social Dilemmas Luis R. Izquierdo & Segis Izquierdo

Outline of the presentation • BM reinforcement model • Macy and Flache’s SRE and SCE • Self-Reinforcing Equilibrium (SRE): Challenges • Self-Correcting Equilibrium (SCE): Challenges • Motivation • In-depth analysis of the dynamics of the model • Analysis of the robustness of the model • Conclusions

BM reinforcement model • Reinforcement learners tend to repeat actions that led to satisfactory outcomes in the past, and avoid choices that resulted in unsatisfactory experiences. • The propensity or probability to play an action is increased (decreased) if it leads to a satisfactory (unsatisfactory) outcome.

BM reinforcement model The Prisoner’s Dilemma pC= Probability to cooperate pD= Probability to defect Aspiration Threshold: A = 2 Learning rate: l = 0.5

BM reinforcement model Partner’s Choice: C / D Aspiration Threshold Outcome [CC, DD, CD, DC] pa C / D Payoff [R, P, S, T] Stimulus D+C- T = 4 C+C+ R = 3 A = 2 D-D- P = 1 –1 <= Stimulus <= 1 S = 0 C-D+

D+C- T = 4 C+C+ R = 3 A = 2 D-D- P = 1 S = 0 C-D+ BM reinforcement model Partner’s Choice: C / D Aspiration Threshold Outcome [CC, DD, CD, DC] pa C / D Payoff [R, P, S, T] Stimulus pD ↑↑ pC ↓↓ st(D / T) = 1 pC ↑ pC ↑ st(C / R) = 0.5 pD ↓ pC ↑ st(D / P) = –0.5 pC ↓↓ pC ↓↓ st(C / S) = – 1

BM reinforcement model The Prisoner’s Dilemma pC= Probability to cooperate Aspiration Threshold: A = 2

BM reinforcement model Partner’s Choice: C / D Aspiration Threshold Outcome [CC, DD, CD, DC] C / D Payoff [R, P, S, T] Stimulus pa If satisfactory,movetowards 1 a proportion (l·st) of the remaining distance If unsatisfactory, movetowards 0 a proportion (l·st) of the remaining distance

BM reinforcement model 1 Learning rate: l = 0.5 pC C+ \ D- D+C- T = 4 C+ \ D- st = 1 C- \ D+ C+C+ R = 3 C- \ D+ st = 0.5 A = 2 C+ \ D- D-D- P = 1 st = -0.5 S = 0 C-D+ C- \ D+ st = -1 0 n= 0 n= 1 … n= 2

BM reinforcement model The Prisoner’s Dilemma pC= Probability to cooperate Aspiration Threshold: A = 2

D+C- T = 4 C+C+ R = 3 A = 2 D-D- P = 1 S = 0 C-D+ BM reinforcement model Most likely move Player 1 = D (T) Player 2 = C (S) Player 1 = C (R) Player 2 = C (R) Player 1 = D (P) Player 2 = D (P) Player 1 = C (S) Player 2 = D (T)

D+C- T = 4 C+C+ R = 3 A = 2 D-D- P = 1 S = 0 C-D+ BM reinforcement model Expected motion

MACY M W and Flache A (2002) Learning Dynamics in Social Dilemmas. Proc. Natl. Acad. Sci. USA 99(3), 7229-7236. Fixed aspirations, P < A < R Floating aspirations

MACY M W and Flache A (2002) Learning Dynamics in Social Dilemmas. Proc. Natl. Acad. Sci. USA 99(3), 7229-7236. “We identify a dynamic solution concept, stochastic collusion, based on a random walk from a self-correcting equilibrium (SCE) to a self-reinforcing equilibrium (SRE). These concepts make much more precise predictions about the possible outcomes for repeated games.” “Rewards produce a SRE in which the equilibrium strategy is reinforced by the payoff, even if an alternative strategy has higher utility.” e.g.: PC1 = 1, PC2 = 1 with Ai < Ri Action/stimulus : C+C+; C+C+; C+C+; … PC1, PC2 : 1 1 ; 1 1 ; 1 1 ; …

D+C- T = 4 C+C+ R = 3 A = 2 D-D- P = 1 S = 0 C-D+ BM reinforcement model Expected motion SRE

MACY M W and Flache A (2002) Learning Dynamics in Social Dilemmas. Proc. Natl. Acad. Sci. USA 99(3), 7229-7236. “A mix of rewards and punishments can produce a SCE in which outcomes that punish cooperation or reward defection (causing the probability of cooperation to decrease) balance outcomes that reward cooperation or punish defection (causing the probability of cooperation to increase).” E(∆ PC) = 0 Prob 0.6 0.2 E(∆ PC) = 0.6 × 0.2 – 0.4 × 0.3 = 0 0.3 Prob 0.4 “The SRE is a black hole from which escape is impossible. In contrast, players are never permanently trapped in SCE”

D+C- T = 4 C+C+ R = 3 A = 2 D-D- P = 1 S = 0 C-D+ BM reinforcement model Expected motion SRE SCE

SRE: Challenges “The SRE is a black hole from which escape is impossible” “A chance sequence of fortuitous moves can lead both players into a self-reinforcing stochastic collusion. The fewer the number of coordinated moves needed to lock-in SRE, the better the chances.” However… The cooperative SRE implies PC = 1, but according to the BM model, PC = 1 can be approached, but not reached in finite time! How can you “lock-in” if the SRE cannot be reached? Floating-point errors? Can we give a precise definition of SREs?

SCE: Challenges “The SCE obtains when the expected change of probabilities is zero and there is a positive probability of punishment as well as reward” E(∆ PC) = 0 But … Such SCEs are not always attractors of the actual dynamics.

SCE: Challenges Such SCEs are not always attractors of the actual dynamics. E(∆ PC) = 0 SCE

SCE: Challenges “The SCE obtains when the expected change of probabilities is zero and there is a positive probability of punishment as well as reward” E(∆ PC) = 0 But … Such SCEs are not always attractors of the actual dynamics! Apart from describing regularities in simulated processes, Can we provide a mathematical basis for the attractiveness of SCEs?

Outline of the presentation • Motivation • In-depth analysis of the dynamics of the model • Formalisation of SRE and SCE • Different regimes in the dynamics of the system • Dynamics with HIGH learning rates (i.e. fast adaptation) • Dynamics with LOW learning rates (i.e. slow adaptation) • Validity of the expected motion approximation • Analysis of the robustness of the model • Conclusions

A definition of SRE SRE: an absorbing state of the system CC CD 1 C+C+ C-D+ Space of states PC1 D+C- D+D+ 0 0 1 If Si < Ai < Pi DD DC PC2 Not an event Not an infinite chain of events SREs cannot be reached in finite time, but the probability distribution of the states concentrates around them

A definition of SCE SCE of a system S: An asymptotically stable critical point of the continuous time limit approximation of its expected motion SRE SCE

SRE & SCE E(∆ PC) = 0 NOT AN SCE SRE & SCE Expected movement of the system in a Stag Hunt game parameterized as [3 , 4 , 1 , 0 | 0.5 | 0.5 ]2. The numbered balls show the state of the system after the indicated number of iterations in a sample run.

Three different dynamic regimes “By the ultralong run, we mean a period of time long enough for the asymptotic distribution to be a good description of the behavior of the system. The long run refers to the time span needed for the system to reach the vicinity of the first equilibrium in whose neighborhood it will linger for some time. We speak of the medium run as the time intermediate between the short run [i.e. initial conditions] and the long run, during which the adjustment to equilibrium is occurring.” (Binmore, Samuelson and Vaughan 1995, p. 10)

By the ultralong run, we mean a period of time long enough for the asymptotic distribution to be a good description of the behavior of the system The long run refers to the time span needed for the system to reach the vicinity of the first equilibrium in whose neighborhood it will linger for some time. Ultralong run We speak of the medium run as the time intermediate between the short run [i.e. initial conditions] and the long run, during which the adjustment to equilibrium is occurring Long run Medium run Short run (initial conditions) Trajectories in the phase plane of the differential equation corresponding to the Prisoner’s Dilemma game parameterised as [ 4 , 3 , 1 , 0 | 2 | l ]2, together with a sample simulation run ( l=2−2 ). This system has a SCE at [ 0.37 , 0.37 ].

The ultralong run (SRE) Most BM systems –in particular all the systems studied by Macy and Flace (2002) with fixed aspirations– converge to an SRE in the ultralong run if there exits at least one SRE.

High learning rates (fast adaptation) Learning rate: l = 0.5 SRE The ultralong run(i.e. convergence to an SRE) is quickly reached. The other dynamic regimes are not clearly observed.

Low learning rates (slow adaptation) Ultralong run Learning rate: l = 0.25 The three dynamic regimes are clearly observed. Long run Medium run -> trajectories Medium run Long run -> SCE Ultralong run -> SRE Short run (initial conditions)

The medium run (trajectories) For sufficiently small learning rates and number of iterations n not too large (n·l bounded), the medium run dynamics of the system are best characterised by the trajectories in the phase plane. l = 0.3 n = 10 l = 0.03 n = 100 l = 0.003 n = 1000

The long run (SCEs) When trajectories finish in an SCE, the system will approach the SCE and spend a significant amount of time in its neighbourhood if learning rates are low enough and the number of iterations n is large enough (and finite). This regime is the long run. Long run

The ultralong run (SREs) Ultralong run Most BM systems – in particular all the systems studied by Macy and Flace (2002) with fixed aspirations– converge to a SRE in the ultralong run if there exits at least one SRE.

l = 2-1 Learning rate l = 2-7 Iteration (time)

The validity of mean field approximations The asymptotic (i.e. ultralong) behaviour of the BM model cannot be approximated using the continuous time limit version of its expected motion. Such an approximation can be valid over bounded time intervals but deteriorates as the time horizon increases. Ultralong run

Outline of the presentation • Motivation • In-depth analysis of the dynamics of the model • Analysis of the robustness of the model • Model with occasional mistakes (trembling hands) • Renewed importance of SCEs • Discrimination among different SREs • Conclusions

Model with trembling hands l = 0.25 noise = 0 l = 0.25 noise = 0.01 MODEL WITH TREMBLING HANDS ORIGINAL MODEL

Model with trembling hands (no SREs) SREUP: SRE OF THE UNPERTURBED PROCESS The lower the noise, the higher the concentration around SREUPs.

Model with trembling hands Importantly, not all the SREs of the unperturbed process are equally robust to noise. One representative run of the system parameterised as [ 4 , 3 , 1 , 0 | 0.5 | 0. 5 ] with initial state [ 0.9 , 0.9 ], and noise εi = ε = 0.1. Without noise: Prob( SRE[1,1] ) ≈ 0.7 Prob( SRE[0,0] ) ≈ 0.3

Model with trembling hands Not all the SREs of the unperturbed process are equally robust to noise.

Outline of the presentation • Motivation • In-depth analysis of the dynamics of the model • Analysis of the robustness of the model • Conclusions

Conclusions • Formalisation of SRE and SCE • In-depth analysis of the dynamics of the model • Strongly dependent on speed of learning • Beware of mean field approximations • Analysis of the robustness of the model • Results change dramatically when small quantities of noise are added • Not all the SREs of the unperturbed process are equally robust to noise

Reinforcement Learning Dynamics in Social Dilemmas Luis R. Izquierdo & Segis Izquierdo

Reinforcement Learning Dynamics in Social Dilemmas

Reinforcement Learning Dynamics in Social Dilemmas

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Social Dilemmas

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Cobot: A Social Reinforcement Learning Agent

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning