340 likes | 352 Views
W-Learning: Competition Among Selfish Q-Learners. Presented by Alp Sardağ. Autonomous Mobile Robots:. Behaviour Based AI: emphasizing intelligence as emerging from ongoing interaction with the world. Subsumption Architecture: By Brooks. Ideas Of Subsumption Arcihtecture.
E N D
W-Learning:Competition Among Selfish Q-Learners Presented by Alp Sardağ
Autonomous Mobile Robots: • Behaviour Based AI: emphasizing intelligence as emerging from ongoing interaction with the world. • Subsumption Architecture: By Brooks
Ideas Of Subsumption Arcihtecture • Default Behaviour: ‘Avois All things’ layer1 takes control of the robot whenever ‘look for food’ layer2 is idle. • Multiple parallel goals: Which to give control?
The Action Selection Problem • Brooks gives to the modules full sensing-and acting powers, but action-selection is the job of the programmer • W-learning modules are competing for control, At this kind of robots action-selection is not designed but learnt.
Competition Among Selfish Agents • Make Layers peers • Layers Compete for control
Definition and Terms • The collection of agents A1,...,An are : • Selfish agents • No cooperation • No knowledge of others • Each agent Ai suggests an action ai(x) where x the world state. • The robot chooses one of these actions ak(x) and executes it.
How the Robot Works? • Some way of resolving the competition. • The idea: Agents have always an action to suggest, but it will care some times more than others. Example: ‘avoid the predator’ and ‘wander around looking for food’
How to resolve the competition • Each agent suggests some action ai(x) with weight Wi(x), the robot executes action ak(x) where: Wk(x)=max Wi(x) where i1,2,...,n • Ak is the leader of the competition for state x.
Example • No agent is explicitly aware of the existence of any other. • An agent can still ‘use’ another agent by ceding to control.
W-Values as Action-Selection • As opposed to agents that share information and make a compromise action, This is a winner-take-all action selection scheme. • The division of control is state-based rather than time-based. • Blumberg points out that animals sometimes appear to engage in a form of time-sharing. • Same effect can be achieved by a suitable state representation x.
Example • Let x=(e,i) be the state. • e: information from external sensors. • i:(f,c) information from external sensors. • f:very hungry(2),hungry(1),not hungry(0) • c:very dirty(2),dirty(1),clean(0) • The weights may be: Wf((e(2,c))) > Wf((e(1,c))) > Wc((e(f,2))) > Wf((e(0,c))) > Wc((e(f,1))) > Wf((e(f,0)))
Engage Opportunistic Behaviour • Hungry and Thirsty animal example: Food is only found in the north, water in the south. The animal treks north, eats and as soon as its hunger only partially satisfied thirst is now higher. Even before it got south, it wiil be starving again. • 1st solution: time-based agents, get control for some minimum amount of time. • 2nd solution: the agents can tell the difference between immediate and distant likely payoff, and present W-values accordingly. • Assigning W-values to actions: • Previous work: as a design problem • Using learning methods that automatically assign values to actions.
Reinforcement Learning • By trial-and-error, the agent learns to take the actions which maximise its rewards.
.5 1.0 .5 .33 .33 .5 .33 1.0 .5 .33 .5 .5 .33 .5 .33 .5 .5 Q-learning (a)Simple Stocastic Environment (b)Mij is provided in PL, Maij is provided in AL NOTE: Transitions are probabilistic. Pxa(y) is the probability that doing a in x will lead to state y.
Q-learning • The agent is interested not in immediate rewards, but in the total discounted reward. R=rt+rt+1+ 2rt+2+... where 0 <1 • The expected total discounted reward: V(xt)=E(R)=E(rt)+ E(rt+1)+ 2E(rt+2)+... =E(rt)+ [E(rt+1)+ E(rt+2)+...] =E(rt)+ V(xt+1) =r rPxa(r)+ yV(y)Pxa(y)
Q-learning • In learning phase the agent try to build up Q-values for each pair (x,a). • Temporal Difference Learning: Q(x,a) Q(x,a) +(r+ maxbQ(y,b) - Q(x,a)) where learning rate and discount rate. • Convergence of Q-learning: Q(x,a) (1- )Q(x,a) +(r+ maxbQ(y,b)) where takes decreasing successive values 1, 2,... let n(x,a) =1,2,... The number of times (x,a) visited (x,a)=1n(x,a) =1,12,13,...
{ R+ if n<Ne u otherwise Q-learning • The optimal policy: *(x)=a*(x) where =maxaQ(x,a) • Exploration problem: • A new approach to exploration problem: • U(i) R(i) + maxa F(jMaijU(j),N(a,i)) where • F(u,n) =
Multi-Module RL • Most work in RL focused on single agents. In theory any problem can be seen just another IO mapping to be learnt by a single agent. Scalibility problem leads to combine simple agents to solve complex task. Some approaches are: • Top-down: identifying task and decompose it into subtasks. Moore by hand, Tham learn the decomposition where subtasks combine sequentially to solve main task. • Bottom-up: the behaviour that emerges when multiple RL agents are combined in different ways. Tan studies the benefits of cooperation among agents like ants.
Selfish Q-learners: • Each agent is a Q-learning agent, with its own reward function and Q-values. • Co-operation is involuntary and emerges from competition among agents. • Let agents be A1,…,An The robot works: observe x for (all agents): get sugested action ai with strength Wi(x) find Wk(x)=max Wi(x) execute ak observe y for (all agents): get reward ri update Q andor W
W-values • For updating W, use the numerical Q-values. • Static W-values: the agent promote its action with the same strength no matter what its competition. W(x)=Q(x,a) • W=importance: W could be the difference between suggested action and the worst possible action: W(x)=Q(x,a)-minb(x,b)
Dynamic (learnt) W-values • Previous W-values fail to take into account what the other agents are doing. • Examples: • suggested actions may be the same. • The other agent might be suggesting an action which would be disastrous for the other agent. • Two types of Ai need not compete for: • A state which is relatively unimportant to it. • A state which is important but some agent Ak suggesting an action which is good for Ai.
Meaning of W-values • W=(P-A): the difference between P (what is predicted if we are listened to) and actual reward A (what actually happened). • An agent will not need explicit knowledge about who it is competing with. It will be aware of them when they stop its action being obeyed, and will be aware of the y and r caused as a result. • The agents will set their own W-values in an incremental way using Q-values.
W-learning • Q-learning process: P:=(1-Q)P+ Q(A) • W-learning process: W:=(1-w)W+ w(P-A) • For updating Q-values: Qi(x,ak):=(1- Q)Qi(x,ak)+ Q(ri+maxbQi(y,b)) • For updating W-values: Wi(x)=(1- w)Wi(x)+ w(Qi(x,ai)-(ri+maxbQi(y,b)) NOTE: only agents that were not obeyed are updated
W-learning pseudo-code State x := observe(); For ( all i ) a[i] := A[i].suggestAction(x); Find k Execute ( a[k] ); State y := observe(); For ( all i ) { r[i] := A[i].reward(x,y); A[i].updateQ ( x , a[k] , y , r[i] ); if (i!=k) A[i].updateW(x , a[k] , y , r[i] ); }
Learning Q (somewhat) Before learning W • Ideally ‘Learn Q first, then W’. • It is impossible to learn Q completely in finite time. • Alternatively, learning W while Q is still being learnt: Wi(x)=(1- w)Wi(x)+ w(1- Q)T(Qi(x,ai)-(ri+maxbQi(y,b)) where T >0 is the delaying rate.
After Q has been learnt • Imagine a dynamically changing collection with agents being continually created and destroyed over time, and the suriving agnets adjusting their W-values as the nature of their competiton changes. Q is leant once, whereas W is relearnt again. • Edelman’s biological theory of Neural Darvinism.
Self-modifying W-values • The update of W for Ai if ak chosen: Wi(x)=(1- w)Wi(x)+ wdki(x) where dki(x) is the difference between P and A • If Ak leads from start to infinity: Wi(x)E(dki(x)) This is why we don’t update Wk(x) because E(dkk(x))=0 • Benefit: • W-learning algorithm can handle any number of switches of leader.
Will competition ever be resolved? • What we need to show the leader will not keep changing forever.
Convergence of W-learning This process will terminate within n2 steps, resolving competition with a winner: Wk(x)E(dki(x)) i, ik
Remark1: More than one possible winner ) 0 3 0 0 0 9 0 0 0 ( Start with all Wi(x)=0. Choose A2’s action: W1(x)=(1-1)x0+1.d21=3 Now A1 is the leader. Start with all Wi(x)=0. Choose A3’s action: W2(x)=(1-1)x0+1.d32=9 Now A2 is the leader.
Remark2: Should we score winner’s W • Wk(x) E(dkk(x)) = 0 the leader’s W converging to 0. Hence back and forth competition forever under any such system.
Remark3:Scaling, peers and unequal agents • An agent with high rewards will end up with high W-values. • The agents peers because they compete on the same basis. • All concerns may not be of equal importance.