1 / 31

Collaborative Reinforcement Learning

Collaborative Reinforcement Learning. Presented by Dr. Ying Lu. Credits. Reinforcement Learning : A User ’ s Guide . Bill Smart at ICAC 2005

mahlah
Download Presentation

Collaborative Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collaborative Reinforcement Learning Presented by Dr. Ying Lu

  2. Credits • Reinforcement Learning: A User’s Guide. Bill Smart at ICAC 2005 • Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill, "Collaborative Reinforcement Learning of Autonomic Behaviour", 2nd International Workshop on Self-Adaptive and Autonomic Computing Systems, pages 700-704, 2004. [Winner Best Paper Award].

  3. What is RL? • “a way of programming agents by reward and punishment without needing to specify how the task is to be achieved” • [Kaelbling, Littman, & Moore, 96]

  4. World Basic RL Model • Observe state, st • Decide on an action, at • Perform action • Observe new state, st+1 • Observe reward, rt+1 • Learn from experience • Repeat • Goal: Find a control policy that will maximize the observed rewards over the lifetime of the agent S R A

  5. An Example: Gridworld • Canonical RL domain • States are grid cells • 4 actions: N, S, E, W • Reward for entering top right cell • -0.01 for every other move • Maximizing sum of rewards  Shortest path • In this instance +1

  6. The Promise of RL • Specify what to do, but not how to do it • Through the reward function • Learning “fills in the details” • Better final solutions • Based on actual experiences, not programmer assumptions • Less (human) time needed for a good solution

  7. Mathematics of RL • Before we talk about RL, we need to cover some background material • Some simple decision theory • Markov Decision Processes • Value functions

  8. Making Single Decisions 1 1 A • Single decision to be made • Multiple discrete actions • Each action has a reward associated with it • Goal is to maximize reward • Not hard: just pick the action with the largest reward • State 0 has a value of 2 • Sum of rewards from taking the best action from the state 0 B 2 2

  9. Markov Decision Processes • We can generalize the previous example to multiple sequential decisions • Each decision affects subsequent decisions • This is formally modeled by a Markov Decision Process (MDP) A 3 A 1 1 1 1 A B 1 0 5 B -1000 10 2 2 4 A A

  10. Markov Decision Processes • Formally, an MDP is • A set of states, S = {s1, s2, ... , sn} • A set of actions, A = {a1, a2, ... , am} • A reward function, R: SAS→ • A transition function, • We want to learn a policy, p: S →A • Maximize sum of rewards we see over our lifetime

  11. Policies • There are 3 policies for this MDP • 0 →1 →3 →5 • 0 →1 →4 →5 • 0 →2 →4 →5 • Which is the best one? A 3 A 1 1 1 1 A B 1 0 5 B -1000 10 2 2 4 A A

  12. Comparing Policies • Order policies by how much reward they see • 0 →1 →3 →5 = 1 + 1 + 1 = 3 • 0 →1 →4 →5 = 1 + 1 + 10 = 12 • 0 →2 →4 →5 = 2 – 1000 + 10 = -988 A 3 A 1 1 1 1 A B 1 0 5 B -1000 10 2 2 4 A A

  13. Q(1, A) = 2 Q(1, B) = 11 Q(3, A) = 1 Q(0, A) = 12 Q(0, B) = -988 Q(2, A) = -990 Q(4, A) = 10 Value Functions • We can define value without specifying the policy • Specify the value of taking action a from state s and then performing optimally • This is the state-action value function, Q How do you tell which action to take from each state? A 3 A 1 1 1 1 A B 1 0 5 B -1000 10 2 2 4 A A

  14. Value Functions • So, we have value function • Q(s, a) = R(s, a, s’) + maxa’ Q(s’, a’) • In the form of • Next reward plus the best I can do from the next state • These extend to probabilistic actions s’ is the next state

  15. Getting the Policy • If we have the value function, then finding the best policy is easy • p(s) = arg maxa Q(s, a) • We’re looking for the optimal policy, p*(s) • No policy generates more reward than p* • Optimal policy defines optimal value functions • The easiest way to learn the optimal policy is to learn the optimal value function first

  16. Collaborative Reinforcement Learningto Adaptively Optimize MANET Routing Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill

  17. Overview • Building autonomic distributed systems with self* properties • Self-Organizing • Self-Healing • Self-Optimizing • Add collaborative learning mechanism to self-adaptive component model • Improved ad-hoc routing protocol

  18. Introduction • Autonomous distributed systems will consist of interacting components free from human interference • Existing top-down management and programming solutions require too much global state • Bottom up, decentralized collection of components who make their own decisions based on local information • System wide self* behavior emerges from interactions

  19. Self-* Behavior • Self-adaptive components that change structure and/or behavior at run-time, adapt to • discovered faults • reduced performance • Requires active monitoring of component states and external dependencies

  20. Self-* Distributed Systems using Distributed (collaborative) Reinforcement Learning • For complex systems, programmers cannot be expected to describe all conditions • Self-adaptive behavior learnt by components • Decentralized co-ordination of components to support system-wide properties • Distributed Reinforcement Learning (DRL) is extension to RL and uses neighbor interactions only

  21. Model-Based Reinforcement Learning • Markov Decision Process = • {States }, {Actions}, • R(States,Actions), P(States, Actions, States) 1.Action Reward 2. State Transition Model 3. Next State Reward

  22. Decentralised System Optimisation • Coordinating the solution to a set of Discrete Optimisation Problems (DOPs) • Components have a Partial System View • Coordination Actions • Actions ={delegation} U {DOP actions} U {discovery} • Connection Costs

  23. Collaborative Reinforcement Learning • Advertisement • Update Partial Views of Neighbours • Decay • Negative Feedback on State Values in the Absence of Advertisements Cached Neighbour’s V-value State Transition Model Action Reward Connection Cost

  24. Adaptation in CRL System • A feedback process to • Changes in the optimal policy of any RL agent • Changes in the system environment • The passing time

  25. SAMPLE: Ad-hoc Routing using DRL • Probabilistic ad-hoc routing protocol based on DRL • Adaptation of network traffic around areas of congestion • Exploitation of stable routes • Routing decisions based on local information and information obtained from neighbors • Outperforms Ad-hoc On Demand Distance Vector Routing (AODV) and Dynamic Source Routing (DSR)

  26. SAMPLE: A CRL System (I)

  27. SAMPLE: A CRL System (II) Instead of always choosing the neighbor with the best Q value, i.e., taking the delegation action a= arg maxaQi(B, a), a neighbor is chosen probabilistically

  28. SAMPLE: A CRL System (III) • Pi(s’|s, aj) = E(CS/CA)

  29. SAMPLE: A CRL System (IV)

  30. Performance • Metric: • Maximize • throughput • ratio of delivered packets to undelivered packets • Minimize • number of transmission required per packet sent • Figures 5-10

  31. Questions/Discussions

More Related