290 likes | 316 Views
Application of Reinforcement Learning in Network Routing. By Chaopin Zhu. Machine Learning. Supervised Learning Unsupervised Learning Reinforcement Learning. Supervised Learning. Feature: Learning with a teacher Phases Training phase Testing phase Application Pattern recognition
E N D
Application of Reinforcement Learning in Network Routing By Chaopin Zhu
Machine Learning • Supervised Learning • Unsupervised Learning • Reinforcement Learning
Supervised Learning • Feature: Learning with a teacher • Phases • Training phase • Testing phase • Application • Pattern recognition • Function approximation
Unsupervised Leaning • Feature • Learning without a teacher • Application • Feature extraction • Other preprocessing
Reinforcement Learning • Feature: Learning with a critic • Application • Optimization • Function approximation
Elements ofReinforcement Learning • Agent • Environment • Policy • Reward function • Value function • Model of environment (optional)
Markov Decision Process (MDP) Definition: A reinforcement learning task that satisfies the Markov property Transition probabilities
Markov Decision Process (cont.) • Parameters Value functions
Elementary Methods forReinforcement Learning Problem • Dynamic programming • Monte Carlo Methods • Temporal-Difference Learning
Dynamic Programming Methods • Policy evaluation • Policy improvement
Dynamic Programming (cont.) E ---- policy evaluation I ---- policy improvement • Policy Iteration • Value Iteration
Monte Carlo Methods • Feature • Learning from experience • Do not need complete transition probabilities • Idea • Partition experience into episodes • Average sample return • Update at episode-by-episode base
Temporal-Difference Learning • Features (Combination of Monte Carlo and DP ideas) • Learn from experience (Monte Carlo) • Update estimates based in part on other learned estimates (DP) • TD() algorithm seemlessly integrates TD and Monte Carlo Methods
TD(0) Learning Initialize V(x) arbitrarily • to the policy to be evaluated Repeat (for each episode): Initialize x Repeat (for each step of episode) aaction given by for x Take action a; observe reward r and next state x’ xx’ until x is terminal
Q-Learning Initialize Q(x,a) arbitrarily Repeat (for each episode) Initialize x Repeat (for each step of episode): Choose a from x using policy derived from Q Take action a, observe r, x’ xx’ until x is terminal
Q-Routing Qx(y,d)----estimated time that a packet would take to reach the destination node d from current node x via x’s neighbor node y Ty(d) ------y’s estimate for the time remaining in the trip qy ---------queuing time in node y Txy --------transmission time between x and y
Algorithm of Q-Routing • Set initial Q-values for each node • Get the first packet from the packet queue of node x • Choose the best neighbor node and forward the packet to node by • Get the estimated value from node • Update • Go to 2.
Initialization/ Termination Procedures • Initilization • Initialize and / or register global variable • Initialize routing table • Termination • Destroy routing table • Release memory
Arrival Procedure • Data packet arrival • Update routing table • Route it with control information or destroy the packet if it reaches the destination • Control information packet arrival • Update routing table • Destroy the packet
Departure Procedure • Set all fields of the packet • Get a shortest route • Send the packet according to the route
References [1] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning—An Introduction [2] Chengan Guo, Applications of Reinforcement Learning in Sequence Detection and Network Routing [3] Simon Haykin, Neural Networks– A Comprehensive Foundation