Q-Learning for Policy Improvement in Network Routing

Q-Learning for Policy Improvement in Network Routing W. Chen & S. Meyn Dept ECE & CSL University of Illinois

Q-learning Techniques for Net Opt W. Chen & S. Meyn ACHIEVEMENT DESCRIPTION • Implementation – Consensus algorithms & Information distribution • Adaptation – kernel based TD-learning and Q-learning • Integration with Network Coding projects: Code around network hot-spots What is the state of the art and what are its limitations? Control by learning: obtain a routing/scheduling policy for the network that is approximately optimal with respect to delay & throughput. Can these techniques be extended to wireless models? MAIN RESULT: Q-Learning for network routing: by observing the network behavior, we can approximate the optimal policy over a class of policies with restricted complexity, and restricted information. STATUS QUO IMPACT Near optimal performance with simple solution: Q-learning with steepest descent or Newton Raphson Method via stochastic approximation • Theoretical analysis for the convergence. • Simulation experiments are on-going. • KEY NEW INSIGHTS: • Extend to wireless? YES • Complexity is similar to MaxWeight. Policies are distributed and throughput optimal. • Learn the approximately optimal solution by Q-learning is feasible, even for complex networks • New application: Q-learning and TD-learning for power control HOW IT WORKS: Step 1: Generate a state trajectory. Step 2: Learn the best value function approximation by stochastic approximation. Step 3: Policy for routing: h-MW policy derived from value function approximation NEW INSIGHTS NEXT-PHASE GOALS • Un-consummated union challenge: Integrate coding and resource allocation • Generally, solutions to complex decision problems should offer insight Algorithms for dynamic routing: Visualization and Optimization

Problem Decentralized resource allocation for a dynamic network is complex. How to achieve the near optimal performance (in delay or power) while maintaining throughput optimality and distributed implementation? Motivation Optimality control theory is intrinsically based on associated value functions. We can approximate the value function using either a relaxation technique, or other methods. Solution Learn an approximation of the value function over a parametric family (which can be chosen by prior knowledge) and get an approximate optimal routing policy for the network.

h-MaxWeight Method (Meyn 2008): • h-MaxWeight uses a perturbation technique to generate a rich class of universally stabilizing policies. Any monotone and convex function can be used as the function h. • Relaxation based h-MaxWeight: • Find function h through relaxation of the fluid value function. • Q-learning based h-MaxWeight: • Approximate the optimal function h over a parametric family by local learning. How to approximatethe optimal function h?

Find Function h by Local Learning Approximate value function by a general quadratic: Require in a sub-region of the state space. • Architecture: to choose , • Fluid value function approximation • Diffusion value function approximation • Shortest path information

Q-learning based h-MaxWeight Method: Main Idea and Procedure Goal: Minimize the Bellman error over parametric family of value function approximation. Procedure: Step 1: Generate a state trajectory. Step 2: Learn the best value function approximation by stochastic approximation. Step 3: Policy for routing: h-MW policy derived from value function approximation

Define value function in a subspace: Steepest descent method: Stochastic approximation: approximates Optimal parameter: -MaxWeight

Example: local approximation for a simple network

Summaries and challenges Challenges CAN WE IMPLEMET? How to construct implementable approximations for the optimal routing/scheduling policies? CAN WE CODE? How to combine the routing/scheduling policies with network coding in ITMANET? KEY CONCLUSION Approximate the optimal scheduling/routing policies for network by local learning. • References • S. Meyn. Stability and asymptotic optimality of generalized MaxWeight policies. SIAM J. Con Optim., 47(6):3259–3294, 2009 • W. Chen et. al. Approximate Dynamic Programming using Fluid and Diffusion Approximations with Applications to Power Management. Accepted to the 48th IEEE Conference on Decision and Control, 2009. • W. Chen, et. al. Coding and Control for Communication Networks. Invited paper to appear in the special issue of Queueing systems. • S. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. • P. Mehta. and S. Meyn. Q-learning and Pontryagin’s Minimum Principle. Accepted to the 48th IEEE Conference on Decision and Control, 2009

Q-Learning for Policy Improvement in Network Routing

Q-Learning for Policy Improvement in Network Routing

Presentation Transcript

TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning

Routing in delay tolerant network.

Network Layer: Routing

Q Learning and SARSA: Off-policy and On-policy RL

Policy-Based Routing

Sensor Network Routing

Connectivity Improvement for Inter-Domain Routing in MANETs

Network Layer Routing

Q-learning

Routing Policy

Mapping Logical Network Routing to Physical Network Routing *

Network Routing Algorithms

INTERDOMAIN ROUTING POLICY

Network Layer: Routing

Network Layer: Routing

Q uality Improvement

Network for Systemic Improvement--Progress

Network Layer: Routing

Network Routing

Network Layer: Routing

Network-Layer Routing