Addressing Queuing Bottlenecks at High Speeds

Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner

Agenda and Overview • In this paper, we • Introduce the potential bottlenecks in high speed queuing systems • We only address the bottlenecks associated with off-chip SRAM • Overview of queuing system • Bottlenecks • Our Solution • Conclusion

High Speed Packet Queuing Systems • Packet Queues are crucial to isolate traffic (QoS, etc) • Modern systems can have thousands or may be million queues • Queues must operate at link rates (OC192, OC768, …) • Memory must be efficiently utilized • Dynamic sharing among queues • Minimal wastage due to fragmentation, etc … • Power consumption, chip area, cost, … • Most shared memory queuing systems consists of: • A packet buffer • A queuing subsystem to store per queue’s control information • Supports enqueue and dequeue • A scheduler to make dequeue decisions

High Speed Packet Queuing Systems • Packet Queues are crucial to isolate traffic (QoS, etc) • Modern systems can have thousands or may be million queues • Queues must operate at link rates (OC192, OC768, …) • Memory must be efficiently utilized • Dynamic sharing among queues • Minimal wastage due to fragmentation, etc … • Power consumption, chip area, cost, … • Most shared memory queuing systems consists of: • A packet buffer • A queuing subsystem to store per queue’s control information • Supports enqueue and dequeue • A scheduler to make dequeue decisions • Currently we concentrate on the queuing subsystem

Assumptions • Considering only hardware implementation • Shared memory linked-list queues • Packets are broken into cells and are stored • Implicit mapping • Will explain shortly • Separate memory for • Packet storage (DRAM) • Queues linked-list • Queue descriptor (head, tail, etc)

A shared memory queuing subsystem

Limitations of such a Queuing subsystem • Dequeue throughput determined by SRAM latency • SRAM latency > 20 ns; 64 byte cells implies 25 Gbps • Might not be desirable in systems: • with speedup • With multipass support • Fair queuing (scheduling) algorithms performance • Most algorithms require packet lengths to make decisions • Since packet lengths are stored in SRAM, each scheduling decision will take at least 20 ns • Cost • Per bit cost of SRAM is still 100 times that of DRAM • A 512 MB packet buffer needs 40 MB SRAM (64B cells) • Cost of SRAM can be 8 times that of DRAM • 12 SRAM chips versus 4 DRAM (power, area, etc)

Scope for Improvement • Reduce the impact of SRAM latency • Reduce SRAM bandwidth • Is it possible to control the cost by replacing the SRAM with DRAM? • With all of above, is it possible to ensure a good worst case throughput

Buffer Aggregation • Each node of the linked-list maps to multiple buffers • A occupancy or offset tracks the fill level of nodes • Note that only head and tail node can be partially filled • Fewer SRAM access is needed (1/X times on average)

Buffer Aggregation • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads

Buffer Aggregation Not a difficult case as SRAM accesses are not required • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads

Buffer Aggregation • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads Legitimate concern, throughput can be same as a system without buffer aggregation

Buffer Aggregation • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads Legitimate concern, throughput can be same as a system without buffer aggregation We show that by adding few request queues, performance remains high with good prob.

Queuing model with request queues • Arrival process makes enqueue requests • Requests are stored in the enqueuer queue • Departure process makes dequeue requests • Requests requiring SRAM access are stored in dequeuer queue • Output queue holds the dequeued cells • Elastic buffer holds few free nodes

Queuing model with request queues • We use a discrete time Markov model and show that a relatively small sized request queues (with 8-16 entries) ensures good throughput with very high probability • Experimental results for few example systems are shown in the paper

Queue Descriptor size • Note that every list node now consists of X buffers • Potentially X packets can be stored at the head node • Thus queue descriptor may need to hold the length of all X packets • Large queue descriptor memory, might not be desirable • We introduce a clever encoding of packet lengths and boundary which results in coarse grained scheduling

Coarse grained scheduling • Coarse grained scheduling makes schedule decisions based upon the cell count • May result in short term error • Error can be easily compensated within a X cell window • Thus long term fairness is ensured • The jitter remains negligible for all practical purposes Log2(XC+P) Not needed

Conclusions • We presented the linked-list queue bottlenecks and tried to address them • The current work might have been used in some production systems • Our aim was to present these ideas to the research community • Further research • Our ideas can be extended to asynchronous packet buffers • Thus no need for segmentation of packets into cells • Improved space and bandwidth efficiency

Questions ?

Addressing Queuing Bottlenecks at High Speeds

Addressing Queuing Bottlenecks at High Speeds

Presentation Transcript

Queuing

Queuing Theory

Queuing Theory

Queuing Analysis

Queuing Theory

High speeds and high quality color have a whole new platform.

Queuing Theory

Queuing Theory

Queuing

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Queuing Theory

Addressing Thermal and Power Delivery Bottlenecks in 3D Circuits

The Lifelong Code Optimization Project: Addressing Fundamental Bottlenecks

Network processing at gigabit speeds

Addressing Protein Crystallization Bottlenecks by Screening Multiple Homologs

Queuing

Shunting: Detecting and Blocking Network Attacks at Ultra-High Speeds

S6C10 - Queuing

Changing Speeds

Speeds and Feeds

High speeds automation doors | Nivha Technologies

Queuing Analysis