200 likes | 424 Views
Addressing Queuing Bottlenecks at High Speeds. Sailesh Kumar Patrick Crowley Jonathan Turner. Agenda and Overview. In this paper, we Introduce the potential bottlenecks in high speed queuing systems We only address the bottlenecks associated with off-chip SRAM. Overview of queuing system
E N D
Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner
Agenda and Overview • In this paper, we • Introduce the potential bottlenecks in high speed queuing systems • We only address the bottlenecks associated with off-chip SRAM • Overview of queuing system • Bottlenecks • Our Solution • Conclusion
High Speed Packet Queuing Systems • Packet Queues are crucial to isolate traffic (QoS, etc) • Modern systems can have thousands or may be million queues • Queues must operate at link rates (OC192, OC768, …) • Memory must be efficiently utilized • Dynamic sharing among queues • Minimal wastage due to fragmentation, etc … • Power consumption, chip area, cost, … • Most shared memory queuing systems consists of: • A packet buffer • A queuing subsystem to store per queue’s control information • Supports enqueue and dequeue • A scheduler to make dequeue decisions
High Speed Packet Queuing Systems • Packet Queues are crucial to isolate traffic (QoS, etc) • Modern systems can have thousands or may be million queues • Queues must operate at link rates (OC192, OC768, …) • Memory must be efficiently utilized • Dynamic sharing among queues • Minimal wastage due to fragmentation, etc … • Power consumption, chip area, cost, … • Most shared memory queuing systems consists of: • A packet buffer • A queuing subsystem to store per queue’s control information • Supports enqueue and dequeue • A scheduler to make dequeue decisions • Currently we concentrate on the queuing subsystem
Assumptions • Considering only hardware implementation • Shared memory linked-list queues • Packets are broken into cells and are stored • Implicit mapping • Will explain shortly • Separate memory for • Packet storage (DRAM) • Queues linked-list • Queue descriptor (head, tail, etc)
Limitations of such a Queuing subsystem • Dequeue throughput determined by SRAM latency • SRAM latency > 20 ns; 64 byte cells implies 25 Gbps • Might not be desirable in systems: • with speedup • With multipass support • Fair queuing (scheduling) algorithms performance • Most algorithms require packet lengths to make decisions • Since packet lengths are stored in SRAM, each scheduling decision will take at least 20 ns • Cost • Per bit cost of SRAM is still 100 times that of DRAM • A 512 MB packet buffer needs 40 MB SRAM (64B cells) • Cost of SRAM can be 8 times that of DRAM • 12 SRAM chips versus 4 DRAM (power, area, etc)
Scope for Improvement • Reduce the impact of SRAM latency • Reduce SRAM bandwidth • Is it possible to control the cost by replacing the SRAM with DRAM? • With all of above, is it possible to ensure a good worst case throughput
Buffer Aggregation • Each node of the linked-list maps to multiple buffers • A occupancy or offset tracks the fill level of nodes • Note that only head and tail node can be partially filled • Fewer SRAM access is needed (1/X times on average)
Buffer Aggregation • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads
Buffer Aggregation Not a difficult case as SRAM accesses are not required • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads
Buffer Aggregation • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads Legitimate concern, throughput can be same as a system without buffer aggregation
Buffer Aggregation • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads Legitimate concern, throughput can be same as a system without buffer aggregation We show that by adding few request queues, performance remains high with good prob.
Queuing model with request queues • Arrival process makes enqueue requests • Requests are stored in the enqueuer queue • Departure process makes dequeue requests • Requests requiring SRAM access are stored in dequeuer queue • Output queue holds the dequeued cells • Elastic buffer holds few free nodes
Queuing model with request queues • We use a discrete time Markov model and show that a relatively small sized request queues (with 8-16 entries) ensures good throughput with very high probability • Experimental results for few example systems are shown in the paper
Queue Descriptor size • Note that every list node now consists of X buffers • Potentially X packets can be stored at the head node • Thus queue descriptor may need to hold the length of all X packets • Large queue descriptor memory, might not be desirable • We introduce a clever encoding of packet lengths and boundary which results in coarse grained scheduling
Queue Descriptor size • Note that every list node now consists of X buffers • Potentially X packets can be stored at the head node • Thus queue descriptor may need to hold the length of all X packets • Large queue descriptor memory, might not be desirable • We introduce a clever encoding of packet lengths and boundary which results in coarse grained scheduling
Coarse grained scheduling • Coarse grained scheduling makes schedule decisions based upon the cell count • May result in short term error • Error can be easily compensated within a X cell window • Thus long term fairness is ensured • The jitter remains negligible for all practical purposes Log2(XC+P) Not needed
Conclusions • We presented the linked-list queue bottlenecks and tried to address them • The current work might have been used in some production systems • Our aim was to present these ideas to the research community • Further research • Our ideas can be extended to asynchronous packet buffers • Thus no need for segmentation of packets into cells • Improved space and bandwidth efficiency