1 / 20

Addressing Queuing Bottlenecks at High Speeds

Addressing Queuing Bottlenecks at High Speeds. Sailesh Kumar Patrick Crowley Jonathan Turner. Agenda and Overview. In this paper, we Introduce the potential bottlenecks in high speed queuing systems We only address the bottlenecks associated with off-chip SRAM. Overview of queuing system

drew
Download Presentation

Addressing Queuing Bottlenecks at High Speeds

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner

  2. Agenda and Overview • In this paper, we • Introduce the potential bottlenecks in high speed queuing systems • We only address the bottlenecks associated with off-chip SRAM • Overview of queuing system • Bottlenecks • Our Solution • Conclusion

  3. High Speed Packet Queuing Systems • Packet Queues are crucial to isolate traffic (QoS, etc) • Modern systems can have thousands or may be million queues • Queues must operate at link rates (OC192, OC768, …) • Memory must be efficiently utilized • Dynamic sharing among queues • Minimal wastage due to fragmentation, etc … • Power consumption, chip area, cost, … • Most shared memory queuing systems consists of: • A packet buffer • A queuing subsystem to store per queue’s control information • Supports enqueue and dequeue • A scheduler to make dequeue decisions

  4. High Speed Packet Queuing Systems • Packet Queues are crucial to isolate traffic (QoS, etc) • Modern systems can have thousands or may be million queues • Queues must operate at link rates (OC192, OC768, …) • Memory must be efficiently utilized • Dynamic sharing among queues • Minimal wastage due to fragmentation, etc … • Power consumption, chip area, cost, … • Most shared memory queuing systems consists of: • A packet buffer • A queuing subsystem to store per queue’s control information • Supports enqueue and dequeue • A scheduler to make dequeue decisions • Currently we concentrate on the queuing subsystem

  5. Assumptions • Considering only hardware implementation • Shared memory linked-list queues • Packets are broken into cells and are stored • Implicit mapping • Will explain shortly • Separate memory for • Packet storage (DRAM) • Queues linked-list • Queue descriptor (head, tail, etc)

  6. A shared memory queuing subsystem

  7. Limitations of such a Queuing subsystem • Dequeue throughput determined by SRAM latency • SRAM latency > 20 ns; 64 byte cells implies 25 Gbps • Might not be desirable in systems: • with speedup • With multipass support • Fair queuing (scheduling) algorithms performance • Most algorithms require packet lengths to make decisions • Since packet lengths are stored in SRAM, each scheduling decision will take at least 20 ns • Cost • Per bit cost of SRAM is still 100 times that of DRAM • A 512 MB packet buffer needs 40 MB SRAM (64B cells) • Cost of SRAM can be 8 times that of DRAM • 12 SRAM chips versus 4 DRAM (power, area, etc)

  8. Scope for Improvement • Reduce the impact of SRAM latency • Reduce SRAM bandwidth • Is it possible to control the cost by replacing the SRAM with DRAM? • With all of above, is it possible to ensure a good worst case throughput

  9. Buffer Aggregation • Each node of the linked-list maps to multiple buffers • A occupancy or offset tracks the fill level of nodes • Note that only head and tail node can be partially filled • Fewer SRAM access is needed (1/X times on average)

  10. Buffer Aggregation • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads

  11. Buffer Aggregation Not a difficult case as SRAM accesses are not required • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads

  12. Buffer Aggregation • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads Legitimate concern, throughput can be same as a system without buffer aggregation

  13. Buffer Aggregation • To Prove the effectiveness, we must consider 2 cases • 1. when queues remain near empty • 2. backlogged queues with near-empty heads Legitimate concern, throughput can be same as a system without buffer aggregation We show that by adding few request queues, performance remains high with good prob.

  14. Queuing model with request queues • Arrival process makes enqueue requests • Requests are stored in the enqueuer queue • Departure process makes dequeue requests • Requests requiring SRAM access are stored in dequeuer queue • Output queue holds the dequeued cells • Elastic buffer holds few free nodes

  15. Queuing model with request queues • We use a discrete time Markov model and show that a relatively small sized request queues (with 8-16 entries) ensures good throughput with very high probability • Experimental results for few example systems are shown in the paper

  16. Queue Descriptor size • Note that every list node now consists of X buffers • Potentially X packets can be stored at the head node • Thus queue descriptor may need to hold the length of all X packets • Large queue descriptor memory, might not be desirable • We introduce a clever encoding of packet lengths and boundary which results in coarse grained scheduling

  17. Queue Descriptor size • Note that every list node now consists of X buffers • Potentially X packets can be stored at the head node • Thus queue descriptor may need to hold the length of all X packets • Large queue descriptor memory, might not be desirable • We introduce a clever encoding of packet lengths and boundary which results in coarse grained scheduling

  18. Coarse grained scheduling • Coarse grained scheduling makes schedule decisions based upon the cell count • May result in short term error • Error can be easily compensated within a X cell window • Thus long term fairness is ensured • The jitter remains negligible for all practical purposes Log2(XC+P) Not needed

  19. Conclusions • We presented the linked-list queue bottlenecks and tried to address them • The current work might have been used in some production systems • Our aim was to present these ideas to the research community • Further research • Our ideas can be extended to asynchronous packet buffers • Thus no need for segmentation of packets into cells • Improved space and bandwidth efficiency

  20. Questions ?

More Related