130 likes | 334 Views
A Deficit Round Robin 20MB/s Layer 2 Switch. Muraleedhara Navada Francois Labonte. Fairness in Switches. Output Queued Switch. How to provide fair bandwidth allocation at output link ? Simple FIFO favors greedy flow Separate flows into FIFOs at output Bit by Bit fair queuing
E N D
A Deficit Round Robin 20MB/s Layer 2 Switch Muraleedhara Navada Francois Labonte
Fairness in Switches Output Queued Switch • How to provide fair bandwidth allocation at output link ? • Simple FIFO favors greedy flow • Separate flows into FIFOs at output • Bit by Bit fair queuing • Weighted Fair Queuing allows different weight for flows • Packetized Weighted Fair Queuing (aka PGPS) calculates departure time for each packet 50 100 50 50 50 50 50 50 150 Round-Robin bit by bit allocation
Deficit Round Robin Credits • Packetized Weighted Fair Queuing is complicated to implement • Deficit Round Robin keeps track of credits for each flow • Flow sends according credits • Add credits according to weight • Essentially PWFQ at coarser level 50 100 75 50 50 50 75 75 50 50 50 75 150 Credits Time 50 100 75 50 50 50 50 25 25 50 50 75 150 Credits 50 100 150 50 50 100 100 50 50 150 150
NetFPGA System 1MB SRAM • 8 Port 10MB/s duplex ethernet • Control FPGA (CFPGA) handles physical interface (MAC) • Our design targets both the User FPGAs (UFPGA) UFPGA1 CFPGA 10MB/s Ethernet UFPGA0 1MB SRAM 1MB SRAM
Design Considerations • 4 MACs behind each port (8) • Each flow is a unique Source Address – Destination Address pair • ~1024 flows • Split across FPGAs • Each UFPGAs read incoming packets from different ports(0-3 and 4-7) • tradeoff between memory storage and fairness across all flows
Memory Buffer Allocation • Static Partitioning of 1MB SRAM across 512 flows gives 2kbytes per flow < 2 max size packets • Need more dynamic allocation • Segments: smaller size means less fragmentation, but more pointer and list handling overhead • 128 bytes was chosen • Keep free segments list • Save on-chip only pointer to head and tail of each flow P1 P1 P2 P3 P4 P5 P5 P6
MAC address Learning • Instead of telling which MAC addresses belong to which port • Learn them from the source address • Note that our split FPGA design (reading from different ports) require them to communicate the MACs learned between them • When destination MAC is not learned yet, broadcast (send to all other ports). • So MAC learning implies broadcast capability
Read Operation Share SA Master Control Read, port MAC Learning Flow Assignment CFPGA Interface Control Handler DA, SA Flow ID Packet Memory Manager Flow Tail DRR Engine Length, ptr 1 MB SRAM
Write Operation Master Control Write, port MAC Learning Flow Assignment CFPGA Interface Port REQ Control Handler Port GNT Data Ready Packet Memory Manager Head, length DRR Engine Next head, length, latency 1 MB SRAM
FLOW data 512 x 160bits SRAM Port 0 FIFO Port 1 FIFO Port 2 FIFO Port 3 FIFO Port 4 FIFO Port 5 FIFO Port 6 FIFO Port 7 FIFO DRR Engine • How to handle 512 flows and stay work conserving: • Only one flow active at any time • DRR allocation happens on dequeuing • Fifos contain the next flow to be serviced for each port • Statistics per flow • Weight • Latency • Byte sent • Packet sent • Packets active
Conclusion • A Deficit Round Robin Switch with 1k flows has been implemented • Provides dynamic memory buffer allocation, MAC learning and broadcast • Parallel design split across 2 chips • Gathers statistics on flows