190 likes | 402 Views
NetFPGA Project: 4-Port Layer 2/3 Switch. Ankur Singla (asingla@stanford.edu) Gene Juknevicius (genej@stanford.edu). Agenda. NetFPGA Development Board Project Introduction Design Analysis Bandwidth Analysis Top Level Architecture Data Path Design Overview Control Path Design Overview
E N D
NetFPGA Project:4-Port Layer 2/3 Switch Ankur Singla (asingla@stanford.edu) Gene Juknevicius (genej@stanford.edu)
Agenda • NetFPGA Development Board • Project Introduction • Design Analysis • Bandwidth Analysis • Top Level Architecture • Data Path Design Overview • Control Path Design Overview • Verification and Synthesis Update • Conclusion
Project Introduction • 4 Port Layer-2/3 Output Queued Switch Design • Ethernet (Layer-2), IPv4, ICMP, and ARP • Programmable Routing Tables – Longest Prefix Match, Exact Match • Register support for Switch Fwd On/Off, Statistics, Queue Status, etc. • Layer-2 Broadcast, and limited Layer-3 Multicast support • Limited support for Access Control • Highly Modular Design for future expandability
Bandwidth Analysis • Available Data Bandwidth • Memory bandwidth: 32 bits * 25 MHz = 800 Mbits/sec • CFPGA to Ingress FIFO/Control Block bandwidth:32 bits * 25 MHz / 4 = 200 Mbits/sec • Packet Queue to Egress bandwidth: 32 bits * 25 MHz / 4 = 200 Mbits/sec • Packet Processing Requirements • 4 ports operating at 10 Mbits/sec => 40 Mbits/sec • Minimum size packet 64 Byte => 512 bits • 512 bits / 40 Mbits/sec = 12.8 us • Internal clock is 25 MHz • 12.8 us * 25 MHz = 320 clocks to process one packet
Data Flow Diagram • Output Queued Shared Memory Switch • Round Robin Scheduling • Packet Processing Engine provides L2/L3 functionality • Coarse Pipelined Arch. at the Block Level
Master Arbiter • Round Robin Scheduling of service to Each Input and Output • Interfaces Rest of the Design with Control FPGA • Co-ordinates activities of all high level blocks • Maintains Queue Status for each Output
Ingress FIFO Control Block • Interfaces three blocks • Control FPGA • Forwarding Engine • Packet Buffer Controller • Dual Packet Memories for coarse pipelining • Responsible for Packet Replication for Broadcast
Packet Processing Engine Overview • Goals • Features – L3/L2/ICMP/ARP Processing • Performance Requirements – 78Kpps • Fit within 60% of Single User FPGA Block • Modularity / Scalability • Verification / Design Ease • Actual • Support for all required features + L2 broadcast, L3 multicast, LPM, Statistics and Policing (coarse access control) • Performance Achieved – 234Kpps (worst case 69Kpps for ICMP echo requests 1500bytes) • Requires only 12% of Single UFPGA resources • Highly Modular Design for design/verification/scalability ease
First Level Parsing Statistics and Policing ARP Processing L3 Processing ICMP Processing L2 Processing Forwarding Master State Machine Pkt Processing Engine Block Diagram From CFPGA Packet Memory0 Native Packet Packet Memory1 To Packet Buffer
Forwarding Master State Machine • Responsible for controlling individual processing blocks • Request/Grant Scheme for future expandability • Initiates a Request for Packet to Ingress FIFO and then assigns to responsible agents based on packet contents • Replication of MSM to provide more throughput
L3 Processing Engine • Parsing of the L3 Information: • Src/Dest Addr, Protocol Type, Checksum, Length, TTL • Longest Prefix Match Engine • Mask Bits to represent the prefix. Lookup Key is Dest Addr • Associated Info Table (AIT) Indexed using the entry hit • AIT provides Destination Port Map, Destination L2 Addr, Statistics Bucket Index • Request/Done scheme to allow for expandability (e.g. future m-way Trie implementation project) • ICMP Support Engine Request (if Dest Addr is Routers IP Address + Protocol Type is ICMP) • Total 85 cycles for Packet Processing with 80% of the cycles spent on Table Lookup If using 4-way trie, total processing time can be reduced to less than 30 cycles.
L2 Processing Engine • If there is any processing problems with ARP, ICMP, and/or L3, then L2 switching is done • Exact Match Engine • Re-use of the LPM match engine but with Mask Bits set to all 1’s. • Associated Info Table (AIT) Indexed using the entry hit • AIT provides Destination Port Map, and Statistics Bucket Index • Request/Done scheme to allow for expandability (e.g. future Hash implementation project) • Learning Engine removed because of Switch/Router Hardware Verification problems (HP Switch bug) • Total 76 cycles for Packet Processing with over 80% of the cycles spent on Table Lookup If using Hashing Function, total processing time can be reduced to less than 20 cycles.
Packet Buffer Interface • Interfaces with Master Arbiter and Forward Engine • Output Queued Switch • Statically Assigned • Single Queue per port • Off-chip ZBT SRAM on NetFPGA board
Control Block • Typical Register Rd/Wr Functionality • Status Register • Control Register (forwarding disable, reset) • Router’s IP Addresses (port 1-4) • Queue Size Registers • Statistics Registers • Layer-2 Table Programming Registers • Layer-3 Table Programming Registers
Verification • Three Levels of Verification Performed • Simulations: • Module Level – to verify the module design intent and bus functional model • System Level – using the NetFPGA verification environment for packet level simulations • Hardware Verification • Ported System Level tests to create tcpdump files for NetFPGA traffic server • Very good success on Hardware with all System Level tests passing. • Only one modification required (reset generation) after Hardware Porting • Demo - Greg can provide lab access to anyone interested
Synthesis Overview • Design was ported to Altera EP20K400 Device • Logic Elements Utilized – 5833 (35% of Total LEs) • RAM ESBs Used – 46848 (21% of Total ESBs) • Max Design Clock Frequency ~ 31MHz • No Timing Violations
Conclusion • Easy to achieve “required” performance in an OQ Shared Memory Switch in NetFPGA • Modularity of the design allows more interesting and challenging future projects • Design/Verification Environment was essential to meet schedule • NetFPGA is an excellent design exploration platform