ECE 526 – Network Processing Systems Design

ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer

Goal • Understanding the inefficiency of 1st, 2nd and 3rd generation network processing systems • Scalability plus flexibility • Recognizing the necessity of new solution: 4th generation (network processor technology) • Learning • courage to appreciate the challenges • skill to characterize the “real” problem • art to propose an engineering solution • Be aware of current network processor is a conceptual and general term ECE 526

Recall 1ST • 1st generation network processing system • Feasibility study • Design a software router • data rate 10Gbps • Assuming small packets (64B) • Assuming each packet need 10,000 instruction to process • Can Intel 80986@2007 do the job? • CPU:24Ghz • MIPs:125,000 (Million Instruction Per Second) • 1 billion transistors …. • Conclusion: not feasible • What is the real problem here? ECE 526

Real Problem is • Technology push: uneven • Link bandwidth scaling much faster than CPU and memory technology • Transistor scaling and VLSI technology help but not enough • Application pull: harder • More complex applications are required • Processing complexity is defined as the number of instructions and number of memory access to process one packet ECE 526

Structured ASIC • Reconfigurable Co-processors • Network Processor • FPGA What is the ideal platform?

2nd and 3rd Generations • 2nd generation: offloading and decentralized • 3rd generation: further offloading and using specialized devices (ASIC + embedded processors) • Problems: losing the flexibility and very cost, why? ECE 526

Why not ASIC? • High cost to develop • Network processing moderate quantity market • Long time to market • Network processing quickly changing services • Difficult to simulate • Complex protocol • Expensive and time-consuming to change • Little reuse across products • Limited reuse across versions • No consensus on framework or supporting chips • Requires expertise ECE 526

Network Processors • Question: where does NP gain higher performance from, compared with conventional processor? ECE 526

Instruction Set: minimality • Not general as RISC and CISC processor • E.g. no floating point instructions • Optimized for packet processing functions only • Not specific to a protocol or part a protocol • Seek a minimal set of instruction set of instructions sufficient to handle arbitrary protocol, • plus specific instructions for protocol processing • Example : atomic operation • Hard problem and will cover later ECE 526

Architecture: multiprocessor • Parallelism • The nature of workload network processing: high parallel • Flow-level • Queue-level • Packet-level • Protocol-level • Pipelining • Pipeline will help system performance at cost of longer delay • Is this acceptable? • System-on-chip • Processing: RISC core • Memory: register, cache, instruction store, scratch pad, SRAM and SDRAM • I/O: network /switch fabric interfaces • Question: how hard to build and use this NPs? ECE 526

Typical Processing ECE 526

From (0) • From (1) • Root • a • b • c • d • e • Prefix (hex : binary) • : 0* • 002 : * • 002F : * • FFE : 000* • FFF : * • FFF • FFE • 000 • 001 • 002 • 003 • Memory access 1 • e • b • a • a • a • 0 • 1 • F • 0 • 1 • F • Memory access 2 • b • b • c • d • d • Lookup • IPRoute • To (0) • Memory access 5 • To (1) • 0 • 1 • F • Memory access 6 Case Study: IPv4 Packet Forwarding • 2-port router (2 Gbps) • Xilinx Virtex-II Pro FPGA (2VP30) • IP Lookup: • longest prefix match • (trie lookup algorithm)

RS232 • Timer • BRAM • BRAM • OPB • LEDs • Verify • Lookup-1 • Lookup-2 • Transmit • Verify • Lookup-1 • Lookup-2 • Transmit • FSL • Packet Transmission • Packet Reception • Verify • Lookup-1 • Lookup-2 • Transmit • Verify • Lookup-1 • Lookup-2 • Transmit • BRAM • BRAM Multiprocessor for Header Processing • FIFO queues

Typical using NPs ECE 526

System Implementation Space ECE 526

Memory Architecture • Memory access bottleneck • Memory is area consuming • Limited memory-on-chip • Limited bandwidth to off-chip memory: pin and package cost • Off-chip memory access is slow: 100 cycles • Possible solutions • Profiling application memory access pattern • Propose heterogeneous memory architecture • Memory aware mapping • Transactional memory (project topic) ECE 526

Application Mapping Mapping Current approach: fixed topology, assembly coding & hand-tuning ECE 526

PE • FPGA • MEM • MEM • From (1) • From (0) • FPGA • PE • MEM • FPGA • PE • PE • FPGA • Lookup • IPRoute • To (0) • MEM • MEM • To (1) Basic Steps for Mapping • Application description • High-level optimizations • Task graph • (platform specific) • Profile • Architecture configuration • HW / SW partitioning • Task allocation • Data layout • Communication assignment • Compilation / Synthesis

Summary • Network Processor • Special purpose, programmable hardware device • Optimized for network processing • Building blocks of network processing systems • Fundamental ideas • Flexibility through programmability • Scalability with parallelism and pipelining • Here, NP is a concept • We will learn example of network processor soon ECE 526

For Next Class & Announcement • Read Comer: chapter 13 and 14 • Lab 1 total grade reduce to 82 • HW 1 due Wed. • Project topic will be announced after Wed. ECE 526

ECE 526 – Network Processing Systems Design