NetThreads: Programming NetFPGA with Threaded Software

NetThreads: Programming NetFPGA with Threaded Software Geoff Salmon Monia Ghobadi Yashar Ganjali Martin Labrecque Gregory Steffan ECE Dept. CS Dept. University of Toronto

Real-Life Customers • Hardware: • NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA • Collaboration with CS researchers • Interested in performing network experiments • Not in coding Verilog • Want to use GigE link at maximum capacity • Requirements: • Easy to program system • Efficient system What would the ideal solution look like?

Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Envisioned System (Someday) data-level parallelism Hardware Accelerator Hardware Accelerator • Many Compute Engines • Delivers the expected performance • Hardware handles communication and synchronizaton Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator control-flow parallelism Processors inside an FPGA?

FPGA Processor DDR controller Ethernet MAC Ethernet MAC Ethernet MAC DDR controller • Easier to program than HDL • Customizable Soft Processors in FPGAs • Soft processors: processors in theFPGA fabric • FPGAs increasingly implement SoCs with CPUs • Commercial soft processors: NIOS-II and Microblaze What is the performance requirement?

Performance In Packet Processing • The application defines the throughput required Edge routing (≥ 1 Gbps/link) Home networking (~100 Mbps/link) Scientific instruments (< 100 Mbps/link) • Our measure of throughput: • Bisection search of the minimum packet inter-arrival • Must not drop any packet Are soft processors fast enough?

Realistic Goals • 109 bps stream with normal inter-frame gap of 12 bytes • 2 processors running at 125 MHz • Cycle budget: • 152 cycles for minimally-sized 64B packets; • 3060 cycles for maximally-sized 1518B packets Soft processors: non-trivial processing at line rate! How can they efficiently be organized?

Key Design Features

1 Memory system with specialized memories 2 Multiple processors support Efficient Network Processing 3 Multithreaded soft processor

processor processor I$ I$ 4-threads 4-threads Multiprocessor System Diagram Synch. Unit Instr. Data Input mem. Output mem. Input Buffer Data Cache Output Buffer packet output packet input Off-chip DDR - Overcomes the 2-port limitation of block RAMs - Shared data cache is not the main bottleneck in our experiments

Performance of Single-Threaded Processors • Single-issue, in order pipeline • Should commit 1 instruction every cycle, but: • stall on instruction dependences • stall on memory, I/O, accelerators accesses • Throughput depends on sequential execution: • packet processing • device control • event monitoring many concurrent threads Solution to Avoid Stalls: Multithreading

Legend Thread1 Thread2 Thread3Thread4 F F F F F D D D D Ideally, eliminates all stalls D E E E E 5 stages AFTER E M M M M M W W W W W Time • Multithreading: execute streams of independent instructions Avoiding Processor Stall Cycles F F F Data or control hazard D D D Single-Thread Traditional execution E E E 5 stages BEFORE M M M W W W Time • 4 threads eliminate hazards in 5-stage pipeline • 5-stage pipeline is 77% more area efficient [FPL’07]

Multithreading Evaluation

Infrastructure • Compilation: • modified versions of GCC 4.0.2 and Binutils 2.16 for the MIPS-I ISA • Timing: • no free PLL: processors run at the speed of the Ethernet MACs, 125MHz • Platform: • 2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2 SDRAM • Virtex II Pro 50 (speed grade 7ns) • 16KB private instruction caches and shared data write-back cache • Capacity would be increased on a more modern FPGA • Validation: • Reference trace from MIPS simulator • Modelsim and online instruction trace collection - PC server can send ~0.7 Gbps maximally size packets - Simple packet echo application can keep up - Complex applications are the bottleneck, not the architecture

Our benchmarks Realistic non-trivial applications, dominated by control flow

What is limiting performance? Packet Backlog due to Synchronization Serializing Tasks Let’s focus on the underlying problem: Synchronization

Addressing Synchronization Overhead

Real Threads Synchronize • All threads execute the same code • Concurrent threads may access shared data • Critical sections ensure correctness Thread1Thread2Thread3Thread4 Lock(); shared_var = f(); Unlock(); Impact on round-robin scheduled threads?

F F F F F F D D D D D D F F E E E E E E D D M M M M M M E E W W W W W W M M W W Multithreaded processor with Synchronization F D Release lock E 5 stages M Acquire lock W Time

F F D D E E M M W W Synchronization Wrecks Round-Robin Multithreading F D Release lock E 5 stages M Acquire lock W Time With round-robin thread scheduling and contention on locks: < 4 threads execute concurrently > 18% cycles are wasted while blocked on synchronization

F F F F F F F F F F F F F F D D D D D D D D D D D D D D E E E E E E E E E E E E E E AFTER 5 stages M M M M M M M M M M M M M M W W W W W W W W W W W W W W Time DESCHEDULE Thread3Thread4 Better Handling of Synchronization F F F F F F D D D D D D E E E E E E BEFORE 5 stages M M M M M M W W W W W W Time

Thread scheduler • Suspend any thread waiting for a lock • Round-robin among the remaining threads • Unlock operation resumes threads across processors - Multithreaded processor hides hazards across active threads - Fewer than N threads requires hazard detection But, hazard detection was on critical path of single threaded processor Is there a low cost solution?

Static Hazard Detection • Hazards can be determined at compile time - Hazard distances are encoded as part of the instructions Static hazard detection allows scheduling without an extra pipeline stage Very low area overhead (5%), no frequency penalty

Thread Scheduler Evaluation

Results on 3 benchmark applications - Thread scheduling improves throughput by 63%, 31%, and 41% - Why isn’t the 2nd processor always improving throughput?

Cycle Breakdown in Simulation Classifier NAT UDHCP - Removed cycles stalled waiting for a lock - What is the bottleneck?

Impact of Allowing Packet Drops - System still under-utilized - Throughput still dominated by serialization

Future Work • Adding custom hardware accelerators • Same interconnect as processors • Same synchronization interface • Evaluate speculative threading • Alleviate need for fine grained-synchronization • Reduce conservative synchronization overhead

Conclusions • Efficient multithreaded design • Parallel threads hide stalls on one thread • Thread scheduler mitigates synchronization costs • System Features • System is easy to program in C • Performance from parallelism is easy to get On the lookout for relevant applications suitable for benchmarking NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

Geoff Salmon Monia Ghobadi Yashar Ganjali Martin Labrecque Gregory Steffan ECE Dept. CS Dept. University of Toronto NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

Backup

Software Network Processing • Not meant for: • Straightforward tasks accomplished at line speed in hardware • E.g. basic switching and routing • Advantages compared to Hardware • Complex applications are best described in a high-level software • Easier to design and fast time-to-market • Can interface with custom accelerators, controllers • Can be easily updated • Our focus: stateful applications • Data structures modified by most packets • Difficult to pipeline the code into balanced stages • Run-to-Completion/Pool-of-Threads model for parallelism: • Each thread processes a packet from beginning to end • No thread-specific behavior

Impact of allowing packet drops t NAT benchmark

Cycle Breakdown in Simulation Classifier NAT UDHCP - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization

Fetch Thread Selection Register Read Execute Writeback Memory More Sophisticated Thread Scheduling • Add pipeline stage to pick hazard-free instruction • Result: • Increased instruction latency • Increased hazard window • Increased branch mis-prediction cost MUX Add hazard detection without an extra pipeline stage?

processor processor I$ I$ 4-threads 4-threads x 36 bits x 36 bits x 32 bits Off-chip DDR Implementation • Where to store the hazard distance bits? • Block RAMs are multiple of 9 bits wide • 36 bits word leaves 4 bits available • Also encode lock and unlock flags 32 Bits 4 Bits How to convert instructions from 36 bits to 32 bits?

Instruction Compaction 36  32 bits R-Type Instructions Example: add rd, rs, rt J-Type Instructions Example: j label I-Type Instructions Example: addi rt, rs, immediate - De-compaction: 2 block RAMs + some logic between DDR and cache - Not a critical path of the pipeline

NetThreads: Programming NetFPGA with Threaded Software

NetThreads: Programming NetFPGA with Threaded Software

Presentation Transcript

Threaded Programming in Python

Threaded Inserts

Threaded Programming

NetFPGA Environment

NetFPGA Informational Tutorial

NetFPGA in Cambridge

Lecture 18 Threaded Programming

Multi-threaded programming with NSPR

NetFPGA

Multi-threaded Programming with P OSIX Threads

Programming your HT with Computer Software

Threaded Fasteners

Threaded Fasteners

NetFPGA Informational Tutorial

Threaded Trees

Multi-threaded Event Processing with JANA

Optimising Streaming Systems with SDN/P4/ NetFPGA

NetFPGA in Cambridge

Threaded Fasteners

Multi-Threaded Systems with Queues

Threaded rods