1 / 36

NetThreads: Programming NetFPGA with Threaded Software

NetThreads: Programming NetFPGA with Threaded Software. Geoff Salmon Monia Ghobadi Yashar Ganjali. Martin Labrecque Gregory Steffan. ECE Dept. CS Dept. University of Toronto. Real-Life Customers. Hardware: NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA

Download Presentation

NetThreads: Programming NetFPGA with Threaded Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. NetThreads: Programming NetFPGA with Threaded Software Geoff Salmon Monia Ghobadi Yashar Ganjali Martin Labrecque Gregory Steffan ECE Dept. CS Dept. University of Toronto

  2. Real-Life Customers • Hardware: • NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA • Collaboration with CS researchers • Interested in performing network experiments • Not in coding Verilog • Want to use GigE link at maximum capacity • Requirements: • Easy to program system • Efficient system What would the ideal solution look like?

  3. Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Envisioned System (Someday) data-level parallelism Hardware Accelerator Hardware Accelerator • Many Compute Engines • Delivers the expected performance • Hardware handles communication and synchronizaton Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator control-flow parallelism Processors inside an FPGA?

  4. FPGA Processor DDR controller Ethernet MAC Ethernet MAC Ethernet MAC DDR controller • Easier to program than HDL • Customizable Soft Processors in FPGAs • Soft processors: processors in theFPGA fabric • FPGAs increasingly implement SoCs with CPUs • Commercial soft processors: NIOS-II and Microblaze What is the performance requirement?

  5. Performance In Packet Processing • The application defines the throughput required Edge routing (≥ 1 Gbps/link) Home networking (~100 Mbps/link) Scientific instruments (< 100 Mbps/link) • Our measure of throughput: • Bisection search of the minimum packet inter-arrival • Must not drop any packet Are soft processors fast enough?

  6. Realistic Goals • 109 bps stream with normal inter-frame gap of 12 bytes • 2 processors running at 125 MHz • Cycle budget: • 152 cycles for minimally-sized 64B packets; • 3060 cycles for maximally-sized 1518B packets Soft processors: non-trivial processing at line rate! How can they efficiently be organized?

  7. Key Design Features

  8. 1 Memory system with specialized memories 2 Multiple processors support Efficient Network Processing 3 Multithreaded soft processor

  9. processor processor I$ I$ 4-threads 4-threads Multiprocessor System Diagram Synch. Unit Instr. Data Input mem. Output mem. Input Buffer Data Cache Output Buffer packet output packet input Off-chip DDR - Overcomes the 2-port limitation of block RAMs - Shared data cache is not the main bottleneck in our experiments

  10. Performance of Single-Threaded Processors • Single-issue, in order pipeline • Should commit 1 instruction every cycle, but: • stall on instruction dependences • stall on memory, I/O, accelerators accesses • Throughput depends on sequential execution: • packet processing • device control • event monitoring many concurrent threads Solution to Avoid Stalls: Multithreading

  11. Legend Thread1 Thread2 Thread3Thread4 F F F F F D D D D Ideally, eliminates all stalls D E E E E 5 stages AFTER E M M M M M W W W W W Time • Multithreading: execute streams of independent instructions Avoiding Processor Stall Cycles F F F Data or control hazard D D D Single-Thread Traditional execution E E E 5 stages BEFORE M M M W W W Time • 4 threads eliminate hazards in 5-stage pipeline • 5-stage pipeline is 77% more area efficient [FPL’07]

  12. Multithreading Evaluation

  13. Infrastructure • Compilation: • modified versions of GCC 4.0.2 and Binutils 2.16 for the MIPS-I ISA • Timing: • no free PLL: processors run at the speed of the Ethernet MACs, 125MHz • Platform: • 2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2 SDRAM • Virtex II Pro 50 (speed grade 7ns) • 16KB private instruction caches and shared data write-back cache • Capacity would be increased on a more modern FPGA • Validation: • Reference trace from MIPS simulator • Modelsim and online instruction trace collection - PC server can send ~0.7 Gbps maximally size packets - Simple packet echo application can keep up - Complex applications are the bottleneck, not the architecture

  14. Our benchmarks Realistic non-trivial applications, dominated by control flow

  15. What is limiting performance? Packet Backlog due to Synchronization Serializing Tasks Let’s focus on the underlying problem: Synchronization

  16. Addressing Synchronization Overhead

  17. Real Threads Synchronize • All threads execute the same code • Concurrent threads may access shared data • Critical sections ensure correctness Thread1Thread2Thread3Thread4 Lock(); shared_var = f(); Unlock(); Impact on round-robin scheduled threads?

  18. F F F F F F D D D D D D F F E E E E E E D D M M M M M M E E W W W W W W M M W W Multithreaded processor with Synchronization F D Release lock E 5 stages M Acquire lock W Time

  19. F F D D E E M M W W Synchronization Wrecks Round-Robin Multithreading F D Release lock E 5 stages M Acquire lock W Time With round-robin thread scheduling and contention on locks: < 4 threads execute concurrently > 18% cycles are wasted while blocked on synchronization

  20. F F F F F F F F F F F F F F D D D D D D D D D D D D D D E E E E E E E E E E E E E E AFTER 5 stages M M M M M M M M M M M M M M W W W W W W W W W W W W W W Time DESCHEDULE Thread3Thread4 Better Handling of Synchronization F F F F F F D D D D D D E E E E E E BEFORE 5 stages M M M M M M W W W W W W Time

  21. Thread scheduler • Suspend any thread waiting for a lock • Round-robin among the remaining threads • Unlock operation resumes threads across processors - Multithreaded processor hides hazards across active threads - Fewer than N threads requires hazard detection But, hazard detection was on critical path of single threaded processor Is there a low cost solution?

  22. Static Hazard Detection • Hazards can be determined at compile time - Hazard distances are encoded as part of the instructions Static hazard detection allows scheduling without an extra pipeline stage Very low area overhead (5%), no frequency penalty

  23. Thread Scheduler Evaluation

  24. Results on 3 benchmark applications - Thread scheduling improves throughput by 63%, 31%, and 41% - Why isn’t the 2nd processor always improving throughput?

  25. Cycle Breakdown in Simulation Classifier NAT UDHCP - Removed cycles stalled waiting for a lock - What is the bottleneck?

  26. Impact of Allowing Packet Drops - System still under-utilized - Throughput still dominated by serialization

  27. Future Work • Adding custom hardware accelerators • Same interconnect as processors • Same synchronization interface • Evaluate speculative threading • Alleviate need for fine grained-synchronization • Reduce conservative synchronization overhead

  28. Conclusions • Efficient multithreaded design • Parallel threads hide stalls on one thread • Thread scheduler mitigates synchronization costs • System Features • System is easy to program in C • Performance from parallelism is easy to get On the lookout for relevant applications suitable for benchmarking NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

  29. Geoff Salmon Monia Ghobadi Yashar Ganjali Martin Labrecque Gregory Steffan ECE Dept. CS Dept. University of Toronto NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

  30. Backup

  31. Software Network Processing • Not meant for: • Straightforward tasks accomplished at line speed in hardware • E.g. basic switching and routing • Advantages compared to Hardware • Complex applications are best described in a high-level software • Easier to design and fast time-to-market • Can interface with custom accelerators, controllers • Can be easily updated • Our focus: stateful applications • Data structures modified by most packets • Difficult to pipeline the code into balanced stages • Run-to-Completion/Pool-of-Threads model for parallelism: • Each thread processes a packet from beginning to end • No thread-specific behavior

  32. Impact of allowing packet drops t NAT benchmark

  33. Cycle Breakdown in Simulation Classifier NAT UDHCP - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization

  34. Fetch Thread Selection Register Read Execute Writeback Memory More Sophisticated Thread Scheduling • Add pipeline stage to pick hazard-free instruction • Result: • Increased instruction latency • Increased hazard window • Increased branch mis-prediction cost MUX Add hazard detection without an extra pipeline stage?

  35. processor processor I$ I$ 4-threads 4-threads x 36 bits x 36 bits x 32 bits Off-chip DDR Implementation • Where to store the hazard distance bits? • Block RAMs are multiple of 9 bits wide • 36 bits word leaves 4 bits available • Also encode lock and unlock flags 32 Bits 4 Bits How to convert instructions from 36 bits to 32 bits?

  36. Instruction Compaction 36  32 bits R-Type Instructions Example: add rd, rs, rt J-Type Instructions Example: j label I-Type Instructions Example: addi rt, rs, immediate - De-compaction: 2 block RAMs + some logic between DDR and cache - Not a critical path of the pipeline

More Related