360 likes | 613 Views
NetThreads: Programming NetFPGA with Threaded Software. Geoff Salmon Monia Ghobadi Yashar Ganjali. Martin Labrecque Gregory Steffan. ECE Dept. CS Dept. University of Toronto. Real-Life Customers. Hardware: NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA
E N D
NetThreads: Programming NetFPGA with Threaded Software Geoff Salmon Monia Ghobadi Yashar Ganjali Martin Labrecque Gregory Steffan ECE Dept. CS Dept. University of Toronto
Real-Life Customers • Hardware: • NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA • Collaboration with CS researchers • Interested in performing network experiments • Not in coding Verilog • Want to use GigE link at maximum capacity • Requirements: • Easy to program system • Efficient system What would the ideal solution look like?
Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Envisioned System (Someday) data-level parallelism Hardware Accelerator Hardware Accelerator • Many Compute Engines • Delivers the expected performance • Hardware handles communication and synchronizaton Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator control-flow parallelism Processors inside an FPGA?
FPGA Processor DDR controller Ethernet MAC Ethernet MAC Ethernet MAC DDR controller • Easier to program than HDL • Customizable Soft Processors in FPGAs • Soft processors: processors in theFPGA fabric • FPGAs increasingly implement SoCs with CPUs • Commercial soft processors: NIOS-II and Microblaze What is the performance requirement?
Performance In Packet Processing • The application defines the throughput required Edge routing (≥ 1 Gbps/link) Home networking (~100 Mbps/link) Scientific instruments (< 100 Mbps/link) • Our measure of throughput: • Bisection search of the minimum packet inter-arrival • Must not drop any packet Are soft processors fast enough?
Realistic Goals • 109 bps stream with normal inter-frame gap of 12 bytes • 2 processors running at 125 MHz • Cycle budget: • 152 cycles for minimally-sized 64B packets; • 3060 cycles for maximally-sized 1518B packets Soft processors: non-trivial processing at line rate! How can they efficiently be organized?
1 Memory system with specialized memories 2 Multiple processors support Efficient Network Processing 3 Multithreaded soft processor
processor processor I$ I$ 4-threads 4-threads Multiprocessor System Diagram Synch. Unit Instr. Data Input mem. Output mem. Input Buffer Data Cache Output Buffer packet output packet input Off-chip DDR - Overcomes the 2-port limitation of block RAMs - Shared data cache is not the main bottleneck in our experiments
Performance of Single-Threaded Processors • Single-issue, in order pipeline • Should commit 1 instruction every cycle, but: • stall on instruction dependences • stall on memory, I/O, accelerators accesses • Throughput depends on sequential execution: • packet processing • device control • event monitoring many concurrent threads Solution to Avoid Stalls: Multithreading
Legend Thread1 Thread2 Thread3Thread4 F F F F F D D D D Ideally, eliminates all stalls D E E E E 5 stages AFTER E M M M M M W W W W W Time • Multithreading: execute streams of independent instructions Avoiding Processor Stall Cycles F F F Data or control hazard D D D Single-Thread Traditional execution E E E 5 stages BEFORE M M M W W W Time • 4 threads eliminate hazards in 5-stage pipeline • 5-stage pipeline is 77% more area efficient [FPL’07]
Infrastructure • Compilation: • modified versions of GCC 4.0.2 and Binutils 2.16 for the MIPS-I ISA • Timing: • no free PLL: processors run at the speed of the Ethernet MACs, 125MHz • Platform: • 2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2 SDRAM • Virtex II Pro 50 (speed grade 7ns) • 16KB private instruction caches and shared data write-back cache • Capacity would be increased on a more modern FPGA • Validation: • Reference trace from MIPS simulator • Modelsim and online instruction trace collection - PC server can send ~0.7 Gbps maximally size packets - Simple packet echo application can keep up - Complex applications are the bottleneck, not the architecture
Our benchmarks Realistic non-trivial applications, dominated by control flow
What is limiting performance? Packet Backlog due to Synchronization Serializing Tasks Let’s focus on the underlying problem: Synchronization
Real Threads Synchronize • All threads execute the same code • Concurrent threads may access shared data • Critical sections ensure correctness Thread1Thread2Thread3Thread4 Lock(); shared_var = f(); Unlock(); Impact on round-robin scheduled threads?
F F F F F F D D D D D D F F E E E E E E D D M M M M M M E E W W W W W W M M W W Multithreaded processor with Synchronization F D Release lock E 5 stages M Acquire lock W Time
F F D D E E M M W W Synchronization Wrecks Round-Robin Multithreading F D Release lock E 5 stages M Acquire lock W Time With round-robin thread scheduling and contention on locks: < 4 threads execute concurrently > 18% cycles are wasted while blocked on synchronization
F F F F F F F F F F F F F F D D D D D D D D D D D D D D E E E E E E E E E E E E E E AFTER 5 stages M M M M M M M M M M M M M M W W W W W W W W W W W W W W Time DESCHEDULE Thread3Thread4 Better Handling of Synchronization F F F F F F D D D D D D E E E E E E BEFORE 5 stages M M M M M M W W W W W W Time
Thread scheduler • Suspend any thread waiting for a lock • Round-robin among the remaining threads • Unlock operation resumes threads across processors - Multithreaded processor hides hazards across active threads - Fewer than N threads requires hazard detection But, hazard detection was on critical path of single threaded processor Is there a low cost solution?
Static Hazard Detection • Hazards can be determined at compile time - Hazard distances are encoded as part of the instructions Static hazard detection allows scheduling without an extra pipeline stage Very low area overhead (5%), no frequency penalty
Results on 3 benchmark applications - Thread scheduling improves throughput by 63%, 31%, and 41% - Why isn’t the 2nd processor always improving throughput?
Cycle Breakdown in Simulation Classifier NAT UDHCP - Removed cycles stalled waiting for a lock - What is the bottleneck?
Impact of Allowing Packet Drops - System still under-utilized - Throughput still dominated by serialization
Future Work • Adding custom hardware accelerators • Same interconnect as processors • Same synchronization interface • Evaluate speculative threading • Alleviate need for fine grained-synchronization • Reduce conservative synchronization overhead
Conclusions • Efficient multithreaded design • Parallel threads hide stalls on one thread • Thread scheduler mitigates synchronization costs • System Features • System is easy to program in C • Performance from parallelism is easy to get On the lookout for relevant applications suitable for benchmarking NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads
Geoff Salmon Monia Ghobadi Yashar Ganjali Martin Labrecque Gregory Steffan ECE Dept. CS Dept. University of Toronto NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads
Software Network Processing • Not meant for: • Straightforward tasks accomplished at line speed in hardware • E.g. basic switching and routing • Advantages compared to Hardware • Complex applications are best described in a high-level software • Easier to design and fast time-to-market • Can interface with custom accelerators, controllers • Can be easily updated • Our focus: stateful applications • Data structures modified by most packets • Difficult to pipeline the code into balanced stages • Run-to-Completion/Pool-of-Threads model for parallelism: • Each thread processes a packet from beginning to end • No thread-specific behavior
Impact of allowing packet drops t NAT benchmark
Cycle Breakdown in Simulation Classifier NAT UDHCP - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization
Fetch Thread Selection Register Read Execute Writeback Memory More Sophisticated Thread Scheduling • Add pipeline stage to pick hazard-free instruction • Result: • Increased instruction latency • Increased hazard window • Increased branch mis-prediction cost MUX Add hazard detection without an extra pipeline stage?
processor processor I$ I$ 4-threads 4-threads x 36 bits x 36 bits x 32 bits Off-chip DDR Implementation • Where to store the hazard distance bits? • Block RAMs are multiple of 9 bits wide • 36 bits word leaves 4 bits available • Also encode lock and unlock flags 32 Bits 4 Bits How to convert instructions from 36 bits to 32 bits?
Instruction Compaction 36 32 bits R-Type Instructions Example: add rd, rs, rt J-Type Instructions Example: j label I-Type Instructions Example: addi rt, rs, immediate - De-compaction: 2 block RAMs + some logic between DDR and cache - Not a critical path of the pipeline