470 likes | 651 Views
ECE Dept. University of Toronto. FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASED MULTITHREADED PROCESSORS. Martin Labrecque Gregory Steffan. FPL 2009 – Prague, Czech Republic. NetThreads Project. Hardware: NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA
E N D
ECE Dept. University of Toronto FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASEDMULTITHREADED PROCESSORS Martin Labrecque Gregory Steffan FPL 2009 – Prague, Czech Republic
NetThreads Project • Hardware: • NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA • Collaboration with CS researchers • Interested in performing network experiments • e.g. new traffic shaping, encapsulation, subscription protocols • Not in coding Verilog • Want to use GigE link at maximum capacity • Easy to program system • Efficient system Easiest way to describe an application?
FPGA Processor DDR controller Ethernet MAC Ethernet MAC Ethernet MAC DDR controller • Easier to program than HDL • Customizable Soft Processors in FPGAs • Soft processors: processors in theFPGA fabric • FPGAs increasingly implement SoCs with CPUs • Commercial soft processors: NIOS-II and Microblaze Are soft processors fast enough?
Measure of Throughput Too fast • Fastest constant input packet rate • Processing time may vary • Do not drop any packet Too slow • Gigabit link • 2 processors running at 125 MHz • Cycle budget: • 152 to3060 cycles per 64 to 1518byte packets Time Soft processors: non-trivial processing at line rate! How can they be efficiently organized?
Accelerator Hardware Accelerator Accelerator processor processor I$ I$ 4-threads 4-threads Make this system: • Deliverthroughput • Easier to program Input Buffer Shared Data Cache Output Buffer packet output packet input Off-chip DDR Minimalist Multiprocessor System Synch. Unit Instr. Data Input mem. Output mem. • - Overcomes the 2-port limitation of FPGA block RAMs • Shared data cache is not the main bottleneck in our experiments • - Complex applications are the bottleneck, not the architecture
processor processor I$ I$ 4-threads 4-threads Input Buffer Shared Data Cache Output Buffer packet output packet input Off-chip DDR Minimalist Multiprocessor System Synch. Unit Instr. Data Input mem. Output mem.
Multithreading Multithreading Synchronization Thread Scheduling
Conventional Single-Threaded Processors • Single-issue, in order pipeline • Should commit 1 instruction every cycle, but: • stall on instruction dependences • stall on memory, I/O, accelerators accesses • Throughput depends on sequential execution: • packet processing • device control • event monitoring many concurrent threads Solution to Avoid Stalls: Multithreading
Legend Thread1 Thread2 Thread3Thread4 F F F F F R R R R R Ideally, eliminates all stalls E E E E E 5 stages AFTER M M M M M W W W W W Time • Multithreading: execute streams of independent instructions Avoiding Processor Stall Cycles F F F Data or control hazard R R R Single-Thread Traditional execution E E E 5 stages BEFORE M M M W W W Time • 4 threads eliminate hazards in 5-stage pipeline
Multithreading is Area Efficient Replicate state for each thread Hazard Detection Logic Data Cache P C Reg. Array Instr. Cache ALU P C Ctrl. +4 • Simplify control logic • 77% more area efficient than single-threaded [FPL’07]
Infrastructure • Compilation: • MIPS-I instruction set • modified GCC 4.0.2 and Binutils 2.16 • Platform: • Virtex II Pro 50, 4 GigE + 1 PCI interfaces • 2 processors @ 125 MHz • 64 MB DDR2 SDRAM @ 200 MHz • Small caches, would be larger on a more modern FPGA Real system executing real applications
Our benchmarks Realistic non-trivial applications, dominated by control flow
Cycles Breakdown - Multithreading is effective at hiding memory stalls - 18% cycles are wasted while blocked on synchronization - Why is there so much time waiting for a packet?
Packet Backlog due to Synchronization Serializing Tasks Throughput Defined by Bursts of Activity Let’s focus on the underlying problem: Synchronization
Real Threads Synchronize • All threads execute the same code • Concurrent threads may access shared data • Critical sections ensure correctness Thread1Thread2Thread3Thread4 Lock(); shared_var = f(); Unlock(); Impact on round-robin scheduled threads?
Only one thread wants a lock Release lock F F F F F F R R R R R R F F E E E E E E Acquire lock R R M M M M M M E E W W W W W W M M W W Multithreaded processor with Synchronization F R E 5 stages M W Time Threads continue with no stall What happens when more threads want the same lock?
F F R R E E M M W W Synchronization Wrecks Round-Robin Multithreading All threads want the lock F R Release lock E 5 stages M Acquire lock W Time • Only 1 thread makes progress • 1/4 of expected throughput • Can we use idle time to help the green thread make progress?
F F F F F F F F F F F F F F R R R R R R R R R R R R R R E E E E E E E E E E E E E E AFTER 5 stages M M M M M M M M M M M M M M W W W W W W W W W W W W W W Time DESCHEDULE Thread3Thread4 Better Handling of Synchronization Thread3Thread4 WAITING FOR LOCK F F F F F F F F R R R R R R R R E E E E E E E E BEFORE 5 stages M M M M M M M M W W W W W W W W Time
F R E M W F R E M W F R E M W F R E M W Time Thread scheduler • Suspend any thread waiting for a lock • Round-robin among other threads to hide hazards • Unlock operation resumes threads across processors - Fewer active threads requires hazard detection But, hazard detection was on critical path of single threaded processor
Fetch Thread Selection Register Read Execute Writeback Memory Typical Thread Scheduling • Add pipeline stage to pick hazard-free instruction • Result: • Increased instruction latency • Increased hazard window • Increased branch mis-prediction cost MUX Add hazard detection without an extra pipeline stage?
Hazard distance 0 1 Schedule another thread F R E M W or r1,r1,r8 F R E M W or r2,r2,r9 0 0 Time Static Hazard Detection • Hazards can be determined at compile time F R E M W F R E M W • Hazard distances are encoded in the instructions Static hazard detection allows scheduling without an extra pipeline stage
processor processor I$ I$ 4-threads 4-threads x 36 bits x 36 bits x 32 bits Off-chip DDR FPGA-Efficient Implementation • Where to store the hazard distance bits? • Block RAMs are multiple of 9 bits wide • 36 bits word leaves 4 bits available • Also encode lock and unlock flags 32 Bits 4 Bits How to convert instructions from 36 bits to 32 bits?
Instruction Compaction 36 32 bits R-Type Instructions Example: add rd, rs, rt J-Type Instructions Example: j label I-Type Instructions Example: addi rt, rs, immediate - De-compaction: 2 block RAMs + some logic between DDR and cache - Not a critical path of the pipeline
CAD results - Preserved 125 MHz operation - Modest logic and memory footprint
Results on 3 benchmark applications ) Thread scheduling improves throughput by 63%, 31%, and 41%
Better Cycle Breakdown UDHCP Classifier NAT - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization --- future work
Conclusions Made performance from parallelism easy to obtain Parallel threads hide stalls on one thread Reduce synchronization cost with thread scheduling Efficient hardware scheduler Transparent to programmer Low hardware overhead, capitalizes on FPGA Throughput improvements of 63%, 31% and 41% On the lookout for relevant applications suitable for benchmarking NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads 30
ECE Dept. University of Toronto Martin Labrecque Gregory Steffan martinL@eecg.utoronto.ca NetThreads: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads
Future Work • Adding custom hardware accelerators • Same interconnect as processors • Same synchronization interface • Evaluate speculative threading • Alleviate need for fine grained-synchronization • Reduce conservative synchronization overhead
Software Network Processing • Not meant for: • Straightforward tasks accomplished at line speed in hardware • E.g. basic switching and routing • Advantages compared to Hardware • Complex applications are best described in a high-level software • Easier to design and fast time-to-market • Can interface with custom accelerators, controllers • Can be easily updated • Our focus: stateful applications • Data structures modified by most packets • Difficult to pipeline the code into balanced stages • Run-to-Completion/Pool-of-Threads model for parallelism: • Each thread processes a packet from beginning to end • No thread-specific behavior
Cycle Breakdown in Simulation Classifier NAT UDHCP - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization
Conclusions • Recently build transactional multi-processor • Single-threaded based • Thread limitation due to signature size • Promising performance results
Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Envisioned System (Someday) data-level parallelism Hardware Accelerator Hardware Accelerator • Many Compute Engines • Delivers the expected performance • Hardware handles communication and synchronizaton Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator control-flow parallelism Processors inside an FPGA?
Performance In Packet Processing • The application defines the throughput required Edge routing (≥ 1 Gbps/link) Home networking (~100 Mbps/link) Scientific instruments (< 100 Mbps/link) • Our measure of throughput: • Bisection search of the minimum packet inter-arrival • Must not drop any packet Are soft processors fast enough?
1 Memory system with specialized memories 2 Multiple processors support Efficient Network Processing 3 Multithreaded soft processor
Multithreading on 3 benchmark applications Why isn’t the 2nd processor always improving throughput?
Cycle Breakdown in Simulation Most of the time is spent waiting for a packet
System Under-utilized Consequence of the ZERO packet drop policy
Impact of allowing packet drops NAT benchmark The processors can process packets actually much faster
Impact of allowing packet drops t NAT benchmark