Deepak Unnikrishnan , Jia Zhao, Russell Tessier Reconfigurable Computing Group

Application-specific Customization and Scalability of Soft Multiprocessors Deepak Unnikrishnan, Jia Zhao, Russell Tessier Reconfigurable Computing Group University of Massachusetts, Amherst Funded by: Altera Corp. and National Science Foundation

Outline • Motivation • Design Components • Design Flow • Results • Conclusion

Motivation • Emerging soft multiprocessor systems and applications • Fully automated approaches to soft multiprocessor application development • Easy to use • Design space exploration of multi-core systems • Existing parallel computing benchmarks • Evaluation of application specific customizations

Soft Multiprocessor Synthesis • FPGA based soft-multiprocessor system for IPv4 Packet forwarding[1] • Stream applications – ESPAM[2] • Latency/Throughput constrained stream applications[3] • Limitations • Tuned for a specific application • No individual processor optimizations • Not scalable [1] “An FPGA-based soft multiprocessor system for IPv4 packet forwarding”, Ravindran et al. FPL 2005. [2] “Efficient automated synthesis, programming and implementation of multi-processor platforms on FPGA chips”, Nikolov et al. , FPL 2006 [3] “Synthesis of an application-specific soft multiprocessor system”, Cong et al. FPGA 2007

Processor Optimization/Interconnect • Soft-processor optimization techniques • Pipeline stages, ISA, Shifter, Forwarding logic[1] • Custom hardware [2] • Instruction scheduling and recoding[3] • Interconnects • Bus/Network on Chip • Topologies – Ring, Star, Mesh, Hypercube[4] • Limitations • Isolated evaluation of design tradeoffs • Limited benchmarks [1] “Application-specific customization of soft processor microarchitecture,”, Yiannacouras et al., FPGA 2006. [2] “CUSTARD- A customizable threaded FPGA soft processor and tools,” Dimond et al. FPL 2007. [3] “Combining Instruction Coding and Scheduling to Optimize Energy in System-on-FPGA,”, Dimond et al. FCCM 2006. [4] “Routability of Network Topologies in FPGAs,” Saldana et al., TVLSI,March 2007

Design Flow Topology StreamitApp # Processors Custom features Streamit Compiler Processor Templates (SPREE) Computation Communication Soft multiprocessor generator SoftCoreMapper Binary profiler Multiprocessor system designs Code for soft multiprocessors SPREE gcc Quartus Flow Area, Performance, Power evaluation

Example Streamit Application void->void pipeline FMRadio(int N,int freq1, int freq2) { add AtoD(); add FMDemod(); add splitjoin { split duplicate; for (inti=0; i<N; i++) { add pipeline { add LowPassFilter(); add HighPassFilter(); } } join roundrobin(); } add Adder(); add Speaker(); } AtoD FMDemod Duplicate LPF1 LPF2 LPF3 HPF1 HPF2 HPF3 RoundRobin Adder Speaker Courtesy: “Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs”, Gordon et al., ASPLOS 2006

Soft multiprocessor application compiler Streamit Application Graph Expansion Partitioning Streamit Compiler Layout Scheduling Dependency Analysis Topology based Rescheduling SoftCoreMapper Code generation Code for soft multiprocesor systems

SPREE – Soft processor generator • Soft Processor Rapid Exploration Environment • Automatic processor generation from processor descriptions • Fine granular micro-architectural customizations • Pipeline stages • Datapath • Instruction set • Excellent platform for hardware-software co-design evaluation. Processor Description RTL Generator Verilog processor designs App Quartus CAD Flow MIPS gcc Area, Power, Frequency Courtesy: “Application-specific customization of soft processor microarchitecture,”, P. Yiannacouras et al., FPGA 2006.

Example Multiprocessor Architecture 0 1 2 • Key architectural features: • Software flow control • Memory mapped I/O ports • Local on-chip memories lF/ D EX/M WB 3 4 5 -- Soft Processor lF/ D EX/M WB --Circular FIFO • Custom Features • Topology • Interconnect buffer size • Pipeline stages • Instruction set architecture

Experimental Framework and Results • Generated multiprocessor systems of size 4, 6, 9 and 16 • Designs synthesized using Altera Quartus II 8.0 /Modelsim 6.1g • 16 processor systems verified on DE 3 platform • Target devices : Stratix II and Stratix III • 8 Streamit benchmarks

Topologies 0 1 2 S->In Out->E W->E S->E W->S W->S • Topologies • Mesh • Point to Point • Communication rescheduling 3 4 5 Out->N Out->E W->In Out->N N->In N->In • Comm schedule - graph • For each generated data • { • Discover hop edges • Eliminate hop edges • Insert point to point edges • } • Reschedule communication Mesh Topology 0 3->In Out->5 3 4 5 Out->0 Out->4 3->In Out->5 3->In 5->In Point to point Topology

Results - Point-to-Point vs Mesh

Pipeline Depth • Deeper pipelines incur more pipeline stalls • Stalls propagate across multiple processors • 4-stage pipelines offer 26% improvement in critical path frequency Application performance vs pipeline depth for 16 processors

Interconnect Buffer • FIFO size can regulate network congestion • Small FIFOs choke network • Infinite FIFO sizes offer little performance:area Interconnect sizing results for 9 processors using point to point topology

ISA Subsetting • Application specific customization of processor instructions • Fine granular applications typically require less instructions • Eliminate unused instructions by profiling application binary Instruction usage patterns for 16 processors

ISA Subsetting Average Area savings for 16 processor systems by ISA subsetting

Impact of combined optimizations • Best case – P2P, ISA subsetted, 4 stage with optimum FIFO size • Worst case – Mesh, No ISA subsetting, 3 stage, unoptimized FIFO size Impact of combined optimizations for 16 processor systems

Scalability

Power Consumption

Conclusion • Fully automatic and scalable flow for design evaluation of large soft-multiprocessor systems with existing parallel computing benchmarks. • 16 processor systems demonstrate 3-5x speedup • Point to point topologies show 1.5x-2x speedup over meshes • Individual micro-architectural customizations yield area and performance improvements.

Thank you

Backup slide - Scalability

Backup slides – Dynamic power

Backup slides - Future Work • Evaluate the impact of Streamit compiler optimizations on soft-multiprocessor systems. • Example: Choice of partitioning – greedy, dynamic programming • More aggressive processor optimizations • Application specific on-chip memory size reduction • Target 32-64 soft processors on larger FPGAs with larger applications

Deepak Unnikrishnan , Jia Zhao, Russell Tessier Reconfigurable Computing Group