1 / 25

Deepak Unnikrishnan , Jia Zhao, Russell Tessier Reconfigurable Computing Group

Application-specific Customization and Scalability of Soft Multiprocessors. Deepak Unnikrishnan , Jia Zhao, Russell Tessier Reconfigurable Computing Group University of Massachusetts, Amherst Funded by: Altera Corp. and National Science Foundation. Outline. Motivation

parley
Download Presentation

Deepak Unnikrishnan , Jia Zhao, Russell Tessier Reconfigurable Computing Group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application-specific Customization and Scalability of Soft Multiprocessors Deepak Unnikrishnan, Jia Zhao, Russell Tessier Reconfigurable Computing Group University of Massachusetts, Amherst Funded by: Altera Corp. and National Science Foundation

  2. Outline • Motivation • Design Components • Design Flow • Results • Conclusion

  3. Motivation • Emerging soft multiprocessor systems and applications • Fully automated approaches to soft multiprocessor application development • Easy to use • Design space exploration of multi-core systems • Existing parallel computing benchmarks • Evaluation of application specific customizations

  4. Soft Multiprocessor Synthesis • FPGA based soft-multiprocessor system for IPv4 Packet forwarding[1] • Stream applications – ESPAM[2] • Latency/Throughput constrained stream applications[3] • Limitations • Tuned for a specific application • No individual processor optimizations • Not scalable [1] “An FPGA-based soft multiprocessor system for IPv4 packet forwarding”, Ravindran et al. FPL 2005. [2] “Efficient automated synthesis, programming and implementation of multi-processor platforms on FPGA chips”, Nikolov et al. , FPL 2006 [3] “Synthesis of an application-specific soft multiprocessor system”, Cong et al. FPGA 2007

  5. Processor Optimization/Interconnect • Soft-processor optimization techniques • Pipeline stages, ISA, Shifter, Forwarding logic[1] • Custom hardware [2] • Instruction scheduling and recoding[3] • Interconnects • Bus/Network on Chip • Topologies – Ring, Star, Mesh, Hypercube[4] • Limitations • Isolated evaluation of design tradeoffs • Limited benchmarks [1] “Application-specific customization of soft processor microarchitecture,”, Yiannacouras et al., FPGA 2006. [2] “CUSTARD- A customizable threaded FPGA soft processor and tools,” Dimond et al. FPL 2007. [3] “Combining Instruction Coding and Scheduling to Optimize Energy in System-on-FPGA,”, Dimond et al. FCCM 2006. [4] “Routability of Network Topologies in FPGAs,” Saldana et al., TVLSI,March 2007

  6. Design Flow Topology StreamitApp # Processors Custom features Streamit Compiler Processor Templates (SPREE) Computation Communication Soft multiprocessor generator SoftCoreMapper Binary profiler Multiprocessor system designs Code for soft multiprocessors SPREE gcc Quartus Flow Area, Performance, Power evaluation

  7. Example Streamit Application void->void pipeline FMRadio(int N,int freq1, int freq2) { add AtoD(); add FMDemod(); add splitjoin { split duplicate; for (inti=0; i<N; i++) { add pipeline { add LowPassFilter(); add HighPassFilter(); } } join roundrobin(); } add Adder(); add Speaker(); } AtoD FMDemod Duplicate LPF1 LPF2 LPF3 HPF1 HPF2 HPF3 RoundRobin Adder Speaker Courtesy: “Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs”, Gordon et al., ASPLOS 2006

  8. Soft multiprocessor application compiler Streamit Application Graph Expansion Partitioning Streamit Compiler Layout Scheduling Dependency Analysis Topology based Rescheduling SoftCoreMapper Code generation Code for soft multiprocesor systems

  9. SPREE – Soft processor generator • Soft Processor Rapid Exploration Environment • Automatic processor generation from processor descriptions • Fine granular micro-architectural customizations • Pipeline stages • Datapath • Instruction set • Excellent platform for hardware-software co-design evaluation. Processor Description RTL Generator Verilog processor designs App Quartus CAD Flow MIPS gcc Area, Power, Frequency Courtesy: “Application-specific customization of soft processor microarchitecture,”, P. Yiannacouras et al., FPGA 2006.

  10. Example Multiprocessor Architecture 0 1 2 • Key architectural features: • Software flow control • Memory mapped I/O ports • Local on-chip memories lF/ D EX/M WB 3 4 5 -- Soft Processor lF/ D EX/M WB --Circular FIFO • Custom Features • Topology • Interconnect buffer size • Pipeline stages • Instruction set architecture

  11. Experimental Framework and Results • Generated multiprocessor systems of size 4, 6, 9 and 16 • Designs synthesized using Altera Quartus II 8.0 /Modelsim 6.1g • 16 processor systems verified on DE 3 platform • Target devices : Stratix II and Stratix III • 8 Streamit benchmarks

  12. Topologies 0 1 2 S->In Out->E W->E S->E W->S W->S • Topologies • Mesh • Point to Point • Communication rescheduling 3 4 5 Out->N Out->E W->In Out->N N->In N->In • Comm schedule - graph • For each generated data • { • Discover hop edges • Eliminate hop edges • Insert point to point edges • } • Reschedule communication Mesh Topology 0 3->In Out->5 3 4 5 Out->0 Out->4 3->In Out->5 3->In 5->In Point to point Topology

  13. Results - Point-to-Point vs Mesh

  14. Pipeline Depth • Deeper pipelines incur more pipeline stalls • Stalls propagate across multiple processors • 4-stage pipelines offer 26% improvement in critical path frequency Application performance vs pipeline depth for 16 processors

  15. Interconnect Buffer • FIFO size can regulate network congestion • Small FIFOs choke network • Infinite FIFO sizes offer little performance:area Interconnect sizing results for 9 processors using point to point topology

  16. ISA Subsetting • Application specific customization of processor instructions • Fine granular applications typically require less instructions • Eliminate unused instructions by profiling application binary Instruction usage patterns for 16 processors

  17. ISA Subsetting Average Area savings for 16 processor systems by ISA subsetting

  18. Impact of combined optimizations • Best case – P2P, ISA subsetted, 4 stage with optimum FIFO size • Worst case – Mesh, No ISA subsetting, 3 stage, unoptimized FIFO size Impact of combined optimizations for 16 processor systems

  19. Scalability

  20. Power Consumption

  21. Conclusion • Fully automatic and scalable flow for design evaluation of large soft-multiprocessor systems with existing parallel computing benchmarks. • 16 processor systems demonstrate 3-5x speedup • Point to point topologies show 1.5x-2x speedup over meshes • Individual micro-architectural customizations yield area and performance improvements.

  22. Thank you

  23. Backup slide - Scalability

  24. Backup slides – Dynamic power

  25. Backup slides - Future Work • Evaluate the impact of Streamit compiler optimizations on soft-multiprocessor systems. • Example: Choice of partitioning – greedy, dynamic programming • More aggressive processor optimizations • Application specific on-chip memory size reduction • Target 32-64 soft processors on larger FPGAs with larger applications

More Related