370 likes | 494 Views
Synthesis of Custom Processors based on Extensible Platforms. Fei Sun + , Srivaths Ravi ++ , Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical Engineering Princeton University ++ : NEC Laboratories America, Inc. Outline. SoC design constraints Background
E N D
Synthesis of Custom Processors based on Extensible Platforms Fei Sun+, Srivaths Ravi++, Anand Raghunathan++ and Niraj K. Jha+ +: Dept. of Electrical Engineering Princeton University ++: NEC Laboratories America, Inc.
Outline • SoC design constraints • Background • Previous work in ASIP design • Xtensa platform • Manual custom instruction generation procedure • Automatic custom instruction generation flow • Experimental results • Conclusions
SoC Design Constraints • Time to market • Cost • Performance • Power • Cost-performance trade-off • Flexibility • ……
Comparison of Different Approaches ASIC ASIP GPP Time to market -- + ++Cost ++ + --Performance ++ + --Power ++ + --Cost-performance ++ + --Flexibility -- + ++ ++ Very good + Good -- Very bad
500 500 - - 1000 MOPS/mW 1000 MOPS/mw ASIC ASIP (Xtensa) 50 50 - - 100 MIPS/mW 100 MIPS/mw Domain Specific Domain Specific Flexibility Flexibility 1 1 - - 10 MIPS/mW 10 MIPS/mw Processor (DSP) Energy Efficiency Energy Efficiency General Embedded 0.1 0.1 - - 1 MIPS/mw 1 MIPS/mW Processor Processor (AMD-K6E) Flexibility vs. Energy Efficiency
Previous Work in ASIP Design • ASIP architectures and overall design methodologies • [Huang, 1994], [Adams, 1996], [Fisher, 1999], [Kucukcakar, 1999] • Application-specific instruction set selection • [Choi, 1999], [Gschwind, 1999], [Arnold, 1999] • Low power ASIP design • [Kalambur, 1997], [Dougherty, 1999], [Ishihara, 2000], [Sami, 2001] • Commercial offerings • Xtensa, ARCtangent, Jazz, SP-5flex, Carmel
Xtensa Architecture TRACE Port Instruction JTAG Tap Control Instruction Memory or Cache & Tags Instruction Address On Chip Debug Align and Decode Interrupt Control Branch Logic & Instruction Fetch Memory Protection Unit Processor Interface Window Register File Date Memory or Cache &Tags Exception Support Coprocessor Register File ALU & Address Generation Processor Controls Write Buffer MAC 16 Base ISA Feature Data Address Coprocessor Execution Units Designer Defined Instruction Execution Unit Configurable Function Timers 1 to n Optional Function Data Special Function Register Access Configurable & Optional Function Data Address Watch 0 to n Extensible Source:www.tensilica.com Instruction Address Watch 0 to n
Logic Synthesis (Synopsys or Ambit) Application Specific Compile, Assemble, Link Block Place/Route (Avant! Or Cadence) Application Simulation with ISS and/or Emulator Timing Verification Software Debugging/Profiling Hardware Profile Xtensa Processor Design Flow Processor Configuration Inputs Designer-DefinedInstruction Descriptions Configuration File Configured GNUC/C++ Compiler Configured Processor HDL Configured GNUAssembler/Disassembler Configured Instruction SetSimulator/Emulator Area, Power and Timing Estimation Application Source Code Generator Output Sample Application Data Internal Database Design data Use of Generated Data Source:www.tensilica.com Optimized Hardware Optimized Software
Manual Custom Instruction Generation Procedure Identify potential new instructions Profile, read source code Slow and error-prone Describe custom instructions Understand source code Insert custom instructions Rewrite source code Verify functional correctness
Contributions of Our Work • Automatic custom instruction selection • Application program to extensible processors with custom instructions • Features • Efficient design space search • Use accurate information from instruction set simulator and synthesis • Bridge the gap between automatic synthesized and manually designed architectures
Key Observations for Pruning • Higher the weight of the template, higher the potential for improvement --- Amdahl’s law • Scope for optimization determined by computation --- No. of cycles needed for executing the template • Scope for optimization determined by read/write ports limitation --- Additional cycles needed for extra reading/writing of input/output variables
Pruning Algorithm • Ranking criterion: • OriginalTime: Fraction of the total execution time of the original program spent in the template (weight) • In, Out: Number of inputs and outputs of the template, respectively • α, β: Number of inputs/outputs encoded in the instruction • γ: No. of cycles needed for executing the template • Higher priority means greater potential for speed up
Highest priority 12.73 12.73 12.73 12.73 5.36 1.18 16.35 Template Generation with Pruning Ranked pool of seed templates Threshold: 0.1 Template set 10.51 7.92 4.05 2.13
12.73 5.36 5.36 4.05 2.13 10.51 7.92 10.51 7.92 4.05 2.13 Template Generation with Pruning Highest priority Ranked pool of seed templates Threshold: 0.1 12.73 Template set 1.18 16.35
12.73 10.51 7.92 5.36 1.18 1.18 4.05 2.13 Template Generation with Pruning Highest priority Ranked pool of seed templates Threshold: 0.1 12.73 Template set 16.35
12.73 16.35 10.51 5.36 10.51 16.35 16.35 7.92 7.92 5.36 4.05 4.05 2.13 2.13 Template Generation with Pruning Highest priority Ranked pool of seed templates Threshold: 0.1 12.73 16.35 Template set
Custom Instruction Insertion • Care must be taken to insert custom instructions into appropriate places without affecting program’s functional correctness • If custom instructions need extra inputs (outputs), care must be taken to select appropriate variables to write to (read from) user-defined registers
Example Illustration of Custom Instruction Insertion (Contd.) ....offset = t + 1;for (i=0; i<100; i++){ j = .... result = offset + i * j;}.... ....offset = t + 1;for (i=0; i<100; i++){ j = .... result = CustomInstr(i,j); }.... WUR(offset,0); (a) (b)
Custom Instruction Combination Selection --- Problem Statement • Given a set of non-overlapping custom instructions, with each instruction having several versions, find a version for each instruction such that performance is maximized while area is under a certain threshold
TIE NECCB11 Custom Processor(HDL Description) Experimental Methodology C Program Aristotle Xtensa GNU Profiler Automatic Custom Instruction Generation Xtensa TIE Compiler Modified C program Synopsys Design Compiler Cross Compiler Tensilica Processor Generator Sente Wattwatcher ISS Synopsys Design Compiler Execution Cycles Power Area Clock Period
Experimental Results (Contd.) Average Performance improvement: 3.4X Energy reduction: 3.2X Energy*delay reduction: 12.6X Area increase: 1.8%
Conclusions • Automatic custom instruction synthesis for ASIPs • Template generation/selection • Custom instruction insertion • Custom instruction combination selection • Experimental results • 3.4X average performance improvement • 12.6X average energy*delay reduction