420 likes | 594 Views
Automatic Processor Specialisation using Ad-hoc Functional Units. Paolo.Ienne@epfl.ch , Laura.Pozzi@epfl.ch , Miljan.Vuletic@epfl.ch EPFL – I & C – LAP. Design Gap!. Classic Options for Systems-on-Chip. Processor Specialisation: Get the Best of Both Options. Embedded!.
E N D
Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.ch, Laura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch EPFL – I&C – LAP
Design Gap! Classic Options for Systems-on-Chip Automatic Processor Specialisation
Processor Specialisation:Get the Best of Both Options Embedded! Automatic Processor Specialisation
VLIW Processor Specialisation • Two complementary specialisation strategies: • Parametric Architecture • Ad-hoc Functional Units (AFUs) Automatic Processor Specialisation
One ad-hoc complex operation instead of a long sequence of standard ones If the ad-hoc functional unit completes the job faster GAIN Automatically Collapsing Clusters of Instructions into New Ones Automatic Processor Specialisation
General Goal Automatically achieve processor specialisation through high-level application code analysis Automatic Processor Specialisation
Outline • Introduction • Motivational example • Goals • Opportunities for specialisation • Challenges, further opportunities,… Automatic Processor Specialisation
Shift-and-add unsigned 8 x 8-bit multiplication Elementary Motivational ExampleAn Important Kernel… /* init */ a <<= 8; /* loop */ for (i = 0; i < 8; i++) { if (a & 0x8000) { a = (a << 1) + b; } else { a <<= 1; } } returna & 0xffff; Automatic Processor Specialisation
Predicate mask (0 or –1 = 0xfffffff) Shift Predicated Add Software Predication /* init */ a <<= 8; /* loop */ for (i = 0; i < 8; i++) { p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; } returna & 0xffff; Automatic Processor Specialisation
a b In SW In HW 0x8000 & 15 1 >> Only wiring << - ~6 cycles & AND gates ALU + a Loop Kernel DAG 1-2 cycles! Automatic Processor Specialisation
Register File ALU LD/ST MSTEP Ad-hoc Unit To AccelerateShift-and-Add Multiplication Loop if (Rn [31] = = 1) then Rn (Rn << 1) + Rm else Rn(Rn << 1) 1 ad-hoc instruction added loop kernel reduced to 15-30% Automatic Processor Specialisation
Loop Unrolling /* init */ a <<= 8; /* no loop anymore */ p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; returna & 0xffff; Automatic Processor Specialisation
a b In SW In HW 8 a << & b 0x8000 0x8000 0x8000 a a b + & & & 15 15 15 1 1 1 &-network >> >> >> + << << << ~50 cycles Column Compr. ~3-4 cycles - - - + + & & & + a Arithmetic Optimiser + + + + Etc. + + + a Full DAG Automatic Processor Specialisation
Register File ALU LD/ST MUL Ad-hoc Unit To AccelerateMultiplication?! Yeah, a MUL… Rn (Rn & 0x0000.ffff) x (Rm & 0x0000.ffff) 1 ad-hoc instruction added function reduced by a factor 10-15 Automatic Processor Specialisation
Classic “Ad-hoc” Customisation… • Altera Nios: • Can we do more of this, really ad-hoc?! Automatic Processor Specialisation
But all assume an onerous manual study and design! Mainstream SoC/FPGA Processors and Specialisation? All the recent embedded processors offer some sort of specialisation: • Arbitrary functional units or tightly coupled coprocessors (IFX Carmel 20xx, ARM, Tensilica Xtensa, Altera Nios, etc.) • Parametric resources (STM Lx, ARC Cores, Tensilica Xtensa, Altera Nios, etc.) Automatic Processor Specialisation
Exploit constant for logic simplification Some operations reduce to wires in hardware Exploit data parallelism in hardware Exploit arithmetic properties for efficient chaining of arithmetic operations (e.g., carry save) Summary of Gain Potentials inAd-hoc FUs Automatic Processor Specialisation
Goals • How much scope for AFU specialisation in typical multimedia code? • Are classic ILP techniques or other optimisations (e.g., arithmetic) important to increase the speedup? To which extent? • What are the microarchitectural needs for exploiting well the potentials? • Memory ports in the AFUs? • Number of inputs from the register file? Are two enough? • Number of outputs to the register file? Is one enough? Automatic Processor Specialisation
Related Work inReconfigurable Computing • Most of the work in reconfigurable computing; typically experiments are linked to a given microarchitecture: • CHIMAERA [Ye et al., 2000] has the most rich measurements but only for 1-output AFUs and no AFU-memory interface • Similarly PRISC [Razdan et al., 1994] and ConCISe [Kastrup et al., 1999] use clustering approaches for 2 inputs - 1 output AFUs • GARP [Hauser et al., 1997] concentates on the mapping of control flow (hyperblocks in loops) in a loosely coupled architecture (coprocessor) First, investigate where potentials are fix microarchitecture Automatic Processor Specialisation
Related Work inAFU Identification • Other authors concentrate on identification methods (“what is the best function for an AFU?”) often with some microarchitectural assumptions • MaxMISOs [Alippi et al., 1999] are 1-output candidates of maximal size • [Jacome et al., 2000] introduce vertical- and horizontal-aggregation as heuristic methods to cluster operations (no comparisons with other techniques) • [Arnold et al, 2001] use library pattern-matching techniques with a dynamic pattern library (instruction clusters) but very limited cluster complexity (3 instructions) in the experiments • ASIP synthesis: different problem (minimal covering) First, investigate where potentials are develop appropriate identification algorithms Automatic Processor Specialisation
Methodology • Concentrate on Data Flow • Easier to capture automatically (no architecturally visible state in the AFUs) • Constant latency (variable latency would hardly fit into a statically scheduled environment—e.g., VLIW) • Measurements on Basic Blocks • Represent the upper limit of the potential advantages • Upper limit is reachable if microarchitectural constraints are satisfied (e.g., no. of inputs and outputs) Automatic Processor Specialisation
Experimental Flow Automatic Processor Specialisation
1: IF ID EX WB IF ID EX WB 2: IF ID EX1 EX2 EX3 WB 3: IF ID EX WB 4: IF ID EX WB 5: Software Execution:Approximate RISC Model • One clock cycle assumed for most SUIF nodes, representing the usage of the execution stage • Exceptions: e.g., type casts (zero), divisions (N) • Assumed all forwarding paths existing • No data/instruction cacheor perfect hit rates assumed • Jumps accounted with afixed amount to the cycle count of each basic block Automatic Processor Specialisation
Hardware Execution:Synthesis-based Model CMOS 0.18µ Synopsys Design Compiler + DesignWare Automatic Processor Specialisation
Partitioning of DFGMix of Hardware and Software • AFU memory bandwidth issue • On-AFU (Hardware) andOff-AFU (Software) instructions • DFG partitioned in HW and SW layers High Cost! Low Performance? Automatic Processor Specialisation
Example of Layering Hybrid DFGs Hardwareandsoftwarelayers Automatic Processor Specialisation
Metrics and Measurements • Topological basic block information: Inputs, outputs, etc. • Saved cycles speedup Automatic Processor Specialisation
…well separated Few Ld/St… Execution concentrated in few BBs High RF pressure Small delays Basic Blocks CharacteristicsExamples Automatic Processor Specialisation
Basic Blocks Characteristics • Moderate hardware resources for AFUs: • Often, half of the execution time concentrated in not more than 2-3 basic blocks • Pressure on the register filehigher than classically supported • Limited importance of memory ports • Except some dramatic cases… • Small delay of typical basic blocks Automatic Processor Specialisation
BBs too simple to bring advantage Not critical… Good speedup with few BBs Potential Basic SpeedupExamples Automatic Processor Specialisation
~50% >60% Inputs and Outputs of Basic Blocks Speedup per # inputs Speedup per # outputs Automatic Processor Specialisation
Potential Basic Speedup • Limited available parallelism • Top-ranking basic blocks: 10 to 50% cycle savings • Hardwired constants not a key advantage • Small price for a reduction in design risk • Sequentialisation penalty not dramatic • AFU memory ports not essential • Accurate bitwidth analysis and arithmetic optimizations bring limited or noadvantage • Basic blocks are too simple, ceiling effects,… Automatic Processor Specialisation
Effects of ILP TechniquesExamples total speedup 30% number of basic blocks to reach 30% speedup Automatic Processor Specialisation
Effects of ILP Techniques • Major improvements: • Cumulative speedups between 1.7x and 6.3x • Register file pressure not significantly modified • Hardware complexity and Thw increased • Area is typically below 2-3x that of 32-bit multiplier, almost never >10x • Accurate bitwidth analysisand arithmetic optimisationsbring limited or no advantage • Baseline advantage already very large Automatic Processor Specialisation
w/o optimisation with optimisation Arithmetic Optimisation Impact Automatic Processor Specialisation
Conclusions • DFG-level opens potential speedups (2–3x)at low cost (hardware and toolset) and low risk • Larger number of AFU write ports (2-3) needed • Hardcoding of constants not essential • AFU memory interfaces also not essential • ILP techniques help, as expected • Sophisticated and detailed techniques(bitwidth analysis, arithmetic optimizations) sometimes masked by other effects Automatic Processor Specialisation
Ongoing Work • Measure advantages through a complete toolchain (notably, compiler): • DSP microarchitecture: • Validate simple model • Find out bottlenecks and impose real DSP constraints (e.g., nonortogonality) • VLIW microarchitecture: • Go beyond simple software execution model • Develop novel speedup-driven identification algorithms • How to get more AFU specialisation potentials • Dynamic identification and configuration of AFUs Automatic Processor Specialisation
MS2 MS3 MS1 Typical Identification Algorithms • Bottom-up greedy approaches to cluster instructions • Topologically-driven rather than speedup driven • E.g., MaxMISO identification [Alippi et al., 1999]: + + + + + + + + + * * * * * + + + Automatic Processor Specialisation
0.1 SIMD-like and unconnected graphs 0.1 Speedup-driven Identification • Prune-out optimal set of low-speedup nodes to achieve the required input/output count i0 i1 i2 i3 i4 i5 i6 i7 0.1 1 0.5 2 k 3 1 0.5 0.1 o0 o1 Automatic Processor Specialisation
Open Issues and Perspectives • Power consumption advantages? • Power down because: • Less instruction fetches and decodes • Less register reads and writebacks • Power possibly up because: • Reduced correlation of signals in the AFU • Low-efficiency of the implementation (in case of eFPGAs) • More opportunities to increase speedup? • Detect and implement LUTs (e.g., in quantisers) as discrete CAMs • Detect runtime constant values Automatic Processor Specialisation
Dynamic Specialisation? • Dynamic compilation and optimisation together with hardware specialisation • DAISY, Crusoe, JiT, etc. • Specialisation may profitfrom runtime information • Identification in runtime conditions • Dynamic reconfigurabilitychallenge Java Bytecode JiT + Specialisation ARM + RFU Automatic Processor Specialisation
Conclusions • Processor customisation opportunities are here: soft cores, FPGA processors, etc. • Very specific field of hardware/software codesign with a very large potential • Do not give up versatility • Get most of the performance of custom hardware • Needs automation, to complement compilers and synthesizers (some work exists but limited in scope) Automatic Processor Specialisation