1 / 42

Automatic Processor Specialisation using Ad-hoc Functional Units

Automatic Processor Specialisation using Ad-hoc Functional Units. Paolo.Ienne@epfl.ch , Laura.Pozzi@epfl.ch , Miljan.Vuletic@epfl.ch EPFL – I & C – LAP. Design Gap!. Classic Options for Systems-on-Chip. Processor Specialisation: Get the Best of Both Options. Embedded!.

bijan
Download Presentation

Automatic Processor Specialisation using Ad-hoc Functional Units

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.ch, Laura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch EPFL – I&C – LAP

  2. Design Gap! Classic Options for Systems-on-Chip Automatic Processor Specialisation

  3. Processor Specialisation:Get the Best of Both Options Embedded! Automatic Processor Specialisation

  4. VLIW Processor Specialisation • Two complementary specialisation strategies: • Parametric Architecture • Ad-hoc Functional Units (AFUs) Automatic Processor Specialisation

  5. One ad-hoc complex operation instead of a long sequence of standard ones If the ad-hoc functional unit completes the job faster  GAIN Automatically Collapsing Clusters of Instructions into New Ones Automatic Processor Specialisation

  6. General Goal Automatically achieve processor specialisation through high-level application code analysis Automatic Processor Specialisation

  7. Outline • Introduction • Motivational example • Goals • Opportunities for specialisation • Challenges, further opportunities,… Automatic Processor Specialisation

  8. Shift-and-add unsigned 8 x 8-bit multiplication Elementary Motivational ExampleAn Important Kernel… /* init */ a <<= 8; /* loop */ for (i = 0; i < 8; i++) { if (a & 0x8000) { a = (a << 1) + b; } else { a <<= 1; } } returna & 0xffff; Automatic Processor Specialisation

  9. Predicate mask (0 or –1 = 0xfffffff) Shift Predicated Add Software Predication /* init */ a <<= 8; /* loop */ for (i = 0; i < 8; i++) { p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; } returna & 0xffff; Automatic Processor Specialisation

  10. a b In SW In HW 0x8000 & 15 1 >> Only wiring << - ~6 cycles & AND gates ALU + a Loop Kernel DAG 1-2 cycles! Automatic Processor Specialisation

  11. Register File ALU LD/ST MSTEP Ad-hoc Unit To AccelerateShift-and-Add Multiplication Loop if (Rn [31] = = 1) then Rn (Rn << 1) + Rm else Rn(Rn << 1) 1 ad-hoc instruction added  loop kernel reduced to 15-30% Automatic Processor Specialisation

  12. Loop Unrolling /* init */ a <<= 8; /* no loop anymore */ p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1; returna & 0xffff; Automatic Processor Specialisation

  13. a b In SW In HW 8 a << & b 0x8000 0x8000 0x8000 a a b + & & & 15 15 15 1 1 1 &-network >> >> >> + << << << ~50 cycles Column Compr. ~3-4 cycles - - - + + & & & + a Arithmetic Optimiser + + + + Etc. + + + a Full DAG Automatic Processor Specialisation

  14. Register File ALU LD/ST MUL Ad-hoc Unit To AccelerateMultiplication?! Yeah, a MUL… Rn (Rn & 0x0000.ffff) x (Rm & 0x0000.ffff) 1 ad-hoc instruction added  function reduced by a factor 10-15 Automatic Processor Specialisation

  15. Classic “Ad-hoc” Customisation… • Altera Nios: • Can we do more of this, really ad-hoc?! Automatic Processor Specialisation

  16. But all assume an onerous manual study and design! Mainstream SoC/FPGA Processors and Specialisation? All the recent embedded processors offer some sort of specialisation: • Arbitrary functional units or tightly coupled coprocessors (IFX Carmel 20xx, ARM, Tensilica Xtensa, Altera Nios, etc.) • Parametric resources (STM Lx, ARC Cores, Tensilica Xtensa, Altera Nios, etc.) Automatic Processor Specialisation

  17. Exploit constant for logic simplification Some operations reduce to wires in hardware Exploit data parallelism in hardware Exploit arithmetic properties for efficient chaining of arithmetic operations (e.g., carry save) Summary of Gain Potentials inAd-hoc FUs Automatic Processor Specialisation

  18. Goals • How much scope for AFU specialisation in typical multimedia code? • Are classic ILP techniques or other optimisations (e.g., arithmetic) important to increase the speedup? To which extent? • What are the microarchitectural needs for exploiting well the potentials? • Memory ports in the AFUs? • Number of inputs from the register file? Are two enough? • Number of outputs to the register file? Is one enough? Automatic Processor Specialisation

  19. Related Work inReconfigurable Computing • Most of the work in reconfigurable computing; typically experiments are linked to a given microarchitecture: • CHIMAERA [Ye et al., 2000] has the most rich measurements but only for 1-output AFUs and no AFU-memory interface • Similarly PRISC [Razdan et al., 1994] and ConCISe [Kastrup et al., 1999] use clustering approaches for 2 inputs - 1 output AFUs • GARP [Hauser et al., 1997] concentates on the mapping of control flow (hyperblocks in loops) in a loosely coupled architecture (coprocessor) First, investigate where potentials are  fix microarchitecture Automatic Processor Specialisation

  20. Related Work inAFU Identification • Other authors concentrate on identification methods (“what is the best function for an AFU?”) often with some microarchitectural assumptions • MaxMISOs [Alippi et al., 1999] are 1-output candidates of maximal size • [Jacome et al., 2000] introduce vertical- and horizontal-aggregation as heuristic methods to cluster operations (no comparisons with other techniques) • [Arnold et al, 2001] use library pattern-matching techniques with a dynamic pattern library (instruction clusters) but very limited cluster complexity (3 instructions) in the experiments • ASIP synthesis: different problem (minimal covering) First, investigate where potentials are  develop appropriate identification algorithms Automatic Processor Specialisation

  21. Methodology • Concentrate on Data Flow • Easier to capture automatically (no architecturally visible state in the AFUs) • Constant latency (variable latency would hardly fit into a statically scheduled environment—e.g., VLIW) • Measurements on Basic Blocks • Represent the upper limit of the potential advantages • Upper limit is reachable if microarchitectural constraints are satisfied (e.g., no. of inputs and outputs) Automatic Processor Specialisation

  22. Experimental Flow Automatic Processor Specialisation

  23. 1: IF ID EX WB IF ID EX WB 2: IF ID EX1 EX2 EX3 WB 3: IF ID EX WB 4: IF ID EX WB 5: Software Execution:Approximate RISC Model • One clock cycle assumed for most SUIF nodes, representing the usage of the execution stage • Exceptions: e.g., type casts (zero), divisions (N) • Assumed all forwarding paths existing • No data/instruction cacheor perfect hit rates assumed • Jumps accounted with afixed amount to the cycle count of each basic block Automatic Processor Specialisation

  24. Hardware Execution:Synthesis-based Model CMOS 0.18µ Synopsys Design Compiler + DesignWare Automatic Processor Specialisation

  25. Partitioning of DFGMix of Hardware and Software • AFU memory bandwidth issue • On-AFU (Hardware) andOff-AFU (Software) instructions • DFG partitioned in HW and SW layers High Cost! Low Performance? Automatic Processor Specialisation

  26. Example of Layering Hybrid DFGs Hardwareandsoftwarelayers Automatic Processor Specialisation

  27. Metrics and Measurements • Topological basic block information: Inputs, outputs, etc. • Saved cycles  speedup Automatic Processor Specialisation

  28. …well separated Few Ld/St… Execution concentrated in few BBs High RF pressure Small delays Basic Blocks CharacteristicsExamples Automatic Processor Specialisation

  29. Basic Blocks Characteristics • Moderate hardware resources for AFUs: • Often, half of the execution time concentrated in not more than 2-3 basic blocks • Pressure on the register filehigher than classically supported • Limited importance of memory ports • Except some dramatic cases… • Small delay of typical basic blocks Automatic Processor Specialisation

  30. BBs too simple to bring advantage Not critical… Good speedup with few BBs Potential Basic SpeedupExamples Automatic Processor Specialisation

  31. ~50% >60% Inputs and Outputs of Basic Blocks Speedup per # inputs Speedup per # outputs Automatic Processor Specialisation

  32. Potential Basic Speedup • Limited available parallelism • Top-ranking basic blocks: 10 to 50% cycle savings • Hardwired constants not a key advantage • Small price for a reduction in design risk • Sequentialisation penalty not dramatic • AFU memory ports not essential • Accurate bitwidth analysis and arithmetic optimizations bring limited or noadvantage • Basic blocks are too simple, ceiling effects,… Automatic Processor Specialisation

  33. Effects of ILP TechniquesExamples  total speedup 30%  number of basic blocks to reach 30% speedup Automatic Processor Specialisation

  34. Effects of ILP Techniques • Major improvements: • Cumulative speedups between 1.7x and 6.3x • Register file pressure not significantly modified • Hardware complexity and Thw increased • Area is typically below 2-3x that of 32-bit multiplier, almost never >10x • Accurate bitwidth analysisand arithmetic optimisationsbring limited or no advantage • Baseline advantage already very large Automatic Processor Specialisation

  35. w/o optimisation with optimisation Arithmetic Optimisation Impact Automatic Processor Specialisation

  36. Conclusions • DFG-level opens potential speedups (2–3x)at low cost (hardware and toolset) and low risk • Larger number of AFU write ports (2-3) needed • Hardcoding of constants not essential • AFU  memory interfaces also not essential • ILP techniques help, as expected • Sophisticated and detailed techniques(bitwidth analysis, arithmetic optimizations) sometimes masked by other effects Automatic Processor Specialisation

  37. Ongoing Work • Measure advantages through a complete toolchain (notably, compiler): • DSP microarchitecture: • Validate simple model • Find out bottlenecks and impose real DSP constraints (e.g., nonortogonality) • VLIW microarchitecture: • Go beyond simple software execution model • Develop novel speedup-driven identification algorithms • How to get more AFU specialisation potentials • Dynamic identification and configuration of AFUs Automatic Processor Specialisation

  38. MS2 MS3 MS1 Typical Identification Algorithms • Bottom-up greedy approaches to cluster instructions • Topologically-driven rather than speedup driven • E.g., MaxMISO identification [Alippi et al., 1999]: + + + + + + + + + * * * * * + + + Automatic Processor Specialisation

  39. 0.1 SIMD-like and unconnected graphs 0.1 Speedup-driven Identification • Prune-out optimal set of low-speedup nodes to achieve the required input/output count i0 i1 i2 i3 i4 i5 i6 i7 0.1 1 0.5 2 k 3 1 0.5 0.1 o0 o1 Automatic Processor Specialisation

  40. Open Issues and Perspectives • Power consumption advantages? • Power down because: • Less instruction fetches and decodes • Less register reads and writebacks • Power possibly up because: • Reduced correlation of signals in the AFU • Low-efficiency of the implementation (in case of eFPGAs) • More opportunities to increase speedup? • Detect and implement LUTs (e.g., in quantisers) as discrete CAMs • Detect runtime constant values Automatic Processor Specialisation

  41. Dynamic Specialisation? • Dynamic compilation and optimisation together with hardware specialisation • DAISY, Crusoe, JiT, etc. • Specialisation may profitfrom runtime information • Identification in runtime conditions • Dynamic reconfigurabilitychallenge Java Bytecode JiT + Specialisation ARM + RFU Automatic Processor Specialisation

  42. Conclusions • Processor customisation opportunities are here: soft cores, FPGA processors, etc. • Very specific field of hardware/software codesign with a very large potential • Do not give up versatility • Get most of the performance of custom hardware • Needs automation, to complement compilers and synthesizers (some work exists but limited in scope) Automatic Processor Specialisation

More Related