210 likes | 337 Views
Exploring the Design Space of LUT-based Transparent Accelerators. Sami Yehia * , Nathan Clark ▪ , Scott Mahlke ▪ , and Krisztian Flautner * * ARM Ltd. ▪ Advanced Computer Architecture Lab, University of Michigan. CASES 2005, September 24-27. Embedded Products Convergence.
E N D
Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia*, Nathan Clark▪, Scott Mahlke▪, and Krisztian Flautner* *ARM Ltd. ▪Advanced Computer Architecture Lab, University of Michigan CASES 2005, September 24-27
Embedded Products Convergence Concept Smart phone of 2008 Biometrics 20 GB HD 3.5G (HSDPA)WiMax GPS Stereo Headset Bluetooth/UWB DMB (Digital Mobile Broadcast) Memory card PC / Mac TV out NFC / RFID • Needs of performance for increasing application demands • Embedded systems win through customization : more performance, low power, etc.. • Traditional ISA customization and hardware specialization cannot cope with the increase of functionalities. • One way : Transparent Instruction Set Customization
Transparent Instruction Set Customization I1 I2 I3 I3 I4 I2 I5 I4 OR… I1 I5 Higher Frequency Collapse Instructions (Customization) I3 I4 I2 I1 I5 • An alternative way to performance Transparent • No ISA (or minor) change • Baseline CPU unchanged • Hardware generates control • Eases software burden • Forward compatible
Architecture Framework Subgraph Execution Unit Application Subgraph Inputs Outputs … BRL … Standard Pipeline Compiler Instructions Control Generation Augments Instruction Stream
LUT-based accelerator LUT 1111011001100110 32 r1 32 r2 r4 32 r5 32 32 r12 • Addition/Subtraction • LUT-Based r1 r2 r4 r5 AND EOR inst1: EOR r6,r1,r2 inst2: AND r7,r4,r5 inst3: ORR r12,r6,r7 inst1: ADD r6,r1,r2 ORR r6i = r1i r2i Cini-1 Cini = r1i.r2i | Cini-1.(r1i r2i) r12 A Carry Generator that is also programmable
Programmable Carry Functional Unit (PCFU) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 8 7 6 5 4 3 2 1 0 9 L1 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o L2 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o L3 o o o o o o o o o o o o o o o o o o o o o o o o o o o o L4 o o o o o o o o o o o o o o o o o o o o o o o o L5 o o o o o o o o o o o o o o o o 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 8 7 6 5 4 3 2 1 0 31 9 i = (gi,pi) (G,P) (G’,P’) o (G | GP’,P.P’)
Configuration generation g LUT p LUT in1 in2 g1 p1 Carry OutLUT Generator cin1 in1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 in2 Out Meta Register file r2 r1 p g Output Cin in2 in1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 Meta Function Unit r3 r4 Out =A B cin g = A.B p = A B Out =A AND B Out =A B Subgraph LUT(r3) = LUT (r1) AND LUT (r2) AND r3, r1, r2 ADD r4, r1, r2 XOR r5, r3, r4
Design Space Shifter in1 in1 in1 OutLUT g1 LUT – p1 LUT in2 in2 in2 in3 in3 in3 g2 LUT – p2 LUT in4 in4 in4 in5 in5 in5 in6 in6 in6 OutLUT2 64 in1 32 32 in2 Shifter Out 32 in3 32 in4 32 Out Out2 • Number of Inputs • Number of Outputs • Number of Addition/Subtractions • Shift support • At inputs • At outputs g1 LUT – p1 LUT 16 16 32 in1 32 in2 32 in3 32 in4 p1 g1 32 32 g2 LUT – p2 LUT 32 32 Carry Generator 32 in1 32 in2 32 in3 32 in4 32 cin1 p2 g2 32 32 Carry Generator 32 cin2 OutLUT 64 32 in1 32 in2 32 in3 32 in4 32
Evaluation • Ported Trimaran compiler to ARM ISA • Subgraph identification engine • Synthesized with Synopsis standard cell library at 0.13µ • SimpleScalar configured as ARM926EJ-S • 5 stage pipe, 250 MHz • 1 cycle 16k I/D caches • Single issue • Baseline: 1 cycle subgraph execution latency
Speedup – Baseline PCFU • 4-inputs, 2-outputs PCFU design
Number of inputs/outputs Area is proportional
Design points 4I, 2O, 2A, None 5I, 3O, 2A, None 4I, 3O, 2A, None
Conclusions • Transparent Instruction Set Customization needs • Extracting computations from program • Efficient Substrate to Map subgraphs • PCFU LUT Based accelerators • Flexible configurable accelerators • Efficient configuration • You can get up to 66% with a 6 input / 3 out / 2 Adder PCFU • ... … but you get 62% with a 8 time smaller, ~40% faster PCFU
LUT-based accelerator LUT 0010100110010110 32 r1i 32 r2i r3i 32 Cini-1 32 32 r5i r1 r2 ADD r4,r1,r2 XOR r5,r3,r4 r3 + r4 r5 r5i = r3i (r1i r2i cini-1) cini = (r1i.r2i) OR (r1i r2i).cini-1 • Closer to FPGA • Bit level functions too complex • Proposed Ripple Carry Scheme too slow • May involve carry propagation network very complex also • Hard to configure and have a reasonable latency in a GPP