Exploring the Design Space of LUT-based Transparent Accelerators

Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia*, Nathan Clark▪, Scott Mahlke▪, and Krisztian Flautner* *ARM Ltd. ▪Advanced Computer Architecture Lab, University of Michigan CASES 2005, September 24-27

Embedded Products Convergence Concept Smart phone of 2008 Biometrics 20 GB HD 3.5G (HSDPA)WiMax GPS Stereo Headset Bluetooth/UWB DMB (Digital Mobile Broadcast) Memory card PC / Mac TV out NFC / RFID • Needs of performance for increasing application demands • Embedded systems win through customization : more performance, low power, etc.. • Traditional ISA customization and hardware specialization cannot cope with the increase of functionalities. • One way : Transparent Instruction Set Customization

Transparent Instruction Set Customization I1 I2 I3 I3 I4 I2 I5 I4 OR… I1 I5 Higher Frequency Collapse Instructions (Customization) I3 I4 I2 I1 I5 • An alternative way to performance Transparent • No ISA (or minor) change • Baseline CPU unchanged • Hardware generates control • Eases software burden • Forward compatible

Architecture Framework Subgraph Execution Unit Application Subgraph Inputs Outputs … BRL … Standard Pipeline Compiler Instructions Control Generation Augments Instruction Stream

Pipeline Interface

LUT-based accelerator LUT 1111011001100110 32 r1 32 r2 r4 32 r5 32 32 r12 • Addition/Subtraction • LUT-Based r1 r2 r4 r5 AND EOR inst1: EOR r6,r1,r2 inst2: AND r7,r4,r5 inst3: ORR r12,r6,r7 inst1: ADD r6,r1,r2 ORR r6i = r1i r2i Cini-1 Cini = r1i.r2i | Cini-1.(r1i r2i) r12 A Carry Generator that is also programmable

Programmable Carry Functional Unit (PCFU) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 8 7 6 5 4 3 2 1 0 9 L1 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o L2 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o L3 o o o o o o o o o o o o o o o o o o o o o o o o o o o o L4 o o o o o o o o o o o o o o o o o o o o o o o o L5 o o o o o o o o o o o o o o o o 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 8 7 6 5 4 3 2 1 0 31 9 i = (gi,pi) (G,P) (G’,P’) o (G | GP’,P.P’)

Configuration generation g LUT p LUT in1 in2 g1 p1 Carry OutLUT Generator cin1 in1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 in2 Out Meta Register file r2 r1 p g Output Cin in2 in1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 Meta Function Unit r3 r4 Out =A  B  cin g = A.B p = A  B Out =A AND B Out =A  B Subgraph LUT(r3) = LUT (r1) AND LUT (r2) AND r3, r1, r2 ADD r4, r1, r2 XOR r5, r3, r4

Design Space Shifter in1 in1 in1 OutLUT g1 LUT – p1 LUT in2 in2 in2 in3 in3 in3 g2 LUT – p2 LUT in4 in4 in4 in5 in5 in5 in6 in6 in6 OutLUT2 64 in1 32 32 in2 Shifter Out 32 in3 32 in4 32 Out Out2 • Number of Inputs • Number of Outputs • Number of Addition/Subtractions • Shift support • At inputs • At outputs g1 LUT – p1 LUT 16 16 32 in1 32 in2 32 in3 32 in4 p1 g1 32 32 g2 LUT – p2 LUT 32 32 Carry Generator 32 in1 32 in2 32 in3 32 in4 32 cin1 p2 g2 32 32 Carry Generator 32 cin2 OutLUT 64 32 in1 32 in2 32 in3 32 in4 32

Evaluation • Ported Trimaran compiler to ARM ISA • Subgraph identification engine • Synthesized with Synopsis standard cell library at 0.13µ • SimpleScalar configured as ARM926EJ-S • 5 stage pipe, 250 MHz • 1 cycle 16k I/D caches • Single issue • Baseline: 1 cycle subgraph execution latency

Speedup – Baseline PCFU • 4-inputs, 2-outputs PCFU design

Number of inputs/outputs Area is proportional

Number of addition/subtractions

Collapsing Emulation

Shift support

Design points 4I, 2O, 2A, None 5I, 3O, 2A, None 4I, 3O, 2A, None

Conclusions • Transparent Instruction Set Customization needs • Extracting computations from program • Efficient Substrate to Map subgraphs • PCFU LUT Based accelerators • Flexible configurable accelerators • Efficient configuration • You can get up to 66% with a 6 input / 3 out / 2 Adder PCFU • ... … but you get 62% with a 8 time smaller, ~40% faster PCFU

Q & A

Backups

PCFU Design Space

LUT-based accelerator LUT 0010100110010110 32 r1i 32 r2i r3i 32 Cini-1 32 32 r5i r1 r2 ADD r4,r1,r2 XOR r5,r3,r4 r3 + r4  r5 r5i = r3i (r1i  r2i  cini-1) cini = (r1i.r2i) OR (r1i  r2i).cini-1 • Closer to FPGA • Bit level functions too complex • Proposed Ripple Carry Scheme too slow • May involve carry propagation network very complex also • Hard to configure and have a reasonable latency in a GPP

Exploring the Design Space of LUT-based Transparent Accelerators

Exploring the Design Space of LUT-based Transparent Accelerators

Presentation Transcript

MAERI: Enabling Rapid Design Space Exploration and Prototyping of DNN Accelerators

Chapter: Exploring Space

“Exploring the space between the ears”

Art and Design Exploring Visual Arts ‘Inner Space’

Exploring Space

Exploring the Similarity Space

Exploring Space

Exploring SPACE

Exploring Space

Exploring the Design Space of a Parallel Object Recognition System

DESIGN OF MUTUALLY TRANSPARENT ANTENNA ARRAYS

Exploring Design Space of VLIW Architectures

Exploring the Design Space of Future CMPs

Exploring Space!

Exploring space

Exploring Space

DESIGN OF MUTUALLY TRANSPARENT ANTENNA ARRAYS

Exploring the Design Space

Exploring Design Space for 3D Clustered Architectures

Exploring Space

Exploring the Design Space of Sensor Networks Using Route-aware MAC Protocols

Space Plasma Accelerators