1 / 21

Exploring the Design Space of LUT-based Transparent Accelerators

Exploring the Design Space of LUT-based Transparent Accelerators. Sami Yehia * , Nathan Clark ▪ , Scott Mahlke ▪ , and Krisztian Flautner * * ARM Ltd. ▪ Advanced Computer Architecture Lab, University of Michigan. CASES 2005, September 24-27. Embedded Products Convergence.

beyla
Download Presentation

Exploring the Design Space of LUT-based Transparent Accelerators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia*, Nathan Clark▪, Scott Mahlke▪, and Krisztian Flautner* *ARM Ltd. ▪Advanced Computer Architecture Lab, University of Michigan CASES 2005, September 24-27

  2. Embedded Products Convergence Concept Smart phone of 2008 Biometrics 20 GB HD 3.5G (HSDPA)WiMax GPS Stereo Headset Bluetooth/UWB DMB (Digital Mobile Broadcast) Memory card PC / Mac TV out NFC / RFID • Needs of performance for increasing application demands • Embedded systems win through customization : more performance, low power, etc.. • Traditional ISA customization and hardware specialization cannot cope with the increase of functionalities. • One way : Transparent Instruction Set Customization

  3. Transparent Instruction Set Customization I1 I2 I3 I3 I4 I2 I5 I4 OR… I1 I5 Higher Frequency Collapse Instructions (Customization) I3 I4 I2 I1 I5 • An alternative way to performance Transparent • No ISA (or minor) change • Baseline CPU unchanged • Hardware generates control • Eases software burden • Forward compatible

  4. Architecture Framework Subgraph Execution Unit Application Subgraph Inputs Outputs … BRL … Standard Pipeline Compiler Instructions Control Generation Augments Instruction Stream

  5. Pipeline Interface

  6. LUT-based accelerator LUT 1111011001100110 32 r1 32 r2 r4 32 r5 32 32 r12 • Addition/Subtraction • LUT-Based r1 r2 r4 r5 AND EOR inst1: EOR r6,r1,r2 inst2: AND r7,r4,r5 inst3: ORR r12,r6,r7 inst1: ADD r6,r1,r2 ORR r6i = r1i r2i Cini-1 Cini = r1i.r2i | Cini-1.(r1i r2i) r12 A Carry Generator that is also programmable

  7. Programmable Carry Functional Unit (PCFU) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 8 7 6 5 4 3 2 1 0 9 L1 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o L2 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o L3 o o o o o o o o o o o o o o o o o o o o o o o o o o o o L4 o o o o o o o o o o o o o o o o o o o o o o o o L5 o o o o o o o o o o o o o o o o 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 8 7 6 5 4 3 2 1 0 31 9 i = (gi,pi) (G,P) (G’,P’) o (G | GP’,P.P’)

  8. Configuration generation g LUT p LUT in1 in2 g1 p1 Carry OutLUT Generator cin1 in1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 in2 Out Meta Register file r2 r1 p g Output Cin in2 in1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 Meta Function Unit r3 r4 Out =A  B  cin g = A.B p = A  B Out =A AND B Out =A  B Subgraph LUT(r3) = LUT (r1) AND LUT (r2) AND r3, r1, r2 ADD r4, r1, r2 XOR r5, r3, r4

  9. Design Space Shifter in1 in1 in1 OutLUT g1 LUT – p1 LUT in2 in2 in2 in3 in3 in3 g2 LUT – p2 LUT in4 in4 in4 in5 in5 in5 in6 in6 in6 OutLUT2 64 in1 32 32 in2 Shifter Out 32 in3 32 in4 32 Out Out2 • Number of Inputs • Number of Outputs • Number of Addition/Subtractions • Shift support • At inputs • At outputs g1 LUT – p1 LUT 16 16 32 in1 32 in2 32 in3 32 in4 p1 g1 32 32 g2 LUT – p2 LUT 32 32 Carry Generator 32 in1 32 in2 32 in3 32 in4 32 cin1 p2 g2 32 32 Carry Generator 32 cin2 OutLUT 64 32 in1 32 in2 32 in3 32 in4 32

  10. Evaluation • Ported Trimaran compiler to ARM ISA • Subgraph identification engine • Synthesized with Synopsis standard cell library at 0.13µ • SimpleScalar configured as ARM926EJ-S • 5 stage pipe, 250 MHz • 1 cycle 16k I/D caches • Single issue • Baseline: 1 cycle subgraph execution latency

  11. Speedup – Baseline PCFU • 4-inputs, 2-outputs PCFU design

  12. Number of inputs/outputs Area is proportional

  13. Number of addition/subtractions

  14. Collapsing Emulation

  15. Shift support

  16. Design points 4I, 2O, 2A, None 5I, 3O, 2A, None 4I, 3O, 2A, None

  17. Conclusions • Transparent Instruction Set Customization needs • Extracting computations from program • Efficient Substrate to Map subgraphs • PCFU LUT Based accelerators • Flexible configurable accelerators • Efficient configuration • You can get up to 66% with a 6 input / 3 out / 2 Adder PCFU • ... … but you get 62% with a 8 time smaller, ~40% faster PCFU

  18. Q & A

  19. Backups

  20. PCFU Design Space

  21. LUT-based accelerator LUT 0010100110010110 32 r1i 32 r2i r3i 32 Cini-1 32 32 r5i r1 r2 ADD r4,r1,r2 XOR r5,r3,r4 r3 + r4  r5 r5i = r3i (r1i  r2i  cini-1) cini = (r1i.r2i) OR (r1i  r2i).cini-1 • Closer to FPGA • Bit level functions too complex • Proposed Ripple Carry Scheme too slow • May involve carry propagation network very complex also • Hard to configure and have a reasonable latency in a GPP

More Related