180 likes | 317 Views
Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization. Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. A Case for Customization.
E N D
Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. 1
A Case for Customization • General purpose processors handles many applications fairly well, but… • Each application has different requirements • Need for efficient execution • Impressive design wins through customization • Performance, power, area • Up to 3.5x speedup [Hot Chips 16] 2
SHR LD AND Instruction Set Customization • Computationally demanding parts of applications run on special hardware • New instructions use the special hardware LD MPY MPY XOR SHR CUSTOM XOR MOV XOR 3
High Non-Recurring Engineering costs (NRE) “Universal” accelerator No ISA change CPU CPU CPU CPU Compute Accelerator (CCA) Traditional vs. Transparent Customization Traditional Transparent CPU CPU 4
IN 1 IN 2 … FU FU FU … FU FU FU … … Design of a Compute Accelerator • Goal: support important computation subgraphs • Array of function units • Exploits subgraph parallelism • Allows natural data propagation CCA F e t c h I s s u e W B … … ALU ALU 5
1 1 1 Mov Mov And Mov And And 1 1 Mov Or Or Or 1 1 Mov And Mov And Mov And 1 Or Or Or CCA Shape 164.gzip 6
2 2 2 Add Mov Xor 2 2 Mov Xor 2 2 Xor And 1 CCA Shape Blowfish 7
CCA Utilization • Dynamic % of subgraphs using FU 8
CCA Operations • Dynamic opcodes in important subgraphs • Excluded mpy/div, load/store, branch • Two main categories – logicals, adds • Subgraphs rarely have more than 3 dependent adds 9
I1 Proposed CCA Design • 4 inputs/2 outputs • Two FU types • Arith/logic • Logic • Crossbar between rows • Captures > 99% of important subgraphs I1 I2 I3 I4 O1 O2 10
Synthesis of CCA • Synopsys design tools, 130nm library 11
ASIPs – ISA change – High NRE ASIPs – ISA change – High NRE ASIPs – ISA change – High NRE ASIPs – ISA change – High NRE + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + No ISA change + No recompile – Simple selection – Hardware complexity + No ISA change + No recompile – Simple selection – Hardware complexity + No ISA change + No recompile – Simple selection – Hardware complexity + No ISA change + No recompile – Simple selection – Hardware complexity CCA Utilization Realization Static Dynamic Static Selection Dynamic 12
… ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … Dynamic Selection – Dynamic Realization • Detect and replace subgraphs in fill unit of trace cache I-Cache D e c o d e . . . E x e c u t e . . . R e t i r e Trace Cache … LSR r2, r2, #4 LD r3 CUSTOM SHR … Subgraph Selection and Insertion Trace Construction 13
Simulation • SimpleScalar – ARM instruction set • 4-wide Execution, 1 compute accelerator • 128 RUU entries • 32k inst. trace cache, 256 inst. Traces • 5000 cycle selection/insert latency • L1 I-cache : 32k, 2 way, 2 cycle hit • L1 D-cache : 32k, 4 way, 2 cycle hit 14
Varying CCA Latency Encryption MediaBench SPECint 1.45 1.40 Lat 1.35 6 1.30 4 2 1.25 Speedup 1 1.20 1.15 1.10 1.05 1.00 rc4 sha epic 3des cjpeg djpeg unepic blowfish Average 181.mcf 164.gzip 300.twolf mpeg2enc mpeg2dec pegwitdec pegwitenc rawdaudio 186.crafty 197.parser gsmdecode g721encode mesamipmap 15
… ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … LSR r2, r2, #4 LD r3 CCA_Start #2 ADD r4, r1, #1 XOR r5, r4, r2 ADD r6, r5, r3 XOR r7, r6, r8 CCA_End SHR … Control Table D e c o d e . . . E x e c u t e . . . R e t i r e I-Cache Static Selection – Dynamic Realization • Compiler selects subgraphs offline • Communicated to the hardware at load time • Control bits stored in a table and inserted at decode 16
Dynamic vs. Static Selection SPECint MediaBench Encryption 1.45 Dynamic Selection Static Selection 1.40 1.35 1.30 1.25 Speedup 1.20 1.15 1.10 1.05 1.00 rc4 sha epic 3des djpeg cjpeg unepic blowfish 181.mcf Average 164.gzip 300.twolf mpeg2dec mpeg2enc pegwitdec pegwitenc rawdaudio 186.crafty 197.parser gsmdecode g721encode mesamipmap 17
Summary • Transparent instruction set customization • Benefits of customization without changing ISA • Presented design of a compute accelerator • Handle majority of important computation subgraphs in many benchmarks • Developed ways to utilize the accelerator • Table-based static selection – dynamic realization • Trace cache based dynamic selection – dynamic realization 18