200 likes | 455 Views
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors. Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles*, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. The Expression Gap.
E N D
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles*, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. 1
The Expression Gap • RISC ISAs are lowest common denominator • Don’t match applications’ computation • Don’t match hardware capabilities • Need efficient execution • Impressive design wins through customization • Performance, power, etc. 2
4 OptimoDE (5 Issue VLIW, 333 MHz) OptimoDE + Custom ISA 3.5 3 2.5 Speedup 2 1.5 1 0.5 0 3Des AES Blowfish Md5 Rc4 SHA Customization Gains: Performance 3
CPU SHR LD MPY CUSTOM AND Custom Hardware Traditional ISA Customization • Demanding parts of applications run on special hardware • New instructions use the special hardware LD MPY XOR SHR XOR MOV XOR 4
Objectives of Transparent ISA Customization • Increase execution efficiency of processors • Architecture framework for subgraph acceleration • Create a pipeline with fixed interface • Design and verify once • Support Plug-and-Play style accelerators • CISC on Demand 5
Traditional Significant ISA change High NRE Verification Masks Control placed in binary Software migration No legacy codes Transparent No ISA change Baseline CPU unchanged Hardware generates control Eases software burden Forward compatible Traditional vs. Transparent Customization 6
Architecture Framework Subgraph Execution Unit 1. Inputs Outputs Application 4. 2. Standard Pipeline … Subg. … Compiler Instructions 3. Control Generation Augments Instruction Stream 7
I1 I1 I2 I3 I4 O1 O2 Configurable Compute Array (CCA) • Array of function units • Two types of FUs: arith/logic, logic • 82% of important subgraphs • Crossbar between rows • 3.19ns critical path • 0.61mm2 in 0.13m 8
Architecture Framework Subgraph Execution Unit 1. Inputs Outputs Application 4. 2. Standard Pipeline … Subg. … Compiler Instructions 3. Control Generation Augments Instruction Stream 9
Compiler • Identify and delineate subgraphs • “Procedural Abstraction” – used in compression 10
Architecture Framework Subgraph Execution Unit 1. Inputs Outputs Application 4. 2. Standard Pipeline … Subg. … Compiler Instructions 3. Control Generation Augments Instruction Stream 11
I2 I1 I1 Control Generation I1 I2 I3 I4 Subg: AND r3, r1, #-4 SEXT r2, r4 AND r2, r2, #3 OR r3, r3, r2 RET O1 O2 12
Architecture Framework Subgraph Execution Unit 1. Inputs Outputs Application 4. 2. Standard Pipeline … Subg. … Compiler Instructions 3. Control Generation Augments Instruction Stream 13
Evaluation • Ported Trimaran compiler to ARM ISA • Subgraph identification engine • Synthesized control generator and accelerator • SimpleScalar configured as ARM926EJ-S • 5 stage pipe, 250 MHz • 1 cycle 16k I/D caches • Single issue • 1 cycle subgraph execution latency 15
6.51 5 SPECint MediaBench Encryption 4.5 4 3.5 Speedup 3 2.5 2 1.5 1 rc4 sha md5 epic djpeg cjpeg unepic Rijndael 181.mcf blowfish 164.gzip 300.twolf 256.bzip2 pegwitenc pegwitdec rawdaudio rawcaudio 197.parser gsmencode gsmdecode g721encode g721decode Performance Results 1.6 IPC on a single-issue core 16
Plug-and-Play Benefits Baseline Area: 0.61mm2 Baseline Speedup: 1.8 17
5 1 2 3 4 4.5 4 3.5 Speedup 3 2.5 2 1.5 1 rc4 sha md5 cjpeg djpeg rijndael epicdec epicenc blowfish rawdaudio rawcaudio pegwitdec pegwitenc gsmdecode gsmencode g721decode g721encode Effect of CCA Pipelining Average: 2.17 1.86 1.64 1.48 18
Conclusions • Expression gap between ISAs and computation • Inherent inefficiency • Transparent ISA Customization • Fixed core Þ low NRE • Plug-and-Play accelerators • Enables “CISC on demand” • 1.8x speedup for 15% area overhead 19
Questions? More info: http://cccp.eecs.umich.edu 20