490 likes | 568 Views
Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension. Hamid Noori † , Farhad Mehdipour ‡ , Koji Inoue ‡ , and Kazuaki Murakami ‡ † Institute of Systems, Information Technologies and Nanotechnologies ‡ Kyushu University Fukuoka, Japan
E N D
Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension Hamid Noori†, Farhad Mehdipour‡, Koji Inoue‡, and Kazuaki Murakami‡ †Institute of Systems, Information Technologies and Nanotechnologies ‡Kyushu University Fukuoka, Japan August 2008 ISLPED@Bangalore,India
Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work
Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work
Introduction • Designing Embedded Systems • Embedded Microprocessors • Application Specific Integrated Circuits (ASICs) • Application Specific Instruction set Processors (ASIPs) & Extensible Processors 4
Instruction Dispatcher + & x LD/ST CFU1 CFU2 AND AND Register File AND2_OR OR Extensible Processors • Base processor (BP)'s fixed instruction set + Custom Instructions CPU LD/ST: Load / Store CFU: Custom Functional Unit 5
Motivations (1/2) • Exploding NRE Costs Keynote @ ASP-DAC 2007 6
Motivations (2/2) This has led to the quest for a flexible and reusable embedded processor that still must achieve the required performance and energy efficiency levels. Keynote @ DATE 2007 7
Introduction • Adaptive Extensible Processor [DATE’07] • Has a coarse-grained re-configurable functional unit • Supports efficient “Multi-Exits CIs” • Achieves high-performance and low-cost • Question • How about energy efficiency? • Results: Energy saving • v.s. base processor: 22% • v.s. single basic-block based CIs: 3%
Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work
ADaptive EXtensible processOR(ADEXOR) • Generating and adding CIs AFTER chip fab. Utilization phase Instruction Dispatcher Config Mem + & x LD/ST CFU1 CRFU Register File 10
Execution Overview of ADEXOR Register File Indexed by mtc1 RFU or sequencer Configuration Memory ID/EXE Reg CRFU ALU MUX Counter Triggered by mtc1 or sequencer EXE/MEM Reg GPP Augmented HW 400680 subiu $25,$25,1 400688 lbu $13,0($7) 400690 lbu $2,0($4) 400698 sll $2,$2,0x18 4006a0 sra $14,$2,0x18 4006a8 addiu $4,$4,1 4006b0 srl $8,$2,0x1c 4006b8 sll $2,$8,0x2 4006c0 addu $2,$2,$25 4006c8 lw $2,0($2) 4006d0 xori $13,$13,1 4006d8 addu $10,$10,$2 400680 subiu $25,$25,1 400698 sll $2,$2,0x18 4006a0 sra $14,$2,0x18 400688 lbu $13,0($7) 4006e0 bgez $10,4006f0 . . . . GPP: General Purpose Processor RFU: Reconfigurable Functional Unit Hot Basic Block 11
Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work
Why Multi-Exits Custom Instructions (MECIs)? Conventional BB-base CI Generation (Single-Entry Single-Exit) adpcm #Required nodes: 4 Assume 16 nodes can be included in one CI in maximum
Why Multi-Exits Custom Instructions (MECIs)? BB-base CI w/ Conditional Execution Support (Single-Enter Single-Exit) adpcm #Required nodes: 12 Assume 16 nodes can be included in one CI in maximum 14
Why Multi-Exits Custom Instructions (MECIs)? BB-base CI w/ Conditional Execution Support (Single-Enter Single-Exit) #Required nodes: 21 (can not be mapped) adpcm Assume 16 nodes can be included in one CI in maximum
Why Multi-Exits Custom Instructions (MECIs)? BB-base CI w/ Conditional Execution Support (Single-Entry Single-Exit) adpcm #Required nodes: 16 Assume 16 nodes can be included in one CI in maximum 16
Why Multi-Exits Custom Instructions (MECIs)? Multiple-Exits Custom Instruction Conditional Execution + Hot-Path Selection #Required nodes: 16 adpcm Exit Exit Assume 16 nodes can be included in one CI in maximum
Main features of MECIs • Fixed point operations √ • Multiply x • Divide x • Control flow √ • Memory instructions x
Custom Instruction Invocation • How to change the execution sequence and run custom instructions on the CRFU? • invoke-mtc1 method • invoke-seq method
0 inst. # address inst. operands (dest, src1, src2) inst. # address inst. operands (dest, src1, src2) 1 400410 R23 R2 lw 100 0 400410 addu R13 R0 R0 1 400418 R23 R2 lw 100 2 400420 addiu R4 R4 2 2 400420 addiu R4 R4 2 3 400428 subu R3 R2 R11 3 400428 subu R3 R2 R11 4 400430 bgez 400440 R3 4 400430 bgez 400440 R3 5 400438 addiu R13 R0 8 5 400438 addiu R13 R0 8 6 400440 beq 400468 R13 6 400440 beq 400468 R13 7 400448 subu R3 R0 R3 7 400448 subu R3 R0 R3 8 400450 addu R10 R0 R0 8 400450 addu R10 R0 R0 10 400458 slt R2 R3 R9 9 400458 lw R8 R9 0x3 9 400460 lw R8 R9 0x3 10 400460 slt R2 R3 R9 11 400468 12 400470 13 400478 bne 4004a8 R2 13 400478 bne 4004a8 R2 14 400480 addiu R10 14 400480 addiu R10 R0 4 15 400488 subu R3 R3 R9 16 400490 addu R8 R8 R9 17 400498 sra R9 R9 0x1 18 4004a0 slt R2 R3 R9 19 4004a8 ori R10 R10 2 20 4004b0 subu R3 R3 R9 21 4004b8 bne 400410 R2 22 4004c0 slt R2 R3 R9 invoke-mtc1 method exit4 mtc1 beq 7 2 3 bgez 5 8 10 bne 11 12 exit3 exit1 bne 20 19 Instruction scheduling exit2 0 400418 addu R13 R0 R0 mtc1 #CI 11 400468 addu R8 R8 R9 addu R8 R8 R9 12 400470 ori R10 R10 1 ori R10 R10 1 R0 4 15 400488 subu R3 R3 R9 16 400490 addu R8 R8 R9 17 400498 sra R9 R9 0x1 18 4004a0 slt R2 R3 R9 19 4004a8 ori R10 R10 2 20 4004b0 subu R3 R3 R9 21 4004b8 bne 400410 R2 22 4004c0 slt R2 R3 R9 Code before generating MECI Code after generating MECI
invoke-seq method exit4 beq 0 7 2 3 bgez 5 8 10 bne 11 12 sequencer table (CAM) exit3 exit1 bne 20 19 exit2
Outline Introduction General Overview of the Proposed Approach Multi-Exits Custom Instructions Microarchitecture of the CRFU Evaluation Results Conclusions and Future Work 22
Supporting Conditional Execution Selector-Mux 24
Implementation Results • VHDL & Hitachi 0.18μm library • Results • Area: 1.7 mm2 • #Configuration Bits: 375 bits (control signals) + 240 bits (immediates) = 615 bits (~ 80 bytes) • Delay (multi-cycle): • 2.3ns, 4.2ns, 6.1ns, 8.0ns, and 9.8ns • ADEXOR Clock Freq: 130 MHz (7.7ns) • 1,2,3 (one clock) • 4,5 (two clocks)
Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work
Experimental Setup (1/2) Base Processor Configuration
Experimental Setup (2/2) arch1: (4-read/2-write) Clock freq: 135MHz (Input > 4) (+1 extra clock) (2 <Output <5) (+1 extra clock) (4 <Output) (+2 extra clock) arch2: (8-read/4-write) Clock freq: 130MHz Area overhead: ~ 5% (4 <Output) (+1 extra clock) 28
Area overhead (1/2) • VHDL & Hitachi 0.18μm library • Base processor: 4.5 mm2 • CRFU: 1.7 mm2 • CACTI 4.2 (0.18μm) • I-Cache & D-Cache (32KB 4-way ): 2.25mm2 • Configuration Memory (SRAM - for 32 MECIs): 0.56mm2 • Sequencer (CAM – 32 entries): 0.092mm2 • Base Processor (with caches) • Area: 9.0mm2 29
Access Reduction 55 35 15 seq mtc1 31
Energy Consumption Evaluation • Base Processor • Base processor: 71.5mW • 32KB 4-way cache (CACTI): 0.294 nJ • Off-chip memory access: 25 nJ • ADEXOR • (Base Processor + CRFU): 229.7mW • Configuration Memory (CACTI): 0.146 nJ • Sequencer (32-entries CAM): 0.184 nJ 32
RFU Area:1.15 mm2 Delay: 7.66 ns Power Consumption: 206.855mW Configuration Memory #Config. Bits: 512 Energy for each Access: 0.116nJ MECIs vs. CIs
Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work
Conclusions • Adaptive Extensible Processor • Multi-Exit Custom Instructions • A Coarse-Grain Reconfigurable Functional Unit with Conditional Execution • No new compiler, opcode, or source code modification and recompilation • MECIs vs. CIs (in average) • 3% more energy saving • 8% more hardware • 20% more configuration bits • 26% higher speedup • The average energy reduction is 22% for arch1/mtc1 and average speedup is 1.87 for arch2/sequencer
Future Work • Generating custom instructions targeting low energy consumption • Study the effect of different nano-meter scale technologies • Supporting memory instructions 41
Acknowledgement • Institute of Systems, Information Technology and Nanotechnologies, Fukuoka, Japan • System LSI Laboratory, Kyushu University • Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST) 42
Tool Chain 45
Phases of ADEXOR • Configuration Phase • Normal Phase Utilization phase
Energy Consumption Pros. Low activity of hardware components I-Cache, Bpred Decoder Register File Functional Unit Higher I-Cache hit rates Reduce the energy for off-chip accesses Cons. • RFU configuration • Accessing the config. Memory • Setting control signals in the RFU • Increased complexity • Communication between the processor’s data-path and the RFU 47