1 / 49

Hamid Noori † , Farhad Mehdipour ‡ , Koji Inoue ‡ , and Kazuaki Murakami ‡

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension. Hamid Noori † , Farhad Mehdipour ‡ , Koji Inoue ‡ , and Kazuaki Murakami ‡ † Institute of Systems, Information Technologies and Nanotechnologies ‡ Kyushu University Fukuoka, Japan

menora
Download Presentation

Hamid Noori † , Farhad Mehdipour ‡ , Koji Inoue ‡ , and Kazuaki Murakami ‡

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension Hamid Noori†, Farhad Mehdipour‡, Koji Inoue‡, and Kazuaki Murakami‡ †Institute of Systems, Information Technologies and Nanotechnologies ‡Kyushu University Fukuoka, Japan August 2008 ISLPED@Bangalore,India

  2. Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work

  3. Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work

  4. Introduction • Designing Embedded Systems • Embedded Microprocessors • Application Specific Integrated Circuits (ASICs) • Application Specific Instruction set Processors (ASIPs) & Extensible Processors 4

  5. Instruction Dispatcher + & x LD/ST CFU1 CFU2 AND AND Register File AND2_OR OR Extensible Processors • Base processor (BP)'s fixed instruction set + Custom Instructions CPU LD/ST: Load / Store CFU: Custom Functional Unit 5

  6. Motivations (1/2) • Exploding NRE Costs Keynote @ ASP-DAC 2007 6

  7. Motivations (2/2) This has led to the quest for a flexible and reusable embedded processor that still must achieve the required performance and energy efficiency levels. Keynote @ DATE 2007 7

  8. Introduction • Adaptive Extensible Processor [DATE’07] • Has a coarse-grained re-configurable functional unit • Supports efficient “Multi-Exits CIs” • Achieves high-performance and low-cost • Question • How about energy efficiency? • Results: Energy saving • v.s. base processor: 22% • v.s. single basic-block based CIs: 3%

  9. Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work

  10. ADaptive EXtensible processOR(ADEXOR) • Generating and adding CIs AFTER chip fab. Utilization phase Instruction Dispatcher Config Mem + & x LD/ST CFU1 CRFU Register File 10

  11. Execution Overview of ADEXOR Register File Indexed by mtc1 RFU or sequencer Configuration Memory ID/EXE Reg CRFU ALU MUX Counter Triggered by mtc1 or sequencer EXE/MEM Reg GPP Augmented HW 400680 subiu $25,$25,1 400688 lbu $13,0($7) 400690 lbu $2,0($4) 400698 sll $2,$2,0x18 4006a0 sra $14,$2,0x18 4006a8 addiu $4,$4,1 4006b0 srl $8,$2,0x1c 4006b8 sll $2,$8,0x2 4006c0 addu $2,$2,$25 4006c8 lw $2,0($2) 4006d0 xori $13,$13,1 4006d8 addu $10,$10,$2 400680 subiu $25,$25,1 400698 sll $2,$2,0x18 4006a0 sra $14,$2,0x18 400688 lbu $13,0($7) 4006e0 bgez $10,4006f0 . . . . GPP: General Purpose Processor RFU: Reconfigurable Functional Unit Hot Basic Block 11

  12. Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work

  13. Why Multi-Exits Custom Instructions (MECIs)? Conventional BB-base CI Generation (Single-Entry Single-Exit) adpcm #Required nodes: 4 Assume 16 nodes can be included in one CI in maximum

  14. Why Multi-Exits Custom Instructions (MECIs)? BB-base CI w/ Conditional Execution Support (Single-Enter Single-Exit) adpcm #Required nodes: 12 Assume 16 nodes can be included in one CI in maximum 14

  15. Why Multi-Exits Custom Instructions (MECIs)? BB-base CI w/ Conditional Execution Support (Single-Enter Single-Exit) #Required nodes: 21 (can not be mapped) adpcm Assume 16 nodes can be included in one CI in maximum

  16. Why Multi-Exits Custom Instructions (MECIs)? BB-base CI w/ Conditional Execution Support (Single-Entry Single-Exit) adpcm #Required nodes: 16 Assume 16 nodes can be included in one CI in maximum 16

  17. Why Multi-Exits Custom Instructions (MECIs)? Multiple-Exits Custom Instruction Conditional Execution + Hot-Path Selection #Required nodes: 16 adpcm Exit Exit Assume 16 nodes can be included in one CI in maximum

  18. Main features of MECIs • Fixed point operations √ • Multiply x • Divide x • Control flow √ • Memory instructions x

  19. Custom Instruction Invocation • How to change the execution sequence and run custom instructions on the CRFU? • invoke-mtc1 method • invoke-seq method

  20. 0 inst. # address inst. operands (dest, src1, src2) inst. # address inst. operands (dest, src1, src2) 1 400410 R23 R2 lw 100 0 400410 addu R13 R0 R0 1 400418 R23 R2 lw 100 2 400420 addiu R4 R4 2 2 400420 addiu R4 R4 2 3 400428 subu R3 R2 R11 3 400428 subu R3 R2 R11 4 400430 bgez 400440 R3 4 400430 bgez 400440 R3 5 400438 addiu R13 R0 8 5 400438 addiu R13 R0 8 6 400440 beq 400468 R13 6 400440 beq 400468 R13 7 400448 subu R3 R0 R3 7 400448 subu R3 R0 R3 8 400450 addu R10 R0 R0 8 400450 addu R10 R0 R0 10 400458 slt R2 R3 R9 9 400458 lw R8 R9 0x3 9 400460 lw R8 R9 0x3 10 400460 slt R2 R3 R9 11 400468 12 400470 13 400478 bne 4004a8 R2 13 400478 bne 4004a8 R2 14 400480 addiu R10 14 400480 addiu R10 R0 4 15 400488 subu R3 R3 R9 16 400490 addu R8 R8 R9 17 400498 sra R9 R9 0x1 18 4004a0 slt R2 R3 R9 19 4004a8 ori R10 R10 2 20 4004b0 subu R3 R3 R9 21 4004b8 bne 400410 R2 22 4004c0 slt R2 R3 R9 invoke-mtc1 method exit4 mtc1 beq 7 2 3 bgez 5 8 10 bne 11 12 exit3 exit1 bne 20 19 Instruction scheduling exit2 0 400418 addu R13 R0 R0 mtc1 #CI 11 400468 addu R8 R8 R9 addu R8 R8 R9 12 400470 ori R10 R10 1 ori R10 R10 1 R0 4 15 400488 subu R3 R3 R9 16 400490 addu R8 R8 R9 17 400498 sra R9 R9 0x1 18 4004a0 slt R2 R3 R9 19 4004a8 ori R10 R10 2 20 4004b0 subu R3 R3 R9 21 4004b8 bne 400410 R2 22 4004c0 slt R2 R3 R9 Code before generating MECI Code after generating MECI

  21. invoke-seq method exit4 beq 0 7 2 3 bgez 5 8 10 bne 11 12 sequencer table (CAM) exit3 exit1 bne 20 19 exit2

  22. Outline Introduction General Overview of the Proposed Approach Multi-Exits Custom Instructions Microarchitecture of the CRFU Evaluation Results Conclusions and Future Work 22

  23. Microarchitecture of the RFU 23

  24. Supporting Conditional Execution Selector-Mux 24

  25. Implementation Results • VHDL & Hitachi 0.18μm library • Results • Area: 1.7 mm2 • #Configuration Bits: 375 bits (control signals) + 240 bits (immediates) = 615 bits (~ 80 bytes) • Delay (multi-cycle): • 2.3ns, 4.2ns, 6.1ns, 8.0ns, and 9.8ns • ADEXOR Clock Freq: 130 MHz (7.7ns) • 1,2,3 (one clock) • 4,5 (two clocks)

  26. Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work

  27. Experimental Setup (1/2) Base Processor Configuration

  28. Experimental Setup (2/2) arch1: (4-read/2-write) Clock freq: 135MHz (Input > 4)  (+1 extra clock) (2 <Output <5)  (+1 extra clock) (4 <Output)  (+2 extra clock) arch2: (8-read/4-write) Clock freq: 130MHz Area overhead: ~ 5% (4 <Output)  (+1 extra clock) 28

  29. Area overhead (1/2) • VHDL & Hitachi 0.18μm library • Base processor: 4.5 mm2 • CRFU: 1.7 mm2 • CACTI 4.2 (0.18μm) • I-Cache & D-Cache (32KB 4-way ): 2.25mm2 • Configuration Memory (SRAM - for 32 MECIs): 0.56mm2 • Sequencer (CAM – 32 entries): 0.092mm2 • Base Processor (with caches) • Area: 9.0mm2 29

  30. Area overhead (2/2) 30

  31. Access Reduction 55 35 15 seq mtc1 31

  32. Energy Consumption Evaluation • Base Processor • Base processor: 71.5mW • 32KB 4-way cache (CACTI): 0.294 nJ • Off-chip memory access: 25 nJ • ADEXOR • (Base Processor + CRFU): 229.7mW • Configuration Memory (CACTI): 0.146 nJ • Sequencer (32-entries CAM): 0.184 nJ 32

  33. Energy Consumption Breakdown for arch1/invoke-mtc1 33

  34. Energy Consumption Breakdown for arch2/invoke-seq 34

  35. Energy Saving/Speedup/Area Overhead 35

  36. RFU Area:1.15 mm2 Delay: 7.66 ns Power Consumption: 206.855mW Configuration Memory #Config. Bits: 512 Energy for each Access: 0.116nJ MECIs vs. CIs

  37. Energy Saving 37

  38. MECIs vs. CIs 38

  39. Outline • Introduction • General Overview of the Proposed Approach • Multi-Exits Custom Instructions • Microarchitecture of the CRFU • Evaluation Results • Conclusions and Future Work

  40. Conclusions • Adaptive Extensible Processor • Multi-Exit Custom Instructions • A Coarse-Grain Reconfigurable Functional Unit with Conditional Execution • No new compiler, opcode, or source code modification and recompilation • MECIs vs. CIs (in average) • 3% more energy saving • 8% more hardware • 20% more configuration bits • 26% higher speedup • The average energy reduction is 22% for arch1/mtc1 and average speedup is 1.87 for arch2/sequencer

  41. Future Work • Generating custom instructions targeting low energy consumption • Study the effect of different nano-meter scale technologies • Supporting memory instructions 41

  42. Acknowledgement • Institute of Systems, Information Technology and Nanotechnologies, Fukuoka, Japan • System LSI Laboratory, Kyushu University • Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST) 42

  43. Thank you very much for your kind attention

  44. Backup Slides

  45. Tool Chain 45

  46. Phases of ADEXOR • Configuration Phase • Normal Phase Utilization phase

  47. Energy Consumption Pros. Low activity of hardware components I-Cache, Bpred Decoder Register File Functional Unit Higher I-Cache hit rates Reduce the energy for off-chip accesses Cons. • RFU configuration • Accessing the config. Memory • Setting control signals in the RFU • Increased complexity • Communication between the processor’s data-path and the RFU 47

  48. Energy Saving/Speedup/Area Overhead 48

  49. MECIs vs. CIs 49

More Related