200 likes | 358 Views
ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions . Koji Inoue†, Hamid Noori ‡, Farhad Mehdipour †, Takaaki Hanada †, and Kazuaki Murakami† †Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan
E N D
ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions Koji Inoue†, HamidNoori‡, FarhadMehdipour†, TakaakiHanada†, and Kazuaki Murakami† †Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan ‡School of Electrical and Computer Engineering, University of Tehran
Outline • Introduction • ADEXOR: Adaptive Extensible Processor • Overview • Microarchitecture • Coarse-grained Reconfigurable Functional Unit • Evaluation • Conclusions
Motivation and Solution • Embedded processors have to achieve • Lowcost • High-performance • Low-power or low-energy consumption • Key point • How can processors adapt to target applications? • Solution: ASIP w/ Re-configurability • Application specific ISA • Provide custom instructions (CIs) • Implement re-configurable FUs
ADaptiveEXtensibleprocessOR(ADEXOR) • Has a coarse-grained re-configurable functional unit • Supports efficient “Multi-Exits CIs” • Achieves high-performance and low energy 400680 subiu $25,$25,1 400688 lbu $13,0($7) 400690 lbu $2,0($4) 400698 sll $2,$2,0x18 4006a0 sra $14,$2,0x18 4006a8 addiu $4,$4,1 4006b0 srl $8,$2,0x1c 4006b8 sll $2,$8,0x2 4006c0 addu $2,$2,$25 4006c8 bgez $10,4006f0 4006d0 xori $13,$13,1 4006d8 addu $10,$10,$2 400680 subiu $25,$25,1 400698 sll $2,$2,0x18 4006a0 sra $14,$2,0x18 400688 lbu $13,0($7) 4006e0 bgez $10,4006f0 . . . . Register File Indexed by mtc1 RFU or sequencer Configuration Memory ID/EXE Reg ID/EXE Reg CRFU ALU MUX Counter Triggered by mtc1 or sequencer EXE/MEM Reg GPP Augmented HW GPP: General Purpose Processor CRFU: Coarse-grained Reconfigurable Functional Unit Hot Basic Block
CRFU Microarchitecture • 16 FUs controlled by configuration bits • MUX-base interconnection between FUs • Early stage data can be transferred to output ports
Supporting Multi-Exits Custom Instructions (MECIs) Multiple-Exits Custom Instruction Conditional Execution + Hot-Path Selection #Required nodes: 16 adpcm Exit Exit Assume 16 nodes can be included in one CI in maximum
Experimental Setup (1/2) Base Processor Configuration
Experimental Setup (2/2) • arch1: (4-read/2-write) • Clock freq: 135MHz • RF read/write access • Input: 5, 6, 7, or 8 +1 extra cycle • Output: 3 or 4 +1 extra cycle • Output: 5 or 6 +2 extra cycles • CRFU execution • arch-1-var: variable (1 or 2 cycles) • arch-1-fix: 2 cycles • arch2: (8-read/4-write) • Clock freq: 130MHz • RF read/write access • Input: no extra cycle • Output: 5 or 6 +1 extra cycle • CRFU execution • arch-2-var: variable (1 or 2 cycles) • arch-2-fix: 2 cycles 8
Energy Consumption Pros. Cons. • Low activity of hardware components • I-Cache, Bpred • Decoder • Register File • Functional Unit • Higher I-Cache hit rates • Reduce the energy for off-chip accesses • RFU configuration • Accessing the config. Memory • Setting control signals in the RFU • Increased complexity • Communication between the processor’s data-path and the RFU 10
FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU Temperature Analysis CRFU Floor Plan (1.7x1.7 [mm2])
Conclusions • ADEXOR: Adaptive Extensible Processor • Has a coarse-grain reconfigurable functional unit • Supports multi-exit custom instructions • Performance / Energy Analysis • 5X speed up (best case) • 60% energy reduction (best case) • Future Work • Extend for 3D-IC Implementation
Acknowledgement • This research was supported in part by • New Energy and Industrial Technology Development Organization • The chip fabrication program of VLSI Design and Education Center(VDEC), the University of Tokyo in collaboration with Hitachi Ltd. and Dai Nippon Printing Corporation. 14
Area overhead (1/2) • VHDL & Hitachi 0.18μm library • Base processor: 4.5 mm2 • CRFU: 1.7 mm2 • CACTI 4.2 (0.18μm) • I-Cache & D-Cache (32KB 4-way ): 2.25mm2 • Configuration Memory (SRAM - for 32 MECIs): 0.56mm2 • Sequencer (CAM – 32 entries): 0.092mm2 • Base Processor (with caches) • Area: 9.0mm2 16
Access Reduction 55 35 15 seq mtc1 18