350 likes | 459 Views
A Reconfigurable Functional Unit for Adaptable Custom Instructions. H. Noori 1 , F. Mehdipour 2 , K. Murakami 1 , K. Inoue 1 and M. Saheb Zamani 2. 1 Kyushu University 2 Amirkabir University of Technology. Agenda. Research goal General overview of the architecture Modes of operation
E N D
A Reconfigurable Functional Unit for AdaptableCustom Instructions H. Noori1, F. Mehdipour2, K. Murakami1, K. Inoue1 and M. Saheb Zamani2 1Kyushu University 2Amirkabir University of Technology
Agenda • Research goal • General overview of the architecture • Modes of operation • Profiler • Reconfigurable Functional Unit (RFU) • Sequencer • RFU Architecture: A Quantitative Approach • Tool Chain • Generating Custom Instructions • Mapping Custom Instructions • Integrating RFU with base processor • Configuration Memory • Performance Evaluation • Conclusions • Future work
Some definitions • Hot Basic Block (HBB) • A basic block which execution frequency is greater than a given threshold specified in the profiler • Custom Instructions (CIs) • Are the extended Instruction Set Architecture (ISA) that are executed on the RFU • Reconfigurable Functional Unit (RFU) • Custom hardware for executing CIs • Training mode • Operation mode for detecting HBBs and generating CIs • Normal mode • Normal operation mode where CIs are executed on the RFU
Research Goal • Proposal of an Adaptive Dynamic Extensible Processor for Embedded Systems • Custom instructions are adaptable to the applications • Custom instructions are detected and created during execution/training • Generation of custom instruction are done transparently and automatically • Advantages of the novel approach • Higher performance than GPPs • Higher flexibility compared to Extensible Processors • Cheaper and shorter design and verification cost and time compared to ASIPs and Extensible Processors
General overview of the architecture Adaptive Dynamic Extensible Processor N-way in-order general RISC Detects start addresses of Hot Basic Blocks (HBBs) Base Processor Fetch Reg File Augmented Hardware Decode Switches between main processor and RFU Profiler Execute RFU Memory Sequencer Write Executes Custom Instructions
General overview of the architecture • Modes of operation • Training mode • Profiling • Detecting start address of Hot Basic Blocks (HBBs) • Generating Custom Instructions • Generating Configuration Data for the RFU • Binary rewriting • Initializing the Sequencer Table ♦ Online • Needs a simple hardware for profiling • All tasks are run on the base processor ♦ Offline • Needs a PC trace after taken branches/jumps • Normal mode • Profiling (optional) • Executing Custom Instructions on the RFU and other parts of the code on the base processor
Operation modes Training Mode Training Mode Normal Mode Running Tools for Generating Custom Instructions, Generating Configuration Data for ACC and Initializing Sequencer Table Monitors PC and Switches between main processor and ACC Detecting Start Address of HBBs Applications Applications Applications Binary-Level Profiling Processor Processor Processor Profiler Profiler Profiler Profiler ACC ACC ACC Sequencer Sequencer Sequencer Binary Rewriting Executing CIs
Profiler Profiler Table Current PC Previous PC Compare No If greater than instruction length Nothing Yes After a taken branch or jump we look at the BBSA to see if the target PC is on the table. If it is a miss we include this address and initialize the counter to 1, otherwise we increment its value. Is Current PC in the table? Yes No Increment the counter Add it as a new entry and set the counter to one.
Reconfigurable Functional Unit (RFU) • RFU is a matrix of Functional Units (FUs) • RFU is a multi-cycle FU with variable delay • RFU has a two level configuration memory • A multi-context memory (keeps two or four config) • A cache • FUs support only logical operations, add/subtract, shifts and compare • RFU updates the PC • RFU has variable delay which depends on size of Custom Instruction
Sequencer • The sequencer mainly determines the microcode execution sequence. • Selects between decoder and config memory for reading RF • Selects between the output of Functional Unit and Accelerator • Distinguishes when to switch between different contexts of multi-context memory • Determines when to load configuration data from cache to multi-context memory. • Checks the configuration data of custom instruction • If it is in multi-context memory, custom instructions will be executed on the accelerator • If it is not in multi-context memory • If there is enough time to load it from cache to multi-context memory, loads it and execute CI on the ACC • If there is not enough time, the original code is executed.
Generation of Custom Instructions • Custom instructions • Exclude floating point, multiply, divide and load instructions • Include at most one STORE, at most one BRANCH/JUMP and all other fixed point instructions • Simple algorithm for generating custom instructions • HBBs usually include 10~40 instructions for Mibench • Custom instruction generator is going to be executed on the base processor (in online training mode)
4052c0 addiu $29,$29,-32 4052c8 mov.d $f0,$f12 4052d0 sw $18,24($29) 4052d8 addu $18,$0,$6 4052e0 sw $31,28($29) 4052e8 sw $16,16($29) 4052f0 mfc1 $16,$f0 4052f8 mfc1 $17,$f1 405300 srl $6,$17,0x14 405308 andi $6,$6,2047 405310 sltiu $2,$6,2047 405318 addu $6,$6,$18 405320 sltiu $2,$6,2047 405328 lui $2,32783 405330 and $17,$17,$2 405338 andi $2,$6,2047 405340 sll $2,$2,0x14 405348 or $17,$17,$2 405350 mtc1 $16,$f0 405358 mtc1 $17,$f1 405360 lw $31,28($29) 405370 lw $16,16($29) 405378 addiu $29,$29,32 405380 jr $31 Finding the biggest sequence of instructions in the HBB that can be executed on the ACC Moving the instructions and appending supportable instructions to the head of the detected instruction sequence after checking flow-dependency and anti-dependency Moving the instructions and appending supportable instructions to the tail of the detected instruction sequence after checking flow-dependency and anti-dependency Rewriting object code if instructions have been moved Moving instructions, should not modify the logic of the application Custom instruction generation is done without considering any other constraints. Generating Custom Instructions
Supported instr(s) (B1) Not supported instr(s) (B2) Supported instr(s) (B1) Not supported instr(s) (B2) Supported instr(s) (B1) Supported instr(s) (B3) Supported instr(s) (B3) Supported instr(s) (B3) Not supported instr(s) (B4) Not supported instr(s) (B2) Supported instr(s) (B5) Generating Custom Instructions • Block 3 (B3) is selected as the biggest instructions sequence that can be executed on the ACC • Block 2 (B2) can not be executed on ACC • Block 1 (B1) can be executed on ACC • If there is no flow and anti-dependency between B1 and B2 exchange them. • This is done for B3, B4 and B5.
Example 2 (rewriting obj code) 400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: addu $7,$0,$0 400d28: lui $9,49152 400d30: sll $4,$4,0x2 400d38: and $2,$8,$9 400d40: srl $2,$2,0x1e 400d48: lw $22,0($29) 400d50: addu $4,$4,$2 400d58: sll $8,$8,0x2 400d60: sll $6,$3,0x1 400d68: sll $3,$3,0x2 400d70: sltu $2,$4,$3 400d78: bne $2,$0,400db8 <usqrt+0xa8>
Example 1 Customized Instruction 1 400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: srl $29,$2,0x1e 400d58: lw $3,0($29) 400d60: addu $4,$4,$3 400d68: sll $8,$8,0x2 400d70: sll $6,$3,0x1 400d78: sll $3,$3,0x2 400d80: addiu $3,$3,1 400d88: sltu $2,$4,$3 400d90: sw $6,0($29) 400d98: bne $2,$0,400db8 <usqrt+0xa8> Customized Instruction 2
RFU Architecture: A Quantitative Approach • 22 programs of MiBench were chosen • Simplescalar toolset was utilized for simulation • RFU is a matrix of FUs • No of Inputs • No of Outputs • No of FUs • Connections • Location of Inputs & Outputs • Some definitions: • Considering frequency and weight in measurement • CI Execution Frequency • Weight (To equal number of executed instructions) • Average = for all CIs (ΣFreq*Weight) • Rejection: Percentage of CI that could not be mapped on the RFU • Coverage: Percentage of CI that could be mapped on the RFU
RFU Inputs (no constraint) 96.37 89.37 98.48 8
RFU Outputs (no constraint) 96.58 6
RFU Node No (Input=8, Output=8) 94.74 16
RFU Width (Inp=8, Out=8, Node=16) 95.65 97.65 6
RFU Architecture 1 • Input=8 • Output=8 • Node=16 • Width = 6,4,3,2,1 • Depth = 5 • Inputs are applied to the first row • Outputs of each row are connected only to the inputs of the subsequent row • MOVE is used for transferring data • Mapping rate is 77.53% • Rejection rate is 22.47% Synthesis results using Hitachi 0.18 μm Area : 0.9069 mm2 Delay : 7.54 ns
RFU Architecture 2 • Distributing Inputs in different rows • Row1 = 7 • Row 2 = 2 • Row 3 = 2 • Row 4 = 2 • Row 5 = 1 • Connections with Variable Length • row1 row3 = 1 • row1 row4 = 1 • row1 row5 = 1 • row2 row4 = 1 • Mapping rate is 90.48% • Rejection rate is 9.52% Synthesis results using Hitachi 0.18 μm Area : 1.1534 mm2 Delay : 9.66 ns
Function Types • Three types of functions: • logical operations (type 1) • add/sub/compare (type 2) • shift operations (type 3)
Integrating RFU with the Base Processor Reg0 Reg31 ………………………………………………………………. Config Mem Decoder Sequencer DEC/EXE Pipeline Registers FU1 FU2 FU3 FU4 ACC Sequencer EXE/MEM Pipeline Registers
Control Bits & Immediate Data • 308 bits are needed as Control Bits for • Multiplexers • Functional Units • 204 bits are needed for Immediates • Each CI configuration needs (308+204 = 512 bits)
Performance Evaluation • Simplescalar was configured to behave as a 4-issue in-order RISC processor. The base processor supports MIPS instruction set. • 22 applications of Mibench
Delay of RFU according to CI length • Synopsys Tools + Hitachi 0.18μm
Conclusions • Adaptive Dynamic Extensible Processor • Binary Profiler • RFU (Inp=8, Out=6, Nodes=16, Width=6,4,3,2,1 & Depth=5) • Sequencer • Average Speedup is 1.21 for 300 MHz base processor • Adaptive Dynamic Extensible Processor • No design time • No extra read port and write port • No design and verification cost • No compiler • No new opcode
Future Work • Generating multi-exit CIs • Proposing an RFU for supporting multi-exit CIs • Scheduling CIs on the processor • Details of sequencer