530 likes | 655 Views
General Overview of A n Adaptive Dynamic Extensible Processor. Hamid Noori, Kazuaki Murakami, Koji Inoue & Victor Goulart. Kyushu University Department of Informatics Workshop on Introspective Architecture (WISA06). Agenda. Background Research goal General overview of the architecture
E N D
General Overview of AnAdaptive Dynamic Extensible Processor Hamid Noori, Kazuaki Murakami, Koji Inoue & Victor Goulart Kyushu University Department of Informatics Workshop on Introspective Architecture (WISA06)
Agenda • Background • Research goal • General overview of the architecture • Modes of operation • Profiler • Accelerator • Sequencer • Generation of Custom Instructions • Configuration Data for the Accelerator • Experiments and Results • Conclusions & Future work
Some definitions • Hot Basic Block (HBB) • A basic block which execution frequency is greater than a given threshold specified in the profiler • Custom Instructions (CIs) • Are the extended Instruction Set Architecture (ISA) that are executed on the ACC • Accelerator (ACC) • Custom hardware for executing CIs • Training mode • Operation mode for detecting HBBs and generating CIs • Normal mode • Normal operation mode where CIs are executed on the ACC
Research Goal • Proposal of an Adaptive Dynamic Extensible Processor for Embedded Systems • Custom instructions are adaptable to the applications • Custom instructions are detected and created during execution/training • Generation of custom instruction are done transparently and automatically • Advantages of the novel approach • Higher performance than GPPs • Higher flexibility compared to Extensible Processors • Shorter TAT and cheaper design and verification cost compared to ASIPs and Extensible Processors
General overview of the architecture Adaptive Dynamic Extensible Processor N-way in-order general RISC Detects start addresses of Hot Basic Blocks (HBBs) Base Processor Fetch Reg File Augmented Hardware Decode Switches between main processor and ACC Profiler Execute ACC Memory Sequencer Write Executes Custom Instructions
General overview of the architecture • Modes of operation • Training mode • Profiling • Detecting start address of Hot Basic Blocks (HBBs) • Generating Custom Instructions • Generating Configuration Data for the ACC • Binary rewriting • Initializing the Sequencer Table ♦ Online • Needs a simple hardware for profiling • All tasks are run on the base processor ♦ Offline • Needs a PC trace after taken branches/jumps • Normal mode • Profiling (optional) • Executing Custom Instructions on the ACC and other parts of the code on the base processor
Components DMA Cache Register File Multi-Context Memory ID/EXE Reg Functional Unit Online Training Accelerator Sequencer Sequencer Table Mux Profiler Profiler Table (HWT) Augmented HW GPP EXE/MEM Reg
Operation modes Training Mode Training Mode Normal Mode Running Tools for Generating Custom Instructions, Generating Configuration Data for ACC and Initializing Sequencer Table Monitors PC and Switches between main processor and ACC Detecting Start Address of HBBs Applications Applications Applications Binary-Level Profiling Processor Processor Processor Profiler Profiler Profiler Profiler ACC ACC ACC Sequencer Sequencer Sequencer Binary Rewriting Executing CIs
Profiler Profiler Table Current PC Previous PC Compare No If greater than instruction length Nothing Yes After a taken branch or jump we look at the BBSA to see if the target PC is on the table. If it is a miss we include this address and initialize the counter to 1, otherwise we increment its value. Is Current PC in the table? Yes No Increment the counter Add it as a new entry and set the counter to one.
Detecting Start Addr of HBBs HBB 400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: bne $2,$0,400db8 <usqrt+0xa8> 400d58: srl $2,$2,0x1e 400d60: lw $3,0($29) 400d68: addu $4,$4,$2 400d70: sll $8,$8,0x2 400d78: sll $6,$3,0x1 400d80: sll $3,$3,0x2 400d88: addiu $3,$3,1 400d90: sltu $2,$4,$3 400d98: sw $6,0($29) BTA 400db8 50 Counter > Threshold Profiler Table Taken Freq Not taken part 400d10 500 400db8 X 500 HBB Table sub Hot? Exec Freq Threshold = 100
Size of Profiler Table Number of Basic Blocks with Exec Freq more than Threshold
Accelerator (ACC) • ACC is a matrix of Functional Units (FUs) • ACC has a two level configuration memory • A multi-context memory (keeps two or four config) • A cache • FUs support only logical operations, add/subtract, shifts and compare • ACC updates the PC • ACC has variable delay which depends on size of Custom Instruction
Connecting ACC to the Base Processor Reg0 Reg31 ………………………………………………………………. Config Mem Decoder DEC/EXE Pipeline Registers FU1 FU2 FU3 FU4 ACC Sequencer EXE/MEM Pipeline Registers
Connecting ACC to the Base Processor Reg0 Reg31 ………………………………………………………………. Config Mem Decoder Sequencer DEC/EXE Pipeline Registers FU1 FU2 FU3 FU4 ACC Sequencer EXE/MEM Pipeline Registers
Sequencer • The sequencer mainly determines the microcode execution sequence. • Selects between decoder and config memory for reading RF • Selects between the output of Functional Unit and Accelerator • Distinguishes when to switch between different contexts of multi-context memory • Determines when to load configuration data from cache to multi-context memory. • Checks the configuration data of custom instruction • If it is in multi-context memory, custom instructions will be executed on the accelerator • If it is not in multi-context memory • If there is enough time to load it from cache to multi-context memory, loads it and execute CI on the ACC • If there is not enough time, the original code is executed.
Generation of Custom Instructions • Custom instructions • Exclude floating point, multiply, divide and load instructions • Include at most one STORE, at most one BRANCH/JUMP and all other fixed point instructions • Simple algorithm for generating custom instructions • HBBs usually include 10~40 instructions for Mibench • Custom instruction generator is going to be executed on the base processor (in online training mode)
4052c0 addiu $29,$29,-32 4052c8 mov.d $f0,$f12 4052d0 sw $18,24($29) 4052d8 addu $18,$0,$6 4052e0 sw $31,28($29) 4052e8 sw $16,16($29) 4052f0 mfc1 $16,$f0 4052f8 mfc1 $17,$f1 405300 srl $6,$17,0x14 405308 andi $6,$6,2047 405310 sltiu $2,$6,2047 405318 addu $6,$6,$18 405320 sltiu $2,$6,2047 405328 lui $2,32783 405330 and $17,$17,$2 405338 andi $2,$6,2047 405340 sll $2,$2,0x14 405348 or $17,$17,$2 405350 mtc1 $16,$f0 405358 mtc1 $17,$f1 405360 lw $31,28($29) 405370 lw $16,16($29) 405378 addiu $29,$29,32 405380 jr $31 Finding the biggest sequence of instructions in the HBB that can be executed on the ACC Moving the instructions and appending supportable instructions to the head of the detected instruction sequence after checking flow-dependency and anti-dependency Moving the instructions and appending supportable instructions to the tail of the detected instruction sequence after checking flow-dependency and anti-dependency Rewriting object code if instructions have been moved Moving instructions, should not modify the logic of the application Custom instruction generation is done without considering any other constraints. Generating Custom Instructions
Supported instr(s) (B1) Not supported instr(s) (B2) Supported instr(s) (B1) Not supported instr(s) (B2) Supported instr(s) (B1) Supported instr(s) (B3) Supported instr(s) (B3) Supported instr(s) (B3) Not supported instr(s) (B4) Not supported instr(s) (B2) Supported instr(s) (B5) Generating Custom Instructions • Block 3 (B3) is selected as the biggest instructions sequence that can be executed on the ACC • Block 2 (B2) can not be executed on ACC • Block 1 (B1) can be executed on ACC • If there is no flow and anti-dependency between B1 and B2 exchange them. • This is done for B4 and B5.
Example 1 Customized Instruction 1 400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: srl $29,$2,0x1e 400d58: lw $3,0($29) 400d60: addu $4,$4,$3 400d68: sll $8,$8,0x2 400d70: sll $6,$3,0x1 400d78: sll $3,$3,0x2 400d80: addiu $3,$3,1 400d88: sltu $2,$4,$3 400d90: sw $6,0($29) 400d98: bne $2,$0,400db8 <usqrt+0xa8> Customized Instruction 2
Example 2 (rewriting obj code) 400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: addu $7,$0,$0 400d28: lui $9,49152 400d30: sll $4,$4,0x2 400d38: and $2,$8,$9 400d40: srl $2,$2,0x1e 400d48: lw $22,0($29) 400d50: addu $4,$4,$2 400d58: sll $8,$8,0x2 400d60: sll $6,$3,0x1 400d68: sll $3,$3,0x2 400d70: sltu $2,$4,$3 400d78: bne $2,$0,400db8 <usqrt+0xa8>
ACC Config Data Generation Flow Base Processor Mibench Applications Simplescalar (PISA Configuration) Profiler Detecting Start Addr of HBBs Reading HBBs from Obj Code DFG
FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU Preliminary Performance Evaluation 400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: srl $2,$2,0x1e 9 – 2 = 7 clock cycles 7 * freq = reduced clock cycles 7 * 50K = 350K clock cycles Depth = 3 1st row = 1 clock 0.5 clock 0.5 clock Total = 2 clock
Results – Number of CI considering their length 82 Length of CIs
Results –Percentage of CIs considering their length Length of CIs
Conclusions • An Adaptive Dynamic Extensible Processor • Training mode and Normal mode • Advantages • It has s simple profiler • CI are detected and added after production • There is no need to a new compiler • There is no need to new opcode for CIs • There is no penalty for absence of CI config data • Lower design cost and shorter design time • By accelerating a small part of code which has a high execution frequency an average 25% speedup improvement can be obtained. Comparing a single issue processor speedup improvement ranges from 7.8% to 52%.
Future Work • Linking HBBs • Providing more details on the architecture (Accelerator, sequencer, etc) • Designing an Accelerator to support conditional execution • Developing a complete framework • Extending ACC for floating point operations • Substituting the in-order base processor with an out-of-order
Example • Application X • CIx1, 100, input = 3 • CIx2, 200, input = 6 • Total executed instruction = 400,000 • Application Y • CIy1, 50, input = 4 • CIy2, 400, input = 6 • Total executed instruction = 800,000 • Input < 5
RFU Design: A Quantitative Approach • RFU or Accelerator is a matrix of ALUs • No of Inputs • No of Outputs • No of ALUs • Connections • Location of Inputs & Outputs • Some definitions: • Considering frequency and weight in measurement • CI Execution Frequency • Weight (To equal number of executed instructions) • Average = for all CIs (ΣFreq*Weight) • Rejection: Percentage of CI that could not be mapped on the RFU • Coverage: Percentage of CI that could be mapped on the RFU • Basic Blocks:A sequence of instructions terminates in a control instruction • Hot Basic Blocks: A basic block executed more than a threshold
RFU Inputs (no constraint) 96.37 89.37 98.48 8
RFU Outputs (no constraint) 96.58 6
RFU Node No (Input=8, Output=8) 94.74 16
RFU Width (Inp=8, Out=8, Node=16) 95.65 97.65 6
RFU Configuration • Input=8 • Output=8 • Node=16 • Width = 6,4,3,2,1 • Depth = 5
General overview of RFU (Architecture 1) • Inputs are applied to the first row • Outputs of each row are connected only to the inputs of the subsequent row • MOVE is used for transferring data • Rejection is 22.47%
General overview of RFU (Architecture 2) • Distributing Inputs in different rows • Row1 = 7 • Row 2 = 2 • Row 3 = 2 • Row 4 = 2 • Row 5 = 1 • Connections with Variable Length • row1 row3 = 1 • row1 row4 = 1 • row1 row5 = 1 • row2 row4 = 1 • Rejection is 9.52%
Functional Units • Types for FUs: • Type1: Logical (xor, nor, and , or) • Type2: add, sub, compare • Type3: shift (left/right) • Number of each type in the RFU • Type 1 = 6 • Type 2 = 14 • Type 3 = 9
RFU with 8 outputs Accelerator Reg Reg Reg Reg FU2-Output FU4-Output FU1-Output FU3-Output Sequencer/control bits Sequencer/control bits
Control Bits & Immediate Data • 287 bits are needed as Control Bits for • Multiplexers • Functional Units • 204 bits are needed for Immediates • Each CI configuration needs (247+204 = 491 bits)
CI Configuration Memory • 2K x 1-bit multi-context memory 4 CI configuration • 8K x 1-bit cache 16 CI configuration • Total 20 CI configuration can be kept in configuration memories
B1 S4 S8 S1 B5 B7 B10 B2 S5 J2 S9 S2 B6 B8 B11 B3 J1 S7 J3 S3 S6 B9 S10 B4 B12 Extension of Custom Instructions over HBBs – Motivating Example
Conclusions • Adaptive Dynamic Extensible Processor • Binary Profiler • RFU (Inp=8, Out=6, Nodes=16, Width=6,4,3,2,1 - Depth=5) • Sequencer • Adaptive Dynamic Extensible Processor • No design time • No extra read port and write port • No design and verification cost • No compiler • No new opcode • No penalty for absence of configuration data of custom instruction in multi-context memory.
Custom Instruction • Generated from HBBs • Using HBB table • Object code • Custom instruction can include • logical operations • add/sub • Shift • At most one store • At most one control instruction (jump/branch) • No load • No floating point instructions • New object code • Logically is equivalent Profiler Table
Processor modes (1/2) • Training mode • Profiling applications • Detecting critical region of code • Generating DFG for critical regions • Generating custom instruction from DFGs • Generating new object code • Generating data for accelerator configuration memories and initializing sequencer table • Training can be done at the gap between two consecutive execution of the application if possible, otherwise just once before processor starts its normal operation
Processor modes (2/2) • Normal mode • Profiling applications • Using the data generated in training mode to execute custom instructions on the accelerator. • Critical regions of the code are executed as custom instructions on the accelerator and the remaining part of the code are executed deploying the processor functional unit as usual.