240 likes | 256 Views
Jack Liu Timothy Kong Fred Chow Cognigine Corp. www.cognigine.com. Effective Compilation Support for Variable Instruction Set Architecture. Outline. VISC Architecture Compile-time Configurable Code Generation Managing the Dictionary Concluding Remarks. Configurable Computing.
E N D
Jack Liu Timothy Kong Fred Chow Cognigine Corp. www.cognigine.com Effective Compilation Support for Variable Instruction Set Architecture
Outline • VISC Architecture • Compile-time Configurable Code Generation • Managing the Dictionary • Concluding Remarks
Configurable Computing Motivation • Higher performance • processor and instruction set customized to type of application • Lower hardware cost • non-essential features excluded • Shorter time-to-market
Variable Instruction Set Architecture (VISC ArchitectureTM) A new approach to configurable computing: • Fixed processor hardware • Many types of operations provided • Numerous instruction variants (CISC-style) • Per-program instruction set tailoring during compile time
Background of this work Cognigine CGN16100 Network Processor • Single-chip, fully programmable network processor • Processing cores: 16 Re-configurable Communications Units (RCU) processor cores • VISC architecture • 4 64-bit parallel execution units • Multi-threaded • 512 KB on-chip memory (text and data)
VISC ArchitectureTM Dictionary (instruction set for current program) dictionary entry: 32-bit: 2 operations 64-bit: 4 operations 128-bit: 8 operations 256 entries instruction opcode opnd0 opnd1 opnd2 opnd3 opcode: 8-bit
Motivation for VISC Architecture • Efficient way to encode/decode the many operation variants with different addressing modes • Not all used in each program • High instruction encoding density • Small opcode bit count • Operands shared among multiple operations • Simplified control logic for VLIW-style ILP • Up to 8 operations per cycle
Operation Specification In Dictionary Entry (only specified once): • Operation name • Operation variants: • Signed and unsigned • Operand and result sizes — 8-bit, 16-bit, 32-bit, 64-bit • Support different sizes among operand(s) or result • Vector — 64v8, 64v16, 64v32, 32v8, 32v16 • Data path to each operand/result In Instruction: • Operands’ encoding formats • Actual operands
Packet Buffers Registers, Scratch Memory RSF RSF Connector 64 256 128 128 “Back-side” Ports 64 Data Memory Instruction Cache Pointer File Dictionary Data Flow Synchronization Source Route Source Route Source Route Source Route Address Calculation Dictionary Decode Pipeline & Thread Control Execution Unit Execution Unit Execution Unit Execution Unit 64 64 64 64 RCU Architecture • 5 Stage Pipeline • 4-way multi-threaded • Hardware RSF synchronization • 128 bit reconfigurable address path • 256 bit reconfigurable data path
Roles of Compiler for VISC Architecture • Determine best instruction set stored in dictionary for best execution time performance • Generate optimized code sequence based on best instruction set • Cater to various hardware limitations: • Dictionary limit • Data path constraints • Dictionary and Instruction encoding constraints
New Compilation Approach: Configurable Code Generation • Exact form of generated instructions decided in the last instruction scheduling phase • Direct result of instruction compaction based on what is allowed by the hardware
Compiler Implementation Method • Retarget SGI Pro64 (Open64) compiler to an Abstract Machine • Code generator operates on an Abstract Operation Representation • Code generation optimizations left intact • Add new Instruction and Dictionary Finalization (IDF) phase as post-pass IDF Phase 1: • Instruction scheduling and folding • Abstract operations converted to target code sequence IDF Phase 2: • Output VISC instructions and dictionary entries
Compiler Phase Structure C GNU / Pro64TM Front-end WHIRL Optimizer Pro64TM Back-end Code Generator IDF Assembly Program: Instructions Dictionary
Abstract Operation Representation (AOR) Each operation corresponds to a micro-operation in the core execution units • RISC-like formats • r1 = op r2, r3 • r2 = load <offset>(<base>) • store r2 <offset>(<base>) • r1 = loadimm <imm> • Optimizations in AOR reflected in final code • No pre-disposition of compiler to any specific instruction format
Multiple AOR ops can be combined to single target operation Operations taking immediate operand r2 = move <imm> => r3 = addi r1 <imm> r3 = add r1, r2 Operations supporting memory operands r2 = load 4(sp) => r3 = add r1 4(sp) r3 = add r1, r2 Post incre/decre memory operations r2 = load 0(r1) => r2 = load 0(r1++) r1 = addi r1, 4 Branches on condition codes r1 = add r2, r3 . . . r1 = add r2, r3 compare (r1 != 0) => br.z label (only if immediately after) br.z label Others
IDF Approach Instruction scheduling + following tasks: • Instruction folding • Opcode selection • Modelling of irregular hardware constraints • Modelling of encoding constraints • Monitoring of states of condition codes and transient registers • Keeping track of dictionary contents Use enumeration (branch and bound) approach
Example of IDF Processing Dictionary Input add xor sub nop $w80 = move 0x55 $w91 = move 0xf8 $w70 = add $w70, $w80 $w71 = xor $w92, $w80 $w90 = sub $w92, $w91 store 8($p1) = $w90 add xor sub nop 3 instruction op3 8($p1) $w70 0x55 0xf8 • move and store instructions subsumed • $w71, $w92 mapped to transient registers
start succeed? end IDF Scheduling Algorithm Input: Sequence of operations in BB To speed up the search: Shrink solution space by: • Coming up with high initial boundsch • Prune useless search paths continuously • Tight hardware constraints help Estimate initial boundsch Search for schedule with length <= boundsch boundsch= boundsch+1 no yes
Managing the Dictionary • Dictionary usage increases due to: • Program size: more variety of operations • High ILP: more combination of operations • Library code linked in • Currently, dictionary contents fixed for each executable • Role of linker: • Merge dictionary entries with identical contents across files/libraries • Error message on dictionary overflow • Role of compiler: • Maximize dictionary entry re-use
Dictionary Compilation Strategy: • Keep track of existing dictionary entries during compilation • Extract dictionary entries from: • Libraries and .s files being linked • .o files compiled before current file Example: cc a.c b.o c.s • Maintain table of existing dictionary entries • Add to table as new entries are generated • Re-use existing dictionary entries • Bias scheduling towards dictionary conservation as dictionary fills up
User Control of Dictionary Compilation Best program performance demands near-full dictionary. When dictionary overflow, needs to re-compile. Provide user control mechanisms: • Trade-off between dictionary consumption and program performance • Command line option: -CG:dict_usage=n n = 0…10 • Embedded in code: #pragma dict_usage n dict_usage is dictionary budget guideline for IDF • Low dict_usage: • Less new dictionary entries created • Low ILP • High dict_usage: • Tighter instruction schedule • More dictionary entries created
IDF Support of dict_usage • Additional search goal bounddict • Number of new dictionary entries allowed for current BB • Automatically adjust lower with more pre-existing entries • When bounddict reached during enumeration, disallow creating new dictionary entry (unless single operation)
Experimental Results Summary (with dict_usage=10): • ILP from IDF scheduling: 1.38 ops per instruction • ILP from relaxed scheduling: 1.51 ops per instruction • 23% of all subsumable operations subsumed • Each dictionary entry referred to by 2.63 instructions (statically) • Scheduling via enumeration: 100 times slower than one-pass schedulers • Compilation time: 1 to 2 minutes per program
Concluding Remarks • VISC approach most suitable as embedded processors • Limited program size • Dictionary space less of an issue • Slow compilation tolerable • CISC-style instructions enable small code size • Compilation support key to deploying applications on VISC • Very hard to write in assembly language • Advanced optimizations performed by compiler • Dictionary managed by compiler with user hints • Compile-time configurable code generation enables RISC compilation techniques to generate CISC output