1 / 24

Effective Compilation Support for Variable Instruction Set Architecture

Jack Liu Timothy Kong Fred Chow Cognigine Corp. www.cognigine.com. Effective Compilation Support for Variable Instruction Set Architecture. Outline. VISC Architecture Compile-time Configurable Code Generation Managing the Dictionary Concluding Remarks. Configurable Computing.

bettyd
Download Presentation

Effective Compilation Support for Variable Instruction Set Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jack Liu Timothy Kong Fred Chow Cognigine Corp. www.cognigine.com Effective Compilation Support for Variable Instruction Set Architecture

  2. Outline • VISC Architecture • Compile-time Configurable Code Generation • Managing the Dictionary • Concluding Remarks

  3. Configurable Computing Motivation • Higher performance • processor and instruction set customized to type of application • Lower hardware cost • non-essential features excluded • Shorter time-to-market

  4. Variable Instruction Set Architecture (VISC ArchitectureTM) A new approach to configurable computing: • Fixed processor hardware • Many types of operations provided • Numerous instruction variants (CISC-style) • Per-program instruction set tailoring during compile time

  5. Background of this work Cognigine CGN16100 Network Processor • Single-chip, fully programmable network processor • Processing cores: 16 Re-configurable Communications Units (RCU) processor cores • VISC architecture • 4 64-bit parallel execution units • Multi-threaded • 512 KB on-chip memory (text and data)

  6. VISC ArchitectureTM Dictionary (instruction set for current program) dictionary entry: 32-bit: 2 operations 64-bit: 4 operations 128-bit: 8 operations 256 entries instruction opcode opnd0 opnd1 opnd2 opnd3 opcode: 8-bit

  7. Motivation for VISC Architecture • Efficient way to encode/decode the many operation variants with different addressing modes • Not all used in each program • High instruction encoding density • Small opcode bit count • Operands shared among multiple operations • Simplified control logic for VLIW-style ILP • Up to 8 operations per cycle

  8. Operation Specification In Dictionary Entry (only specified once): • Operation name • Operation variants: • Signed and unsigned • Operand and result sizes — 8-bit, 16-bit, 32-bit, 64-bit • Support different sizes among operand(s) or result • Vector — 64v8, 64v16, 64v32, 32v8, 32v16 • Data path to each operand/result In Instruction: • Operands’ encoding formats • Actual operands

  9. Packet Buffers Registers, Scratch Memory RSF RSF Connector 64 256 128 128 “Back-side” Ports 64 Data Memory Instruction Cache Pointer File Dictionary Data Flow Synchronization Source Route Source Route Source Route Source Route Address Calculation Dictionary Decode Pipeline & Thread Control Execution Unit Execution Unit Execution Unit Execution Unit 64 64 64 64 RCU Architecture • 5 Stage Pipeline • 4-way multi-threaded • Hardware RSF synchronization • 128 bit reconfigurable address path • 256 bit reconfigurable data path

  10. Roles of Compiler for VISC Architecture • Determine best instruction set stored in dictionary for best execution time performance • Generate optimized code sequence based on best instruction set • Cater to various hardware limitations: • Dictionary limit • Data path constraints • Dictionary and Instruction encoding constraints

  11. New Compilation Approach: Configurable Code Generation • Exact form of generated instructions decided in the last instruction scheduling phase • Direct result of instruction compaction based on what is allowed by the hardware

  12. Compiler Implementation Method • Retarget SGI Pro64 (Open64) compiler to an Abstract Machine • Code generator operates on an Abstract Operation Representation • Code generation optimizations left intact • Add new Instruction and Dictionary Finalization (IDF) phase as post-pass IDF Phase 1: • Instruction scheduling and folding • Abstract operations converted to target code sequence IDF Phase 2: • Output VISC instructions and dictionary entries

  13. Compiler Phase Structure C GNU / Pro64TM Front-end WHIRL Optimizer Pro64TM Back-end Code Generator IDF Assembly Program: Instructions Dictionary

  14. Abstract Operation Representation (AOR) Each operation corresponds to a micro-operation in the core execution units • RISC-like formats • r1 = op r2, r3 • r2 = load <offset>(<base>) • store r2 <offset>(<base>) • r1 = loadimm <imm> • Optimizations in AOR reflected in final code • No pre-disposition of compiler to any specific instruction format

  15. Multiple AOR ops can be combined to single target operation Operations taking immediate operand r2 = move <imm> => r3 = addi r1 <imm> r3 = add r1, r2 Operations supporting memory operands r2 = load 4(sp) => r3 = add r1 4(sp) r3 = add r1, r2 Post incre/decre memory operations r2 = load 0(r1) => r2 = load 0(r1++) r1 = addi r1, 4 Branches on condition codes r1 = add r2, r3 . . . r1 = add r2, r3 compare (r1 != 0) => br.z label (only if immediately after) br.z label Others

  16. IDF Approach Instruction scheduling + following tasks: • Instruction folding • Opcode selection • Modelling of irregular hardware constraints • Modelling of encoding constraints • Monitoring of states of condition codes and transient registers • Keeping track of dictionary contents Use enumeration (branch and bound) approach

  17. Example of IDF Processing Dictionary Input add xor sub nop $w80 = move 0x55 $w91 = move 0xf8 $w70 = add $w70, $w80 $w71 = xor $w92, $w80 $w90 = sub $w92, $w91 store 8($p1) = $w90 add xor sub nop 3 instruction op3 8($p1) $w70 0x55 0xf8 • move and store instructions subsumed • $w71, $w92 mapped to transient registers

  18. start succeed? end IDF Scheduling Algorithm Input: Sequence of operations in BB To speed up the search: Shrink solution space by: • Coming up with high initial boundsch • Prune useless search paths continuously • Tight hardware constraints help Estimate initial boundsch Search for schedule with length <= boundsch boundsch= boundsch+1 no yes

  19. Managing the Dictionary • Dictionary usage increases due to: • Program size: more variety of operations • High ILP: more combination of operations • Library code linked in • Currently, dictionary contents fixed for each executable • Role of linker: • Merge dictionary entries with identical contents across files/libraries • Error message on dictionary overflow • Role of compiler: • Maximize dictionary entry re-use

  20. Dictionary Compilation Strategy: • Keep track of existing dictionary entries during compilation • Extract dictionary entries from: • Libraries and .s files being linked • .o files compiled before current file Example: cc a.c b.o c.s • Maintain table of existing dictionary entries • Add to table as new entries are generated • Re-use existing dictionary entries • Bias scheduling towards dictionary conservation as dictionary fills up

  21. User Control of Dictionary Compilation Best program performance demands near-full dictionary. When dictionary overflow, needs to re-compile. Provide user control mechanisms: • Trade-off between dictionary consumption and program performance • Command line option: -CG:dict_usage=n n = 0…10 • Embedded in code: #pragma dict_usage n dict_usage is dictionary budget guideline for IDF • Low dict_usage: • Less new dictionary entries created • Low ILP • High dict_usage: • Tighter instruction schedule • More dictionary entries created

  22. IDF Support of dict_usage • Additional search goal bounddict • Number of new dictionary entries allowed for current BB • Automatically adjust lower with more pre-existing entries • When bounddict reached during enumeration, disallow creating new dictionary entry (unless single operation)

  23. Experimental Results Summary (with dict_usage=10): • ILP from IDF scheduling: 1.38 ops per instruction • ILP from relaxed scheduling: 1.51 ops per instruction • 23% of all subsumable operations subsumed • Each dictionary entry referred to by 2.63 instructions (statically) • Scheduling via enumeration: 100 times slower than one-pass schedulers • Compilation time: 1 to 2 minutes per program

  24. Concluding Remarks • VISC approach most suitable as embedded processors • Limited program size • Dictionary space less of an issue • Slow compilation tolerable • CISC-style instructions enable small code size • Compilation support key to deploying applications on VISC • Very hard to write in assembly language • Advanced optimizations performed by compiler • Dictionary managed by compiler with user hints • Compile-time configurable code generation enables RISC compilation techniques to generate CISC output

More Related