1 / 31

A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes

A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes. Anup Gangwar. November 28, 2001. Overview. The VLIW code size expansion problem What all such a framework needs to support? Trimaran compiler infrastructure The HPL-PD architecture

erimentha
Download Presentation

A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001

  2. Overview • The VLIW code size expansion problem • What all such a framework needs to support? • Trimaran compiler infrastructure • The HPL-PD architecture • Extensions to the various modules of Trimaran • Results • Future work • Acknowledgements

  3. Choices for exploiting ILP • The architectural choices for utilizing ILP • Superscalar processors • Try to extract ILP at run time • Complex hardware • Limited clock speeds and high power dissipation • Not suited for embedded type of applications • VLIW processors • Compiler has lot of knowledge about hardware • Compiler extracts ILP statically • Simplified hardware • Possible to attain higher clock speeds

  4. Problems with VLIW processors • Complex compiler required to extract ILP from application program • Requires adequate support in hardware for compiler controlled execution • Code size expansion due to explicit NOPs if, • The application does not contain enough parallelism • The compiler is not able to extract parallelism from the application • Need for good instruction encoding and NOP compression schemes

  5. What all such a framework should support? • The framework should have quick retargetability • Studying the effect of a particular instruction encoding and decoding scheme on processor performance • Studying the code size minimization due to a particular instruction encoding scheme • Studying memory bandwidth requirements imposed by a particular instruction decoding scheme.

  6. Trimaran Compiler Infrastructure C Program Bridge Code IMPACT • ANSI C Parsing • Code profiling • Classical machine independent optimizations • Basic block formation ELCOR • Machine dependent code optimizations • Code scheduling • Register allocation ELCOR IR STATISTICS SIMULATOR • ELCOR IR to low level C files • HPL-PD virtual machine • Cache simulation • Performance statistics • Compute and stall cycles • Cache stats • Spill code info HMDES Machine Description

  7. Various modules of Trimaran - 1 • IMPACT • Developed by UIUC’s IMPACT group • Trimaran uses only the IMPACT front-end • Classical machine independent optimizations • Outputs a low level IR, Trimaran bridge code • ELCOR • Developed by HPL’s CAR group • It is the compiler backend • Performs registration allocation and code scheduling • Parameterized by HMDES machine description • Outputs ELCOR IR with annotated HPL-PD assembly

  8. Various modules of Trimaran - 2 • HMDES • Developed by UIUC’s IMPACT group • Specifies resource usage and latency information for an arch. • Input is translated to a low level representation • Has efficient mechanisms for querying the database • Does not specify instruction format information • HPL-PD Simulator • Developed by NYU’s REACT-ILP group • Converts ELCOR’s annotated IR to low level C representation • Processor performance and cache simulation • Generates statistics and execution trace

  9. Various modules of Trimaran - 3 Example ELCOR Operation in IR Op 7 ( ADD_W [ br<11 :I gpr 14>] [br<27 :I gpr 14> I<1> ] p<t> s_time( 3 ) s_opcode( ADD_W.0 ) attr(lc ^52) flags( sched ) )

  10. Various modules of Trimaran - 4 • HMDES Sections • Field_Type e.g. REG, Lit etc. • Resource e.g. Slot0, Slot1 etc. • Resource_Usage e.g. RU_slot0 time( 0 ) • Reservation_Table e.g. RT_slot0 use( Slot0 ) • Operation_Latency e.g. lat1 ( time( 1 ) ) • Scheduling_Alternative e.g. (format(std1) resv(RT1) latency(lat1) ) • Operation e.g. ADD_W.0 ( Alt_1 Alt_2 ) • Elcor_Operation e.g. ADD_W( op( “ADD_W.0” “ADD_W.1” ) )

  11. Various modules of Trimaran - 5 HPL-PD Simulator in detail REBEL Low level C files C libraries Emulation Library Code Processor HMDES Native Compiler Executable for the host platform

  12. Various modules of Trimaran - 7 HPL-PD Simulator in detail HPL-PD Virtual Machine Fetch Next Instruction Fetch Data Execute Instruction Instruction Accesses Data Accesses Dinero IV Cache Simulator Level I Instruction-Cache Level I Data-Cache Level II Unified Cache

  13. The HPL-PD architecture • Parameterized ILP architecture from HP Labs • Possible to vary, • Number and types of FUs • Number and types of registers • Width of instruction words • Instruction latencies • Predicated instruction execution • Compiler visible cache hierarchy • Result multicast is supported for predicate registers • Run time memory disambiguation instructions

  14. The HPL-PD memory hierarchy Registers L1 Cache Data Prefetch Cache L2 Cache • Independent of L1 Cache • Used to store large amount of cache polluting data • Doesn’t require sophisticated cache replacement mechanism Main Memory

  15. The Framework Decoder Model HMDES TRIMARAN Perf. Stats ASSEMBLER (using NJMC) Cache. Stats Obj. File Code Size Instruction Address or Next Instr Request Instruction Address Bytes Fetched DISASSEMBLER (using NJMC)

  16. Studying impact on performance • The HMDES modeling of decompressor, • Add a new resource with latency of decoder • Add a new resource usage section for this decoder • Add this resource usage to all the HPL-PD operations • In the results there are two decompressor units with latency = 1 • The latency of decompressor should be estimated or generated using actual simulation.

  17. Studying code size minimization - 1 A simple template based instruction encoding scheme Issue Slots IALU.0 IALU.1 FALU.0 MU.0 BU.0 ….. MUL_OP Format MUL_OP OPCODE & OPERANDS OPCODE & OPERANDS ADD_W and L_W_C1_C1 00010 IOP ; Sgpr1, Slit1, Dgpr2 MemOP ; Sgpr1, Dgpr1 • Multi-ops are decided after profiling the generated assembly code. • Multi-op field encodes: • Size and position of each Uni-op • Number, size and position of operands of each Uni-op

  18. Studying code size minimization - 2 • Instrumenting ELCOR to generate assembly code 1. Arrange all the ops in IR in forward control order 2. Choose the next basic block and initialize cycle to 0 3. Walk the ops of this BB and dump those with the s_time = cycle 4. If BBs are left goto step 2 5. Dump the global data • Actual instruction encoding is done using procedures created by NJMC

  19. Studying code size minimization - 3 The New Jersey Machine Code Toolkit • Deals with bits at symbolic level • Can be used to write assemblers, disassemblers etc. • Supports concatenation to emit large binary data • Representation is specified in SLED • Has been used to write assemblers for Sparc, i486 etc. • VLIW instructions need to be broken up into 32 bit (max) size tokens • Emitted binary data must end on a 8 bit boundary

  20. Studying code size minimization - 4 Machine specifications in SLED bit 0 is least significant fields of TOK32 (32) Dgpr_1 0:3 Slit_1_part1 4:31 fields of TOK8 (16) Slit_1_part2 0:3 Sgpr_1 4:7 IOP 8:11 tmpl 12:14 patterns IOP_pats is any of [ ADD MUL SUB ], which is tmpl = 1 & IOP = { 0 to 2 } constructors IOP_pats Sgpr_1, Slit_1, Dgpr_1 is IOP_pats & Sgpr_1 & Slit_1_part2 = Slit_1 @[28:31]; Slit_2_part1 = Slit_1 @[0:27] & Dgpr_1

  21. Studying code size minimization - 5 Toolkit encoder output ADD( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 ); MUL( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 ); SUB( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 ); Specifying matcher for disassembler match | ADD( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something | MUL( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something | SUB( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something endmatch

  22. Studying code size minimization - 6 • The matcher application needs functions for fetching data • Bit ordering is different on little and big endian machines • The matcher fails when large number of complex templates are given • Breaking large sized multi-ops across 32 bit tokens makes the representation messy and error prone • Specifying addresses for forward branches requires two passes

  23. Studying impact on memory bandwidth - 1 The Typical VLIW Pipeline Instruction Decode Align Decompress Decode Instruction Fetch Store Results Execute DF/AG

  24. Studying impact on memory bandwidth - 2 • The cache simulation requires the generation of, • Instruction address • No. of bytes to fetch • Instruction address can be generated by disassembling the instructions at run time and keeping track of jumps • The matcher application returns the number of bytes required to disassemble an instruction • The disassembled instruction can be compared with the instruction issued to check correctness

  25. Studying impact on memory bandwidth - 3 • Run time verification of disassembled instructions can be turned off for faster simulation • Due to restricted size of matcher results could not be obtained for larger programs • Memory access addresses and bytes to fetch have been generated by hand for SumToN application

  26. Results - Impact on code size (Strcpy)

  27. Results - Impact on code size (SumToN)

  28. Results - Size of SLED specification for various archs.

  29. Results - Cache performance comparison (SumToN)

  30. Future work • Need for automation in most parts of the framework • Better representation for VLIW instructions than SLED • Unlimited token size • Facility to bind one field with multiple patterns • Methodology for predicting latency for decompressor • Framework for finding the optimal instruction formats

  31. Acknowledgements • Prof. M.Balakrishnan and Prof. Anshul Kumar • Rodric M. Rabbah, Georgia Institute of Technology • Shail Aditya, HP Labs • All the friends at Philips Lab. for stimulating discussions

More Related