Compiling for VIRAM

Compiling for VIRAM Dave Judd Kathy Yelick Rich Fromm David Martin Computer Science Division UC Berkeley

VIRAM/Cray Compiler vcc • VIRAM/Cray vectorizing compiler • Production compiler • Used on the T90, C90, as well as the T3D and T3E • Being ported (by SGI/Cray) to the SV2 architecture • Has C, C++, and Fortran front-ends (focus on C) • Extensive vectorization capability • VIRAM code generator based on new SV2 code generator • SV2 code gen being developed in parallel w/ VIRAM • SV2 vector architecture similar to VIRAM • SV2 scalar similar to T90 with more registers

VIRAM/Cray Compiler Status Frontends Optimizer Code Generators • MIPS backend developed by last retreat • Compiled code for a commercial test suite • VIRAM vector support developed since last retreat • Compiles & executes a commercial test suite • Compiles and executes a Cray vector test suite • Remaining issues with scheduler and memory consistency model C T3D/T3E C++ PDGCS C90/T90 Fortran95 SV2/VIRAM

Codegen/optimizer issues for VIRAM • Variable virtual processor width (VPW) • Variable vector register length (MVL) • Vector flag registers treated as 1 bit wide vector register • Multiple base, incr, stride regs. + autoincrement • Fixed point arithmetic (saturating add, etc.) • Memory Consistency

Virtual Processors (vl) VP0 VP1 VPvl-1 vr0 vr1 Data Registers vr31 vpw Vector Architectural State • Number of VPs given by the Vector Length register vl • Width of each VP given by the register vpw • vpw is one of {8b,16b,32b,64b} • Maximum vector length is given by a read-only register mvl • mvl depends on implementation and vpw: {NA,128,64,32} in VIRAM-1

Generating Code for Variable VPW • Strategy: vectorizer determines minimum correct vpw for each loop nest • Vectorizer assumes vpw=64 initially • At end of vectorization, discard vectorized copy of loop if greatest width encountered is less than 64 and start vectorization over with new vpw. • Code gen checks vpw for each loop nest. • Limitation: a single loop nest will run at the speed of the widest type. • Reason: simplicity & performance of the common case • No attempt to split/combine loops based on vpw

Generating Code for Variable MVL • Maximum vector length is not specified in IRAM ISA. • However, compiler assumes mvl at compile time • mvl based on vpw • mvl assumption dependent on VIRAM-1 hardware implementation • Recompiling required for future hardware versions if mvl changes • MVL knowledge useful for code gen and vectorizer: • register spilling • short loop vectorization • length-dependent vectorization ( and may eliminate safe vector length computation at run time) for (i = 0; i < n; i=++) a[i] = a[i+32]

Vector Flag Registers • Vector flag (mask) register treated as vector register • Bit width of 1 • Flag registyer under control of vector length • Can spill/reload directly to memory • Optimizer and code gen issues to handle correctly

Multiple Base, Incr, Stride Registers • Dedicated registers for vector memory references • 16 vbase, 8 vinc and 8 vstride registers • optional automatic increment of base register • Vectorizer/Codegen strategy: • Changed from computing base address each time thru loop to incrementing base address by vl *stride*multiplier • Define compiler temporary for each base address • Teach codegen to assign vbase, vinc and vstride registers as needed. • Trick code gen into handling multiple results for single vector load/store instruction. • Results in very clean vector loops with only vector instructions in inner loop + vl computation.

C Compiler Testing • “vector” regression test suite (CRAY) • Specifically tests for vectorization • Compares vector and scalar results • Easy to isolate problems • Status: • 56 of 62 tests pass • Some minor numerical differences • 3 failures w/ wrong answers • 1 failure causes assembler abort on bad instruction (caused by vinc autoincrement feature)

C Compiler Testing (cont.) • C regression test suite (industry standard suite) • Scalar emphasis, C conformance • All tests pass except: • errors with functions returning a structure larger than 16 bytes • errors with long double constants (128 bit floating point)

What Essential Features Remain • Finish instruction scheduler • Implement sync strategy • Support -n32 ABI • Double / long double

Instruction Scheduler • Instruction scheduler working, but needs: • Functional unit layout for VIRAM • Instruction latency and busy time for VIRAM • Support for chaining of vector instructions, including mask registers • Scheduler responsible for sync processing • vector - vector sync analysis is working • vector - scalar & scalar - vector analysis needed • synch instr. currently comments in assembler output

Chaining vadd.s $vr1,$vr3,$vr4 vabs.s $vr2,$vr1 With chaining: 0 1 2 3 4 5 6 7 . . . V1[0] R X X X X W V1[1] R X X X X W V1[2] R X X X X W V1[3] R X X X X W V2[0] R X X W V2[1] R X X W V2[2] R X X W V2[3] R X X W

Convert from -64 to -n32 ABI • Pointer and long type revert to 32 bits from 64 • VIRAM tools were n32 originally • Switched to -64 to accommodate compiler, to match sv2 • Revert to n32 to match vector addressing hardware • Change will allow some gather/scatter loops to execute faster (vpw=32 instead of vpw=64)

C++ • All components being generated now • “Include” files and libC library differences? SGI / Cray • C++ testing on SV2 simulator now • Testing/ problem isotation needed

Fortran • Fortran 95 frontend • FCD (fortran character descriptor) code gen support needed for VIRAM • Differences between IRIX and UNICOS libraries for I/O and array intrinsics • Testing needed

Other Future Compiler Features ? • VIRAM machine “target” • Support for speculative execution • Support Cray additions: • peephole optimizer • vector loop unrolling/ tiling • Compiler extensions for fixed point hardware

Vector functions • Define calling sequence conventions • Must be coded in assembler • Take advantage of C versions of Cray routines • Needed for some key benchmarks?

Memory consistency • Sync instructions: SaV VaS VaV vp RaW WaR WaW

VIRAM Tools • vas: assembler • vdis: disassembler • vsim-isa: simulator • vsim-db: debugger • vsim-p: performance simulator • vsim-sync:memory consistency simulator

vsim-sync • Intended for debugging and optimizing sync’s • Tells you when there is a data hazard (sync needed) • Tells you when a sync executed that didn’t prevent a hazard; • sync may not be needed • according to dynamic execution • sync may be needed on some other execution path

Summary • vcc is a reasonably robust compiler for VIRAM • All of the basic compiler elements are present now • Need to prioritize remaining work • Finish and tune scheduler • Implement sync strategy • C++ • Fortran • IRAM target

Compiling for VIRAM