Compiling for VIRAM: Long-Term Success of VIRAM and Code Generation Challenges

Compiling for VIRAM Kathy Yelick Dave Judd, & Ronny Krashinsky Computer Science Division UC Berkeley

Compiling for VIRAM • Long-term success of VIRAM depends on simple programming model, i.e., a compiler • Needs to handle significant class of commercial applications • primary focus on multimedia, graphics, speech and image processing • Needs to utilize hardware features for performance

Questions • Is there sufficient vectorization potential in commercial applications? • Can automatic vectorization find it? • How difficult is code generation for the VIRAM ISA? • How hard is optimization for the VIRAM-1 implementation?

IRAM Compilers • IRAM/Cray vectorizing compiler [Judd] • Production compiler • Used on the T90, C90, as well as the T3D and T3E • Being ported (by SGI/Cray) to the SV2 architecture • Has C, C++, and Fortran front-ends (focus on C) • Extensive vectorization capability • outer loop vectorization, scatter/gather, short loops, … • VIRAM port is under way • IRAM/VSUIF vectorizing compiler [Krashinsky] • Based on VSUIF from Corinna Lee’s group at Toronto which is based on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford • This is a “research” compiler, not intended for compiling large complex applications • It exists now!

IRAM/Cray Compiler Status Vectorizer Code Generators Frontends • MIPS backend developed since last retreat • Compiles ~90% of a commercial test suite for code generation • Generated code run through vas • Remaining issue with register spilling and loader • Vector instructions being added • Question: what information is “known” in each stage C PDGCS C90 C++ IRAM Fortran

Vectorizing Mediaprocessing (Dubey) Kernel Vector length • Matrix transpose/multiply # vertices at once • DCT (video, comm.) image width • FFT (audio) 256-1024 • Motion estimation (video) image width,i.w./16 • Gamma correction (video) image width • Haar transform (media mining) image width • Median filter (image process.) image width • Separable convolution (““) image width (from http://www.research.ibm.com/people/p/pradeep/tutor.html)

Vectorization in SPECint95 Kernel Vector length • m88ksim 5,16,32 • comp 512 • decomp 512 • lisp interpreter (gc) 80, 1000 • ijpeg 8,16, width/2, width (from K. Asanovic, UCB PhD, 1998)

Automatic Vectorization • Vectorizing compilers very successful on scientific applications • not entirely automatic, especially for C/C++ • good tools for training users • Multimedia applications have • shorter vector lengths • can sometime exploit outer loop vectorization for longer vectors • often leads to non-unit strides • tree traversals could be written as scatter/gather (breadth-first), • although automating this is far from solved e.g., image compression

How Hard is Codegen for VIRAM? • Variable virtual processor width (VPW) • Variable vector register length (MVL) • Fixed point arithmetic (saturating add, etc.) • Memory Operations

Virtual Processors (vl) VP0 VP1 VPvl-1 vr0 vr1 Data Registers vr31 vpw Vector Architectural State • Number of VPs given by the Vector Length register vl • Width of each VP given by the register vpw • vpw is one of {8b,16b,32b,64b} • Maximum vector length is given by a read-only register mvl • mvl depends on implementation and vpw: {NA,128,64,32} in VIRAM-1

Generating Code for Variable VPW • Strategy used in IRAM/VSUIF: use a compiler flag for (i = 0; i < N; i++) a[i] = b[i]*b[i]; for (i = 0; i < N; i++) c[i] = c[i]*2; • Compile with the widest data type in the vector part of the program • a,b,c all arrays of doubles => vpw64 • any of a,b,c arrays of => vpw64 • a,b,c are all arrays of single => vpw64 ok, vpw32 faster • similarly for int • Limitation: a single file cannot contain loops that should use different vpw’s • Reason: simplicity

Example: Decryption (IDEA) • IDEA Decryption operates on 16-bit ints • Compiled with IRAM/VSUIF (with unrolling by hand) # lanes

Generating Code for Variable VPW • Planned strategy for IRAM/Cray: backend will determine minimum correct vpw for each loop nest for (i = 0; i < N; i++) for (j = 0; j < N; j++){ a[j,i] = ...; b[i,j] = ...; } for (i = 0; i < N; i++) c[i] = c[i]*2; • Code gen will do a pass over each loop nest. • Limitation: a single loop nest will run at the speed of the widest type. • Reason: simplicity & performance of the common case • Assumes the vectorizer does not need vpw.

Generating Code for Variable MVL • Maximum vector length is not specified in IRAM ISA. • In Cray compiler, mvl is known at compile time. • Different customer/software model (recompiling not onerous) • Three potential problems with unknown mvl • register spilling • not done in IRAM/VSUIF • short loop vectorization • not treated specially in IRAM/VSUIF • length-dependent vectorization ? for (i = 0; i < n; i=++) a[i] = a[i+32]

Register Spilling and MVL • Register spilling • normally reserve fixed space in stack frame • with unknown MVL, compiler can’t determine stack frame layout • Possible solutions • make MVL’s values visible in architectural specification • use dynamic allocation for spilling registers • hybrid • use stack allocation as long as MVL is less or equal to VIRAM-1’s size, namely 32 64-bit values; • store “dope vector” in place otherwise; • results in extra branch when spilling, which is much better than dynamic allocation in the “common” case • proposal: work out details of hybrid, but only implement VIRAM-1

becomes vector instruction Short Vectors and MVL • Strip-mining turns arbitrary length loop into loop of vector operations for (i = 0; i < n; i+= mvl) for (ii = i; ii < mvl; ii++) c[ii] = a[ii] + b[ii]; • Not a good idea if n<mvl • this was problematic in evaluating IRAM/VSUIF, since strip-mining by hand (in source) produces poor quality code • short vectors are common in multimedia applications • Need by vectorizer (which does not know about vpw) • Cray compiler treats similar cases by generating multiple versions; doesn’t match our small-memory view • Plan: avoid dependence on MVL in vectorizer • Choice of “good” transformations is the hard problem.

Using VIRAM Arithmetic • “Peak” performance depends on using of fused multiply-add, unlike Crays. Other compilers find these. • Fixed point operations are not supported in HLLs • No plans for compiler support in VIRAM/Cray • although vectorizing over short data widths is possible in IRAM/VSUIF; likely in IRAM/Cray as well • Under what circumstances can one determine vpw/fixed point optimizations? • IDCT does immediate round/pack after operations • Motion estimation does a min after diff/abs • Easier than discovering fixed point from floating point

VIRAM Memory System • Multiple base/stride registers • IRAM/VSUIF only uses 1, but more is not difficult • Autoincrement seems possible in principle and may be important for short-vector programs • Memory barriers/synchronization • Experience on Cray vector machines: serious performance impact • Only a single kind of (very expensive) barrier • Some instructions had barrier-like side effects • Ordered scatters: for sparse matrices? Pointer-chasing? • Answers may depend on particular compiler assumptions, rather than fundamental problems

VIRAM/VSUIF Matrix/Vector Multiply • VIRAM/VSUIF does reasonably well on long loops • 256x256 single matrix • Compare to 1600 Mflop/s (peak without multadd) • Problems with • short loops • reductions mvm vmm

Conclusions • Compiler support for short vectors, non-unit stride is key. • Compilers can perform most of the transformations people are doing by hand on multi-media kernels • loop unrolling, interchange, outer loop vectorization, instruction scheduling • exceptions are fixed point and changing algorithms, e.g., DCT • Major challenge is choosing the right transformations • avoid pessimizing xforms • avoid code bloat • Good news: the performance of vector machines is much more easily captured by a performance model in the compiler than for OOO/SS with multi-level caches

Backup slides Applications and Compilers

How Hard is Optimization for VIRAM-1?

Example: Decryption (IDEA) • IDEA Decryption operates on 16-bit ints • Compiled with IRAM/VSUIF (with unrolling by hand) Data width # lanes Performance in GOP/s

Spec92fp Operations (M) Instructions (M) Program RISC VSIW R / V RISC VSIW R / V swim256 115 95 1.1x 115 0.8 142x hydro2d 58 40 1.4x 58 0.8 71x nasa7 69 41 1.7x 69 2.2 31x su2cor 51 35 1.4x 51 1.8 29x tomcatv 15 10 1.4x 15 1.3 11x wave5 27 25 1.1x 27 7.2 4x mdljdp2 32 52 0.6x 32 15.8 2x Operation & Instruction Count: RISC v. “VSIW” Processor(from F. Quintana, U. Barcelona.) VSIW reduces ops by 1.2X, instructions by 20X!

Cost: $1M each? Low latency, high BW memory system? Code density? Compilers? Vector Performance? Power/Energy? Scalar performance? Real-time? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM = low latency, high bandwidth memory Much smaller than VLIW/EPIC For sale, mature (>20 years) Easy scale speed with technology Parallel to save energy, keep perf Include modern, modest CPU  OK scalar (MIPS 5K v. 10k) No caches, no speculation repeatable speed as vary input Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Revive Vector (= VSIW) Architecture!

Compiler Status: S98 Review • Plans to use SGI’s compiler had fallen through • Short-term effort identified: VIC • based on small extensions to C • mainly to ease benchmarking effort and get more people involved • end of summer deadline for completion • Intermediate term: SUIF • based on VSUIF from Toronto • retarget backend • Other possibilities: Java compiler (Titanium)

Vector Surprise • Use vectors for inner loop parallelism (no surprise) • One dimension of array: A[0, 0], A[0, 1], A[0, 2], ... • think of machine as 32 vector regs each with 64 elements • 1 instruction updates 64 elements of 1 vector register • and for outer loop parallelism! • 1 element from each column: A[0,0], A[1,0], A[2,0], ... • think of machine as 64 “virtual processors” (VPs) each with 32 scalar registers! ( multithreaded processor) • 1 instruction updates 1 scalar register in 64 VPs • Hardware identical, just 2 compiler perspectives

Software Technology Trends Affecting V-IRAM? • V-IRAM: any CPU + vector coprocessor/memory • scalar/vector interactions are limited, simple • Example V-IRAM architecture based on ARM 9, MIPS • Vectorizing compilers built for 25 years • can buy one for new machine from The Portland Group • Microsoft “Win CE”/ Java OS for non-x86 platforms • Library solutions (e.g., MMX); retarget packages • Software distribution model is evolving? • New Model: Java byte codes over network? + Just-In-Time compiler to tailor program to machine?

Short-term compiler objectives • Intermediate solution to demonstrate IRAM features and performance while other compiler effort ongoing • Generate efficient vector code, particularly for key loops • Allow use of IRAM-specific ISA features • Variable VP width • Flag registers (masked operations) • Not a fully automatic vectorizing compiler • But higher level than assembly language • Programs should be easily convertible to format suitable as input to standard vectorizing compiler or scalar compiler • Up and running quickly • Minimal compiler modifications necessary • Easy to debug

Current compiler infrastructure • vic: source-to-source translator • input: C with vector extensions • output: C with embedded V-IRAM assembly code • vas: V-IRAM assembler • two paths for vic code: debugging or simulation .vic vic .c + asm gcc .s vas a.out vsim .c a.out Sun, SGI, PC, ... gcc

Vic language extensions to C • Based on register-register machine model • Special C variables represent vector registers • User responsible for register allocation • vfor keyword indicates loop should be vectorized • User must strip-mine loops explicitly, but compiler determines appropriate vector length #define x vr0 #define y vr1 #define z vr2 vfor (i=0; i < 32; i++) { z[i] = x[i] + y[i]; } • VL, MVL, VPW used to access vector control regs.

Language extensions, cont. • Unit stride, non-unit stride, and indexed loads/stores int A[256]; vfor (i=0; i < 32; i++) { x[i] = A[2*i+15]; /* Stride 2, start with 15th element */ } • Masked operations using vif keyword vfor (i=0; i < 32; i++) { vif (x[i] < y[i]) { z[i] = x[i] + y[i]; } } • Operations with no obvious mapping from C expressed using macros EXTRACT(x, y, 3, 10);

Kernels in Vic • Vector Add: C = A + B • Vector-Matrix Multiply: C[n] = A[m] * B[m][n] • Matrix-Vector Multiply: C[m] = B[m][n] * A[n] • Matrix Transpose • Load store tests • FFT, two versions: • Based on standard Cooley-Tukey algorithm • VIRAM to mask memory ops and use extracts (underway) • Sparse Matrix-Vector multiply (underway) • QR Decomposition (underway)

Vic implementation • Written in PERL • Easy to write and maintain • Relies heavily on gcc code optimization • Makes generating vic output simpler • Generates code for old VIRAM ISA • Some small changes • Doesn’t use new vector ld/st with single address register • Doesn’t handle shadow of scalar registers in vector unit • fix by adding move-to/from around vector instructions that use scalar registers • or, do real register allocation (not in VIC)

Intermediate-term compiler: SUIF • Port VSUIF to V-IRAM ISA • Vectorizing version of Stanford SUIF compiler • C and F77 front-ends • Currently generates code for • Vector ISA from University of Toronto • T0 • Sun VIS • Toronto plans for VSUIF extensions • Add outer-loop vectorization • Support masked operations • Berkeley undergraduate is currently working on VSUIF port to V-IRAM ISA

Compiler Status: W99 • Plans to use SGI’s compiler revived • VIC • good news: initial VIC implementation done 7/98 (Oppenheimer) • more good news: undergrad benchmarking group (Thomas) • bad news: change of ISA makes VIC obsolete • SUIF • steep learning curve: vectors, VIRAM, SUIF (Krashinsky) • starting basic backend modifications • may need additional people

Compiling for VIRAM: Long-Term Success of VIRAM and Code Generation Challenges