CS433: Computer System Organization

CS433: Computer System Organization Luddy Harrison Vector ComputationTigerSHARC Examples Vectorization

What is Vector Computation? • Vector computation is a simple form of SIMD computation • Single Instruction Multiple Data • The data are packed into vectors • Tuples of a primitive data type • The ALU operates directly on these tuples • We will write e.g. 4x16 for such a type • The underlying primitive type is taken from context

Simplest Case: Vector / Vector A B C D + + + + E F G H = = = = A+E B+F C+G D+H Here the output type is the same as the two input types. TigerSHARC XSR1:0 = R4:3 + R7:6 (S)

A×E B×F C×G D×H Vector / Vector With Different Result Type A B C D X X X X E F G H = = = = 4x16 × 4x16 → 4x32 Note quad output register YR3:0 YR3:0 = R7:6 * R5:4

Variation: Vector / Scalar Vector <A,B,C,D> A B C D + + + + Scalar: X X X X X = = = = A+X B+X C+X D+X

Extra Credit • Email me by midnight tonight an example of a Vector / Scalar operation in the TigerSHARC instruction set • Must be one 32-bit instruction • Running on one of X or Y (not an XY operation)

Variation: Reduction A B C D + + + A+B+C+D

TigerSHARC Reduction

Sum Reduction Using PR Regs PR0 += SUM SR5:4

XY Parallelism in TigerSHARC • We can (sort of) look at this as an additional “×2” vector capacity • R3:0 = R5:4 * R7:6 • This does 4 multiply / adds on X and 4 on Y (8 total) • As if < X5:4, Y5:4 > were one vector

XY Operations on TS X Y

XY Parallelism • This view however is difficult to maintain in light of loads and stores • XR5:4 = [ J2 ] ; YR5:4 = [ K2 ] • If K2 = J2 + 2 then this loads the 4-word vector into <XR5:4, YR5:4>

Using XY to Implement Vectors J2 K2 What are the problems here (there are two of them)?

Vector Computation and Memory • The structure of in-register vector operations mirrors closely the structure of the in-memory storage of the input and output vectors • This is primarily what makes hand or automatic vectorization difficult • The XY feature of TigerSHARC is easiest to use on separate vector operations • A = B + C (vector op) || D = E + F (vector op)

Automatic Vectorization • Loop Unrolling • To provide more than one vector element per iteration • Alignment • To satisfy load / store alignment restrictions • Vector register allocation • To map vectors into the register set • This is a very incomplete list, but it is something like a minimum requirement

Loop Unrolling for (i=0; i<100; ++i){ a[i] = b[i] + c[i]; } for (i=0; i<100; i += 4){ a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; }

Alignment: Analysis void f(int *a, int *b, int *c){ for (i=0; i<100; ++i) { a[i] = b[i] + c[i]; }} void f(int *a, int *b, int *c){ for (i=0; i<100; i += 4) { a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; }} Can we load c[i+3] : c[i+0] using a quad load?

Alignment void f(int *a, int *b, int *c){ for (i=0; i<100; ++i) { a[i] = b[i-1] + c[i]; }} Why is this example difficult? What will happen (concerning alignment) when we unroll? How can we fix this?

Vector Register Allocation V1 = V2 + V3V4 = V5 + V6V7 = V2 * V3V8 = V5 * V9 Vector ADD Vector MUL Should we allocate <V3, V6> to a vector reg? Or <V3, V9>? If V6 and V9 are simultaneously alive there is a problem!

Vector Register Allocation V1 = V2 + V3V4 = V5 + V6V7 = V2 * V6V8 = V5 * V3 Vector ADD Vector MUL Should V3 go into the HIGH half (odd-numbered) register for the ADD, or the LOW half (even-numbered) register for the MUL?

CS433: Computer System Organization