1 / 20

CS433: Computer System Organization

CS433: Computer System Organization. Luddy Harrison Vector Computation TigerSHARC Examples Vectorization. What is Vector Computation?. Vector computation is a simple form of SIMD computation Single Instruction Multiple Data The data are packed into vectors

iris-campos
Download Presentation

CS433: Computer System Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS433: Computer System Organization Luddy Harrison Vector ComputationTigerSHARC Examples Vectorization

  2. What is Vector Computation? • Vector computation is a simple form of SIMD computation • Single Instruction Multiple Data • The data are packed into vectors • Tuples of a primitive data type • The ALU operates directly on these tuples • We will write e.g. 4x16 for such a type • The underlying primitive type is taken from context

  3. Simplest Case: Vector / Vector A B C D + + + + E F G H = = = = A+E B+F C+G D+H Here the output type is the same as the two input types. TigerSHARC XSR1:0 = R4:3 + R7:6 (S)

  4. A×E B×F C×G D×H Vector / Vector With Different Result Type A B C D X X X X E F G H = = = = 4x16 × 4x16 → 4x32 Note quad output register YR3:0 YR3:0 = R7:6 * R5:4

  5. Variation: Vector / Scalar Vector <A,B,C,D> A B C D + + + + Scalar: X X X X X = = = = A+X B+X C+X D+X

  6. Extra Credit • Email me by midnight tonight an example of a Vector / Scalar operation in the TigerSHARC instruction set • Must be one 32-bit instruction • Running on one of X or Y (not an XY operation)

  7. Variation: Reduction A B C D + + + A+B+C+D

  8. TigerSHARC Reduction

  9. Sum Reduction Using PR Regs PR0 += SUM SR5:4

  10. XY Parallelism in TigerSHARC • We can (sort of) look at this as an additional “×2” vector capacity • R3:0 = R5:4 * R7:6 • This does 4 multiply / adds on X and 4 on Y (8 total) • As if < X5:4, Y5:4 > were one vector

  11. XY Operations on TS X Y

  12. XY Parallelism • This view however is difficult to maintain in light of loads and stores • XR5:4 = [ J2 ] ; YR5:4 = [ K2 ] • If K2 = J2 + 2 then this loads the 4-word vector into <XR5:4, YR5:4>

  13. Using XY to Implement Vectors J2 K2 What are the problems here (there are two of them)?

  14. Vector Computation and Memory • The structure of in-register vector operations mirrors closely the structure of the in-memory storage of the input and output vectors • This is primarily what makes hand or automatic vectorization difficult • The XY feature of TigerSHARC is easiest to use on separate vector operations • A = B + C (vector op) || D = E + F (vector op)

  15. Automatic Vectorization • Loop Unrolling • To provide more than one vector element per iteration • Alignment • To satisfy load / store alignment restrictions • Vector register allocation • To map vectors into the register set • This is a very incomplete list, but it is something like a minimum requirement

  16. Loop Unrolling for (i=0; i<100; ++i){ a[i] = b[i] + c[i]; } for (i=0; i<100; i += 4){ a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; }

  17. Alignment: Analysis void f(int *a, int *b, int *c){ for (i=0; i<100; ++i) { a[i] = b[i] + c[i]; }} void f(int *a, int *b, int *c){ for (i=0; i<100; i += 4) { a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; }} Can we load c[i+3] : c[i+0] using a quad load?

  18. Alignment void f(int *a, int *b, int *c){ for (i=0; i<100; ++i) { a[i] = b[i-1] + c[i]; }} Why is this example difficult? What will happen (concerning alignment) when we unroll? How can we fix this?

  19. Vector Register Allocation V1 = V2 + V3V4 = V5 + V6V7 = V2 * V3V8 = V5 * V9 Vector ADD Vector MUL Should we allocate <V3, V6> to a vector reg? Or <V3, V9>? If V6 and V9 are simultaneously alive there is a problem!

  20. Vector Register Allocation V1 = V2 + V3V4 = V5 + V6V7 = V2 * V6V8 = V5 * V3 Vector ADD Vector MUL Should V3 go into the HIGH half (odd-numbered) register for the ADD, or the LOW half (even-numbered) register for the MUL?

More Related