1 / 18

Compiler techniques for exposing ILP

Compiler techniques for exposing ILP . Instruction Level Parallelism. Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Goal: Exploit ILP across multiple basic blocks Iterations of a loop

darice
Download Presentation

Compiler techniques for exposing ILP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler techniques for exposing ILP

  2. Instruction Level Parallelism • Potential overlap among instructions • Few possibilities in a basic block • Blocks are small (6-7 instructions) • Instructions are dependent • Goal: Exploit ILP across multiple basic blocks • Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

  3. Basic Scheduling Sequential MIPS Assembly Code Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F0, 0(R1) 1 stall 2 ADDD F4, F0, F2 3 stall 4 stall 5 SD 0(R1), F4 6 SUBI R1, R1, #8 7 stall 8 BNEZ R1, Loop 9 stall 10 Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6

  4. Loop Unrolling Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, #8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1, R1, #8 BEQZ R1, Exit LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 SUBI R1, R1, #8 BNEZ R1, Loop Exit: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size Comment: Often a precursor step for other optimizations

  5. Loop Transformations • Instruction independency is the key requirement for the transformations • Example • Determine that is legal to move SD after SUBI and BNEZ • Determine that unrolling is useful (iterations are independent) • Use different registers to avoid unnecessary constrains • Eliminate extra tests and branches • Determine that LD and SD can be interchanged • Schedule the code, preserving the semantics of the code

  6. 1. Eliminating Name Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) ADDD F4, F0, F2 SD -8(R1), F4 LD F0, -16(R1) ADDD F4, F0, F2 SD -16(R1), F4 LD F0, -24(R1) ADDD F4, F0, F2 SD -24(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Register Renaming

  7. 2. Eliminating Control Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, #8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1, R1, #8 BEQZ R1, Exit LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 SUBI R1, R1, #8 BNEZ R1, Loop Exit: Intermediate BEQZ are never taken Eliminate!

  8. 3. Eliminating Data Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, #8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1, R1, #8 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 SUBI R1, R1, #8 BNEZ R1, Loop • Data dependencies SUBI, LD, SD • Force sequential execution of iterations • Compiler removes this dependency by: • Computing intermediate R1 values • Eliminating intermediate SUBI • Changing final SUBI • Data flow analysis • Can do on Registers • Cannot do easily on memory locations • 100(R1) = 20(R2)

  9. 4. Alleviating Data Dependencies Unrolled loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Scheduled Unrolled loop: Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, #32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16

  10. Some General Comments • Dependences are a property of programs • Actual hazards are a property of the pipeline • Techniques to avoid dependence limitations • Maintain dependences but avoid hazards • Code scheduling • hardware • software • Eliminate dependences by code transformations • Complex • Compiler-based

  11. Loop-level Parallelism • Primary focus of dependence analysis • Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }

  12. Dependence Analysis Algorithms • Assume array indexes are affine (ai + b) • GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) • General graph cycle determination is NP • a, b, c, and d may not be known at compile time

  13. Software Pipelining Start-up Finish-up Iteration 0 Iteration 1 Iteration 2 Iteration 3 Software pipelined iteration

  14. Example Iteration i Iteration i+1 Iteration i+2 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop

  15. Trace (global-code) Scheduling • Find ILP across conditional branches • Two-step process • Trace selection • Find a trace (sequence of basic blocks) • Use loop unrolling to generate long traces • Use static branch prediction for other conditional branches • Trace compaction • Squeeze the trace into a small number of wide instructions • Preserve data and control dependences

  16. Trace Selection A[I] = A[I] + B[I] LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else . . . . SW 0(R2), . . . J join Else: . . . . X Join: . . . . SW 0(R3), . . . T F A[I] = 0? X B[I] = C[I] =

  17. Summary of Compiler Techniques • Try to avoid dependence stalls • Loop unrolling • Reduce loop overhead • Software pipelining • Reduce single body dependence stalls • Trace scheduling • Reduce impact of other branches • Compilers use a mix of three • All techniques depend on prediction accuracy

  18. Food for thought: Analyze this • Analyze this for different values of X and Y • To evaluate different branch prediction schemes • For compiler scheduling purposes • add r1, r0, 1000 #  all numbers in decimal • add r2, r0, a # Base address of array a • loop: • andi r10, r1, X • beqz r10, even • lw r11, 0(r2) • addi r11, r11, 1 • sw 0(r2), r11 • even: • addi r2, r2, 4 • subi r1, r1, Y • bnez r1, loop

More Related