210 likes | 562 Views
Compiler techniques for exposing ILP. Instruction Level Parallelism. Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Goal: Exploit ILP across multiple basic blocks Iterations of a loop
E N D
Instruction Level Parallelism • Potential overlap among instructions • Few possibilities in a basic block • Blocks are small (6-7 instructions) • Instructions are dependent • Goal: Exploit ILP across multiple basic blocks • Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;
Basic Scheduling Sequential MIPS Assembly Code Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F0, 0(R1) 1 stall 2 ADDD F4, F0, F2 3 stall 4 stall 5 SD 0(R1), F4 6 SUBI R1, R1, #8 7 stall 8 BNEZ R1, Loop 9 stall 10 Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6
Loop Unrolling Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, #8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1, R1, #8 BEQZ R1, Exit LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 SUBI R1, R1, #8 BNEZ R1, Loop Exit: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size Comment: Often a precursor step for other optimizations
Loop Transformations • Instruction independency is the key requirement for the transformations • Example • Determine that is legal to move SD after SUBI and BNEZ • Determine that unrolling is useful (iterations are independent) • Use different registers to avoid unnecessary constrains • Eliminate extra tests and branches • Determine that LD and SD can be interchanged • Schedule the code, preserving the semantics of the code
1. Eliminating Name Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) ADDD F4, F0, F2 SD -8(R1), F4 LD F0, -16(R1) ADDD F4, F0, F2 SD -16(R1), F4 LD F0, -24(R1) ADDD F4, F0, F2 SD -24(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Register Renaming
2. Eliminating Control Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, #8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1, R1, #8 BEQZ R1, Exit LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 SUBI R1, R1, #8 BNEZ R1, Loop Exit: Intermediate BEQZ are never taken Eliminate!
3. Eliminating Data Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, #8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1, R1, #8 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 SUBI R1, R1, #8 BNEZ R1, Loop • Data dependencies SUBI, LD, SD • Force sequential execution of iterations • Compiler removes this dependency by: • Computing intermediate R1 values • Eliminating intermediate SUBI • Changing final SUBI • Data flow analysis • Can do on Registers • Cannot do easily on memory locations • 100(R1) = 20(R2)
4. Alleviating Data Dependencies Unrolled loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Scheduled Unrolled loop: Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, #32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16
Some General Comments • Dependences are a property of programs • Actual hazards are a property of the pipeline • Techniques to avoid dependence limitations • Maintain dependences but avoid hazards • Code scheduling • hardware • software • Eliminate dependences by code transformations • Complex • Compiler-based
Loop-level Parallelism • Primary focus of dependence analysis • Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }
Dependence Analysis Algorithms • Assume array indexes are affine (ai + b) • GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) • General graph cycle determination is NP • a, b, c, and d may not be known at compile time
Software Pipelining Start-up Finish-up Iteration 0 Iteration 1 Iteration 2 Iteration 3 Software pipelined iteration
Example Iteration i Iteration i+1 Iteration i+2 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop
Trace (global-code) Scheduling • Find ILP across conditional branches • Two-step process • Trace selection • Find a trace (sequence of basic blocks) • Use loop unrolling to generate long traces • Use static branch prediction for other conditional branches • Trace compaction • Squeeze the trace into a small number of wide instructions • Preserve data and control dependences
Trace Selection A[I] = A[I] + B[I] LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else . . . . SW 0(R2), . . . J join Else: . . . . X Join: . . . . SW 0(R3), . . . T F A[I] = 0? X B[I] = C[I] =
Summary of Compiler Techniques • Try to avoid dependence stalls • Loop unrolling • Reduce loop overhead • Software pipelining • Reduce single body dependence stalls • Trace scheduling • Reduce impact of other branches • Compilers use a mix of three • All techniques depend on prediction accuracy
Food for thought: Analyze this • Analyze this for different values of X and Y • To evaluate different branch prediction schemes • For compiler scheduling purposes • add r1, r0, 1000 # all numbers in decimal • add r2, r0, a # Base address of array a • loop: • andi r10, r1, X • beqz r10, even • lw r11, 0(r2) • addi r11, r11, 1 • sw 0(r2), r11 • even: • addi r2, r2, 4 • subi r1, r1, Y • bnez r1, loop