120 likes | 160 Views
Compiler Techniques for ILP. Instruction Level Parallelism Scheduling, Loop unrolling, Software pipelining. Software – Compiler ILP Techniques chapter 4. ( ILP) Instruction level parallelism Overlap unrelated instructions. 17 % are branches 5 instructions + 1 branch
E N D
Compiler Techniques for ILP Instruction Level Parallelism Scheduling, Loop unrolling, Software pipelining
(ILP) Instruction level parallelism Overlapunrelated instructions • 17% are branches • 5 instructions + 1 branch • Must go Beyond single block to get more ILP • Loop level parallelism one opportunity • SW ch 3 • HW (dynamic scheduling ch 4)
Example :: FP Loop Hazards & stalls To demo various techniques Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot
FP Loop dependency Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall ;latency between LD, ADDD 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall ; latency between ADDD, SD 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slot • 9 clocks: Rewrite to minimize stalls
Revised code to Minimize StallsSD moved to slot 6, to separate from ADD 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI 6 clocksUnroll loop to make code faster
Before unrolling Loop Unrolling Example loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop for (i=0; i<N; i++) B[i] = A[i] + C; Unrolled inner loop 4 iterations at once for (i=0; i<N; i+=4) { B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; } after unrolling • Loop unrolling reduces branches & code • Allows aggressive pipeline scheduling if N not multiple of unrolling factor; handle with final cleanup loop
Scheduled Unrolled Code Multiple units needed Unroll 4 ways Int1 Int 2 M1 M2 FP+ FPx loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop loop: ld f1 ld f2 ld f3 add r1 ld f4 fadd f5 Schedule fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 add r2 bne sd f8 4 fadds / 11 cycles = 0.36 Ptrs r1, r2 now incremented by 32
Another ILP technique: Software Pipelining • For independent loop iterations, take instructions from different loop iterations • reorganize loops each iteration made from instructions from different loop iterations • dependent operations within each iteration are separated. More efficient than unrolling. • Register management tough. Supported in itanium.
ld f1 ld f1 Int1 Int 2 M1 M2 FP+ FPx ld f2 ld f2 ld f3 ld f3 add r1 add r1 ld f4 ld f4 prolog fadd f5 fadd f5 fadd f6 fadd f6 fadd f7 fadd f7 fadd f8 fadd f8 loop: ld f1 sd f5 sd f5 iterate ld f2 sd f6 sd f6 add r2 add r2 ld f3 sd f7 sd f7 add r1 bne bne ld f4 sd f8 sd f8 fadd f5 fadd f6 epilog fadd f7 fadd f8 sd f5 Software Pipelining Example Unroll 4 ways first loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop How many FLOPS/cycle? 4 fadds / 4 cycles = 1
Software Pipelining vs. Loop Unrolling Loop Unrolled Wind-down overhead performance Startup overhead time Loop Iteration Software Pipelined performance time Loop Iteration Software pipelining pays startup/wind-down costs only once per loop, not once per iteration
Register Renaming eliminates name dependencies Only RAWs remain Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI,BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI,BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI,BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) ;drop DADDUI,BNE DADDUI R1,R1,#-32 BNE R1,R2,Loop