1 / 12

Compiler Techniques for ILP

Compiler Techniques for ILP. Instruction Level Parallelism Scheduling, Loop unrolling, Software pipelining. Software – Compiler ILP Techniques chapter 4. ( ILP) Instruction level parallelism Overlap unrelated instructions. 17 % are branches 5 instructions + 1 branch

wellse
Download Presentation

Compiler Techniques for ILP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Techniques for ILP Instruction Level Parallelism Scheduling, Loop unrolling, Software pipelining

  2. Software – Compiler ILP Techniques chapter 4

  3. (ILP) Instruction level parallelism Overlapunrelated instructions • 17% are branches • 5 instructions + 1 branch • Must go Beyond single block to get more ILP • Loop level parallelism one opportunity • SW ch 3 • HW (dynamic scheduling ch 4)

  4. Example :: FP Loop Hazards & stalls To demo various techniques Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot

  5. FP Loop dependency Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall ;latency between LD, ADDD 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall ; latency between ADDD, SD 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slot • 9 clocks: Rewrite to minimize stalls

  6. Revised code to Minimize StallsSD moved to slot 6, to separate from ADD 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI 6 clocksUnroll loop to make code faster

  7. Before unrolling Loop Unrolling Example loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop for (i=0; i<N; i++) B[i] = A[i] + C; Unrolled inner loop 4 iterations at once for (i=0; i<N; i+=4) { B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; } after unrolling • Loop unrolling reduces branches & code • Allows aggressive pipeline scheduling if N not multiple of unrolling factor; handle with final cleanup loop

  8. Scheduled Unrolled Code Multiple units needed Unroll 4 ways Int1 Int 2 M1 M2 FP+ FPx loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop loop: ld f1 ld f2 ld f3 add r1 ld f4 fadd f5 Schedule fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 add r2 bne sd f8 4 fadds / 11 cycles = 0.36 Ptrs r1, r2 now incremented by 32

  9. Another ILP technique: Software Pipelining • For independent loop iterations, take instructions from different loop iterations • reorganize loops each iteration made from instructions from different loop iterations • dependent operations within each iteration are separated. More efficient than unrolling. • Register management tough. Supported in itanium.

  10. ld f1 ld f1 Int1 Int 2 M1 M2 FP+ FPx ld f2 ld f2 ld f3 ld f3 add r1 add r1 ld f4 ld f4 prolog fadd f5 fadd f5 fadd f6 fadd f6 fadd f7 fadd f7 fadd f8 fadd f8 loop: ld f1 sd f5 sd f5 iterate ld f2 sd f6 sd f6 add r2 add r2 ld f3 sd f7 sd f7 add r1 bne bne ld f4 sd f8 sd f8 fadd f5 fadd f6 epilog fadd f7 fadd f8 sd f5 Software Pipelining Example Unroll 4 ways first loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop How many FLOPS/cycle? 4 fadds / 4 cycles = 1

  11. Software Pipelining vs. Loop Unrolling Loop Unrolled Wind-down overhead performance Startup overhead time Loop Iteration Software Pipelined performance time Loop Iteration Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

  12. Register Renaming eliminates name dependencies Only RAWs remain Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI,BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI,BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI,BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) ;drop DADDUI,BNE DADDUI R1,R1,#-32 BNE R1,R2,Loop

More Related