CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP

CS 6461: Computer ArchitectureBasic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.2

Basic Compiler Techniques for Exposing ILP • Crucial for processors that use static issue, and important for processors that make dynamic issue decisions but use static scheduling CS 6461 Compiler Based Scheduling

Basic Pipeline Scheduling and Loop Unrolling • Exploiting parallelism among instructions • Finding sequences of unrelated instructions that can be overlapped in the pipeline • Separation of a dependent instruction from a source instruction by a distance in clock cycles equal to the pipeline latency of the source instruction. (Avoid the stall) • The compiler works with a knowledge of the amount of available ILP in the program and the latencies of the functional units within the pipeline • This couples the compiler, sometimes to the specific chip version, or at least requires the setting of appropriate compiler flags CS 6461 Compiler Based Scheduling

Assumed Latencies CS 6461 Compiler Based Scheduling

Basic Pipeline Scheduling and Loop Unrolling (cont) • Assume standard 5 stage integer pipeline • Branches have a delay of one clock cycle • Functional units are fully pipelined or replicated (as many times as the pipeline depth) • An operation of any type can be issued on every clock cycle and there are no structural hazards CS 6461 Compiler Based Scheduling

Basic Pipeline Scheduling and Loop Unrolling (cont) • Sample code For (i=1000; i>0; i=i-1)x[i] = x[i] + s; • MIPS code Loop: L.D F0,0(R1) ;F0 = array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4,0(R1) ;store back DADDUI R1,R1,#-8 ;decrement index BNE R1,R2,Loop ;R2 is precomputed so that ;8(R2) is last value to be ;computed CS 6461 Compiler Based Scheduling

Basic Pipeline Scheduling and Loop Unrolling (cont) • MIPS code Loop: L.D F0,0(R1) ;1 clock cycle stall ;2 ADD.D F4,F0,F2 ;3stall ;4stall ;5 S.D F4,0(R1) ;6 DADDUI R1,R1,#-8 ;7stall ;8 BNE R1,R2,Loop ;9 CS 6461 Compiler Based Scheduling

Rescheduling Gives • Sample code For (i=1000; i>0; i=i-1)x[i] = x[i] + s; • MIPS code Loop: L.D F0,0(R1) 1 DADDUI R1,R1,#-8 2 ADD.D F4,F0,F2* 3stall 4 stall 5 S.D F4,8(R1)* 6 BNE R1,R2,Loop 7 CS 6461 Compiler Based Scheduling

Unrolling Summary (continued) • Simple Unroll Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D F0,-8(R1) ADD.D F4,F0,F2 S.D F4,-8(R1) L.D F0,-16(R1) ADD.D F4,F0,F2 S.D F4,-16(R1) L.D F0,-24(R1) ADD.D F4,F0,F2 S.D F4,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop Name Dependences Data Dependences CS 6461 Compiler Based Scheduling

Unrolling and Renaming Gives • MIPS code Loop: L.D F0,0(R1) ADD.D F4,F0,F2 we have a stall coming S.D F4,0(R1) L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop CS 6461 Compiler Based Scheduling

Unrolling and Removing Hazards Gives • MIPS code Loop: L.D F0,0(R1) ;total of 14 clock cycles L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop CS 6461 Compiler Based Scheduling

Unrolling Summary for Above • Determine that it was legal to move the S.D after the DADDUI and BNE, and find the amount to adjust the S.D offset • Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for loop maintenance code • Use different registers to avoid unnecessary constraints that would be forced by using the same registers • Eliminate the extra test and branch instruction and adjust the loop termination and iteration code. • Determine that the loads and stores can be interchanged by determining that the loads and stores from different iterations are independent • Schedule the code, preserving any dependencies CS 6461 Compiler Based Scheduling

Unrolling Summary (continued) • Example on Page 311 shows the steps Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D F0,-8(R1) ADD.D F4,F0,F2 S.D F4,-8(R1) L.D F0,-16(R1) ADD.D F4,F0,F2 S.D F4,-16(R1) L.D F0,-24(R1) ADD.D F4,F0,F2 S.D F4,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop Name Dependences Data Dependences CS 6461 Compiler Based Scheduling

Unrolling Summary (Renaming) • Example on Page 311 shows the steps Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop Name Dependences Data Dependences CS 6461 Compiler Based Scheduling

Unrolling Summary (continued) • Limits to Impacts of Unrolling Loops • As we unroll more, each unroll yields a decreased amount of improvement of distribution of overhead • Growth in code size • Shortfall in available registers (register pressure) • Scheduling the code to increase ILP causes the number of live values to increase • This could generate a shortage of registers and negatively impact the optimization • Useful in a variety of processors today CS 6461 Compiler Based Scheduling

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP