240 likes | 383 Views
Compiler Support for Superscalar Processors. Loop Unrolling. Assumption: Standard five stage pipeline Empty cycles between instructions before the result can be used: FP-ALU – FP-ALU 3 FP-ALU – Store 2 Load – FP-ALU 1 Load – Store 0 Jumps have one empty cylce
E N D
Loop Unrolling • Assumption: • Standard five stage pipeline • Empty cycles between instructions before the result can be used: • FP-ALU – FP-ALU 3 • FP-ALU – Store 2 • Load – FP-ALU 1 • Load – Store 0 • Jumps have one empty cylce • Independent operations are important for efficient usage of the pipeline • Loop unrolling is a very important technique.
Example For (i=1000; i>0; i=i-1) x[i]=x[i]+s Compiler Loop: load f0, 0(r1) ; f0=x[i] add f4,f0,f2 ; x[i]+s store f4, 0(r1); x[i]= addi r1,r1, -8 ; bne r1,r2,Loop; Branch r1!=r2 Loop: load f0, 0(r1) ; 1 stall ; 2 add f4,f0,f2 ; 3 stall ; 4 stall ; 5 store f4, 0(r1); 6 addi r1,r1, -8 ; 7 stall ; 8 bne r1,r2,Loop; 9 stall ;10 Execution
Instruction Scheduling • Good instruction scheduling can reduce the execution time from 10 cycles to 6 cycles. • Requires • Dependence analysis • Symbolic optimization Loop: load f0, 0(r1) ; 1 addi r1,r1, -8 ; 2 add f4,f0,f2 ; 3 stall ; 4 bne r1,r2,Loop; 5 store f4, 8(r1); 6
Loop Unrolling • The real computation requires only three instructions • load, add, store • Additional instruction for loop control (Overhead) • Loop unrolling by a factor of k means • The loop body is replicated k times. • Accesses to the loop variable have to be adapted. • The loop control needs to be adapted. • Generation of a post loop if the number of iterations is not divisible by k.
Example • Advantages of loop unrolling • The ratio between useful instructions and overhead is improved. • There are more operations available for instruction scheduling. For (i=1000; i>0; i=i-4){ x[i]=x[i]+s x[i-1]=x[i-1]+s x[i-2]=x[i-2]+s x[i-3]=x[i-3]+s }
Reduction of overhead 1 3 6 7 9 12 13 15 18 19 21 24 25 27 28 Loop: load f0, 0(r1) ; x[i] add f4,f0,f2 ; store f4, 0(r1) ; load f6, -8(r1) ; x[i-1] add f8,f6,f2 ; store f8,-8(r1) ; load f10,-16(r1) ; x[i-2] add f12,f10,f2 ; store f12,-16(r1); load f14,-24(r1) ; x[i-3] add f16,f14,f2 ; store f16,-24(r1); addi r1,r1, -32 ; bne r1,r2,Loop ; • 28 cycles for 4 iterations • Before 40 cycles for 4 iterations
Optimized scheduling of instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Loop: load f0, 0(r1) ; x[i] load f6, -8(r1) ; x[i-1] load f10,-16(r1) ; x[i-2] load f14,-24(r1) ; x[i-3] add f4,f0,f2 ; add f8,f6,f2 ; add f12,f10,f2 ; add f16,f14,f2 ; store f4, 0(r1) ; store f8,-8(r1) ; addi r1,r1, -32 ; store f12,16(r1); bne r1,r2,Loop ; store f16,8(r1) ; • Results in 3,5 cycles per iteration (6 before)
Register Allocation • Using different registers allows reordering Loop: load f0, 0(r1) ; x[i] add f4,f0,f2 ; store f4, 0(r1) ; load f0, -8(r1) ; x[i-1] add f4,f0,f2 ; store f4,-8(r1) ; … Loop: load f0, 0(r1) ; x[i] stall add f4,f0,f2 ; load f0, -8(r1) ; x[i-1] stall store f4, 0(r1) ; add f4,f0,f2 ; stall stall store f4,-8(r1) ; …
Register Allocation • Compiler starts with an unlimited number of virtual registers. • These registers are then mapped with graph coloring to the registers in the ISA. • Life range of a register: Instructions where a virtual register is life, i.e., from the definition of the register to the last access. • Creation of a graph • Nodes are virtual registers • Edges are inserted if the life ranges overlap • Goal: Coloring of nodes with a minimal number of colors, so that neighboring nodes do not have the same color. The number of colors has to be smaller or equal to the number of ISA registers.
Graph Coloring • Three registers are required. • In addition an index register. Loop: load v0, 0(r1) ; add v4,v0,v2 ; store v4, 0(r1) ; load v6, -8(r1) ; add v8,v6,v2 ; store v8,-8(r1) ; load v10,-16(r1) ; add v12,v10,v2 ; store v12,-16(r1); load v14,-24(r1) ; add v16,v14,v2 ; store v16,-24(r1); addi r1,r1, -32 ; bne r1,r2,Loop ; v4 v0 v6 v16 v8 v2 v14 v10 v12
Register Allocation after Instruction Scheduling Loop: load v0, 0(r1) ; load v6, -8(r1) ; load v10,-16(r1) ; load v14,-24(r1) ; add v4,v0,v2 ; add v8,v6,v2 ; add v12,v10,v2 ; add v16,v14,v2 ; store v4, 0(r1) ; store v8,-8(r1) ; addi r1,r1, -32 ; store v12,16(r1); bne r1,r2,Loop ; store v16,8(r1) ; v4 v0 v6 v16 v8 v14 v10 v12
Register Allocation after Instruction Scheduling • 5 FP registers are required. Loop: load v0, 0(r1) ; load v6, -8(r1) ; load v10,-16(r1) ; load v14,-24(r1) ; add v4,v0,v2 ; add v8,v6,v2 ; add v12,v10,v2 ; add v16,v14,v2 ; store v4, 0(r1) ; store v8,-8(r1) ; addi r1,r1, -32 ; store v12,16(r1); bne r1,r2,Loop ; store v16,8(r1) ; v4 v0 v6 v16 v8 v14 v10 v12
Software Pipelining • Execution with loop unrolling (a) and software pipelining (b) Proportional to number of unrolls Numberofoverlappedoperations (a) Start-up Wind-down Numberofoverlappedoperations (b)
Software Pipelining • Loops are restructured, such that in each iteration of the new loop different instructions of different iterations of the original loop are executed. Iteration0 Iteration1 Iteration2 Iteration3 Iteration4
Example Software Pipelining Pipelined loop load f0, 0(r1) add f4,f0,f2 store f4, 0(r1) load f0, 0(r1) add f4,f0,f2 store f4, 0(r1) load f0, 0(r1) add f4,f0,f2 store f4, 0(r1) load f0, 0(r1) add f4,f0,f2 store f4, 0(r1) addi r1,r1, -8 bne r1,r2,Loop Iteration i: Iteration i-1: Original loop Iteration i-2: Loop: store f4,16(r1); storesinto M[i] add f4,f0,f2 ; addsto M[i-1] load f0,0(r1) ; loads M[i-2] addi r1,r1, -8 bne r1,r2,Loop
Example: Software Pipelining • Start-up code and wind-down code have been omitted. • Requires Register Renaming to get rid of WAR-conflicts. • Requires 5 cycles per iteration if the instruction scheduling will handle addi and jump as before.
Software Pipelining vs Loop Unrolling • Software Pipelining is symbolic Loop Unrolling • Algorithms are based on Loop Unrolling • Advantage of Software Pipelining • Results in shorter code, especially for long latencies. • Reduces area of low overlap to start-up and wind-down loop. • Advantage of Loop Unrolling • Reduces loop overhead • Advantage of both techniques • Use independent operations from different loop iterations. • Best results by combining both techniques.
Loop fusion • Loop fusion combines subsequent loops with same loop control. • Instructions might be executed more efficiently. • Loop fusion is not always possible. do i=1,n a(i)= b(i)+2 enddo do i=1,n a(i)= b(i)+2 c(i)= d(i+1) * a(i) enddo do i=1,n c(i)= d(i+1) * a(i) enddo
Example: Incorrect Loop Fusion do i=1,n S1: a(i)= b(i)+2 enddo do i=1,n S1: a(i)= b(i)+2 S2: c(i)= d(i+1) * a(i+1) enddo do i=1,n S2: c(i)= d(i+1) * a(i+1) enddo
Example: Correct Loop Fusion do i=1,n S1: a(i)= b(i)+2 enddo do i=1,n S1: a(i)= b(i)+2 S2: c(i)= d(i+1) * a(i-1) enddo do i=1,n S2: c(i)= d(i+1) * a(i-1) enddo
Advantages of Transformations • Increase the number of independent instructions. • These can be scheduled and executed more efficiently.
Disadvantages of the Transformations • Transformations increase reigster pressure. • They increase the size of the code which might lead to a more inefficient usage of the memory hierarchy. • Transformations can also lead to less data locality.
Summary of Transformations • Compiler has a global overview. • Goal: More operations for instruction scheduling. • Compiler supports efficient execution in other areas.