80 likes | 262 Views
Computer Architecture Principles Dr. Mike Frank. CDA 5155 Summer 2003 Module #26 Software Pipelining. Software Pipelining. Software Pipelining. In hardware pipelining, we overlap the execution of multiple instructions .
E N D
Computer Architecture PrinciplesDr. Mike Frank CDA 5155Summer 2003 Module #26Software Pipelining
Software Pipelining • In hardware pipelining, we overlap the execution of multiple instructions. • In software pipelining, we overlap the issuing of instructions for multiple loop iterations. • Like loop unrolling, this allows us to separate the issuing of data-dependent instructions without stalling. • Unlike loop unrolling, it does not (by itself): • eliminate loop overheads (index variable updating & branches), or increase loop code size.
Software Pipelining Illustration Processing of different array elements (e.g. A[0] through A[4]) In 1 iteration of the software-pipelined loop, We execute instruction 3 of element 4, instruction 4 of element 3, instruction 5 of element 2, instruction 6 of element 1, instruction 7 of element 0. Data dependence between two instructions inone iteration of the original loop (processing of an array element)
Our old friend, “Mr. Loop Example” • Same old code: Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop • What would a software-pipelined version of this loop look like? As-is, there would be stalls between these instructions due to the data value dependences, & the resulting RAW hazards. (Even with forwarding.)
Software-Pipelined Version • Here is the new code: Loop: SD 16(R1),F4 ; M[i]=tmp2 ADDD F4,F0,F2 ; tmp2=tmp1+F2 LD F0,0(R1) ; tmp1=M[i-2] SUBI R1,R1,#8 ; i=i-1 BNEZ R1,Loop • Note: • All the value dependences now cross loop-iteration boundaries. • The whole dependence path from LD through ADDD to SD now spans 2 loop iterations. • So, we load from M[i-2] (or 0(R1)), and, two iterations later, store back to M[i] (or 16(R1)).
Timing across 3 iterations • Note greater separation between LD, ADDD, and SD for a single array element - no stalls needed. • Note we decrement R1 by 8 twice between the LD and the corresponding SD, thus the need for the 16 offset in the SD. • Some antidependences (including the one through memory) are noted by green arrows. Loop: SD 16(R1),F4 ; M[i]=tmp2 ADDD F4,F0,F2 ; tmp2=tmp1+F2 LD F0,0(R1) ; tmp1=M[i-2] SUBI R1,R1,#8 ; i=i-1 BNEZ R1,Loop Loop: SD 16(R1),F4 ; M[i]=tmp2 ADDD F4,F0,F2 ; tmp2=tmp1+F2 LD F0,0(R1) ; tmp1=M[i-2] SUBI R1,R1,#8 ; i=i-1 BNEZ R1,Loop Loop: SD 16(R1),F4 ; M[i]=tmp2 ADDD F4,F0,F2 ; tmp2=tmp1+F2 LD F0,0(R1) ; tmp1=M[i-2] SUBI R1,R1,#8 ; i=i-1 BNEZ R1,Loop
SW Pipelining vs. Unrolling • Both of them: • Improve scheduling among high-latency instructions in the inner loop. • Loop unrolling also: • Reduces loop overhead (index variable updating & end-of-loop testing). • Confines most stalls to once per n iterations. • Software pipelining also: • Confines most stalls to 1st & last iteration only. • Keeps code size small. • Can use both techniques in combination.