1 / 23

Pipeline Optimization

Pipeline with data forwarding and accelerated branch Loop Unrolling Dual Pipeline. Pipeline Optimization. C-code. k = len; do { k--; A[k] = A[k] + x; } while(k > 0). Register Usage. $so len $s1 base address of A $s2 x $t1 address of A[k] $t0 value of A[k] (old and new).

landis
Download Presentation

Pipeline Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipeline with data forwarding and accelerated branch Loop Unrolling Dual Pipeline Pipeline Optimization

  2. C-code • k = len; • do { • k--; • A[k] = A[k] + x; • } while(k > 0)

  3. Register Usage • $so len • $s1 base address of A • $s2 x • $t1 address of A[k] • $t0 value of A[k] (old and new)

  4. Basic loop using pointer hopping • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • loop: • addi $t1, $t1, -4 • lw $t0, 0($t1) • add $t0, $t0, $s2 • sw $t0, 0($t1) • bne $t1, $s1, loop • xxx

  5. Time for 1000 Iterationsfor Single- and Multi-cycle • Single Cycle • Every instruction takes 800 picoseconds (ps) • Time = 5x800x1000 + 2x800 = 4,001,600 ps = 4001.6 nanoseconds (ns) • Multicycle • 200 ps per cycle, variable number of CPI • Cycles = (1x3 + 3x4 + 1x5)x1000 + 2x4 = 20,008 • Time = 20,008 x 200 ps = 4,001.6 ns

  6. PipelineFilling Stall/Delay slots • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • loop: • addi $t1, $t1, -4 • lw $t0, 0($t1) • nop • add $t0, $t0, $s2 • sw $t0, 0($t1) • bne $t1, $s1, loop • nop • xxx

  7. Time for simple pipeline • 200 ps per cycle, 1 CPI (including nops) • Time = 7x200x1000 + 2x200 ps = 1,400.4 ns

  8. Reordering Code to fill branch delay slot • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • loop: • addi $t1, $t1, -4 • lw $t0, 0($t1) • nop • add $t0, $t0, $s2 • bne $t1, $s1, loop • sw $t0, 0($t1) • xxx

  9. Time for pipeline with reordered code • 200 ps per cycle, 1 CPI (including nops) • Time = 6x200x1000 + 2x200 ps = 1,200.4 ns

  10. Loop Unrolling step 1 (4 iterations) addi $t1, $t1, -4 lw $t0, 0($t1) nop add $t0, $t0, $s2 beq $t1, $s1, loopend sw $t0, 0($t1) addi $t1, $t1, -4 lw $t0, 0($t1) nop add $t0, $t0, $s2 bne $t1, $s1, loop sw $t0, 0($t1) loopend: xxx • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • loop: • addi $t1, $t1, -4 • lw $t0, 0($t1) • nop • add $t0, $t0, $s2 • beq $t1, $s1, loopend • sw $t0, 0($t1) • addi $t1, $t1, -4 • lw $t0, 0($t1) • nop • add $t0, $t0, $s2 • beq $t1, $s1, loopend • sw $t0, 0($t1)

  11. Loop Unrolling step 2One pointer with offsets • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • loop: • addi $t1, $t1, -16 • lw $t0, 12($t1) • nop • add $t0, $t0, $s2 • sw $t0, 12($t1) • lw $t0, 8($t1) • nop • add $t0, $t0, $s2 • sw $t0, 8($t1) lw $t0, 4($t1) nop add $t0, $t0, $s2 sw $t0, 4($t1) lw $t0, 0($t1) nop add $t0, $t0, $s2 bne $t1, $s1, loop sw $t0, 0($t1) xxx

  12. Loop Unrolling step 3Filling data hazard slots • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • loop: • addi $t1, $t1, -16 • lw $t0, 12($t1) • lw $t3, 8($t1) • add $t0, $t0, $s2 • sw $t0, 12($t1) • lw $t0, 4($t1) • add $t3, $t3, $s2 • sw $t3, 8($t1) lw $t3, 0($t1) add $t0, $t0, $s2 sw $t0, 4($t1) add $t3, $t3, $s2 bne $t1, $s1, loop sw $t3, 0($t1) xxx

  13. Time for pipeline with loop unrolling • 200 ps per cycle, 1 CPI (including nops) • 4 iterations per loop means 250 times in loop • Time = 14x200x250 + 2x200 ps = 700.4 ns

  14. Dual Pipeline • Two instruction pipe • one for arithmetic or branch • one for load or store • Instructions can be issued at same time • if no data dependencies • following instructions follow same delay rules • Loop unrolling for more overlap • Register renaming to avoid data dependency

  15. Dual Pipeline Code pairing instructions • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • loop: • addi $t1, $t1, -4 • lw $t0, 0($t1) • nop • add $t0, $t0, $s2 • bne $t1, $s1, loop sw $t0, 0($t1) • nop • xxx

  16. Dual Pipeline Codefill branch delay slot • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • addi $t1, $t1, -4 • loop: • lw $t0, 0($t1) • nop • add $t0, $t0, $s2 • bne $t1, $s1, loop sw $t0, 0($t1) • addi $t1, $t1, -4 • xxx

  17. Time for dual pipeline(no loop unrolling) • 200 ps per cycle, 1 or 2 instr per cycle • Time = 5x200x1000 + 3x200 ps = 1,000.6 ns

  18. Dual Pipe Optimizationwith loop unrolling Unrolled and reordered loop • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • loop: • addi $t1, $t1, -16 • lw $t0, 12($t1) • lw $t3, 8($t1) • add $t0, $t0, $s2 • sw $t0, 12($t1) • lw $t0, 4($t1) • add $t3, $t3, $s2 • sw $t3, 8($t1) lw $t3, 0($t1) add $t0, $t0, $s2 sw $t0, 4($t1) add $t3, $t3, $s2 bne $t1, $s1, loop sw $t3, 0($t1) loopend: xxx

  19. step 1, use more registers(register renaming) • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • loop: • addi $t1, $t1, -16 • lw $t0, 12($t1) • lw $t3, 8($t1) • add $t0, $t0, $s2 • sw $t0, 12($t1) • lw $t5, 4($t1) • add $t3, $t3, $s2 • sw $t3, 8($t1) lw $t7, 0($t1) add $t5, $t5, $s2 sw $t5, 4($t1) add $t7, $t7, $s2 bne $t1, $s1, loop sw $t7, 0($t1) loopend: xxx

  20. step 2, reorder/pair instructions • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • loop: • addi $t1, $t1, -16 • lw $t0, 12($t1) • lw $t3, 8($t1) • add $t0, $t0, $s2 lw $t5, 4($t1) • add $t3, $t3, $s2 sw $t0, 12($t1) • add $t5, $t5, $s2 lw $t7, 0($t1) • sw $t3, 8($t1) • add $t7, $t7, $s2 sw $t5, 4($t1) • bne $t1, $s1, loop sw $t7, 0($t1) • nop • xxx

  21. step 2, fill branch delay • sll $t1, $s0, 2 • addu $t1, $t1, $s1 • addi $t1, $t1, -16 • lw $t0, 12($t1) • loop: • lw $t3, 8($t1) • add $t0, $t0, $s2 lw $t5, 4($t1) • add $t3, $t3, $s2 sw $t0, 12($t1) • add $t5, $t5, $s2 lw $t7, 0($t1) • sw $t3, 8($t1) • add $t7, $t7, $s2 sw $t5, 4($t1) • bne $t1, $s1, loop sw $t7, 0($t1) • addi $t1, $t1, -16 lw $t0, -4($t1) • xxx

  22. Time for dual pipeline • 200 ps per cycle, 1 or 2 instr per cycle • 4 iterations per loop, 250 times through loop • Time = 8x200x250 + 4x200 ps = 400.8 ns • 10 times faster than single cycle or multi-cycle • 3 ½ times faster than simple pipeline • 1 ¾ times faster than pipeline with loop unrolled

  23. More Parallelism? • Suppose loop has more operations • Multiplication (takes longer than adds) • Floating point (takes much longer than integer) • More parallel pipelines for different operations – all the above techniques could result in better performance • Static (as above) vs dynamic reordering • Speculation • Out-of-order execution

More Related