270 likes | 451 Views
Chapter Six. Pipelining: Overview. Pipelining. Improve performance by increasing instruction throughput. Pipelining. Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?. single-cycle vs. pipelined performance.
E N D
Chapter Six Pipelining: Overview
Pipelining • Improve performance by increasing instruction throughput
Pipelining • Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?
single-cycle vs. pipelined performance • This chapter assume only 8 instructions: lw, sw, add, sub, and, or, slt, beq.
single-cycle vs. pipelined performance • Single clock implementation: clock must be as long as longest instruction, ie. lw at 8 ns • To execute 2 lw instructions: 24 ns • All pipelined stages take a single clock. Clock must accommodate slowest operation, 2ns. • Pipelined time: see next slide.
single-cycle vs. pipelined performance Pipelined time: 14 ns
single-cycle vs. pipelined performance • Speed-up: • Time between instructionspipelined = Time between instructionsnonpipelined • Number of pipe stages • Ideal: 5-stage pipeline gives 5 time speed-up. • Problems: • stages may be imperfectly balanced. • pipelining involves some overhead. • Result: time per instruction in pipelined machine will exceed minimum possible.
single-cycle vs. pipelined performance • Note that we got 14ns vs. 24ns, not a 4 fold increase. • Total execution time is less important: • assume that we had 1003 instructions • Add 1000 instructions to pipeline • Each instruction adds 2ns to total execution time: 2 x 1000 + 14 = 2014ns • Single clock: 8 x 1000 + 24 = 8024ns • Ratio: 8024/2014 = 3.98 • Pipelining improves performance by increasing instruction throughput • Does not decrease the execution time of an individual instruction
Designing instruction sets for pipelining • MIPS instructions are same length. • Makes easier to fetch in stage 1 and decode in stage 2 • In the 80x86 IS, instructions vary from 1 byte to 17 bytes. Pipelining harder. • MIPS has only a few instruction formats. • Source register in same place in each instruction • Second stage can begin reading the register file at same time that hardware is decoding instruction. • If instruction formats were not the same, MIPS would have to split stage 2, giving 6 stages.
Designing instruction sets for pipelining • MIPS memory operands only appear in loads or stores. • Can use the execute stage to calculate memory address and then access memory in following stage. • 80x86: can operate on the operands in memory. • So stages 3 and 4 expand to an address stage, memory stage, then execute stage. • MIPS operands must be aligned in memory. • A single data transfer instruction cannot require two data memory accesses. • Always transfer data between processor and memory in a single pipeline stage.
Pipelining • What makes it easy • all instructions are the same length • just a few instruction formats • memory operands appear only in loads and stores • What makes it hard? • structural hazards: suppose we had only one memory • control hazards: need to worry about branch instructions • data hazards: an instruction depends on a previous instruction • We’ll build a simple pipeline and look at these issues • We’ll talk about modern processors and what really makes it hard: • exception handling • trying to improve performance with out-of-order execution, etc.
Pipeline hazards • Structural hazards • Hardware cannot support the combination of instructions that we want to execute in the same clock cycle. • Example: assume one memory (eg. One cache). • assume the pipeline example earlier had 4th instruction • in 1 clock cycle the first instruction is accessing data from memory while 4th instruction is fetching instruction from same memory.
Pipeline hazards • Control hazards • Need to make a decision based on the results of one instruction while others are executing. • Branch instruction. • One solution: stall. • Assume we have enough extra hardware to test registers, calculate the branch address, update the PC during second stage (we’ll do this later). • Result: next slide.
Pipeline hazards • Control hazards • The lw instruction, executed if the branch fails, is stalled one extra 2-ns clock cycle before starting. • Called pipeline stall or bubble
Pipeline hazards • Control hazards • Problem: if cannot resolve branch in second stage, must stall more. • Common with longer pipelines. • Too slow. • Solution: Predict whether branch will fail. Execute appropriately. Undo if wrong. • Example: always predict that branches will fail. • Only slows when branch is taken. • See next slide.
Pipeline hazards • Top figure: branch not take. • Bottom figure: branch taken.
Pipeline hazards • More sophisticated prediction: • Always predict that a branch at the bottom of a loop is taken • Dynamic hardware predictors. • Guess depends on the behavior of each branch. • Predictions change over life of a program. • Example: keep a history for each branch as taken or untaken. Use past to predict future. • Accuracy of this: about 90% • If wrong: must restart the pipeline from proper branch address.
Pipeline hazards • Cost of stalls. Assume all instructions have CPI of 1. Branch delays 1 clock. • assume 17% of instructions have branch. • CPI becomes 1.17. • So slowdown is 1.17. • Note that slt and slti are included as branch instructions, but will not stall. So this is an approximation.
Pipeline hazards • Delayed decision (what MIPS actually does) • Delayed branch always executes the next sequential instruction. • Branch takes place after that one instruction delay. • Assembler automatically puts an instruction into the branch delay slot. • Compilers typically fill 50% of the branch delay slots
Pipeline hazards • Data hazards • An instruction depends on the results of a previous instruction that is still in the pipeline. • Example: add $s0, $t0, $t1 sub $t2, $s0, $t3 • Problem: the sub needs the result of the add (ie., $s0) • Can add bubbles, but add doesn’t write result until stage 5! • Cannot handle this with compilers: too common • Solution: forwarding or bypassing • Get the needed value as soon as it is calculated, but before it is written.
Sidetrack: new pipeline representation • Use symbols to represent the physical resources. • IF instruction fetch stage. Box represents instruction memory. • ID: instruction decode/register read stage. Box represents register file. • EX: execution stage. Box represents ALU • MEM: memory access stage. Box represents data memory. • WB: write back stage. Box represents register file. • Shading: right half means read, left half means write.
Pipeline hazards • Example: solution to above instructions.
Pipeline hazards • Forwarding valid only if the destination stage is later in time than the source stage. • Cannot forward from output of memory access stage in first instruction to the input of the execution stage of the following. • Forwarding cannot prevent all pipeline stalls. • Example: lw $s0, 0($t1) ; data loaded into $s0 in stage 4 sub $s0, $s0, $t1 ; data needed in stage 3 • Must stall • See next slide.
Pipeline hazards • Load-use data hazard
Pipeline hazards • Can reorder code to avoid pipeline stalls • Example: # reg $t1 has address of v[k] lw $t0, 0($t1) # reg $t0 (temp) = v[k] lw $t2, 4($t1) # reg $t2 = v[k+1] sw $t2, 0($t1) # v[k] = reg $t2 sw $t0, 4($t1) # v[k+1] = reg $t0 (temp) • Hazard occurs on register $t2 between second lw and first sw. • Swap instructions to eliminate hazard: # reg $t1 has address of v[k] lw $t0, 0($t1) # reg $t0 (temp) = v[k] lw $t2, 4($t1) # reg $t2 = v[k+1] sw $t0, 4($t1) # v[k+1] = reg $t0 (temp) sw $t2, 0($t1) # v[k] = reg $t2
Pipeline hazards • Original MIPS processors required software to follow a load with an instruction independent of that load. • Called a delayed load. • MIPS designed to enable easier forwarding. • Each MIPS instruction writes a single result at end of execution • Forwarding is harder if there are multiple results to forward per instruction • Also harder if instruction needs to write before end of an instruction.
Pipeline hazards: other hazards • Ian’s hazard: instruction is not in cache (memory) • Save the original PC value (current PC - 4) • Stall the pipeline • Fetch the instruction from RAM (or level 2 cache) • Write the cache entry when receive it from RAM • Restart the program at original PC value • refetches the instruction • this time finds it in cache • Data not in cache • Similar, but can continue to execute later instructions while wait (if they don’t use data from the stalled instruction). • Other techniques covered in chapter 7