ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law

ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law Pipelining is a method of processing in which a problem is divided into a number of sub-problems and solved and the solutions of the sub-problems for different instances of the problem are then overlapped.

Example: a[i] = b[i] + c[i] + d[i] + e[i] + f[i], i = 1, 2, 3,…,n f[2] f[1] e[2] D e[1] d[2] D D d[1] c[2] D D D c[1] + + + + b[1] a[2] a[1] Adders have delay D to compute. Computation time = 4D + (n-1)D = nD +3D Speed-up = 4nD/{3D + nD} -> 4 for large n.

We can describe the computation process in a linear pipeline algorithmically. There are three distinct phases to this computation: (a) filling the pipeline, (b) running the pipeline in the filled state until the last input arrives, and (c) emptying the pipeline.

Instruction pipelines: Goal: (i) to increase the throughput (number of instructions/sec) in executing programs (ii) to reduce the execution time (clock cycles/instruction, etc. fetch clock decode execute

clock fetch decode write back memory execute

Speed-up of pipelined execution of instructions over a sequential execution: Assuming that the systems operate at the same clock rate and use the same number of operations:

Example Suppose that the instruction mix of programs executed on a serial and pipelined machines is 40% ALU, 20% branching, and 40% memory with 4, 2, and 4 cycles per each instruction in the three classes respectively. Then, under ideal conditions (no stalls due to hazards) If, the clock speed needs to be increased for the pipeline implementation then the speed-up will have to be scaled down accordingly.

MIPS Pipeline ID EX WB IF Register operations WB ID EX ME IF Register/Memory operations

Instruction Pipelines (Hennessy & Patterson)

Hazards 1-Structural Hazards 2-Data Hazards 3-Control Hazards Structural Hazards: They arise when limited resources are scheduled to operate on different streams during the same clock period.

Structural Hazards: They arise when limited resources are scheduled to operate concurrently on different streams during the same clock period. Example: Memory conflict (data fetch + instruction fetch) or datapath conflict (arithmetic operation + PC update)

Fix: Duplicate hardware (too expensive) Stall the pipeline (serialize the operation) (too slow)

Speed-up = Tserial/Tpipeline = 5nts/ {2nts + 2ts}, for odd n = 5nts/ {2nts + 3ts }, for even n -> 5/2 as the number of instructions, n, tends to infinity. Thus, we loose half the throughput due to stalls. Note: The pipeline time of execution can be computed using the recurrences T1 = 4 Ti = Ti-1 + 1 for even i Ti = Ti-1 + 3 for odd i

Data Hazards They occur when the executions of two instructions may result in the incorrect reading of operands and/or writing of a result. Read After Write (RAW) Hazard (Data Dependency) Write After Read Hazard (WAR) (Data Anti-dependency) Write After Write Hazard (WAW) (Data Anti-dependency)

RAW Hazards They occur when reads are early and writes are late. I1: R1 = R1 + R2 I2: R3 = R1+ R2

RAW Hazards (Cont’d) They can be avoided by stalling the reads but this increases the execution time. A better approach is to use data forwarding: I1: R1 = R1 + R2 I2: R3 = R1+ R2

WAR Hazards They occur when writes are early and reads are late I1: R2= R2+ R3 ;R9 = R3+ R4 ,I2: R3 = R7+ R5;R6 = R2+ R8

Branch Prediction in Pipeline Instruction Sequencing One of the major issues in pipelined instruction processing is to schedule conditional branch instructions. When a pipeline controller encounters a conditional branch instruction it has a choice to decode it into one of two instruction streams. If the branch condition is met then the execution continues from the target of the conditional branch instruction; Otherwise, it continues with the instruction that follows the conditional branch instruction.

Example: Suppose that we execute the following assembly code on a 5-stage pipeline (IF, ID,EX,ME, WB): JCD R0 < 10, add; SUB R0,R1; JMP D,halt; add: ADD R0,R1; halt: HLT; If we assume that R0 < 10 then the SUB instruction would have been incorrectly fetched during the second clock cycle. and we will have to another fetch cycle to fetch the ADD instruction.

Classification of branch prediction algorithms Static Branch Prediction: The branch decision doesnotchange over time-- we use a fixed branching policy. Dynamic Branch Prediction: The branch decision does change over time-- we use a branching policy that varies over time.

Static Branch Prediction Algorithms 1 Don’t predict (stall the pipeline) 2- Never take the branch 3- Always take the branch 4- Delayed branch

1- Stall the pipeline by 1 clock cycle : This allows us to determine the target of the branch instruction. Stall and decide the branch.

Pipeline Execution Speed (stall case): Assuming only branch hazards, we can compute the average number of clock cycles per instruction (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty  branch frequency = 1 + branch frequency In general, CPI of the pipeline > 1 + branch frequency because of data and possibly structural hazards Pros: Straightforward to implement Cons: The time overhead is high when the instruction mix includes a high percentage of branch instructions.

2- Never take the branch. The instruction in the pipeline is flushed if it is determined that the branch should have been taken after the ID stage is carried out.   SUB instruction is always executed and then either the IOR instruction is executed next or SUB is flushed and XOR is executed.

Pipeline Execution Speed (Never take the branch case): Assuming only branch hazards, we can compute the average number of clock cycles per instruction (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty  branch frequency  misprediction rate = 1 + branch frequency  misprediction rate Pros: If the prediction is highly accurate then the pipeline can operate close to its full throughput. Cons: Implementation is not as straightforward and requires flushing if decoding the branch address takes more than 1 clock cycle.

3- Always take the branch. The instruction in the pipeline is flushed if it is determined that the branch should have been taken after the ID stage is carried out.   address computation

Pipeline Execution Speed (Always take the branch case): Assuming only branch hazards, we can compute the average number of clock cycles per instruction (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty  branch frequency  prediction rate + branch penalty  branch frequency  misprediction rate = 1 + branch frequency  prediction rate + 2 branch frequency  misprediction rate Pros: Better suited for the execution of loops without the compiler's intervention (but this can generally be overcome, see the next slide). Cons: Implementation is not as straightforward, and has a higher misprediction penalty. Not as advantageous as not taking the branch since the branch address computation is not completed until after the EX segment is carried out.

Example: for (i = 0; i < 10; i++) a[i] = a[i] + 1; “Branch always” will not work well without compiler’s help CLR R0; loop: JCD R0 >=10,exit LDD R1,R0; ADD R1,1; ST+ R1,R0; JMP D,loop; exit: ---------------------------------------------------------- “Branch always” will work well without compiler’s help CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop;

3- Delayed branch: Insert an instruction after a branch instruction, and always execute it whether or not the branch condition applies. Of course, this must be an instruction that can be executed without any side effects on the correctness of the program. Pros: Pipeline is never stalled or flushed and with the correct choice branch delayed slot instruction, performance can approach that of an ideal pipeline. Cons: It is not always possible to find a delayed slot instruction in which case a NOP instruction may have to be inserted into the delayed slot to make sure that the program's integrity is not violated. It makes compilers work harder.

Which instruction to place into the delayed branch slot? 3.1-Choose an instruction before the branch, but make sure that branch does not depend on moved instruction. If such an instruction can be found, this always pays off. Example: ADD R1,R2; JCD R2>10,exit; can be rescheduled as JCD R2,>,10,exit; ADD R1,R2; (Delay slot)

3.2-Choose an instruction from the target of the branch, but make sure that the moved instruction is executable when the branch is not taken. Example: ADD R1,R2; JCD R2 > 10,sub; JMP D, add; …. sub: SUB R4,R5; add: ADI R3,5; can be rescheduled as ADD R1,R2; JCD R2,>,10,sub; ADI R3,5; (Delay slot) …. sub: SUB R4,R5;

3.3-Choose an instruction from the anti-target of the branch, but make sure that the moved instruction is executable when the branch is taken. Example: // ADD R3,R2; JCD R2 > 10,exit; ADD R3,R2; exit: SUB R4,R5; // ADD R4,R3; can be rescheduled as ADD R1,R2; JCD R2,>,10,exit; ADD R3,R2; (Schedule for execution if it does not alter the program flow or output) exit: SUB R4,R5;

Dynamic Branch Prediction --Dynamic branch prediction relies on the history of how branch conditions were resolved in the past. --History of branches is kept in a buffer. To keep this buffer reasonably small and easy to access, the buffer is indexed by some fixed number of lower order bits of the address of the branch instruction. --Assumption is that the address values in the lower address field are unique enough to prevent frequent collisions or overrides. Thus if we are trying to predict branches in a program which remains within a block of 256 locations, 8 bits should suffice. x x+1 x+256

Branch instructions in the instruction cache include a branch prediction field that is used to predict if the branch should be taken.

Branch prediction: In the simplest case, the field is a 1-bit tag: 0 <=> branch was not taken last time (State A) 1 <=> branch was taken last time (State B) not taken taken taken A B not taken While in state A predict the branch as “not to be taken” While in state B predict the branch as “to be taken”

This works relatively well: It accurately predicts the branches in loops in all but two of the iterations CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop; Assuming that we begin in state A, prediction fails when R0 = 1 (branch is not taken when it should be) and R0 =10(branch is taken when it should not be) Assuming that we begin in state B, prediction fails when R0 =10 (branch is taken when it should not be)

We can modify the loop to make the branch prediction algorithm fail twice when we begin in state B as well. CLR R0; loop:LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 >=10,exit; JMP D,loop; exit: Assuming that we begin in state B, prediction fails: when R0 = 1 (branch is taken when it should not be) and R0 =10(branch is not taken when it should not be)

What is worse is that we can make this branch prediction algorithm fail each time it makes a prediction: LDI R0,1; loop: JCD R0 > 0,neg; LDI R0,1; JMP D,loop;neg: LDI R0,-1; JMP D,loop; Assuming that we begin in state A, prediction fails when R0 = 1 (branch is not taken when it should be) R0 = -1 (branch is taken when it should not be) R0 = 1 (branch is not taken when it should be) R0 = -1 (branch is taken when it should not be) and so on

2- bit prediction ( A more reluctant flip in decision ) not taken taken A1 A2 not taken taken not taken taken taken B2 B1 not taken While in states A1 and A2 predict the branch as “not to be taken” While in states B1 and B2 predict the branch as “to be taken”

CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop; Assuming that we begin in state A1, prediction fails when R0 = 1,2 (branch is not taken when it should be) and R0 = 10 (branch is taken when it should not be) Assuming that we begin in state B1, prediction fails when R0 = 10 (branch is taken when it should not be) not taken taken A1 A2 not taken taken not taken taken taken B2 B1 not taken

2-bit predictors are more resilient to branch inversions (predictions are reversed when they are missed twice): LDI R0,1; loop: JCD R0 > 0,neg; LDI R0,1; JMP D,loop;neg: LDI R0,-1; JMP D,loop; Assuming that we begin in state B1, prediction succeeds when R0 = 1 (branch is taken when it should be) fails when R0 = -1 (branch is taken when it should not be) succeeds when R0 = 1 (branch is taken when it should be) fails when R0 = -1 (branch is taken when it should not be) and so on… not taken taken A1 A2 not taken taken not taken taken taken B2 B1 not taken

Amdahl's Law (Fixed Load Speed-up) Let q be the fraction of a load L that cannot be speeded-up by introducing more processors and let T(p) be the amount time it takes to execute L on p processors by a linear work function, p> 1. Then All this means is that, the maximum speed-up of a system is limited by the fraction of the work that must be completed sequentially. Thus, the execution of the work using p processors can be reduced to qT(1) under the best of circumstances, and the speed-up cannot exceed 1/q.

Example A 4-processor computer executes instructions that are fetched from a random access memory over a shared bus as shown below:

The task to be performed is divided into two parts: Fetch instruction (serial part)- it takes 30 microseconds Execute instruction (parallel part)- it takes 10 microseconds to execute: S(4) = T(1)/T(4) = 1/(0.75 + 0.25/4) = 4/3.25 = 1.23 microseconds microseconds microseconds microseconds

Now, suppose that the number of processors is doubled. Then S(8) = T(1)/T(8) = 1/(0.75 + 0.25/8) = 8/6.25 = 1.28 Suppose that the number of processors is doubled again. Then S(16) = T(1)/T(16) = 1/(0.75 + 0.25/16) = 16/12.25 = 1.30.

What is the limit S(p) = T(1)/T(p) = 1/(0.75 + 0.25/p) = 1/0.75 = 1.333.

Alternate Forms of Amdahl's Law where s is the speed-up of the computation that can be enhanced.

Example: Suppose that you've upgraded your computer from a 2 GHz processor to a 4 GHz processor. What is the maximum speed-up you expect in executing a typical program assuming that (1) the speed of fetching each instruction is directly proportional to the speed of reading an instruction from the primary memory of your computer, and reading an instruction takes four times longer than executing it, (2) the speed of executing each instruction is directly proportional to the clock speed of the processor of your computer? Using Amdahl's Law with q = 0.8 and s = 2, we have S = 2 /(0.2 + 0.8 x 2) = 1.111 Very disappointing as you are likely to have paid quite a bit of money for the upgrade!

ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law