Module 5

Module 5 Pipeline

Pipeline • CPU break fetch-decode-execute cycle into tasks that can be performed in parallel • Performance of a computer can be increased by increasing the performance of the CPU. This can be done by executing more than one tasks at a time. • This procedure is referred to as pipelining. • The concept of pipelining is to allow the processing of a new task even though the processing of previous task has not ended.

Pipeline Process T3 T2 T1 Segment 1 Segment 3 Segment 2 • A single process is divided into several small independent tasks.

Analogy: Pipelined Laundry • Non-pipelined approach: • run load of clothes through washer • run load through dryer • fold the clothes • put the clothes away • While the first load is drying, put the second load in the washing machine. • When the first load is being folded and the second load is in the dryer, put the third load in the washing machine.

1 6 P M 7 8 9 1 0 1 1 1 2 2 A M T i m e T a s k o r d e r A B C D 1 6 P M 7 8 9 1 0 1 1 1 2 2 A M T i m e T a s k o r d e r A B C D Non-pipelined  16 units of time Pipelined  7 units of time

Analogy: Pipelined Laundry • For 4 loads: • non-pipelined approach takes 16 units of time . • pipelined approach takes 7 units of time. • For 816 loads: • non-pipelined approach takes 3264 units of time. • pipelined approach takes 819 units of time.

P1 P1 P1 P2 P2 P2 P3 P3 P3 P4 P4 P4 Pipeline: Space Time Diagram Show events of a pipeline 1 2 3 4 5 6 7 clock P1 P2 P3 P4 S1 S2 S3 S4 S = segment; P = process;

Design Issues • Since each segment are connected to each other in a sequent, the next segment cannot start execution until it has received the result from the previous segment (in this case, pipelining is not ideal). • So, the cycle time of the segments must be the same. • However, it is known that the execution time of each segment is not the same. • Therefore for synchronization, the cycle time for the pipeline will be based on, the longest execution time of the segment in the pipeline.

Pipeline performance:Degree of Speedup • Given tnis the cycle time for non-pipelining and tpfor pipelining . • An ideal pipeline divides a task into k independent sequential processes: • Each process requires tp time unit to complete, • The task itself then requires k tptime units to complete. • For n iterations of the task, the execution times will be: • With no pipelining: ntntime units, • With pipelining: k tp + (n-1) tptime units. • Degree of speedup is thus:S= Ex. Time non-pipelining / Ex. Time with pipelining = (ntn )/ [k+(n-1)] tp • If nis too large from k, n >> k and when tn = k tp, thus max. speedup: Smax = k

Non- Ideal Pipeline Structure: Example Data operands pass through all segments in sequence Each segment consists of a circuit that performs sub-operation Segments are separated by registers, that hold the intermediate results between stages. Control Unit R1 S1 R2 S2 Rm Sm ... Segment m Segment 1 Segment 2 Data In Data Out

Example 1: Pipeline (Non-ideal case) Given a 4 segment pipeline whereby each segment has a delay time as follows: - Segment 1 : 40 ns - Segment 2 : 25 ns - Segment 3 : 45 ns - Segment 4 : 45 ns The delay time for the interface register is 5ns. Calculate the: i) cycle time of the non-pipeline and pipeline, ii) execution time for 100 tasks, iii) real speedup, and iv) maximum speedup.

Example 1: Pipeline • Cycle time: tn = (40 + 25 + 45 + 45 + 5) ns = 160ns tp = the longest delay for execution + interface delay = (45 + 5) ns = 50 ns Control Unit Segment 1 Segment 2 Segment 3 Segment 4 Data Output Data Input Cycle time Segment 1 : 40 ns Segment 2: 25 ns Segment 3: 45ns Segment 4 : 45ns

Example 1: Pipeline ii. Execution time for 100 tasks = k x tp+(n-1) tp = ((4 * 50) + (99 * 50)) ns = (50 * 103)ns = 5150ns For non-pipeline system, the total execution time for 100 tasks = ntn=100 * 160ns = 16000ns iii. The real speedup for 100 task Speed up = Execution time for non-pipeline /Execution time for pipeline = 16000 / 5150= 3.1068 iv. Maxima speedup, Smax= k = 4

R IF R OF R IE R OS Example 2: Pipeline There are 4 segments as follows: The interface delay is 3 ns. The simple block diagram of the pipeline: Draw the space time diagram and calculate the: i) cycle time of the non-pipeline and pipeline, ii) execution time for 50 tasks, iii) speedup, and iv) maximum speedup.

Example 2: Pipeline The space time diagram for the pipeline (5 processes): i. Cycle time: tn = (52 + 40+ 30 + 40 + 3) ns = 165ns tp = the longest delay for execution + interface delay = (52+ 3) ns = 55 ns ii. Execution time for 50 tasks = k x tp +(n-1) tp = ((4 * 55) + (49 * 55)) ns = 220+2695ns = 2915ns For non-pipeline system, the total execution time for 50 tasks = ntn = 50 * 165ns = 8250ns iii. The real speedup for 50 task Speed up = Execution time for non-pipeline /Execution time for pipeline = 8250/ 2915= 2.83 iv. Maxima speedup, Smax = k = 4 15

Instruction Pipeline • The instruction cycle clearly shows the sequence of operations that take place in order to execute a single instruction. • A “good” design goal of any system is to have all of its components performing useful work all of the time – high efficiency. • Following the instruction cycle (fetch-decode-execute) in a sequential fashion does not permit this level of efficiency. • Analogy: Automobile assembly line • Perform all tasks concurrently, but on different (sequential) instructions • The result is temporal parallelism • Result is the instruction pipeline • With the use of pipeline , the average CPI is reduced. • However, every instruction still require the same number of clock cycles for execution.

Implementation of Instruction Pipeline: Case 1 • Divide the instruction cycle into two processes • Instruction fetch (Fetch cycle) • Everything else (Execution phase) • While one instruction is in “execution,” overlap the prefetching of the next instruction • Assumes the memory bus will be idle at some point during the execution phase. • Reduces the time to fetch an instruction to zero (ideal situation).

Sequential Fetch #1 Execute #1 Fetch #2 Execute #2 Pipeline Fetch #1 Execute #1 Fetch #2 Execute #2 Implementation of Instruction Pipeline: Case 1 • Problems • Fetch and execute part are usually not same in size. • Branching can negate the prefetching. As a result of the branch instruction, you have prefetched the “wrong” instruction.

Implementation of Instruction Pipeline: Case 2 • Finer division of the instruction cycle with 6 stages – better speedup • Example use a 6-stages pipeline: • Fetch instruction (FI) • Decode instruction (DI) • Calculate operands (CO) • Fetch operands (FO) • Execute instruction (EI) • Write (store) operand (WO) • Use multiple execution “functional units” to parallelize the actual execution phase of several instructions • Use branching strategies to minimize branch impact

Implementation of Instruction Pipeline: Case 2 Pipeline = 14 Non pipeline = (9 x 6) = 54

Implementation of Instruction Pipeline: Case 2 Problems from space time diagram for Case 2: • Assumes each instruction goes thru all 6 stages  not true • Example: load instruction don’t need WO stage • Assumes there are no memory conflicts • Example: FI, FO, WO involve memory access  cannot do it simultaneously • Assumes no interrupt or branching happens • Assumes no data dependency • Where the CO stage may depend on contents of a register that could be altered by a previous instruction still in the pipeline

Example 3 • The estimated timings for each of the stages of an instruction pipeline:

P r o g r a m 2 4 6 8 1 0 1 2 1 4 1 6 1 8 e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) I n s t r u c t i o n D a t a mov ax , num1 R e g A L U R e g f e t c h a c c e s s I n s t r u c t i o n D a t a mov bx , num2 R e g A L U R e g 8 n s f e t c h a c c e s s I n s t r u c t i o n mov cx , num3 8 n s f e t c h . . . 8 n s P r o g r a m 1 4 2 4 6 8 1 0 1 2 e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) I n s t r u c t i o n D a t a mov ax , num1 R e g A L U R e g f e t c h a c c e s s I n s t r u c t i o n D a t a mov bx , num2 2 n s R e g A L U R e g f e t c h a c c e s s I n s t r u c t i o n D a t a mov cx , num3 2 n s R e g A L U R e g f e t c h a c c e s s 2 n s 2 n s 2 n s 2 n s 2 n s Example 3

Organization of 4 Segment Instruction Pipeline Processor I-cache D-cache PC Fetch & Decode Logic Data Read Logic ALU Data Write Logic IR Register File S1: Instruction Fetch S2: Operand Load S3: ALU operation S4: Operand store

Pipeline Limitations • 3 major issues that affect pipeline performance: • Resource conflict (or resource hazard or structural hazard) • Caused by access to memory by 2 segments at the same time • Data dependency (or data hazard) • Conflict arise when an instruction depends on the result of a previous instruction, but this result is not yet available • Branch difficulties (or control hazard) • Arise from branch and other instructions that change the PC value • Pipeline depth is often not included in the list but it does have an effect on pipeline

Pipeline Limitation: Pipeline depth • If the speedup is based on the number of stages, why not build lots of stages? • Each stage uses latches at its input (output) to buffer the next set of inputs. • If the stage granularity is reduced too much, the latches and their control become a significant hardware overhead. • Also suffer a time overhead in the propagation time through the latches. • Limits the rate at which data can be clocked through the pipeline. • Logic to handle memory and register use and to control the overall pipeline increases significantly with increasing pipeline depth.

Pipeline Limitation: Data dependency • Pipelining must insure that computed results are the same as if computation was performed in strict sequential order. • With multiple stages, two instructions “in execution” in the pipeline may have data dependencies -- must design the pipeline to prevent this. • Data dependencies limit when an instruction can be input to the pipeline. • Data dependency is shown in the following portion of a program: A = B + C; D = E + A; C = G x H; A = D / H;

I1 idle I2 I3 I4 Example 4: Data Dependency I1: SUB AX, BX ; [AX]  [AX] – [BX] I2: ADD AX, CX ; [AX] [AX] + [CX] 1 2 3 4 5 6 7 I1 I2 I3 I4 Fetch Instruction Decode Instruction I1 I2 I2 I3 I4 I1 idle I2 I3 I4 Execute Instruction Store Result

Solution to Data Dependency • Hardware interlocks • An interlock is a circuit that detects instructions with data dependency and inserts required delays to resolve conflicts • Operand forwarding • Allowing the result of ALU to be used by other ALU operations in the next instruction cycle. • Delayed load (NOP) • Compiler is used to detect conflict and reorder instructions to delay loading of conflicting data. • Using NOP instruction.

Solution to Data Dependency:Operand Forwarding Src1, Src2 RSLT ALU Operation Operand Store Forwarding Data Path

P r o g r a m e x e c u t i o n 2 4 6 8 1 0 o r d e r T i m e ( i n i n s t r u c t i o n s ) a d d ax , num I F I D E X M E M W B s u b bx , ax M E M I F I D E X M E M W B Picture of Forwarding

Solution to Data Dependency: Delay Load (NOP) 1 2 3 4 5 6 7 I1 NOP I2 I3 I4 Fetch Instruction Decode Instruction I1 NOP I2 I3 I4 I1 NOP I2 I3 I4 Execute Instruction I1 NOP I2 I3 I4 Store Result

Example 5: Solution to Data Dependency: Delay Load (NOP)

I1 I2 I3 I4 I5 Pipeline Limitation: Conflict of Resources • Occurs when two segment need to access memory at the same time • Can be solved by implementing modular memory. 1 2 3 4 5 Fetch Instruction I1 I2 I3 I4 I5 Fetch cycle I4 Decode Instruction I1 I2 I3 I4 I5 Indirect operand I3 Execute Instruction Store Result I1 I2 I3 I4 I5 Store/Write I1

I1 I2 I3 I3 I4 Pipeline Delay 1 2 3 4 5 Fetch Instruction I1 I2 I3 I4 I5 Decode Instruction I5 I1 I2 I3 I3 I4 Execute Instruction I5 Store Result I1 I1 (idle) I1 idle I2 I3 A delay in any stage can cause pipeline stalls.

Pipeline Delay 5-staged pipeline, ideal case Assume that memory has a single port  so data reads and writes can only happen 1 at a time. Assume that the source operand for I1 is in memory. We have a conflict with 2 instruction needing the same resource A delay in any stage can cause pipeline stalls. The FI stage for I3 must idle for 1 clock cycle before beginning

Handling Resource Conflict This scenario describes a memory conflict caused by the instruction fetch of I3 and memory resident operand fetch of I1.

Handling Resource Conflict

Pipeline Limitation: Branching • For the pipeline to have the desired operational speedup, we must “feed it” with long strings of instructions. • However, 15-20% of instructions in an assembly-level stream are (conditional) branches. • Of these, 60-70% take the branch to a target address. • Impact of the branch is that pipeline never really operates at its full capacity – limiting the performance improvement that is derived from the pipeline

Pipeline Limitation: Branching Instruction that need to be flushed out Fetch Instruction Decode Instruction Execute Instruction Store Result Idle

Solution for Branching • Delayed branch (NOP) • Rearranging the instructions • Implementation of instruction queue

Example 6: Solution for Branching:Delay Branch – Using NOP • When the compiler detect a branch, it will automatically insert several NOP so that there is no interruptions in the pipeline. • Example: How many delay branch NOP to insert = No segment – 1 = k – 1 = 4 – 1 = 3

Example 6: Delay Branch – Using NOP Fetch Instruction Mov Inc Add Ret NOP NOP NOP I1 I2 I3 I4 Decode Instruction Mov Inc Add Ret NOP NOP NOP I1 I2 I3 Execute Instruction Mov Inc Add Ret NOP NOP NOP I1 I2 Store Result Mov Inc Add Ret NOP NOP NOP I1 Wasted 3 pipeline clock cycles

Example 7: Solution for Branching: Rearranging Instructions How to arrange, moving instruction up = No segment – 1 = k – 1 = 4 – 1 = 3 (move up 3) Fetch Instruction Ret Mov Add Inc I1 I2 I3 I4 I5 Decode Instruction Ret Mov Add Inc I1 I2 I3 I4 Execute Instruction Ret Mov Add Inc I1 I2 I3 Store Result Ret Mov Add Inc I1 I2

Example 8: Solution for Branching: Rearranging Instructions Flush Idle

Example 8: Solution for Branching: Rearranging Instructions How to arrange, move up 3 times = No segment – 1 = k – 1 = 4 – 1 = 3

Instruction Queue Prefetch target instruction: Prefetch both possible next instructions in the case of a conditional branch Fetch Instruction Fetch Operand Execute Instruction Store Result Instruction Queue S1 S2 Instruction Queue JMP When FI segment detect a JMP the next address instruction regarding to the JMP will be determined. S4 will be deleted and the new instruction will be fetched. S4

Example 9 : Instruction Pipeline • Given a pipeline that consist of 5 segments; Fetch Instruction (FI), Decode Instruction (DI), Fetch Operand (FO), Execute Instruction (EI) and Store Result (SR).

Example 9 – Space Time Diagram Draw the space time diagram for the execution of the instructions Show data dependency and branching problem Flush fetched instructions when I5 (JMP) is executed Branching Problem: JMP LOOP1 Data Dependency Problem: MOV AX,NUM ADD BX,AX Idle till I5 (JMP) is executed and I12 is fetched and executed

Example 9 – Delay Load – Using NOP Solve data dependency problem: Insert Delay Load with NOP Branching Problem: JMP LOOP1 : I5 NOP I6 I7 :

Module 5

Module 5

Presentation Transcript

MODULE 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5

Module 5